Update README (2025-09-11 01:12:33)
Browse files
README.md
CHANGED
|
@@ -33,9 +33,9 @@ Attach these heads to the base model to **stop decoding early** when a token’s
|
|
| 33 |
|
| 34 |
## TL;DR
|
| 35 |
|
| 36 |
-
-
|
| 37 |
-
- At each step,
|
| 38 |
-
- Ships with a
|
| 39 |
|
| 40 |
---
|
| 41 |
|
|
@@ -58,7 +58,7 @@ early = importlib.util.module_from_spec(spec); sys.modules["early_exit_wrapper"]
|
|
| 58 |
spec.loader.exec_module(early)
|
| 59 |
|
| 60 |
# 3) Load wrapped model + tokenizer
|
| 61 |
-
wrapped, tok = early.load_early_exit_from_hub(REPO_ID) # picks CPU/MPS/CUDA & safe dtype
|
| 62 |
|
| 63 |
# 4) Generate with early exit
|
| 64 |
ids = early.generate_with_early_exit(
|
|
@@ -67,43 +67,3 @@ ids = early.generate_with_early_exit(
|
|
| 67 |
max_new_tokens=64, temperature=0.7, top_p=0.9
|
| 68 |
)
|
| 69 |
print(tok.decode(ids[0], skip_special_tokens=True))
|
| 70 |
-
|
| 71 |
-
---
|
| 72 |
-
language:
|
| 73 |
-
- en
|
| 74 |
-
tags:
|
| 75 |
-
- early-exit
|
| 76 |
-
- adapters
|
| 77 |
-
- efficiency
|
| 78 |
-
- inference
|
| 79 |
-
- tinyllama
|
| 80 |
-
- llama
|
| 81 |
-
- text-generation
|
| 82 |
-
license: apache-2.0
|
| 83 |
-
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
| 84 |
-
library_name: transformers
|
| 85 |
-
pipeline_tag: text-generation
|
| 86 |
-
model-index:
|
| 87 |
-
- name: tinyllama-1.1b-early-exit
|
| 88 |
-
results: []
|
| 89 |
-
---
|
| 90 |
-
|
| 91 |
-
## Citation
|
| 92 |
-
|
| 93 |
-
If you use this, please cite:
|
| 94 |
-
|
| 95 |
-
```bibtex
|
| 96 |
-
@misc{tinyllama_early_exit_2025,
|
| 97 |
-
title = {TinyLlama Early-Exit Heads (Adapter)},
|
| 98 |
-
author = {Sivateja (5ivatej)},
|
| 99 |
-
year = {2025},
|
| 100 |
-
url = {https://huggingface.co/5ivatej/tinyllama-1.1b-early-exit}
|
| 101 |
-
}
|
| 102 |
-
|
| 103 |
-
@misc{zhang2023tinyllama,
|
| 104 |
-
title = {TinyLlama: Open-Source Small Language Models},
|
| 105 |
-
author = {Zhang, et al.},
|
| 106 |
-
year = {2023},
|
| 107 |
-
howpublished = {\url{https://huggingface.co/TinyLlama}},
|
| 108 |
-
note = {Apache-2.0}
|
| 109 |
-
}
|
|
|
|
| 33 |
|
| 34 |
## TL;DR
|
| 35 |
|
| 36 |
+
- One tiny linear **head per transformer layer**.
|
| 37 |
+
- At each decoding step, compute layer-wise logits; if `max_prob >= confidence_threshold`, **exit** early.
|
| 38 |
+
- Ships with a loader and a minimal generation helper.
|
| 39 |
|
| 40 |
---
|
| 41 |
|
|
|
|
| 58 |
spec.loader.exec_module(early)
|
| 59 |
|
| 60 |
# 3) Load wrapped model + tokenizer
|
| 61 |
+
wrapped, tok = early.load_early_exit_from_hub(REPO_ID) # auto-picks CPU/MPS/CUDA & safe dtype
|
| 62 |
|
| 63 |
# 4) Generate with early exit
|
| 64 |
ids = early.generate_with_early_exit(
|
|
|
|
| 67 |
max_new_tokens=64, temperature=0.7, top_p=0.9
|
| 68 |
)
|
| 69 |
print(tok.decode(ids[0], skip_special_tokens=True))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|