5ivatej commited on
Commit
bd4c956
·
verified ·
1 Parent(s): 8bc9ffd

Update README (2025-09-11 01:12:33)

Browse files
Files changed (1) hide show
  1. README.md +4 -44
README.md CHANGED
@@ -33,9 +33,9 @@ Attach these heads to the base model to **stop decoding early** when a token’s
33
 
34
  ## TL;DR
35
 
36
- - Plug-in heads for each transformer layer.
37
- - At each step, we check the layer-wise logits; if `max_prob >= confidence_threshold`, we **exit** before running later layers.
38
- - Ships with a tiny loader and a simple generation helper.
39
 
40
  ---
41
 
@@ -58,7 +58,7 @@ early = importlib.util.module_from_spec(spec); sys.modules["early_exit_wrapper"]
58
  spec.loader.exec_module(early)
59
 
60
  # 3) Load wrapped model + tokenizer
61
- wrapped, tok = early.load_early_exit_from_hub(REPO_ID) # picks CPU/MPS/CUDA & safe dtype
62
 
63
  # 4) Generate with early exit
64
  ids = early.generate_with_early_exit(
@@ -67,43 +67,3 @@ ids = early.generate_with_early_exit(
67
  max_new_tokens=64, temperature=0.7, top_p=0.9
68
  )
69
  print(tok.decode(ids[0], skip_special_tokens=True))
70
-
71
- ---
72
- language:
73
- - en
74
- tags:
75
- - early-exit
76
- - adapters
77
- - efficiency
78
- - inference
79
- - tinyllama
80
- - llama
81
- - text-generation
82
- license: apache-2.0
83
- base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
84
- library_name: transformers
85
- pipeline_tag: text-generation
86
- model-index:
87
- - name: tinyllama-1.1b-early-exit
88
- results: []
89
- ---
90
-
91
- ## Citation
92
-
93
- If you use this, please cite:
94
-
95
- ```bibtex
96
- @misc{tinyllama_early_exit_2025,
97
- title = {TinyLlama Early-Exit Heads (Adapter)},
98
- author = {Sivateja (5ivatej)},
99
- year = {2025},
100
- url = {https://huggingface.co/5ivatej/tinyllama-1.1b-early-exit}
101
- }
102
-
103
- @misc{zhang2023tinyllama,
104
- title = {TinyLlama: Open-Source Small Language Models},
105
- author = {Zhang, et al.},
106
- year = {2023},
107
- howpublished = {\url{https://huggingface.co/TinyLlama}},
108
- note = {Apache-2.0}
109
- }
 
33
 
34
  ## TL;DR
35
 
36
+ - One tiny linear **head per transformer layer**.
37
+ - At each decoding step, compute layer-wise logits; if `max_prob >= confidence_threshold`, **exit** early.
38
+ - Ships with a loader and a minimal generation helper.
39
 
40
  ---
41
 
 
58
  spec.loader.exec_module(early)
59
 
60
  # 3) Load wrapped model + tokenizer
61
+ wrapped, tok = early.load_early_exit_from_hub(REPO_ID) # auto-picks CPU/MPS/CUDA & safe dtype
62
 
63
  # 4) Generate with early exit
64
  ids = early.generate_with_early_exit(
 
67
  max_new_tokens=64, temperature=0.7, top_p=0.9
68
  )
69
  print(tok.decode(ids[0], skip_special_tokens=True))