DemonioStrada commited on
Commit
e9591af
·
verified ·
1 Parent(s): 30aa2bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -98
README.md CHANGED
@@ -11,112 +11,141 @@ license: apache-2.0
11
  short_description: Place for AI Models
12
  ---
13
 
14
- Model Card: LLM Brain Rot Demonstration
15
- Model/Dataset Name
 
16
  LLM Brain Rot Demonstration: Qwen2.5 0.5B Comparison
17
- 1. Overview
 
18
  This demonstration showcases the "Brain Rot" effect in Large Language Models (LLMs) as described in the research paper "LLMs Can Get Brain Rot" by Xing et al. (2024). The demonstration compares two Qwen2.5 0.5B Instruct models: one trained on control data and one trained on 100% M1 junk data, illustrating how exposure to low-quality web content can degrade an LLM's cognitive capabilities.
 
19
  The original research was conducted by a team from Texas A&M University, University of Texas at Austin, and Purdue University. This demonstration is a simplified implementation of their findings, focusing on the most extreme case (100% junk data) to clearly illustrate the phenomenon.
20
- 2. Intended Use
21
- Primary Tasks
22
- ⦁ Educational demonstration of data quality effects on LLM performance
23
- ⦁ Comparison of reasoning capabilities between models trained on different data quality
24
- ⦁ Illustration of "thought-skipping" phenomenon in LLMs
25
- Intended Users
26
- ⦁ Students learning about LLM training and data quality
27
- ⦁ Researchers studying model robustness and data effects
28
- ⦁ Educators demonstrating AI concepts
29
- ⦁ Anyone interested in understanding how training data affects model behavior
30
- Inappropriate Uses
31
- ⦁ Production deployment or real-world applications
32
- ⦁ Making generalized claims about all LLMs based on this limited comparison
33
- ⦁ Evaluating the overall quality of the base Qwen2.5 model family
34
- ⦁ Drawing conclusions about the effects of content beyond what is demonstrated
35
- 3. Dataset/Model Details
36
- Models
37
- ⦁ Base Model: Qwen2.5 0.5B Instruct
38
- ⦁ Comparison Models:
39
- 1. Qwen2.5 0.5B trained on control data (0% junk)
40
- 2. Qwen2.5 0.5B trained on 100% M1 junk data
41
- Dataset
42
- ⦁ ARC Challenge questions (small sample from main repository)
43
- ⦁ Safety questions (small sample from main repository)
44
- ⦁ RULER (3 custom sets based on RULER repository sub tests; Needle in Haystack, Variable Tracking and Question Answering)
45
- ⦁ TRAIT (custom set based on original TRAIT repository)
46
- Model Variants and Datasets in Original Research
 
 
 
 
 
 
 
 
47
  The original research included 40 model variants:
48
- 4 base models: Llama3 8B, Qwen2.5 7B, Qwen2.5 0.5B, Qwen3 4B
49
- 2 junk metrics: M1 (engagement degree) and M2 (semantic quality)
50
- 5 training ratios: 0%, 20%, 50%, 80%, 100% junk data
51
- 4 base models x 5 training ratios x 2 junk metrics = 40 total model variants
 
52
  The original Dataset was:
53
- Source: Twitter/X posts from 2010
54
- Filtering: M1 metric (engagement degree) - short but highly popular posts
55
- Processing: Control data consists of longer, less popular posts
56
- Language: Primarily English
57
- 4. Ethical Considerations
58
- Possible Biases
59
- ⦁ The Twitter dataset may contain demographic, cultural, and ideological biases present on the platform
60
- ⦁ The M1 metric (based on popularity) may amplify content that is attention-grabbing rather than accurate or thoughtful
61
- The models may reproduce stereotypes or problematic content present in the training data
62
- Risks of Misuse
63
- The junk-trained model may generate lower-quality, less reliable, or potentially problematic responses
64
- ⦁ Users might overgeneralize from this specific demonstration to make broader claims about LLMs
65
- ⦁ The demonstration might be misinterpreted as a definitive statement about all social media content
66
- Privacy/Consent Issues
67
- ⦁ The models were trained on public Twitter posts, but individual tweets may contain personal information
68
- ⦁ Users should be cautious about inputting personal information into either model
69
- 5. Limitations
70
- Scope Limitations
71
- ⦁ Only demonstrates the effect with one model family (Qwen2.5) and size (0.5B)
72
- ⦁ Only shows the comparison between 0% and 100% junk data, not the "dose-response" relationship
73
- ⦁ Only demonstrates M1 metric effects, not M2 (semantic quality)
74
- ⦁ Only evaluates a limited number of examples per task type for demonstration purposes
75
- Technical Limitations
76
- ⦁ The smaller model size (0.5B) may show more pronounced effects than larger models
77
- ⦁ The demonstration focuses on reasoning tasks, but the original paper found effects across multiple capabilities
78
- ⦁ The interface may not fully capture all nuances of the "thought-skipping" phenomenon
79
- Generalizability
80
- ⦁ Results may not apply to all LLM architectures or training methodologies
81
- ⦁ The specific Twitter dataset from 2010 may not represent current web content
82
- ⦁ The demonstration shows correlation, not necessarily causation for all scenarios
83
- 6. Training & Evaluation
84
- Training Process
 
 
 
 
 
 
 
 
 
 
85
  The original models were trained using the following process:
86
- 1. Base models (Qwen2.5 0.5B Instruct) underwent continual pre-training
87
- 2. Training parameters: learning rate 1×10^-5, AdamW optimizer, 3 epochs
88
- 3. Models were trained on either control data or 100% M1 junk data
89
- 4. After pre-training, models underwent instruction tuning on the Alpaca English dataset
90
- Evaluation Metrics
 
91
  The original research evaluated models on multiple benchmarks:
92
- ARC Challenge: Chain-of-thought prompting with accuracy measurement
93
- RULER: Sample tasks representing needle-in-haystack, variable tracking, and question answering
94
- TRAIT: Sample personality questions with simplified analysis
95
- Safety: Subset of harmful behaviors with refusal detection
96
- Thought-skipping analysis: Heuristic-based categorization of reasoning failures
97
- Key Results from Original Research
 
98
  For Qwen2.5 0.5B with M1 intervention:
99
- ARC Challenge (COT): 74.9 → 57.2 (17.7 point drop)
100
- RULER Overall: 93.9 → 71.0 (22.9 point drop)
101
- Safety metrics showed increased risk scores
102
- Personality traits showed increases in narcissism and psychopathy
103
- Analysis of Failures
 
104
  The primary failure mode identified was "thought-skipping," where models:
105
- Skip intermediate reasoning steps
106
- Provide answers without showing their thinking process
107
- Make logical leaps or factual errors in their reasoning
108
- 7. References
109
- Primary Research
110
- ⦁ Xing, S., Hong, J., Wang, Y., Chen, R., Zhang, Z., Grama, A., Tu, Z., & Wang, Z. (2024). LLMs Can Get Brain Rot! arXiv preprint arXiv:2510.13928.
111
- Resources
112
- ⦁ GitHub Repository: https://github.com/llm-brain-rot/llm-brain-rot
113
- ⦁ Project Website: https://llm-brain-rot.github.io/
114
- ⦁ Hugging Face Models:
115
- 1. Qwen2.5 0.5B trained on control data (0% junk): https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-control-tweet-1m-en-sft
116
- 2. Qwen2.5 0.5B trained on 100% M1 junk data: https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-junk-tweet-1m-en-sft
117
- Related Work
118
- ⦁ Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
119
- ⦁ Shumailov, I., Shumailov, I., Shumailova, Z., Papernot, N., Anderson, A., & Gal, Y. (2023). The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.
120
- ⦁ Seddik, M. E., Shumailov, I., Shumailova, Z., & Gal, Y. (2024). How bad is training on synthetic data? A statistical analysis of language model collapse. arXiv preprint arXiv:2404.05094.
 
 
 
 
121
 
122
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
11
  short_description: Place for AI Models
12
  ---
13
 
14
+ # Model Card: LLM Brain Rot Demonstration
15
+
16
+ ## Model/Dataset Name
17
  LLM Brain Rot Demonstration: Qwen2.5 0.5B Comparison
18
+
19
+ ## Overview
20
  This demonstration showcases the "Brain Rot" effect in Large Language Models (LLMs) as described in the research paper "LLMs Can Get Brain Rot" by Xing et al. (2024). The demonstration compares two Qwen2.5 0.5B Instruct models: one trained on control data and one trained on 100% M1 junk data, illustrating how exposure to low-quality web content can degrade an LLM's cognitive capabilities.
21
+
22
  The original research was conducted by a team from Texas A&M University, University of Texas at Austin, and Purdue University. This demonstration is a simplified implementation of their findings, focusing on the most extreme case (100% junk data) to clearly illustrate the phenomenon.
23
+
24
+ ## Intended Use
25
+
26
+ ### Primary Tasks
27
+ - Educational demonstration of data quality effects on LLM performance
28
+ - Comparison of reasoning capabilities between models trained on different data quality
29
+ - Illustration of "thought-skipping" phenomenon in LLMs
30
+
31
+ ### Intended Users
32
+ - Students learning about LLM training and data quality
33
+ - Researchers studying model robustness and data effects
34
+ - Educators demonstrating AI concepts
35
+ - Anyone interested in understanding how training data affects model behavior
36
+
37
+ ### Inappropriate Uses
38
+ - Production deployment or real-world applications
39
+ - Making generalized claims about all LLMs based on this limited comparison
40
+ - Evaluating the overall quality of the base Qwen2.5 model family
41
+ - Drawing conclusions about the effects of content beyond what is demonstrated
42
+
43
+ ## Dataset/Model Details
44
+
45
+ ### Models
46
+ - Base Model: Qwen2.5 0.5B Instruct
47
+ - Comparison Models:
48
+ - Qwen2.5 0.5B trained on control data (0% junk)
49
+ - Qwen2.5 0.5B trained on 100% M1 junk data
50
+
51
+ ### Dataset
52
+ - ARC Challenge questions (small sample from main repository)
53
+ - Safety questions (small sample from main repository)
54
+ - RULER (3 custom sets based on RULER repository sub tests; Needle in Haystack, Variable Tracking and Question Answering)
55
+ - TRAIT (custom set based on original TRAIT repository)
56
+
57
+ ### Model Variants and Datasets in Original Research
58
  The original research included 40 model variants:
59
+ - 4 base models: Llama3 8B, Qwen2.5 7B, Qwen2.5 0.5B, Qwen3 4B
60
+ - 2 junk metrics: M1 (engagement degree) and M2 (semantic quality)
61
+ - 5 training ratios: 0%, 20%, 50%, 80%, 100% junk data
62
+ - 4 base models x 5 training ratios x 2 junk metrics = 40 total model variants
63
+
64
  The original Dataset was:
65
+ - Source: Twitter/X posts from 2010
66
+ - Filtering: M1 metric (engagement degree) - short but highly popular posts
67
+ - Processing: Control data consists of longer, less popular posts
68
+ - Language: Primarily English
69
+
70
+ ## Ethical Considerations
71
+
72
+ ### Possible Biases
73
+ - The Twitter dataset may contain demographic, cultural, and ideological biases present on the platform
74
+ - The M1 metric (based on popularity) may amplify content that is attention-grabbing rather than accurate or thoughtful
75
+ - The models may reproduce stereotypes or problematic content present in the training data
76
+
77
+ ### Risks of Misuse
78
+ - The junk-trained model may generate lower-quality, less reliable, or potentially problematic responses
79
+ - Users might overgeneralize from this specific demonstration to make broader claims about LLMs
80
+ - The demonstration might be misinterpreted as a definitive statement about all social media content
81
+
82
+ ### Privacy/Consent Issues
83
+ - The models were trained on public Twitter posts, but individual tweets may contain personal information
84
+ - Users should be cautious about inputting personal information into either model
85
+
86
+ ## Limitations
87
+
88
+ ### Scope Limitations
89
+ - Only demonstrates the effect with one model family (Qwen2.5) and size (0.5B)
90
+ - Only shows the comparison between 0% and 100% junk data, not the "dose-response" relationship
91
+ - Only demonstrates M1 metric effects, not M2 (semantic quality)
92
+ - Only evaluates a limited number of examples per task type for demonstration purposes
93
+
94
+ ### Technical Limitations
95
+ - The smaller model size (0.5B) may show more pronounced effects than larger models
96
+ - The demonstration focuses on reasoning tasks, but the original paper found effects across multiple capabilities
97
+ - The interface may not fully capture all nuances of the "thought-skipping" phenomenon
98
+
99
+ ### Generalizability
100
+ - Results may not apply to all LLM architectures or training methodologies
101
+ - The specific Twitter dataset from 2010 may not represent current web content
102
+ - The demonstration shows correlation, not necessarily causation for all scenarios
103
+
104
+ ## Training & Evaluation
105
+
106
+ ### Training Process
107
  The original models were trained using the following process:
108
+ - Base models (Qwen2.5 0.5B Instruct) underwent continual pre-training
109
+ - Training parameters: learning rate 1×10^-5, AdamW optimizer, 3 epochs
110
+ - Models were trained on either control data or 100% M1 junk data
111
+ - After pre-training, models underwent instruction tuning on the Alpaca English dataset
112
+
113
+ ### Evaluation Metrics
114
  The original research evaluated models on multiple benchmarks:
115
+ - ARC Challenge: Chain-of-thought prompting with accuracy measurement
116
+ - RULER: Sample tasks representing needle-in-haystack, variable tracking, and question answering
117
+ - TRAIT: Sample personality questions with simplified analysis
118
+ - Safety: Subset of harmful behaviors with refusal detection
119
+ - Thought-skipping analysis: Heuristic-based categorization of reasoning failures
120
+
121
+ ### Key Results from Original Research
122
  For Qwen2.5 0.5B with M1 intervention:
123
+ - ARC Challenge (COT): 74.9 → 57.2 (17.7 point drop)
124
+ - RULER Overall: 93.9 → 71.0 (22.9 point drop)
125
+ - Safety metrics showed increased risk scores
126
+ - Personality traits showed increases in narcissism and psychopathy
127
+
128
+ ### Analysis of Failures
129
  The primary failure mode identified was "thought-skipping," where models:
130
+ - Skip intermediate reasoning steps
131
+ - Provide answers without showing their thinking process
132
+ - Make logical leaps or factual errors in their reasoning
133
+
134
+ ## References
135
+
136
+ ### Primary Research
137
+ - Xing, S., Hong, J., Wang, Y., Chen, R., Zhang, Z., Grama, A., Tu, Z., & Wang, Z. (2024). LLMs Can Get Brain Rot! arXiv preprint arXiv:2510.13928.
138
+
139
+ ### Resources
140
+ - GitHub Repository: https://github.com/llm-brain-rot/llm-brain-rot
141
+ - Project Website: https://llm-brain-rot.github.io/
142
+ - Hugging Face Models:
143
+ - Qwen2.5 0.5B trained on control data (0% junk): https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-control-tweet-1m-en-sft
144
+ - Qwen2.5 0.5B trained on 100% M1 junk data: https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-junk-tweet-1m-en-sft
145
+
146
+ ### Related Work
147
+ - Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
148
+ - Shumailov, I., Shumailov, I., Shumailova, Z., Papernot, N., Anderson, A., & Gal, Y. (2023). The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.
149
+ - Seddik, M. E., Shumailov, I., Shumailova, Z., & Gal, Y. (2024). How bad is training on synthetic data? A statistical analysis of language model collapse. arXiv preprint arXiv:2404.05094.
150
 
151
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference