Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,112 +11,141 @@ license: apache-2.0
|
|
| 11 |
short_description: Place for AI Models
|
| 12 |
---
|
| 13 |
|
| 14 |
-
Model Card: LLM Brain Rot Demonstration
|
| 15 |
-
|
|
|
|
| 16 |
LLM Brain Rot Demonstration: Qwen2.5 0.5B Comparison
|
| 17 |
-
|
|
|
|
| 18 |
This demonstration showcases the "Brain Rot" effect in Large Language Models (LLMs) as described in the research paper "LLMs Can Get Brain Rot" by Xing et al. (2024). The demonstration compares two Qwen2.5 0.5B Instruct models: one trained on control data and one trained on 100% M1 junk data, illustrating how exposure to low-quality web content can degrade an LLM's cognitive capabilities.
|
|
|
|
| 19 |
The original research was conducted by a team from Texas A&M University, University of Texas at Austin, and Purdue University. This demonstration is a simplified implementation of their findings, focusing on the most extreme case (100% junk data) to clearly illustrate the phenomenon.
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
The original research included 40 model variants:
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
| 52 |
The original Dataset was:
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
The original models were trained using the following process:
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
|
|
|
| 91 |
The original research evaluated models on multiple benchmarks:
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
| 98 |
For Qwen2.5 0.5B with M1 intervention:
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
|
|
|
| 104 |
The primary failure mode identified was "thought-skipping," where models:
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 11 |
short_description: Place for AI Models
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Model Card: LLM Brain Rot Demonstration
|
| 15 |
+
|
| 16 |
+
## Model/Dataset Name
|
| 17 |
LLM Brain Rot Demonstration: Qwen2.5 0.5B Comparison
|
| 18 |
+
|
| 19 |
+
## Overview
|
| 20 |
This demonstration showcases the "Brain Rot" effect in Large Language Models (LLMs) as described in the research paper "LLMs Can Get Brain Rot" by Xing et al. (2024). The demonstration compares two Qwen2.5 0.5B Instruct models: one trained on control data and one trained on 100% M1 junk data, illustrating how exposure to low-quality web content can degrade an LLM's cognitive capabilities.
|
| 21 |
+
|
| 22 |
The original research was conducted by a team from Texas A&M University, University of Texas at Austin, and Purdue University. This demonstration is a simplified implementation of their findings, focusing on the most extreme case (100% junk data) to clearly illustrate the phenomenon.
|
| 23 |
+
|
| 24 |
+
## Intended Use
|
| 25 |
+
|
| 26 |
+
### Primary Tasks
|
| 27 |
+
- Educational demonstration of data quality effects on LLM performance
|
| 28 |
+
- Comparison of reasoning capabilities between models trained on different data quality
|
| 29 |
+
- Illustration of "thought-skipping" phenomenon in LLMs
|
| 30 |
+
|
| 31 |
+
### Intended Users
|
| 32 |
+
- Students learning about LLM training and data quality
|
| 33 |
+
- Researchers studying model robustness and data effects
|
| 34 |
+
- Educators demonstrating AI concepts
|
| 35 |
+
- Anyone interested in understanding how training data affects model behavior
|
| 36 |
+
|
| 37 |
+
### Inappropriate Uses
|
| 38 |
+
- Production deployment or real-world applications
|
| 39 |
+
- Making generalized claims about all LLMs based on this limited comparison
|
| 40 |
+
- Evaluating the overall quality of the base Qwen2.5 model family
|
| 41 |
+
- Drawing conclusions about the effects of content beyond what is demonstrated
|
| 42 |
+
|
| 43 |
+
## Dataset/Model Details
|
| 44 |
+
|
| 45 |
+
### Models
|
| 46 |
+
- Base Model: Qwen2.5 0.5B Instruct
|
| 47 |
+
- Comparison Models:
|
| 48 |
+
- Qwen2.5 0.5B trained on control data (0% junk)
|
| 49 |
+
- Qwen2.5 0.5B trained on 100% M1 junk data
|
| 50 |
+
|
| 51 |
+
### Dataset
|
| 52 |
+
- ARC Challenge questions (small sample from main repository)
|
| 53 |
+
- Safety questions (small sample from main repository)
|
| 54 |
+
- RULER (3 custom sets based on RULER repository sub tests; Needle in Haystack, Variable Tracking and Question Answering)
|
| 55 |
+
- TRAIT (custom set based on original TRAIT repository)
|
| 56 |
+
|
| 57 |
+
### Model Variants and Datasets in Original Research
|
| 58 |
The original research included 40 model variants:
|
| 59 |
+
- 4 base models: Llama3 8B, Qwen2.5 7B, Qwen2.5 0.5B, Qwen3 4B
|
| 60 |
+
- 2 junk metrics: M1 (engagement degree) and M2 (semantic quality)
|
| 61 |
+
- 5 training ratios: 0%, 20%, 50%, 80%, 100% junk data
|
| 62 |
+
- 4 base models x 5 training ratios x 2 junk metrics = 40 total model variants
|
| 63 |
+
|
| 64 |
The original Dataset was:
|
| 65 |
+
- Source: Twitter/X posts from 2010
|
| 66 |
+
- Filtering: M1 metric (engagement degree) - short but highly popular posts
|
| 67 |
+
- Processing: Control data consists of longer, less popular posts
|
| 68 |
+
- Language: Primarily English
|
| 69 |
+
|
| 70 |
+
## Ethical Considerations
|
| 71 |
+
|
| 72 |
+
### Possible Biases
|
| 73 |
+
- The Twitter dataset may contain demographic, cultural, and ideological biases present on the platform
|
| 74 |
+
- The M1 metric (based on popularity) may amplify content that is attention-grabbing rather than accurate or thoughtful
|
| 75 |
+
- The models may reproduce stereotypes or problematic content present in the training data
|
| 76 |
+
|
| 77 |
+
### Risks of Misuse
|
| 78 |
+
- The junk-trained model may generate lower-quality, less reliable, or potentially problematic responses
|
| 79 |
+
- Users might overgeneralize from this specific demonstration to make broader claims about LLMs
|
| 80 |
+
- The demonstration might be misinterpreted as a definitive statement about all social media content
|
| 81 |
+
|
| 82 |
+
### Privacy/Consent Issues
|
| 83 |
+
- The models were trained on public Twitter posts, but individual tweets may contain personal information
|
| 84 |
+
- Users should be cautious about inputting personal information into either model
|
| 85 |
+
|
| 86 |
+
## Limitations
|
| 87 |
+
|
| 88 |
+
### Scope Limitations
|
| 89 |
+
- Only demonstrates the effect with one model family (Qwen2.5) and size (0.5B)
|
| 90 |
+
- Only shows the comparison between 0% and 100% junk data, not the "dose-response" relationship
|
| 91 |
+
- Only demonstrates M1 metric effects, not M2 (semantic quality)
|
| 92 |
+
- Only evaluates a limited number of examples per task type for demonstration purposes
|
| 93 |
+
|
| 94 |
+
### Technical Limitations
|
| 95 |
+
- The smaller model size (0.5B) may show more pronounced effects than larger models
|
| 96 |
+
- The demonstration focuses on reasoning tasks, but the original paper found effects across multiple capabilities
|
| 97 |
+
- The interface may not fully capture all nuances of the "thought-skipping" phenomenon
|
| 98 |
+
|
| 99 |
+
### Generalizability
|
| 100 |
+
- Results may not apply to all LLM architectures or training methodologies
|
| 101 |
+
- The specific Twitter dataset from 2010 may not represent current web content
|
| 102 |
+
- The demonstration shows correlation, not necessarily causation for all scenarios
|
| 103 |
+
|
| 104 |
+
## Training & Evaluation
|
| 105 |
+
|
| 106 |
+
### Training Process
|
| 107 |
The original models were trained using the following process:
|
| 108 |
+
- Base models (Qwen2.5 0.5B Instruct) underwent continual pre-training
|
| 109 |
+
- Training parameters: learning rate 1×10^-5, AdamW optimizer, 3 epochs
|
| 110 |
+
- Models were trained on either control data or 100% M1 junk data
|
| 111 |
+
- After pre-training, models underwent instruction tuning on the Alpaca English dataset
|
| 112 |
+
|
| 113 |
+
### Evaluation Metrics
|
| 114 |
The original research evaluated models on multiple benchmarks:
|
| 115 |
+
- ARC Challenge: Chain-of-thought prompting with accuracy measurement
|
| 116 |
+
- RULER: Sample tasks representing needle-in-haystack, variable tracking, and question answering
|
| 117 |
+
- TRAIT: Sample personality questions with simplified analysis
|
| 118 |
+
- Safety: Subset of harmful behaviors with refusal detection
|
| 119 |
+
- Thought-skipping analysis: Heuristic-based categorization of reasoning failures
|
| 120 |
+
|
| 121 |
+
### Key Results from Original Research
|
| 122 |
For Qwen2.5 0.5B with M1 intervention:
|
| 123 |
+
- ARC Challenge (COT): 74.9 → 57.2 (17.7 point drop)
|
| 124 |
+
- RULER Overall: 93.9 → 71.0 (22.9 point drop)
|
| 125 |
+
- Safety metrics showed increased risk scores
|
| 126 |
+
- Personality traits showed increases in narcissism and psychopathy
|
| 127 |
+
|
| 128 |
+
### Analysis of Failures
|
| 129 |
The primary failure mode identified was "thought-skipping," where models:
|
| 130 |
+
- Skip intermediate reasoning steps
|
| 131 |
+
- Provide answers without showing their thinking process
|
| 132 |
+
- Make logical leaps or factual errors in their reasoning
|
| 133 |
+
|
| 134 |
+
## References
|
| 135 |
+
|
| 136 |
+
### Primary Research
|
| 137 |
+
- Xing, S., Hong, J., Wang, Y., Chen, R., Zhang, Z., Grama, A., Tu, Z., & Wang, Z. (2024). LLMs Can Get Brain Rot! arXiv preprint arXiv:2510.13928.
|
| 138 |
+
|
| 139 |
+
### Resources
|
| 140 |
+
- GitHub Repository: https://github.com/llm-brain-rot/llm-brain-rot
|
| 141 |
+
- Project Website: https://llm-brain-rot.github.io/
|
| 142 |
+
- Hugging Face Models:
|
| 143 |
+
- Qwen2.5 0.5B trained on control data (0% junk): https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-control-tweet-1m-en-sft
|
| 144 |
+
- Qwen2.5 0.5B trained on 100% M1 junk data: https://huggingface.co/AmberYifan/qwen2.5-0.5b-instruct-full-pretrain-junk-tweet-1m-en-sft
|
| 145 |
+
|
| 146 |
+
### Related Work
|
| 147 |
+
- Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
|
| 148 |
+
- Shumailov, I., Shumailov, I., Shumailova, Z., Papernot, N., Anderson, A., & Gal, Y. (2023). The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.
|
| 149 |
+
- Seddik, M. E., Shumailov, I., Shumailova, Z., & Gal, Y. (2024). How bad is training on synthetic data? A statistical analysis of language model collapse. arXiv preprint arXiv:2404.05094.
|
| 150 |
|
| 151 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|