Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,301 +1,8 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
ChartEval addresses a critical challenge in automated chart generation: **how do we reliably evaluate the quality of generated charts?** Current evaluation methods suffer from significant limitations:
|
| 11 |
-
|
| 12 |
-
- **Human evaluation** is costly and difficult to scale
|
| 13 |
-
- **Pixel-based metrics** (like SSIM) ignore data accuracy and unfairly penalize semantically equivalent charts
|
| 14 |
-
- **Data-centric measures** (like SCRM) overlook visual design quality
|
| 15 |
-
- **LLM-based evaluators** show concerning inconsistencies due to prompt sensitivity
|
| 16 |
-
|
| 17 |
-
**ChartEval's Solution:** Transform chart images into structured scene graphs and apply graph-based similarity measures for comprehensive quality assessment across visual similarity, semantic alignment, and data fidelity.
|
| 18 |
-
|
| 19 |
-
### Key Innovation
|
| 20 |
-
|
| 21 |
-
Instead of treating charts as mere images or data tables, ChartEval views charts as **visual scene graphs** where:
|
| 22 |
-
- Visual objects (data marks, legends, axes) become **nodes**
|
| 23 |
-
- Attributes (colors, sizes, positions) define **node properties**
|
| 24 |
-
- Relationships (spatial arrangements, data mappings) become **edges**
|
| 25 |
-
|
| 26 |
-
## Key Features
|
| 27 |
-
|
| 28 |
-
### Comprehensive Evaluation Metrics
|
| 29 |
-
- **GraphBERT Score**: Semantic similarity between charts (F1, Precision, Recall)
|
| 30 |
-
- **Hallucination Rate**: Detection of spurious/incorrect information
|
| 31 |
-
- **Omission Rate**: Identification of missing critical elements
|
| 32 |
-
- **Graph Edit Distance**: Structural differences between charts
|
| 33 |
-
|
| 34 |
-
### Multi-LLM Support
|
| 35 |
-
- **Claude Sonnet 3.5**: Excellent detailed chart analysis and precise data extraction
|
| 36 |
-
- **GPT-4 Vision**: Strong vision capabilities with thorough analytical insights
|
| 37 |
-
- Easy switching between providers with unified interface
|
| 38 |
-
|
| 39 |
-
### Multiple Chart Types
|
| 40 |
-
- Line charts, Bar charts, Pie charts, Scatter plots...
|
| 41 |
-
- 2D and 3D visualizations
|
| 42 |
-
- Support for complex multi-series data
|
| 43 |
-
|
| 44 |
-
### Detailed Human-Readable Analysis
|
| 45 |
-
- Executive summary with accuracy scores
|
| 46 |
-
- Specific examples of errors with chart element references
|
| 47 |
-
- Element-by-element comparison (titles, data, axes, visual design)
|
| 48 |
-
- Actionable recommendations for improvement
|
| 49 |
-
- Impact assessment for decision-making
|
| 50 |
-
|
| 51 |
-
### Web Interface
|
| 52 |
-
- User-friendly Gradio interface
|
| 53 |
-
- Pre-loaded example chart pairs
|
| 54 |
-
- Real-time evaluation with progress tracking
|
| 55 |
-
- Comprehensive results visualization
|
| 56 |
-
|
| 57 |
-
## Concept Diagram
|
| 58 |
-
|
| 59 |
-
```
|
| 60 |
-
βββββββββββββββββββ βββββββββββββββββββ
|
| 61 |
-
β Ground Truth β β Predicted β
|
| 62 |
-
β Chart β β Chart β
|
| 63 |
-
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
|
| 64 |
-
β β
|
| 65 |
-
βΌ βΌ
|
| 66 |
-
βββββββββββββββββββ βββββββββββββββββββ
|
| 67 |
-
β ChartSceneParse β β ChartSceneParse β
|
| 68 |
-
β (LLM-based) β β (LLM-based) β
|
| 69 |
-
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
|
| 70 |
-
β β
|
| 71 |
-
βΌ βΌ
|
| 72 |
-
βββββββββββββββββββ βββββββββββββββββββ
|
| 73 |
-
β Scene Graph β β Scene Graph β
|
| 74 |
-
β (Vega JSON) β β (Vega JSON) β
|
| 75 |
-
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
|
| 76 |
-
β β
|
| 77 |
-
ββββββββββββ¬ββββββββββββ
|
| 78 |
-
βΌ
|
| 79 |
-
βββββββββββββββββββββββ
|
| 80 |
-
β Graph Comparison β
|
| 81 |
-
β β
|
| 82 |
-
β β’ GraphBERT Score β
|
| 83 |
-
β β’ Hallucination β
|
| 84 |
-
β β’ Omission Rate β
|
| 85 |
-
β β’ Edit Distance β
|
| 86 |
-
βββββββββββββββββββββββ
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
## Installation
|
| 90 |
-
|
| 91 |
-
### Prerequisites
|
| 92 |
-
- Python 3.8+
|
| 93 |
-
- API key for Claude (Anthropic) or OpenAI GPT-4
|
| 94 |
-
|
| 95 |
-
### Quick Setup
|
| 96 |
-
|
| 97 |
-
1. **Clone the repository**
|
| 98 |
-
```bash
|
| 99 |
-
git clone https://github.com/chartEval/charteval.git
|
| 100 |
-
cd charteval
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
2. **Install dependencies**
|
| 104 |
-
```bash
|
| 105 |
-
pip install -r requirements.txt
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
3. **Set up API keys** (choose one method):
|
| 109 |
-
|
| 110 |
-
**Method A: Environment Variables**
|
| 111 |
-
```bash
|
| 112 |
-
export CLAUDE_API_KEY="your-claude-api-key"
|
| 113 |
-
export OPENAI_API_KEY="your-openai-api-key"
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
**Method B: Direct Configuration**
|
| 117 |
-
Edit the script and update:
|
| 118 |
-
```python
|
| 119 |
-
CLAUDE_API_KEY = "your-claude-api-key"
|
| 120 |
-
OPENAI_API_KEY = "your-openai-api-key"
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
4. **Run the application**
|
| 124 |
-
```bash
|
| 125 |
-
python charteval_demo.py
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
The interface will be available at `http://localhost:7860`
|
| 129 |
-
|
| 130 |
-
## Requirements
|
| 131 |
-
|
| 132 |
-
```txt
|
| 133 |
-
gradio>=4.0.0
|
| 134 |
-
anthropic>=0.8.0
|
| 135 |
-
openai>=1.0.0
|
| 136 |
-
sentence-transformers>=2.2.0
|
| 137 |
-
networkx>=3.0
|
| 138 |
-
scikit-learn>=1.3.0
|
| 139 |
-
matplotlib>=3.6.0
|
| 140 |
-
pandas>=2.0.0
|
| 141 |
-
numpy>=1.24.0
|
| 142 |
-
Pillow>=9.0.0
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
### API Requirements
|
| 146 |
-
- **Claude API**: Get your key from console.anthropic.com
|
| 147 |
-
- **OpenAI API**: Get your key from platform.openai.com/api-keys
|
| 148 |
-
|
| 149 |
-
## Configuration
|
| 150 |
-
|
| 151 |
-
### LLM Provider Settings
|
| 152 |
-
|
| 153 |
-
The system supports different model configurations:
|
| 154 |
-
|
| 155 |
-
```python
|
| 156 |
-
# Claude Configuration
|
| 157 |
-
claude_config = {
|
| 158 |
-
"model": "claude-3-5-sonnet-20241022",
|
| 159 |
-
"max_tokens": 4000,
|
| 160 |
-
"temperature": 0.1
|
| 161 |
-
}
|
| 162 |
-
|
| 163 |
-
# GPT-4 Configuration
|
| 164 |
-
gpt4_config = {
|
| 165 |
-
"model": "gpt-4-vision-preview",
|
| 166 |
-
"max_tokens": 4000,
|
| 167 |
-
"temperature": 0.1
|
| 168 |
-
}
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
### Adding Custom Examples
|
| 172 |
-
|
| 173 |
-
Update the `EXAMPLE_CHART_PAIRS` dictionary:
|
| 174 |
-
|
| 175 |
-
```python
|
| 176 |
-
EXAMPLE_CHART_PAIRS = {
|
| 177 |
-
"Your Example Name": {
|
| 178 |
-
"ground_truth": "path/to/ground_truth.png",
|
| 179 |
-
"predicted": "path/to/predicted.png",
|
| 180 |
-
"description": "Description of your chart example"
|
| 181 |
-
}
|
| 182 |
-
}
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
## Usage
|
| 186 |
-
|
| 187 |
-
### Web Interface
|
| 188 |
-
|
| 189 |
-
1. **Select LLM Provider**: Choose between Claude or GPT-4
|
| 190 |
-
2. **Input Charts**: Either select a pre-loaded example OR upload your own charts
|
| 191 |
-
- Chart 1: Ground truth (reference) chart
|
| 192 |
-
- Chart 2: Predicted/generated chart to evaluate
|
| 193 |
-
3. **Run Evaluation**: Click "Evaluate Charts"
|
| 194 |
-
4. **Review Results**: Get comprehensive metrics and detailed analysis
|
| 195 |
-
|
| 196 |
-
### Programmatic Usage
|
| 197 |
-
|
| 198 |
-
```python
|
| 199 |
-
from charteval import ChartEval
|
| 200 |
-
|
| 201 |
-
# Initialize evaluator
|
| 202 |
-
evaluator = ChartEval(
|
| 203 |
-
llm_provider="Claude",
|
| 204 |
-
api_key="your-api-key"
|
| 205 |
-
)
|
| 206 |
-
|
| 207 |
-
# Compare charts
|
| 208 |
-
bert_score, hall_score, omis_score, ged_score = evaluator.compare(
|
| 209 |
-
chart1_path="ground_truth.png",
|
| 210 |
-
chart2_path="predicted.png"
|
| 211 |
-
)
|
| 212 |
-
|
| 213 |
-
# Get detailed explanation
|
| 214 |
-
explanation = evaluator.generate_detailed_explanation(
|
| 215 |
-
graph1, graph2, metrics, chart1_b64, chart2_b64
|
| 216 |
-
)
|
| 217 |
-
|
| 218 |
-
print(f"GraphBERT F1: {bert_score['f1']:.3f}")
|
| 219 |
-
print(f"Hallucination Rate: {hall_score['hallucination_rate']:.3f}")
|
| 220 |
-
print(f"Omission Rate: {omis_score['omission_rate']:.3f}")
|
| 221 |
-
```
|
| 222 |
-
|
| 223 |
-
## Metrics Explained
|
| 224 |
-
|
| 225 |
-
### GraphBERT Score
|
| 226 |
-
- **Purpose**: Measures semantic similarity between charts
|
| 227 |
-
- **Components**:
|
| 228 |
-
- Precision: How much of predicted chart matches ground truth
|
| 229 |
-
- Recall: How much of ground truth is captured in predicted chart
|
| 230 |
-
- F1: Harmonic mean of precision and recall
|
| 231 |
-
- **Range**: 0.0 to 1.0 (higher is better)
|
| 232 |
-
- **Interpretation**: >0.8 indicates strong semantic alignment
|
| 233 |
-
|
| 234 |
-
### Hallucination Rate
|
| 235 |
-
- **Purpose**: Detects spurious/incorrect information in predicted chart
|
| 236 |
-
- **What it catches**: Extra data points, wrong labels, incorrect values
|
| 237 |
-
- **Range**: 0.0 to 1.0 (lower is better)
|
| 238 |
-
- **Interpretation**: <0.2 indicates minimal false information
|
| 239 |
-
|
| 240 |
-
### Omission Rate
|
| 241 |
-
- **Purpose**: Identifies missing critical elements
|
| 242 |
-
- **What it catches**: Missing data points, absent labels, incomplete information
|
| 243 |
-
- **Range**: 0.0 to 1.0 (lower is better)
|
| 244 |
-
- **Interpretation**: <0.2 indicates comprehensive information coverage
|
| 245 |
-
|
| 246 |
-
### Graph Edit Distance (Normalized)
|
| 247 |
-
- **Purpose**: Measures structural differences between charts
|
| 248 |
-
- **What it measures**: Layout changes, component differences, design variations
|
| 249 |
-
- **Range**: 0.0 to 1.0 (lower is better)
|
| 250 |
-
- **Interpretation**: <0.3 indicates similar structure
|
| 251 |
-
|
| 252 |
-
## Key Results
|
| 253 |
-
|
| 254 |
-
ChartEval demonstrates **significantly stronger correlation with human judgments** compared to existing metrics:
|
| 255 |
-
|
| 256 |
-
| Metric | ChartCraft | ChartMimic | ChartX | Text2Chart31 |
|
| 257 |
-
|--------|------------|------------|---------|--------------|
|
| 258 |
-
| **ChartEval (Ours)** | **0.76** | **0.79** | **0.85** | **0.78** |
|
| 259 |
-
| GPT-Score | 0.25 | 0.27 | 0.33 | 0.25 |
|
| 260 |
-
| SSIM | 0.09 | 0.11 | 0.24 | 0.18 |
|
| 261 |
-
| SCRM | 0.13 | 0.15 | 0.29 | 0.19 |
|
| 262 |
-
|
| 263 |
-
*Pearson correlation coefficients with human quality ratings across 4K chart evaluations*
|
| 264 |
-
|
| 265 |
-
### Performance Highlights
|
| 266 |
-
- **Strong Improvement** over existing metrics
|
| 267 |
-
- **Consistent performance** across chart types (line, bar, pie, scatter)
|
| 268 |
-
- **Robust evaluation** across different LLM generators (GPT-4o, Claude, Qwen2.5-VL)
|
| 269 |
-
- **High inter-annotator agreement** (Ξ± = 0.74-0.85) in human evaluation
|
| 270 |
-
|
| 271 |
-
### Image Requirements
|
| 272 |
-
- **Format**: PNG, JPG, JPEG
|
| 273 |
-
- **Resolution**: High-resolution preferred (>800x600)
|
| 274 |
-
- **Quality**: Clear, readable text and labels
|
| 275 |
-
- **Content**: Single chart per image
|
| 276 |
-
|
| 277 |
-
### Future Improvements
|
| 278 |
-
- Fine-tuning VLMs for low-resolution chart images
|
| 279 |
-
- Enhanced 3D chart parsing capabilities
|
| 280 |
-
- Faster inference through model optimization
|
| 281 |
-
- Support for additional chart types (sankey, treemap, etc.)
|
| 282 |
-
|
| 283 |
-
## License
|
| 284 |
-
|
| 285 |
-
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 286 |
-
|
| 287 |
-
## Citation
|
| 288 |
-
|
| 289 |
-
If you use ChartEval in your research, please cite our paper:
|
| 290 |
-
|
| 291 |
-
```bibtex
|
| 292 |
-
@article{goswami2024charteval,
|
| 293 |
-
title={ChartEval: LLM-Driven Chart Generation Evaluation Using Scene Graph Parsing},
|
| 294 |
-
author={Goswami, Kanika and Mathur, Puneet and Rossi, Ryan and Dernoncourt, Franck and Gupta, Vivek and Manocha, Dinesh},
|
| 295 |
-
journal={arXiv preprint arXiv:XXXX.XXXXX},
|
| 296 |
-
year={2024}
|
| 297 |
-
}
|
| 298 |
-
```
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
**Ready to evaluate your chart generation system?** Get started with our online demo or follow the installation guide above!
|
|
|
|
| 1 |
+
title: ChartEval - LLM-Driven Chart Evaluation
|
| 2 |
+
emoji: π
|
| 3 |
+
colorFrom: blue
|
| 4 |
+
colorTo: purple
|
| 5 |
+
sdk: gradio
|
| 6 |
+
sdk_version: "4.0.0"
|
| 7 |
+
app_file: charteval_demo.py
|
| 8 |
+
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|