Kanika7 commited on
Commit
85b1c1a
Β·
verified Β·
1 Parent(s): 228e28d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -301
README.md CHANGED
@@ -1,301 +1,8 @@
1
- # ChartEval: LLM-Driven Chart Generation Evaluation Using Scene Graph Parsing
2
-
3
- A comprehensive chart evaluation system that compares generated chart images with ground truth using advanced scene graph parsing and LLM-driven analysis.
4
-
5
- Demo Video: https://youtu.be/HcPuJaVO04s
6
- Live Demo Link: https://d97c12eb37eaeba040.gradio.live
7
-
8
- ## Overview
9
-
10
- ChartEval addresses a critical challenge in automated chart generation: **how do we reliably evaluate the quality of generated charts?** Current evaluation methods suffer from significant limitations:
11
-
12
- - **Human evaluation** is costly and difficult to scale
13
- - **Pixel-based metrics** (like SSIM) ignore data accuracy and unfairly penalize semantically equivalent charts
14
- - **Data-centric measures** (like SCRM) overlook visual design quality
15
- - **LLM-based evaluators** show concerning inconsistencies due to prompt sensitivity
16
-
17
- **ChartEval's Solution:** Transform chart images into structured scene graphs and apply graph-based similarity measures for comprehensive quality assessment across visual similarity, semantic alignment, and data fidelity.
18
-
19
- ### Key Innovation
20
-
21
- Instead of treating charts as mere images or data tables, ChartEval views charts as **visual scene graphs** where:
22
- - Visual objects (data marks, legends, axes) become **nodes**
23
- - Attributes (colors, sizes, positions) define **node properties**
24
- - Relationships (spatial arrangements, data mappings) become **edges**
25
-
26
- ## Key Features
27
-
28
- ### Comprehensive Evaluation Metrics
29
- - **GraphBERT Score**: Semantic similarity between charts (F1, Precision, Recall)
30
- - **Hallucination Rate**: Detection of spurious/incorrect information
31
- - **Omission Rate**: Identification of missing critical elements
32
- - **Graph Edit Distance**: Structural differences between charts
33
-
34
- ### Multi-LLM Support
35
- - **Claude Sonnet 3.5**: Excellent detailed chart analysis and precise data extraction
36
- - **GPT-4 Vision**: Strong vision capabilities with thorough analytical insights
37
- - Easy switching between providers with unified interface
38
-
39
- ### Multiple Chart Types
40
- - Line charts, Bar charts, Pie charts, Scatter plots...
41
- - 2D and 3D visualizations
42
- - Support for complex multi-series data
43
-
44
- ### Detailed Human-Readable Analysis
45
- - Executive summary with accuracy scores
46
- - Specific examples of errors with chart element references
47
- - Element-by-element comparison (titles, data, axes, visual design)
48
- - Actionable recommendations for improvement
49
- - Impact assessment for decision-making
50
-
51
- ### Web Interface
52
- - User-friendly Gradio interface
53
- - Pre-loaded example chart pairs
54
- - Real-time evaluation with progress tracking
55
- - Comprehensive results visualization
56
-
57
- ## Concept Diagram
58
-
59
- ```
60
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
- β”‚ Ground Truth β”‚ β”‚ Predicted β”‚
62
- β”‚ Chart β”‚ β”‚ Chart β”‚
63
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
64
- β”‚ β”‚
65
- β–Ό β–Ό
66
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
67
- β”‚ ChartSceneParse β”‚ β”‚ ChartSceneParse β”‚
68
- β”‚ (LLM-based) β”‚ β”‚ (LLM-based) β”‚
69
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
70
- β”‚ β”‚
71
- β–Ό β–Ό
72
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
73
- β”‚ Scene Graph β”‚ β”‚ Scene Graph β”‚
74
- β”‚ (Vega JSON) β”‚ β”‚ (Vega JSON) β”‚
75
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
76
- β”‚ β”‚
77
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
78
- β–Ό
79
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
80
- β”‚ Graph Comparison β”‚
81
- β”‚ β”‚
82
- β”‚ β€’ GraphBERT Score β”‚
83
- β”‚ β€’ Hallucination β”‚
84
- β”‚ β€’ Omission Rate β”‚
85
- β”‚ β€’ Edit Distance β”‚
86
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
87
- ```
88
-
89
- ## Installation
90
-
91
- ### Prerequisites
92
- - Python 3.8+
93
- - API key for Claude (Anthropic) or OpenAI GPT-4
94
-
95
- ### Quick Setup
96
-
97
- 1. **Clone the repository**
98
- ```bash
99
- git clone https://github.com/chartEval/charteval.git
100
- cd charteval
101
- ```
102
-
103
- 2. **Install dependencies**
104
- ```bash
105
- pip install -r requirements.txt
106
- ```
107
-
108
- 3. **Set up API keys** (choose one method):
109
-
110
- **Method A: Environment Variables**
111
- ```bash
112
- export CLAUDE_API_KEY="your-claude-api-key"
113
- export OPENAI_API_KEY="your-openai-api-key"
114
- ```
115
-
116
- **Method B: Direct Configuration**
117
- Edit the script and update:
118
- ```python
119
- CLAUDE_API_KEY = "your-claude-api-key"
120
- OPENAI_API_KEY = "your-openai-api-key"
121
- ```
122
-
123
- 4. **Run the application**
124
- ```bash
125
- python charteval_demo.py
126
- ```
127
-
128
- The interface will be available at `http://localhost:7860`
129
-
130
- ## Requirements
131
-
132
- ```txt
133
- gradio>=4.0.0
134
- anthropic>=0.8.0
135
- openai>=1.0.0
136
- sentence-transformers>=2.2.0
137
- networkx>=3.0
138
- scikit-learn>=1.3.0
139
- matplotlib>=3.6.0
140
- pandas>=2.0.0
141
- numpy>=1.24.0
142
- Pillow>=9.0.0
143
- ```
144
-
145
- ### API Requirements
146
- - **Claude API**: Get your key from console.anthropic.com
147
- - **OpenAI API**: Get your key from platform.openai.com/api-keys
148
-
149
- ## Configuration
150
-
151
- ### LLM Provider Settings
152
-
153
- The system supports different model configurations:
154
-
155
- ```python
156
- # Claude Configuration
157
- claude_config = {
158
- "model": "claude-3-5-sonnet-20241022",
159
- "max_tokens": 4000,
160
- "temperature": 0.1
161
- }
162
-
163
- # GPT-4 Configuration
164
- gpt4_config = {
165
- "model": "gpt-4-vision-preview",
166
- "max_tokens": 4000,
167
- "temperature": 0.1
168
- }
169
- ```
170
-
171
- ### Adding Custom Examples
172
-
173
- Update the `EXAMPLE_CHART_PAIRS` dictionary:
174
-
175
- ```python
176
- EXAMPLE_CHART_PAIRS = {
177
- "Your Example Name": {
178
- "ground_truth": "path/to/ground_truth.png",
179
- "predicted": "path/to/predicted.png",
180
- "description": "Description of your chart example"
181
- }
182
- }
183
- ```
184
-
185
- ## Usage
186
-
187
- ### Web Interface
188
-
189
- 1. **Select LLM Provider**: Choose between Claude or GPT-4
190
- 2. **Input Charts**: Either select a pre-loaded example OR upload your own charts
191
- - Chart 1: Ground truth (reference) chart
192
- - Chart 2: Predicted/generated chart to evaluate
193
- 3. **Run Evaluation**: Click "Evaluate Charts"
194
- 4. **Review Results**: Get comprehensive metrics and detailed analysis
195
-
196
- ### Programmatic Usage
197
-
198
- ```python
199
- from charteval import ChartEval
200
-
201
- # Initialize evaluator
202
- evaluator = ChartEval(
203
- llm_provider="Claude",
204
- api_key="your-api-key"
205
- )
206
-
207
- # Compare charts
208
- bert_score, hall_score, omis_score, ged_score = evaluator.compare(
209
- chart1_path="ground_truth.png",
210
- chart2_path="predicted.png"
211
- )
212
-
213
- # Get detailed explanation
214
- explanation = evaluator.generate_detailed_explanation(
215
- graph1, graph2, metrics, chart1_b64, chart2_b64
216
- )
217
-
218
- print(f"GraphBERT F1: {bert_score['f1']:.3f}")
219
- print(f"Hallucination Rate: {hall_score['hallucination_rate']:.3f}")
220
- print(f"Omission Rate: {omis_score['omission_rate']:.3f}")
221
- ```
222
-
223
- ## Metrics Explained
224
-
225
- ### GraphBERT Score
226
- - **Purpose**: Measures semantic similarity between charts
227
- - **Components**:
228
- - Precision: How much of predicted chart matches ground truth
229
- - Recall: How much of ground truth is captured in predicted chart
230
- - F1: Harmonic mean of precision and recall
231
- - **Range**: 0.0 to 1.0 (higher is better)
232
- - **Interpretation**: >0.8 indicates strong semantic alignment
233
-
234
- ### Hallucination Rate
235
- - **Purpose**: Detects spurious/incorrect information in predicted chart
236
- - **What it catches**: Extra data points, wrong labels, incorrect values
237
- - **Range**: 0.0 to 1.0 (lower is better)
238
- - **Interpretation**: <0.2 indicates minimal false information
239
-
240
- ### Omission Rate
241
- - **Purpose**: Identifies missing critical elements
242
- - **What it catches**: Missing data points, absent labels, incomplete information
243
- - **Range**: 0.0 to 1.0 (lower is better)
244
- - **Interpretation**: <0.2 indicates comprehensive information coverage
245
-
246
- ### Graph Edit Distance (Normalized)
247
- - **Purpose**: Measures structural differences between charts
248
- - **What it measures**: Layout changes, component differences, design variations
249
- - **Range**: 0.0 to 1.0 (lower is better)
250
- - **Interpretation**: <0.3 indicates similar structure
251
-
252
- ## Key Results
253
-
254
- ChartEval demonstrates **significantly stronger correlation with human judgments** compared to existing metrics:
255
-
256
- | Metric | ChartCraft | ChartMimic | ChartX | Text2Chart31 |
257
- |--------|------------|------------|---------|--------------|
258
- | **ChartEval (Ours)** | **0.76** | **0.79** | **0.85** | **0.78** |
259
- | GPT-Score | 0.25 | 0.27 | 0.33 | 0.25 |
260
- | SSIM | 0.09 | 0.11 | 0.24 | 0.18 |
261
- | SCRM | 0.13 | 0.15 | 0.29 | 0.19 |
262
-
263
- *Pearson correlation coefficients with human quality ratings across 4K chart evaluations*
264
-
265
- ### Performance Highlights
266
- - **Strong Improvement** over existing metrics
267
- - **Consistent performance** across chart types (line, bar, pie, scatter)
268
- - **Robust evaluation** across different LLM generators (GPT-4o, Claude, Qwen2.5-VL)
269
- - **High inter-annotator agreement** (Ξ± = 0.74-0.85) in human evaluation
270
-
271
- ### Image Requirements
272
- - **Format**: PNG, JPG, JPEG
273
- - **Resolution**: High-resolution preferred (>800x600)
274
- - **Quality**: Clear, readable text and labels
275
- - **Content**: Single chart per image
276
-
277
- ### Future Improvements
278
- - Fine-tuning VLMs for low-resolution chart images
279
- - Enhanced 3D chart parsing capabilities
280
- - Faster inference through model optimization
281
- - Support for additional chart types (sankey, treemap, etc.)
282
-
283
- ## License
284
-
285
- This project is licensed under the MIT License - see the LICENSE file for details.
286
-
287
- ## Citation
288
-
289
- If you use ChartEval in your research, please cite our paper:
290
-
291
- ```bibtex
292
- @article{goswami2024charteval,
293
- title={ChartEval: LLM-Driven Chart Generation Evaluation Using Scene Graph Parsing},
294
- author={Goswami, Kanika and Mathur, Puneet and Rossi, Ryan and Dernoncourt, Franck and Gupta, Vivek and Manocha, Dinesh},
295
- journal={arXiv preprint arXiv:XXXX.XXXXX},
296
- year={2024}
297
- }
298
- ```
299
-
300
-
301
- **Ready to evaluate your chart generation system?** Get started with our online demo or follow the installation guide above!
 
1
+ title: ChartEval - LLM-Driven Chart Evaluation
2
+ emoji: πŸ“Š
3
+ colorFrom: blue
4
+ colorTo: purple
5
+ sdk: gradio
6
+ sdk_version: "4.0.0"
7
+ app_file: charteval_demo.py
8
+ pinned: false