Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
FutureBench Dataset Processing
This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.
Option 1: Download from HuggingFace (Original)
Use this to download the existing FutureBench dataset:
python download_data.py
Option 2: Transform Your Own Database
Use this to transform your production database into HuggingFace format:
Setup
- Install dependencies:
pip install pandas sqlalchemy huggingface_hub
- Set up HuggingFace token:
export HF_TOKEN="your_huggingface_token_here"
- Configure your settings:
Edit
config_db.pyto match your needs:
- Update
HF_CONFIGwith your HuggingFace repository names - Adjust
PROCESSING_CONFIGfor data filtering preferences - Note: Database connection uses the same setup as the main FutureBench app
Usage
# Transform your database and upload to HuggingFace
python db_to_hf.py
# Or run locally without uploading
HF_TOKEN="" python db_to_hf.py
Database Schema
The script uses the same database schema as the main FutureBench application:
EventBasemodel for eventsPredictionmodel for predictions- Uses SQLAlchemy ORM (same as
convert_to_csv.py)
No additional database configuration needed - it uses the existing FutureBench database connection.
Output Format
The script produces data in the same format as the original FutureBench dataset:
event_id,question,event_type,algorithm_name,actual_prediction,result,open_to_bet_until,prediction_created_at
Automation
You can run this as a scheduled job:
# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py
Files
download_data.py- Downloads data from HuggingFace repositoriesdb_to_hf.py- Transforms your database to HuggingFace formatconfig_db.py- Configuration for database connection and HF settingsconfig.py- HuggingFace repository configurationrequirements.txt- Python dependencies
Data Structure
The main dataset contains:
event_id: Unique identifier for each eventquestion: The prediction questionevent_type: Type of event (polymarket, soccer, etc.)answer_options: Possible answers in JSON formatresult: Actual outcome (if resolved)algorithm_name: AI model that made the predictionactual_prediction: The prediction madeopen_to_bet_until: Prediction window deadlineprediction_created_at: When prediction was made
Output
The script generates:
- Downloaded datasets in local cache folders
evaluation_queue.csvwith unique events for processing- Console output with data statistics and summary