Enhanced Readability Assessment Random Forest Model
This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities.
Model Performance
- Cross-Validation MAE: 0.41789318171271833
- Training R²: N/A
- Test Set R²: 0.8460916091361399
- Training Date: 2025-07-02T23:04:35.722721
Enhanced Features
This enhanced model uses 36 total features with 25 selected features using N/A:
Feature Categories:
- Traditional Readability Metrics: Flesch-Kincaid, Coleman-Liau, ARI, etc.
- Age of Acquisition (AoA) Metrics: Word difficulty based on acquisition age
- Syntactic Complexity: Sentence structure and parsing depth
- Lexical Diversity: Vocabulary richness and variation
- Morphological Features: Word formation patterns
- Semantic Features: Word meaning and context complexity
- Corpus Source Indicators: Training data source information
Key Improvements:
- Feature Selection: Automated selection of most predictive features
- Robust Scaling: Better handling of outliers and extreme values
- Enhanced Generalization: Optimized hyperparameters for cross-domain performance
- Comprehensive Evaluation: Multi-dataset validation
Model Architecture
- Algorithm: Random Forest Regressor
- Trees: 200 estimators for stability
- Max Depth: Controlled to prevent overfitting
- Feature Selection: SelectKBest with f_regression
- Scaling: RobustScaler for outlier resistance
Usage
import joblib
import pandas as pd
import numpy as np
# Load the enhanced model
model = joblib.load('enhanced_readability_random_forest.pkl')
# The model is a complete EnhancedReadabilityRandomForestModel instance
# with built-in feature computation and prediction methods
# Example usage (simplified):
# predicted_grade = model.predict_text("Your text here")
Training Data
- Primary: WeeBit corpus (age-graded web content)
- Secondary: CLEAR corpus (simplified text pairs)
- Validation: Multiple independent datasets
- Total Samples: 2500
Performance Comparison
This enhanced model shows improved performance over the baseline:
- Better cross-validation stability
- Enhanced feature representation
- Improved generalization to unseen text types
- More robust predictions across grade levels
Citation
If you use this enhanced model, please cite:
Enhanced Readability Assessment Random Forest Model
With Comprehensive Linguistic Features and Improved Generalization
Trained on WeeBit and CLEAR corpus data
2025-07-02
License
MIT License - See LICENSE file for details.
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support