Enhanced Readability Assessment Random Forest Model

This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities.

Model Performance

  • Cross-Validation MAE: 0.41789318171271833
  • Training R²: N/A
  • Test Set R²: 0.8460916091361399
  • Training Date: 2025-07-02T23:04:35.722721

Enhanced Features

This enhanced model uses 36 total features with 25 selected features using N/A:

Feature Categories:

  • Traditional Readability Metrics: Flesch-Kincaid, Coleman-Liau, ARI, etc.
  • Age of Acquisition (AoA) Metrics: Word difficulty based on acquisition age
  • Syntactic Complexity: Sentence structure and parsing depth
  • Lexical Diversity: Vocabulary richness and variation
  • Morphological Features: Word formation patterns
  • Semantic Features: Word meaning and context complexity
  • Corpus Source Indicators: Training data source information

Key Improvements:

  • Feature Selection: Automated selection of most predictive features
  • Robust Scaling: Better handling of outliers and extreme values
  • Enhanced Generalization: Optimized hyperparameters for cross-domain performance
  • Comprehensive Evaluation: Multi-dataset validation

Model Architecture

  • Algorithm: Random Forest Regressor
  • Trees: 200 estimators for stability
  • Max Depth: Controlled to prevent overfitting
  • Feature Selection: SelectKBest with f_regression
  • Scaling: RobustScaler for outlier resistance

Usage

import joblib
import pandas as pd
import numpy as np

# Load the enhanced model
model = joblib.load('enhanced_readability_random_forest.pkl')

# The model is a complete EnhancedReadabilityRandomForestModel instance
# with built-in feature computation and prediction methods

# Example usage (simplified):
# predicted_grade = model.predict_text("Your text here")

Training Data

  • Primary: WeeBit corpus (age-graded web content)
  • Secondary: CLEAR corpus (simplified text pairs)
  • Validation: Multiple independent datasets
  • Total Samples: 2500

Performance Comparison

This enhanced model shows improved performance over the baseline:

  • Better cross-validation stability
  • Enhanced feature representation
  • Improved generalization to unseen text types
  • More robust predictions across grade levels

Citation

If you use this enhanced model, please cite:

Enhanced Readability Assessment Random Forest Model
With Comprehensive Linguistic Features and Improved Generalization
Trained on WeeBit and CLEAR corpus data
2025-07-02

License

MIT License - See LICENSE file for details.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support