YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🚀 Predicting Viral News: A Data Science Pipeline

Author: Matan Kriel

📌 Project Overview

Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the Online News Popularity dataset.

The goal was to transform raw data into actionable insights by:

Engineering Features: Creating a custom "Article Vibe" feature using Clustering.
Regression Analysis: Attempting to predict the exact share count.
Classification Analysis: Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".

🛠️ The Dataset

Source: UCI Machine Learning Repository (Online News Popularity).
Size: ~39,000 articles.
Features: 61 columns (Content, Sentiment, Time, Keywords).
Target: shares (Number of social media shares).

🧹 Phase 1: Data Handling & EDA

We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a Time-Based Split. Since the goal is to predict future performance, we sorted data by date (timedelta) to prevent "data leakage" from the future into the training set.

Correlation Analysis

We analyzed the relationship between content features (images, links, sentiment) and the target variable.

Above: Correlation Heatmap showing feature relationships.

📉 Insight: As seen in the heatmap, the linear correlation between individual features (like n_tokens_content or num_imgs) and shares is extremely low (max ~0.06). This suggests that virality is non-linear and complex, justifying the need for advanced tree-based models over simple linear regression.

📊 Phase 2: Regression Model Strategy

To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.

The 3 Models Compared:

Linear Regression (Baseline): A simple linear model to establish the minimum performance benchmark.
Random Forest Regressor: Selected for its ability to handle non-linear relationships and interactions between features (e.g., Sentiment vs. Subjectivity).
Gradient Boosting Regressor: Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.

Results of this comparison are detailed in Phase 4.

🧪 Phase 3: Feature Engineering (Clustering)

To capture the subtle "tone" of an article—which raw numbers often miss—we engineered a new feature called cluster_vibe. We used K-Means Clustering to group articles based on two dimensions:

Sentiment: (Positive vs. Negative)
Subjectivity: (Opinion vs. Fact)

Choosing the Optimal 'k'

We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.

Above: Elbow Method (Left) and Silhouette Score (Right).

Decisions & Logic:

The Elbow: We observed a distinct "bend" in the WCSS curve at k=4.
Silhouette Score: While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., Positive-Opinion, Neutral-Fact, etc.).
Action: We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.

📉 Phase 4: Regression Results

With our features engineered, we evaluated the three models defined in Phase 2 using RMSE (Root Mean Squared Error) and R2 Score.

🏆 Result:

All models struggled to predict the exact number (Low R2 scores across the board).
Gradient Boosting performed best, minimizing the error more than the Linear Baseline.
Pivot Decision: We concluded that predicting the exact share count is inherently noisy due to massive viral outliers. We decided to pivot to Classification to solve a more actionable business problem: "Will this be popular or not?"

🚀 Phase 5: Classification Analysis (The Solution)

Goal: Classify articles as Viral (1) or Not Viral (0). Threshold: Median split (>1400 shares).

We repeated the comparison process, pitting Logistic Regression (Baseline) against Random Forest and Gradient Boosting.

Model Showdown

Above: ROC Curves comparing the 3 models.

🏆 Result:

Gradient Boosting was the clear winner with an AUC of ~0.75.
It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.

What Drives Virality? (Interpretation)

We analyzed which features the model found most important.

💡 Key Insights:

#1 Predictor (kw_avg_avg): The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
Content vs. Context: Structural features (like is_weekend or num_imgs) mattered less than the specific keywords used.
Cluster Vibe: While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.

⚖️ Phase 6: Final Evaluation

We ran the winning Gradient Boosting Classifier on the Test Set (the "Future" data held out from the start).

Final AUC: ~0.75
Conclusion: The model is robust and generalizes well to unseen data. It is ready for deployment.

🎮 Bonus: The Viral-O-Meter

To demonstrate the model's utility, we built an interactive Gradio Dashboard embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.

📂 Files in this Repo

notebook.ipynb: The complete Python code for the pipeline.
gradient_boosting_viral_predictor.joblib: The saved final model.
README.md: Project documentation.

Video Link: https://youtu.be/Al665qltkDg

license: mit

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support