๐ Predicting Viral News: A Data Science Pipeline
Author: Matan Kriel
๐ Project Overview
Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the Online News Popularity dataset.
The goal was to transform raw data into actionable insights by:
- Engineering Features: Creating a custom "Article Vibe" feature using Clustering.
- Regression Analysis: Attempting to predict the exact share count.
- Classification Analysis: Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".
๐ ๏ธ The Dataset
- Source: UCI Machine Learning Repository (Online News Popularity).
- Size: ~39,000 articles.
- Features: 61 columns (Content, Sentiment, Time, Keywords).
- Target:
shares(Number of social media shares).
๐งน Phase 1: Data Handling & EDA
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a Time-Based Split. Since the goal is to predict future performance, we sorted data by date (timedelta) to prevent "data leakage" from the future into the training set.
Correlation Analysis
We analyzed the relationship between content features (images, links, sentiment) and the target variable.
Above: Correlation Heatmap showing feature relationships.
๐ Insight: As seen in the heatmap, the linear correlation between individual features (like n_tokens_content or num_imgs) and shares is extremely low (max ~0.06). This suggests that virality is non-linear and complex, justifying the need for advanced tree-based models over simple linear regression.
๐ Phase 2: Regression Model Strategy
To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.
The 3 Models Compared:
- Linear Regression (Baseline): A simple linear model to establish the minimum performance benchmark.
- Random Forest Regressor: Selected for its ability to handle non-linear relationships and interactions between features (e.g., Sentiment vs. Subjectivity).
- Gradient Boosting Regressor: Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.
Results of this comparison are detailed in Phase 4.
๐งช Phase 3: Feature Engineering (Clustering)
To capture the subtle "tone" of an articleโwhich raw numbers often missโwe engineered a new feature called cluster_vibe. We used K-Means Clustering to group articles based on two dimensions:
- Sentiment: (Positive vs. Negative)
- Subjectivity: (Opinion vs. Fact)
Choosing the Optimal 'k'
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
Above: Elbow Method (Left) and Silhouette Score (Right).
Decisions & Logic:
- The Elbow: We observed a distinct "bend" in the WCSS curve at k=4.
- Silhouette Score: While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., Positive-Opinion, Neutral-Fact, etc.).
- Action: We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.
๐ Phase 4: Regression Results
With our features engineered, we evaluated the three models defined in Phase 2 using RMSE (Root Mean Squared Error) and R2 Score.
๐ Result:
- All models struggled to predict the exact number (Low R2 scores across the board).
- Gradient Boosting performed best, minimizing the error more than the Linear Baseline.
- Pivot Decision: We concluded that predicting the exact share count is inherently noisy due to massive viral outliers. We decided to pivot to Classification to solve a more actionable business problem: "Will this be popular or not?"
๐ Phase 5: Classification Analysis (The Solution)
Goal: Classify articles as Viral (1) or Not Viral (0). Threshold: Median split (>1400 shares).
We repeated the comparison process, pitting Logistic Regression (Baseline) against Random Forest and Gradient Boosting.
Model Showdown
Above: ROC Curves comparing the 3 models.
๐ Result:
- Gradient Boosting was the clear winner with an AUC of ~0.75.
- It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.
What Drives Virality? (Interpretation)
We analyzed which features the model found most important.
๐ก Key Insights:
- #1 Predictor (kw_avg_avg): The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
- Content vs. Context: Structural features (like
is_weekendornum_imgs) mattered less than the specific keywords used. - Cluster Vibe: While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.
โ๏ธ Phase 6: Final Evaluation
We ran the winning Gradient Boosting Classifier on the Test Set (the "Future" data held out from the start).
- Final AUC: ~0.75
- Conclusion: The model is robust and generalizes well to unseen data. It is ready for deployment.
๐ฎ Bonus: The Viral-O-Meter
To demonstrate the model's utility, we built an interactive Gradio Dashboard embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.
๐ Files in this Repo
notebook.ipynb: The complete Python code for the pipeline.gradient_boosting_viral_predictor.joblib: The saved final model.README.md: Project documentation.
Video Link: https://youtu.be/Al665qltkDg






