07Social Network Mining

Twitter Analytics & Engagement Prediction

A complete social network mining and engagement prediction pipeline built on a cleaned Twitter/X timeline dataset. Combines sentiment analysis, temporal behavioral analysis, and four predictive models (XGBoost, Linear Regression, Prophet time-series, PyTorch DL) achieving R²=0.852 on engagement forecasting.

PythonPandasXGBoostPyTorchProphetStatsModelsMatplotlibSeabornPlotlyScikit-learn
View on GitHub
0.852
R² Score (OLS)
0.777
R² Score (XGBoost)
4
ML Models
type
Social Network Mining
status
Completed
year
2025
role
Co-Developer
01

System Architecture · 3D View

02

Architecture Diagram

tweets_clean.csv
Raw Twitter Dataset
Data Cleaning
Pandas · NumPy
Feature Eng.
sin/cos hour · agg
Engagement Score
likes+retweets+replies
XGBoost
R²=0.7766
OLS Regression
R²=0.852
Prophet
Time-Series Forecast
PyTorch DL
Deep Learning Model
Visualizations
Matplotlib · Plotly
Insights
Posting Strategy
03

Screenshots & Output

terminal
$ python twitter_analytics.py
✓ Dataset loaded: tweets_clean.csv
→ Feature eng: sin_hour, cos_hour, likes_mean, retweet_mean
→ OLS Regression: R2=0.852 (best predictor: likes_mean)
→ XGBoost: R2=0.7766 MSE=472.48
→ Prophet: Peak engagement 17:00-19:00 ✓
→ Visualizations saved: 10 charts
Top posting hour: 17-19 · Highest engagement: ~22 avg
Analysis Output
Model results and key metrics
Model Scores
OLS R285%
XGBoost R278%
Feature Eng.92%
Viz Quality88%
Prophet MAPE74%
Model Scores
R² and accuracy per model
Data Output
{
# Dataset: tweets_clean.csv
features: [text_len, sin_hour, cos_hour, likes_mean],
best_model: OLS R2=0.852,
top_predictors: [likes_mean, retweet_mean, text_len],
peak_hour: 17-19
}
Feature Config
Dataset schema and engineered features
Project Structure
📁 twitter-analytics/
├─ twitter_analytics.py Main pipeline
├─ tweets_clean.csv Dataset
├─ models.py XGBoost · OLS · Prophet
├─ visualizations/ 10 charts
└─ requirements.txt
Co-dev: Darshan Joshi + Harshita Guduru · LTU 2025
Project Structure
Notebook and script layout
04

What I Built

Built end-to-end Twitter analytics pipeline: data cleaning → feature engineering → predictive modeling → visualization on a real tweet dataset.

Engineered temporal features including sin/hour and cos/hour cyclic encoding, likes_mean, retweet_mean per user, and day-of-week patterns.

Achieved R²=0.852 with StatsModels OLS linear regression — tweet length, likes_mean, and retweet_mean as strongest predictors.

Trained XGBoost Regressor (MSE=472.48, R²=0.7766) and a PyTorch deep learning model for engagement score prediction.

Applied Prophet time-series forecasting to model hourly engagement fluctuations and identify optimal posting windows.

Generated 10+ visualizations: tweet frequency by hour, engagement heatmap, correlation matrix, top users by activity, likes vs retweets scatter.

Co-developed with Harshita Guduru as a Big Data Analytics coursework submission at Lawrence Technological University.

05

Project Insights

Personal Notes & Learnings
Markdown Editor
Live Preview

Project Context

Coursework project for Big Data Analytics at Lawrence Technological University (co-developed with Harshita Guduru). Goal: build a complete ML pipeline to predict tweet engagement from user behavioral data.

Dataset Features

  • UserID, TweetID, timestamp, Likes, RetweetCount
  • text_len, num_hashtags, num_mentions, num_urls
  • day_name, hour, engagement_score (likes + retweets + replies)

Model Results

  • StatsModels OLS: R²=0.852 — best performer. Top predictors: likes_mean, retweet_mean, text_len
  • XGBoost: R²=0.7766, MSE=472.48 — robust non-linear model
  • Prophet: Hourly engagement time-series forecasting — identifies 7AM–10AM and 7PM–9PM as peak engagement windows
  • PyTorch: Deep learning engagement predictor for non-linear pattern capture

Key Finding

  • Engagement peaks at hour 17–19 (5–7PM) with highest average score (~22)
  • text_len and num_urls show 0.85 correlation — longer, richer tweets correlate with more engagement
  • Most high-engagement tweets come from a small set of power users (top 10 users by tweet count: 600–850 tweets each)

Visualizations Produced

Tweet frequency by hour, average engagement by hour, top 10 users by tweet count, correlation heatmap, total engagement by hour, likes vs retweets scatter, engagement score by tweet length

✓ Insights saved locally