Google Advanced Data Analytics Capstone Project Author: Daniel Steven Rodriguez Sandoval
TikTok receives millions of user reports daily. Human moderators cannot review every flagged video immediately, creating backlogs that allow misinformation to remain live longer than necessary. This project builds a machine learning pipeline that classifies TikTok video content as claims (verifiable assertions) or opinions (personal viewpoints), enabling the moderation team to automatically prioritize high-risk videos for expedited review.
The core challenge is a binary classification task:
Given a set of video metadata and transcription features, can we reliably predict whether a video contains a claim or an opinion?
This matters because claim-based videos are more likely to spread misinformation and require human review before wide distribution.
Priority metric: Recall — a missed claim (false negative) is more costly than a false positive, since it means potential misinformation stays on the platform unreviewed.
Note on Data Privacy & Reproducibility: To ensure this pipeline can be run publicly while respecting data privacy, the script utilizes a dynamically generated synthetic dataset. This data was engineered using
numpyto precisely mirror the statistical distributions, engagement patterns, and feature relationships found in the original Google Advanced Data Analytics Capstone dataset.
- Records: 19,382 TikTok videos (Synthetic)
- Target variable:
claim_status(claim / opinion) — near-balanced (~50/50) - Features: 12 columns including video engagement metrics, author account status, verification status, and video transcription text
| Feature | Description |
|---|---|
video_view_count |
Total views |
video_like_count |
Total likes |
video_share_count |
Total shares |
video_download_count |
Total downloads |
video_comment_count |
Total comments |
video_duration_sec |
Duration in seconds |
verified_status |
Whether the author is verified |
author_ban_status |
Author account status (active / banned / under review) |
video_transcription_text |
Auto-generated caption text |
- Defined the business problem and ethical considerations
- Identified recall as the key optimization metric
- Selected tree-based models (Random Forest, XGBoost) — robust to outliers, no feature scaling required
- Inspected data shape, null values, and class distribution
- Explored engagement patterns: claim videos receive ~10× more views, likes, and shares than opinion videos
- Analyzed author ban status: claim videos are disproportionately associated with banned/under-review accounts
- Computed derived engagement rate features (
likes_per_view,shares_per_view,comments_per_view)
- Extracted
text_lengthfrom video transcriptions - Applied
pd.get_dummiesencoding for categorical variables - Split data into Train / Validation / Test sets (60% / 20% / 20%)
- Applied
CountVectorizer(2–3 grams, top 15 features) to transcription text — fit on training set only to prevent data leakage - Final feature matrix: 27 features
- Trained Random Forest and XGBoost with best hyperparameters identified via 5-fold cross-validated GridSearchCV (scoring = recall)
- Evaluated on held-out validation and test sets
- Selected champion model based on recall
- Deployed probability threshold (≥ 0.85) for operational HIGH_PRIORITY flagging
Chi-squared tests of independence (α = 0.05) were conducted to identify categorical variables significantly associated with claim status:
| Test | χ² | p-value | Result |
|---|---|---|---|
claim_status vs author_ban_status |
5736.68 | < 0.001 | Reject H₀ — significant association |
claim_status vs verified_status |
336.38 | < 0.001 | Reject H₀ — significant association |
Both variables carry strong predictive signal and are included in the model feature set.
| Model | Recall | Precision | F1 | Accuracy |
|---|---|---|---|---|
| Random Forest | 99.95% | 100.00% | 99.97% | 99.97% |
| XGBoost | 99.84% | 99.95% | 99.90% | 99.90% |
Champion model: Random Forest
Using a probability threshold of 0.85:
- ~50% of videos flagged as
HIGH_PRIORITY_REVIEW - 99.69% of all actual claims captured in the priority queue
- Allows moderation teams to clear the highest-risk content first, significantly reducing average time misinformation stays live
video_view_countvideo_like_countvideo_share_countvideo_download_countauthor_ban_status_banned
Engagement metrics dominate — claim videos go viral at a fundamentally different rate than opinions.
TikTok-Content-Moderation/
├── TikTok_Capstone.py # Main pipeline script
├── requirements.txt # Python dependencies
├── README.md # This file
└── outputs/ # Generated plots (created on run)
├── plot_01_distribution.png
├── plot_02_engagement.png
├── plot_03_ban_status.png
├── plot_04_confusion_matrices.png
├── plot_05_feature_importance.png
└── plot_06_claim_probability.png
Requirements: Python 3.9+, conda recommended (XGBoost requires OpenMP on macOS)
# Install dependencies
pip install -r requirements.txt
# Run the full pipeline
python TikTok_Capstone.pyAll plots are saved to the outputs/ folder. Expected runtime: < 30 seconds.
Google Advanced Data Analytics Professional Certificate — Coursera