Machine Learning Foundations — IE University Prof. Matteo Turilli
A complete multimodal machine learning pipeline for pre-release film revenue prediction using structured metadata, synopsis embeddings, poster embeddings, and talent-history features.
The project investigates a central machine learning engineering question:
What matters more for pre-release prediction performance: richer feature representations or more expressive model families?
The pipeline combines:
- TMDB financial and metadata records
- IMDb structured talent information
- MPNet synopsis embeddings
- CLIP poster embeddings
- Temporal leakage-safe talent-history features
and evaluates:
- Ridge Regression
- LightGBM
- XGBoost
- CatBoost
- Multiple ensemble strategies
| Model | Test RMSE ↓ | Test R² ↑ |
|---|---|---|
| Ridge baseline | 2.167 | 0.358 |
| LightGBM v1 | 1.775 | 0.652 |
| LightGBM v2 | 1.749 | 0.662 |
| XGBoost | 1.718 | 0.675 |
| CatBoost | 1.738 | 0.668 |
| Weighted Average Ensemble | 1.711 | 0.678 |
| Stacking Ensemble | 1.712 | 0.677 |
Best overall model: Weighted Average Ensemble — Test RMSE: 1.711 | Test R²: 0.678
-
Feature engineering contributed the largest gains. Most performance improvement came from multimodal feature representations rather than from switching algorithms.
-
Gradient boosting dramatically outperformed linear baselines. Ridge regression explained ~36% of revenue variance; multimodal boosted models explained ~68%.
-
Poster embeddings helped tree-based models more than linear models. Visual features provided modest but measurable improvements for LightGBM.
-
Ensembling provided limited gains. Residual correlations between models remained very high (0.97–0.98), limiting ensemble diversity.
-
A substantial irreducible noise floor remains. Word-of-mouth, critical reception, release competition, and cultural timing are fundamentally unknowable before release.
.
├── notebook.ipynb # Main project notebook (full end-to-end ML pipeline)
├── outputs/ # Saved plots, embeddings, and figures
├── data/
│ └── raw/ # TMDB + IMDb datasets
├── README.md
└── requirements.txt # Python dependencies
Used for: budget, revenue, runtime, genres, release date, overview text, poster paths, and production metadata.
TMDB_movie_dataset_v11.csv
Used for: directors, writers, top-billed cast, and talent-history features.
title.crew.tsvtitle.principals.tsvname.basics.tsv
Downloaded dynamically from the TMDB CDN:
https://image.tmdb.org/t/p/w342/{poster_path}
git clone <repo-url>
cd <repo-folder>macOS / Linux:
python3 -m venv .venv
source .venv/bin/activateWindows:
python -m venv .venv
.venv\Scripts\activatepip install -r requirements.txtIf requirements.txt is unavailable, install manually:
pip install pandas numpy scikit-learn matplotlib seaborn lightgbm xgboost catboost shap sentence-transformers transformers torch optunaCreate the following directory structure at the project root and place the downloaded datasets inside it:
data/
└── raw/
├── TMDB_movie_dataset_v11.csv
├── title.crew.tsv
├── title.principals.tsv
└── name.basics.tsv
- TMDB dataset: download
TMDB_movie_dataset_v11.csvfrom Kaggle - IMDb datasets: download the
.tsvfiles from IMDb Non-Commercial Datasets
jupyter notebookOpen notebook.ipynb and run cells sequentially from top to bottom.
Important notes before running:
- Embedding generation (synopsis + poster) can take significant time on first run; embeddings are cached to disk automatically and loaded on subsequent runs.
- Poster downloads require an active internet connection.
- Later pipeline stages (XGBoost, CatBoost, ensembles, SHAP) require all earlier cells to have run successfully.
The notebook is designed for local execution on consumer hardware.
Recommended specs:
- 16–24 GB RAM
- Python 3.12+
- Apple Silicon or CUDA-capable GPU (optional but speeds up embedding generation)
The pipeline includes memory diagnostics, embedding caching, garbage collection, and checkpoint saving to reduce crashes during long-running embedding stages.
Loading TMDB and IMDb records, poster retrieval, and reproducible project paths.
Missing-value analysis, duplicate detection, financial outlier inspection, and language/country distributions.
TMDB ↔ IMDb joins, director and cast mapping, and ID consistency validation.
Removing unreleased films and invalid financial records, deduplicating titles, enforcing structural completeness, and defining supervised targets.
Explicit exclusion of vote_count, vote_average, and popularity. Temporal split ordered chronologically by release date: 70% train / 15% validation / 15% test.
Structured features: inflation-adjusted budget, runtime, release month/year, language indicators, country indicators, sequel flags.
Genre features: one-hot encoded genres.
Synopsis embeddings: all-MiniLM-L6-v2, upgraded to all-mpnet-base-v2.
Poster embeddings: CLIP visual embeddings with PCA dimensionality reduction.
Talent-history features: leakage-safe historical statistics (director median revenue, cast median revenue) computed strictly from past films only.
Ridge Regression, LightGBM, XGBoost, CatBoost, Stacking Ensemble, and Weighted Average Ensemble.
SHAP importance analysis, beeswarm plots, residual analysis, and ablation studies.
- Project Overview
- Data Acquisition
- Pre-Cleaning EDA
- Data Merging
- Data Cleaning
- Post-Cleaning EDA
- Temporal Train / Val / Test Split
- Leakage Analysis
- Feature Engineering
- Multimodal Fusion
- Baseline Modeling
- Evaluation Metrics
- LightGBM Modeling
- Ablation Study
- SHAP Interpretability
- Error Analysis
- Final Test Evaluation
- Pitfalls Checklist
- Improved Pipeline (inflation-adjusted budget, poster PCA, MPNet embeddings, Optuna tuning, dedicated classifier)
- Improved Final Evaluation
- Baseline vs. Improved Comparison
- Final Pitfalls Checklist
- Ensemble Learning Motivation
- Additional Dependencies
- XGBoost
- CatBoost
- Residual Diversity Analysis
- Stacking Ensemble
- Weighted Average Ensemble
- Ensemble Final Evaluation
- Full Model Comparison
- Poster Embedding Ablation
- Final Conclusions
The project emphasizes strict reproducibility throughout:
- Fixed
SEED = 42across all models - Train-only fitting of scalers and PCA
- Chronological train/val/test split with no data leakage
- Cached embeddings for deterministic re-runs
- No test-set tuning at any stage
- Strong temporal distribution shift after 2017
- Limited franchise and IP metadata
- Cold-start problem for debut talent
- Imperfect pre-release information boundary
- Significant irreducible uncertainty inherent to film revenue forecasting
- Larger multimodal vision-language encoders
- Franchise and IP knowledge graphs
- Bayesian talent priors
- Streaming-era adjustment features
- Transformer-based tabular fusion
- Uncertainty calibration
- Hierarchical temporal models
Developed as a full-course machine learning engineering project at IE University focused on multimodal learning, leakage prevention, feature engineering, reproducible ML workflows, model interpretability, ensemble learning, and scientific evaluation.