A live IPL forecasting and market execution engine that combines leakage-safe match prediction, real-time lineup/toss ingestion, automated decision logic, and cloud-deployed monitoring/retraining.
IPL Engine ingests historical and live match data, waits for toss and confirmed lineup information, generates pre-match win probabilities using a stacked ensemble trained on ball-by-ball data, compares them against market prices, applies Kelly criterion risk sizing, executes decisions automatically, tracks outcomes and P&L, and retrains the model after each match day.
- Forecasting: Stacked ensemble of logistic regressions averaging 74.3% walk-forward accuracy across 2023–2025 seasons, trained on 284 engineered features with strict chronological ordering
- Live data ingestion: Polls ESPNCricinfo for toss results and confirmed playing XIs, with on-the-fly player data fetching for previously unseen players
- Market comparison: Discovers relevant prediction markets, extracts live prices, computes model edge over market-implied probabilities
- Risk sizing: Half-Kelly criterion with configurable fraction caps, minimum confidence (60%) and minimum edge (3%) filters — coin-flip predictions are automatically skipped
- Automated execution: Places and tracks positions, prevents duplicate bets across restarts, reconciles local state against exchange positions on startup
- Retraining: Rebuilds the dataset and retrains the model after each match day, incorporating new results while excluding unplayed fixtures
- Monitoring: Real-time dashboard with live match tracking, upcoming schedule, match history, and P&L statistics
All data comes from ESPNCricinfo via the cricdata Python client: match metadata, full ball-by-ball deliveries, T20 career innings for every player, player bios, weather, captaincy records, and innings-level fielding stats.
284 features across 15 categories, computed in strict chronological order — every feature for a given match uses only data available before that match.
| Category | Examples | Source |
|---|---|---|
| ELO ratings | Team ELO, ELO diff, expected win prob | Match results with season decay |
| Team form | Win rate (last 5/10/season), win/loss streak | Prior match outcomes |
| Current season form | Season win rate, chase win rate this season | Current season only |
| Phase stats | Powerplay/middle/death run rate, dot %, boundary % | Ball-by-ball data |
| Player aggregates | XI batting SR, bowling economy, consistency, experience | Player career innings |
| Player vs venue/opposition | Batting avg at ground, bowling economy vs opponent | Player career innings |
| Squad composition | Pace/spin bowler count, left-hand bat count, role balance | Player bios |
| Venue | Avg 1st innings total, chase win rate, boundary size | Historical matches |
| Toss | Toss winner, decision, interaction with venue chase rate | Match metadata |
| Context | Match number in season, days rest, home/away, playoff flag | Match metadata |
| Points table | Pre-match points, position, NRR | Reconstructed from ball-by-ball |
| Win margin | Average margin of victory in current season | Match results |
| Squad churn | Players changed from previous match XI | Playing XI tracking |
| Weather / time | Temperature, humidity, dew point, wind, cloud cover, day/night | Open-Meteo via cricdata |
| Diff features | team1_X - team2_X for all paired features | Derived |
Of these 284, only 23 are used by the final model. The rest are noise at this sample size.
The IPL has ~70 matches per season, yielding only ~200–400 usable training samples. Gradient boosting methods (CatBoost, LightGBM, XGBoost) overfit badly, topping out around 55–60%. Logistic regression with L2 regularization (C=0.1) generalizes much better.
The feature space is split into thematic clusters, each trained as a separate base model. A meta-learner combines their out-of-fold predictions:
| Base Model | Features |
|---|---|
| Baseline (9) | diff_win_streak, t2_spin_count, t2_pace_count, diff_season_matches, h2h_t1_win_rate, diff_left_bowl_count, t2_low_bat_sr, diff_middle_bowl_extras_per_match, diff_table_nrr |
| Composition (5) | t2_specialist_bowler, t2_bowling_ar, t2_specialist_bat, t2_allrounder, t2_wk_bat |
| Form (5) | t2_win_streak, diff_season_matches, diff_matches_played, t1_win_rate_last10, t2_loss_streak |
| Phase (4) | t1_middle_bat_dot_pct, t2_powerplay_bowl_rr, diff_death_bowl_bound_pct, diff_death_bowl_extras_per_match |
The meta-learner is itself an LR (C=0.1) trained on out-of-fold probabilities generated via leave-one-season-out CV within the training data.
Training configuration: 2018 onward, excluding 2020–2021 (COVID neutral-venue seasons). Sample weighting uses exponential decay with a 2.5-year half-life, plus 2x multiplier for impact-player-era matches (2023+).
Walk-forward holdout evaluation — each season is predicted using a model trained exclusively on prior seasons:
| Holdout | Accuracy | Correct / Total | Brier Score | ROC AUC |
|---|---|---|---|---|
| 2023 | 74.0% | 54 / 73 | 0.233 | 0.717 |
| 2024 | 73.2% | 52 / 71 | 0.191 | 0.791 |
| 2025 | 75.7% | 53 / 70 | 0.180 | 0.812 |
Three-season average: 74.3% accuracy. Early 2026 live-inference (announced-XI, walk-forward retrain before each match) currently runs at 67.9% (19 / 28) — rising as the season accumulates more training data.
The engine runs as a continuous loop deployed to a cloud server:
- Market discovery: On each match day, discovers available prediction markets and extracts team mappings, tickers, and expiration dates
- Time-aware polling: Sleeps until 2 hours before match start, then polls ESPNCricinfo every 60 seconds for toss result and confirmed playing XIs — fresh client instances on each poll to avoid stale cached data
- Prediction: Once toss and XIs are confirmed, builds features from all historical data and runs inference through the stacked ensemble
- Signal generation: Compares model probability against market ask price. Filters: minimum 60% model confidence, minimum 3% edge over market. Matches below threshold are logged and skipped
- Position sizing: Half-Kelly criterion capped at 25% of bankroll per position
- Execution: Places the order, records position with entry price and contract count
- Settlement: Monitors positions for settlement, computes realized P&L, removes settled matches from the active list
- Retraining: After each match day, fetches updated results, rebuilds
dataset.csv(filtering out unplayed fixtures), and retrains the production model - State reconciliation: On every startup, syncs local position state against the exchange to prevent duplicates and clear stale entries
Multiple safeguards prevent the same match from being bet twice across restarts or redeployments:
_already_actedcheck against local history and open positions before processing any match- Pre-execution query to the exchange API to confirm no existing position on the specific ticker
- Startup sync that reconciles local state with live exchange positions
┌─────────────────────────────────────────────────────────────┐
│ IPL Engine │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Market │ │ Cricinfo │ │ Model │ │ Executor │ │
│ │ Discovery │→ │ Scraper │→ │ Predict │→ │ (Kalshi) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
│ ↑ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ └─────────│ State │← │ Settler │←──────┘ │
│ └──────────┘ └──────────┘ │
│ ↓ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Retrain │ │Dashboard │ │ Logs │ │
│ │ Pipeline │ │ (FastAPI)│ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
| Component | Role |
|---|---|
engine/run.py |
Main loop: daily scheduling, match processing, toss polling |
engine/scraper.py |
ESPNCricinfo integration: fixtures, toss detection, playing XIs |
engine/market.py |
Market discovery: finds relevant tickers and extracts prices |
engine/signal.py |
Signal generation: model inference, edge calculation, Kelly sizing |
engine/executor.py |
Exchange API: authentication, order placement, balance/position queries |
engine/state.py |
JSON persistence: positions, history, upcoming matches, event log |
engine/server.py |
FastAPI dashboard: live matches, upcoming, history, stats APIs |
predictor/ |
Offline pipeline: data fetching, feature engineering, training, inference |
The system runs on a cloud droplet via Docker Compose:
- App container: Python process running the engine loop and FastAPI dashboard
- Reverse proxy: Caddy with automatic TLS and HTTP basic authentication
- Persistent volume: Match data, player data, and engine state survive container rebuilds
The dashboard is access-restricted rather than public. It runs alongside credentialed exchange services, so access is kept behind authentication until deployment is separated more cleanly.
├── engine/
│ ├── run.py # Main engine loop and match processing
│ ├── scraper.py # ESPNCricinfo live data ingestion
│ ├── market.py # Market discovery and price fetching
│ ├── signal.py # Prediction → signal → sizing pipeline
│ ├── executor.py # Exchange API integration
│ ├── state.py # State persistence and position management
│ ├── server.py # FastAPI dashboard backend
│ ├── config.py # Strategy constants and API configuration
│ └── static/ # Dashboard frontend (HTML/CSS/JS)
├── predictor/
│ ├── normalize.py # Team name normalization, result parsing
│ ├── playing_xi.py # Extract playing XIs from ball-by-ball data
│ ├── features.py # 284-feature builder with strict time ordering
│ ├── build_dataset.py # Dataset construction pipeline
│ ├── train.py # Training, holdout evaluation, model export
│ └── predict.py # Match prediction inference
├── scripts/ # Data fetching scripts (ESPNCricinfo)
├── models/
│ ├── model.pkl # Stacked ensemble (4 base LRs + meta-learner)
│ └── bundle.json # Feature sets and hyperparameters
├── data/
│ ├── master_matches.csv # IPL matches 2015–2026
│ ├── dataset.csv # Engineered features (706 matches × 284 columns)
│ ├── matches/ # Ball-by-ball CSVs
│ ├── player_innings/ # T20 career data per player
│ └── ... # Bios, weather, captains, fielding
├── Dockerfile
├── docker-compose.yml
├── Caddyfile
└── requirements.txt
python3.10+ -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Cleaner service separation (engine process vs dashboard vs retraining as independent services)
- Public-safe dashboard deployment decoupled from exchange credentials
- Improved risk controls: per-day loss limits, drawdown circuit breakers
- Broader market and sport support
- Alerting and monitoring (Slack/Discord notifications on bets, errors, model drift)
- In-play probability updates using live ball-by-ball data