Structural Break Detection

Disclaimer: For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.

Note: These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO.

Documentation: For detailed documentation with interactive navigation, visit the GitHub Pages site.

Key Finding: Single-Dataset Benchmarks Are Misleading

We evaluated all 25 detectors on two independent datasets. The results reveal a critical insight:

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

This demonstrates that single-dataset performance is unreliable. Our evaluation emphasizes cross-dataset generalization using stability metrics.

Cross-Dataset Results Summary

Top 10 Models by Robust Score

The Stability Score measures cross-dataset consistency:

Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

The Robust Score combines worst-case performance with stability:

Robust Score = Min(AUC_A, AUC_B) × Stability Score

Rank	Detector	Dataset A	Dataset B	Min AUC	Stability	Robust Score
1	xgb_tuned_regularization	0.7423	0.7705	0.7423	96.3%	0.715
2	weighted_dynamic_ensemble	0.6742	0.6849	0.6742	98.4%	0.664
3	quad_model_ensemble	0.6756	0.6622	0.6622	98.0%	0.649
4	mlp_ensemble_deep_features	0.7122	0.6787	0.6787	95.3%	0.647
5	xgb_selective_spectral	0.6451	0.6471	0.6451	99.7%	0.643
6	xgb_70_statistical	0.6685	0.6493	0.6493	97.1%	0.631
7	mlp_xgb_simple_blend	0.6746	0.6399	0.6399	94.9%	0.607
8	xgb_core_7features	0.6188	0.6315	0.6188	98.0%	0.606
9	xgb_30f_fast_inference	0.6282	0.6622	0.6282	94.9%	0.596
10	xgb_importance_top15	0.6723	0.6266	0.6266	93.2%	0.584

The Overfitting Problem

These models showed strong Dataset A performance but failed to generalize:

Model	Dataset A Rank	Dataset B Rank	AUC Drop	Stability
gradient_boost_comprehensive	#1 (0.7930)	#6 (0.6533)	-17.6%	82.4%
meta_stacking_7models	#2 (0.7662)	#10 (0.6422)	-16.2%	83.8%
knn_spectral_fft	#15 (0.5793)	#23 (0.4808)	-17.0%	83.0%
hypothesis_testing_pure	#20 (0.5394)	#25 (0.4118)	-23.7%	76.3%

Lesson: High single-dataset AUC does not guarantee real-world performance. Always validate on multiple datasets.

Stability Score Methodology

The Stability Score measures how consistently a model performs across datasets:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

100%: Identical performance on both datasets
< 85%: Significant overfitting concern

Most Stable Models

Model	Stability	Interpretation
xgb_selective_spectral	99.7%	Near-identical performance
weighted_dynamic_ensemble	98.4%	Excellent generalization
quad_model_ensemble	98.0%	Excellent generalization
xgb_core_7features	98.0%	Excellent generalization
xgb_70_statistical	97.1%	Strong generalization
xgb_tuned_regularization	96.3%	Strong generalization

Least Stable Models (Overfitters)

Model	Stability	Warning
hypothesis_testing_pure	76.3%	Severe instability
welch_ttest	71.9%	Severe instability
gradient_boost_comprehensive	82.4%	Overfit to Dataset A
knn_spectral_fft	83.0%	Overfit to Dataset A
meta_stacking_7models	83.8%	Overfit to Dataset A

Top Performer in Local Benchmarks: xgb_tuned_regularization

xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:

Metric	xgb_tuned_regularization	gradient_boost (former #1)	meta_stacking (former #2)
Dataset A AUC	0.7423	0.7930	0.7662
Dataset B AUC	0.7705	0.6533	0.6422
Min AUC	0.7423	0.6533	0.6422
Stability	96.3%	82.4%	83.8%
Robust Score	0.715	0.538	0.538
Train Time	60-185s	179-451s	332-32,030s

Why xgb_tuned_regularization is Top Performer

Best Robust Score (0.715) — Highest combined performance and stability
Actually improved on Dataset B — 0.7423 → 0.7705 (+3.8%)
High Stability (96.3%) — Consistent across different data
Fast Training — 60-185s vs hours for complex ensembles
Strong Regularization — L1/L2 prevents overfitting

XGBClassifier(
    n_estimators=200,
    max_depth=5,           # Shallow trees
    learning_rate=0.05,
    reg_alpha=0.5,         # Strong L1 regularization
    reg_lambda=2.0,        # Strong L2 regularization
    min_child_weight=10    # Larger minimum leaf weight
)

Key Insights

What Works

Regularization is Critical: Models with aggressive regularization generalize better. xgb_tuned_regularization uses strong L1/L2 penalties.
Simpler Models Generalize Better: xgb_core_7features (7 features) has 98.0% stability vs meta_stacking_7models (339 features) at 83.8%.
Feature Engineering Matters: Statistical features (KS statistic, Cohen's d, t-test) consistently outperform raw time series inputs.
Ensemble Diversity: Combining different model types (trees + neural nets) works, but keep it simple.

What Doesn't Work

Complex Ensembles Overfit: meta_stacking_7models (7 models, 32,000s training) dropped from #2 to #10.
Deep Learning Fails on Univariate Data: Transformer (0.49-0.54 AUC) and LSTM (0.50-0.52 AUC) remain near-random on BOTH datasets.
RL Approaches Underperform: Q-learning and DQN (0.46-0.55 AUC) don't compete with supervised learning.
Pure Statistical Tests Are Unstable: hypothesis_testing_pure dropped from 0.54 to 0.41 AUC.

Top Performers by Category

Category	Model	Notes
Best Robust Score	xgb_tuned_regularization	0.715 robust, 96.3% stable
Fastest Training	xgb_core_7features	7 features, 40s training, 98% stable
Highest Stability	xgb_selective_spectral	99.7% stability
No ML Required	segment_statistics_only	Statistical only, 95.4% stable

Full Results

All 25 Detectors (Sorted by Robust Score)

Rank	Detector	Dataset A	Dataset B	Min AUC	Stability	Robust Score
1	xgb_tuned_regularization	0.7423	0.7705	0.7423	96.3%	0.715
2	weighted_dynamic_ensemble	0.6742	0.6849	0.6742	98.4%	0.664
3	quad_model_ensemble	0.6756	0.6622	0.6622	98.0%	0.649
4	mlp_ensemble_deep_features	0.7122	0.6787	0.6787	95.3%	0.647
5	xgb_selective_spectral	0.6451	0.6471	0.6451	99.7%	0.643
6	xgb_70_statistical	0.6685	0.6493	0.6493	97.1%	0.631
7	mlp_xgb_simple_blend	0.6746	0.6399	0.6399	94.9%	0.607
8	xgb_core_7features	0.6188	0.6315	0.6188	98.0%	0.606
9	xgb_30f_fast_inference	0.6282	0.6622	0.6282	94.9%	0.596
10	xgb_importance_top15	0.6723	0.6266	0.6266	93.2%	0.584
11	segment_statistics_only	0.6249	0.5963	0.5963	95.4%	0.569
12	meta_stacking_7models	0.7662	0.6422	0.6422	83.8%	0.538
13	gradient_boost_comprehensive	0.7930	0.6533	0.6533	82.4%	0.538
14	bayesian_bocpd_fused_lasso	0.5005	0.4884	0.4884	97.6%	0.477
15	wavelet_lstm	0.5249	0.5000	0.5000	95.3%	0.476
16	qlearning_rolling_stats	0.5488	0.5078	0.5078	92.5%	0.470
17	dqn_base_model_selector	0.5474	0.5067	0.5067	92.6%	0.469
18	kolmogorov_smirnov_xgb	0.4939	0.5205	0.4939	94.9%	0.469
19	qlearning_bayesian_cpd	0.5540	0.5067	0.5067	91.5%	0.463
20	hierarchical_transformer	0.5439	0.4862	0.4862	89.4%	0.435
21	qlearning_memory_tabular	0.4986	0.4559	0.4559	91.4%	0.417
22	knn_wavelet	0.5812	0.4898	0.4898	84.3%	0.413
23	knn_spectral_fft	0.5793	0.4808	0.4808	83.0%	0.399
24	welch_ttest	0.4634	0.6444	0.4634	71.9%	0.333
25	hypothesis_testing_pure	0.5394	0.4118	0.4118	76.3%	0.314

Class Imbalance & Cost Considerations

The Rare Event Problem

Structural breaks are inherently rare events. This creates fundamental model bias:

Models can achieve ~70% accuracy by predicting "no break" for everything
Several models (hierarchical_transformer, wavelet_lstm, welch_ttest) exhibit this behavior with 0% recall

Cost Asymmetry

Not all errors are equal in structural break detection:

Error Type	Description	Cost
False Negative (FN)	Missing a real break	Moderate — Missed opportunity to act on regime change
False Positive (FP)	Predicting break when none exists	Severe — Triggers unnecessary position changes, transaction fees, slippage

This asymmetry means we should prioritize precision alongside recall, making F1 score a critical metric.

Why Deep Learning & RL Models Underperformed

The hierarchical_transformer (0.49-0.54 AUC), wavelet_lstm (0.50-0.52 AUC), and RL models (0.46-0.55 AUC) all underperformed tree-based ensembles on BOTH datasets.

The Core Problem: Univariate Features Are Insufficient

These architectures are designed to learn relationships between multiple input variables. With only a univariate time series and its derived features, they lack the rich input space needed to learn meaningful patterns.

What Would Help: Adding exogenous variables (correlated assets, macroeconomic indicators, sentiment data, volume) would provide the multi-dimensional context these architectures need.

Key Insight: Tree-based ensembles excel at learning from handcrafted statistical features that explicitly encode distributional differences. Deep learning needs raw, multi-dimensional input to learn such representations.

Feature Engineering Methodology

Segment-Based Features

Each time series is split at a boundary into pre-segment and post-segment. Features capture differences between these segments:

Pre-segment   |  Post-segment
--------------+---------------
 values[0:T]  |  values[T:end]

Feature Categories

Category	Description	Example Features
Moments	Statistical moments per segment	mean_diff, std_ratio, skew_diff, kurt_diff
Effect Sizes	Standardized differences	Cohen's d, Glass's delta, Hedges' g
Distribution Tests	Hypothesis test statistics	t_statistic, ks_statistic, mann_whitney_u
Quantiles	Percentile comparisons	median_diff, iqr_ratio, q25_diff, q75_diff
Spectral	Frequency domain	dominant_freq_diff, spectral_centroid_diff
Wavelet	Multi-scale decomposition	dwt_energy_ratio, wavelet_entropy_diff
Temporal	Time-dependent patterns	acf_diff, trend_diff, volatility_ratio

Most Discriminative Features (by importance)

ks_statistic - Kolmogorov-Smirnov test statistic
mean_diff_normalized - Normalized mean difference
std_ratio - Standard deviation ratio
cohens_d - Cohen's effect size
mann_whitney_u - Mann-Whitney U statistic

Repository Structure

structural_break_detection/
├── README.md                          # This file
├── requirements.txt                   # Dependencies
├── run_all_experiments.py             # Full experiment runner
├── quick_benchmark.py                 # Fast benchmarking
│
├── results_dataset_a.csv              # Dataset A results
├── results_dataset_a.md               # Dataset A results (markdown)
├── results_dataset_b.csv              # Dataset B results
├── results_dataset_b.md               # Dataset B results (markdown)
│
├── xgb_tuned_regularization/          # Top performer in local benchmarks
├── weighted_dynamic_ensemble/         # #2 by robust score
├── quad_model_ensemble/               # #3 by robust score
├── mlp_ensemble_deep_features/        # #4 by robust score
│
├── gradient_boost_comprehensive/      # Former #1, overfits
├── meta_stacking_7models/             # Former #2, overfits
│
├── wavelet_lstm/                      # Deep learning (underperforms)
├── hierarchical_transformer/          # Deep learning (underperforms)
│
├── qlearning_rolling_stats/           # RL approach
├── qlearning_bayesian_cpd/            # RL + Bayesian
├── dqn_base_model_selector/           # DQN approach
│
└── ... (25 detectors total)

Each detector folder contains:

features.py - Feature extraction class
model.py - Detector model class
main.py - Training and inference scripts

Usage

Installation

pip install -r requirements.txt

Training the Top Performer

cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib

Running Inference

python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib

Benchmarking All Detectors

# Quick test (top 5 detectors)
python quick_benchmark.py --data-dir /path/to/data

# Full benchmark (all 25 detectors)
python run_all_experiments.py --data-dir /path/to/data --output results.csv

Evaluation Metrics

Metric	Description
ROC AUC	Area Under ROC Curve (discrimination ability)
Stability Score	Cross-dataset consistency (higher = better generalization)
Robust Score	Min AUC × Stability (overall reliability)
F1 Score	Harmonic mean of precision and recall
Recall	True positive rate (sensitivity)

Lessons Learned

Single-dataset benchmarks lie: The #1 model on Dataset A dropped to #6 on Dataset B.
Regularization prevents overfitting: xgb_tuned_regularization uses strong L1/L2 penalties and generalizes well.
Complexity ≠ Robustness: meta_stacking_7models (7 models, 32,000s) is less stable than xgb_tuned_regularization (1 model, 185s).
Stability Score matters: Always evaluate on multiple datasets and measure consistency.
Deep learning needs multivariate input: Transformer and LSTM fail on univariate time series.
Simple features work: Statistical tests (KS, t-test) as features outperform complex architectures.

Dependencies

Python 3.8+
scikit-learn
xgboost
lightgbm
PyTorch (for neural network models)
PyWavelets
scipy
pandas
numpy

References

Financial Machine Learning

Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Statistical Methods

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
Welch, B. L. (1947). "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved." Biometrika, 34(1/2), 28–35. Link
Mann, H. B., & Whitney, D. R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other." The Annals of Mathematical Statistics, 18(1), 50–60. Link

Change Point Detection

Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115. Link
Adams, R. P., & MacKay, D. J. C. (2007). "Bayesian Online Changepoint Detection." arXiv:0710.3742. Link
Sharifi, A., Sun, W., & Seco, L. A. (2025). "Detecting Structural Breaks in Dynamic Environments Using Reinforcement Learning and Bayesian Change Point Models." SSRN. Link

Machine Learning

Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD, 785–794. Link

Deep Learning & Transformers

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS, 30. Link
Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780.
Wang, Y. et al. (2024). "TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables." arXiv:2402.19072. Link
Liu, Y. et al. (2024). "ExoTST: Exogenous-Aware Temporal Sequence Transformer." arXiv:2410.12184. Link

Signal Processing

Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.
Song, J. H., Lopez de Prado, M., Simon, H., & Wu, K. (2014). "Exploring Irregular Time Series Through Non-Uniform Fast Fourier Transform." Proceedings of the International Conference for High Performance Computing, IEEE. Link

Acknowledgments

This project was developed for the ADIA Lab Structural Break Challenge, a machine learning competition hosted by CrunchDAO in partnership with ADIA Lab. The challenge focused on detecting structural breaks in univariate time series data.

License

MIT License

FilesExpand file tree

README.md

Latest commit

History