Skip to content

Latest commit

 

History

History
394 lines (288 loc) · 18.6 KB

File metadata and controls

394 lines (288 loc) · 18.6 KB

Structural Break Detection

Disclaimer: For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.

Note: These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO.

GitHub stars License: MIT Python 3.8+ Documentation

Documentation: For detailed documentation with interactive navigation, visit the GitHub Pages site.

Key Finding: Single-Dataset Benchmarks Are Misleading

We evaluated all 25 detectors on two independent datasets. The results reveal a critical insight:

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

This demonstrates that single-dataset performance is unreliable. Our evaluation emphasizes cross-dataset generalization using stability metrics.

Cross-Dataset Results Summary

Top 10 Models by Robust Score

The Stability Score measures cross-dataset consistency:

Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

The Robust Score combines worst-case performance with stability:

Robust Score = Min(AUC_A, AUC_B) × Stability Score
Rank Detector Dataset A Dataset B Min AUC Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643
6 xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631
7 mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607
8 xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606
9 xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596
10 xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584

The Overfitting Problem

These models showed strong Dataset A performance but failed to generalize:

Model Dataset A Rank Dataset B Rank AUC Drop Stability
gradient_boost_comprehensive #1 (0.7930) #6 (0.6533) -17.6% 82.4%
meta_stacking_7models #2 (0.7662) #10 (0.6422) -16.2% 83.8%
knn_spectral_fft #15 (0.5793) #23 (0.4808) -17.0% 83.0%
hypothesis_testing_pure #20 (0.5394) #25 (0.4118) -23.7% 76.3%

Lesson: High single-dataset AUC does not guarantee real-world performance. Always validate on multiple datasets.

Stability Score Methodology

The Stability Score measures how consistently a model performs across datasets:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
  • 100%: Identical performance on both datasets
  • < 85%: Significant overfitting concern

Most Stable Models

Model Stability Interpretation
xgb_selective_spectral 99.7% Near-identical performance
weighted_dynamic_ensemble 98.4% Excellent generalization
quad_model_ensemble 98.0% Excellent generalization
xgb_core_7features 98.0% Excellent generalization
xgb_70_statistical 97.1% Strong generalization
xgb_tuned_regularization 96.3% Strong generalization

Least Stable Models (Overfitters)

Model Stability Warning
hypothesis_testing_pure 76.3% Severe instability
welch_ttest 71.9% Severe instability
gradient_boost_comprehensive 82.4% Overfit to Dataset A
knn_spectral_fft 83.0% Overfit to Dataset A
meta_stacking_7models 83.8% Overfit to Dataset A

Top Performer in Local Benchmarks: xgb_tuned_regularization

xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:

Metric xgb_tuned_regularization gradient_boost (former #1) meta_stacking (former #2)
Dataset A AUC 0.7423 0.7930 0.7662
Dataset B AUC 0.7705 0.6533 0.6422
Min AUC 0.7423 0.6533 0.6422
Stability 96.3% 82.4% 83.8%
Robust Score 0.715 0.538 0.538
Train Time 60-185s 179-451s 332-32,030s

Why xgb_tuned_regularization is Top Performer

  1. Best Robust Score (0.715) — Highest combined performance and stability
  2. Actually improved on Dataset B — 0.7423 → 0.7705 (+3.8%)
  3. High Stability (96.3%) — Consistent across different data
  4. Fast Training — 60-185s vs hours for complex ensembles
  5. Strong Regularization — L1/L2 prevents overfitting
XGBClassifier(
    n_estimators=200,
    max_depth=5,           # Shallow trees
    learning_rate=0.05,
    reg_alpha=0.5,         # Strong L1 regularization
    reg_lambda=2.0,        # Strong L2 regularization
    min_child_weight=10    # Larger minimum leaf weight
)

Key Insights

What Works

  1. Regularization is Critical: Models with aggressive regularization generalize better. xgb_tuned_regularization uses strong L1/L2 penalties.

  2. Simpler Models Generalize Better: xgb_core_7features (7 features) has 98.0% stability vs meta_stacking_7models (339 features) at 83.8%.

  3. Feature Engineering Matters: Statistical features (KS statistic, Cohen's d, t-test) consistently outperform raw time series inputs.

  4. Ensemble Diversity: Combining different model types (trees + neural nets) works, but keep it simple.

What Doesn't Work

  1. Complex Ensembles Overfit: meta_stacking_7models (7 models, 32,000s training) dropped from #2 to #10.

  2. Deep Learning Fails on Univariate Data: Transformer (0.49-0.54 AUC) and LSTM (0.50-0.52 AUC) remain near-random on BOTH datasets.

  3. RL Approaches Underperform: Q-learning and DQN (0.46-0.55 AUC) don't compete with supervised learning.

  4. Pure Statistical Tests Are Unstable: hypothesis_testing_pure dropped from 0.54 to 0.41 AUC.

Top Performers by Category

Category Model Notes
Best Robust Score xgb_tuned_regularization 0.715 robust, 96.3% stable
Fastest Training xgb_core_7features 7 features, 40s training, 98% stable
Highest Stability xgb_selective_spectral 99.7% stability
No ML Required segment_statistics_only Statistical only, 95.4% stable

Full Results

All 25 Detectors (Sorted by Robust Score)

Rank Detector Dataset A Dataset B Min AUC Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643
6 xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631
7 mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607
8 xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606
9 xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596
10 xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584
11 segment_statistics_only 0.6249 0.5963 0.5963 95.4% 0.569
12 meta_stacking_7models 0.7662 0.6422 0.6422 83.8% 0.538
13 gradient_boost_comprehensive 0.7930 0.6533 0.6533 82.4% 0.538
14 bayesian_bocpd_fused_lasso 0.5005 0.4884 0.4884 97.6% 0.477
15 wavelet_lstm 0.5249 0.5000 0.5000 95.3% 0.476
16 qlearning_rolling_stats 0.5488 0.5078 0.5078 92.5% 0.470
17 dqn_base_model_selector 0.5474 0.5067 0.5067 92.6% 0.469
18 kolmogorov_smirnov_xgb 0.4939 0.5205 0.4939 94.9% 0.469
19 qlearning_bayesian_cpd 0.5540 0.5067 0.5067 91.5% 0.463
20 hierarchical_transformer 0.5439 0.4862 0.4862 89.4% 0.435
21 qlearning_memory_tabular 0.4986 0.4559 0.4559 91.4% 0.417
22 knn_wavelet 0.5812 0.4898 0.4898 84.3% 0.413
23 knn_spectral_fft 0.5793 0.4808 0.4808 83.0% 0.399
24 welch_ttest 0.4634 0.6444 0.4634 71.9% 0.333
25 hypothesis_testing_pure 0.5394 0.4118 0.4118 76.3% 0.314

Class Imbalance & Cost Considerations

The Rare Event Problem

Structural breaks are inherently rare events. This creates fundamental model bias:

  • Models can achieve ~70% accuracy by predicting "no break" for everything
  • Several models (hierarchical_transformer, wavelet_lstm, welch_ttest) exhibit this behavior with 0% recall

Cost Asymmetry

Not all errors are equal in structural break detection:

Error Type Description Cost
False Negative (FN) Missing a real break Moderate — Missed opportunity to act on regime change
False Positive (FP) Predicting break when none exists Severe — Triggers unnecessary position changes, transaction fees, slippage

This asymmetry means we should prioritize precision alongside recall, making F1 score a critical metric.

Why Deep Learning & RL Models Underperformed

The hierarchical_transformer (0.49-0.54 AUC), wavelet_lstm (0.50-0.52 AUC), and RL models (0.46-0.55 AUC) all underperformed tree-based ensembles on BOTH datasets.

The Core Problem: Univariate Features Are Insufficient

These architectures are designed to learn relationships between multiple input variables. With only a univariate time series and its derived features, they lack the rich input space needed to learn meaningful patterns.

What Would Help: Adding exogenous variables (correlated assets, macroeconomic indicators, sentiment data, volume) would provide the multi-dimensional context these architectures need.

Key Insight: Tree-based ensembles excel at learning from handcrafted statistical features that explicitly encode distributional differences. Deep learning needs raw, multi-dimensional input to learn such representations.

Feature Engineering Methodology

Segment-Based Features

Each time series is split at a boundary into pre-segment and post-segment. Features capture differences between these segments:

Pre-segment   |  Post-segment
--------------+---------------
 values[0:T]  |  values[T:end]

Feature Categories

Category Description Example Features
Moments Statistical moments per segment mean_diff, std_ratio, skew_diff, kurt_diff
Effect Sizes Standardized differences Cohen's d, Glass's delta, Hedges' g
Distribution Tests Hypothesis test statistics t_statistic, ks_statistic, mann_whitney_u
Quantiles Percentile comparisons median_diff, iqr_ratio, q25_diff, q75_diff
Spectral Frequency domain dominant_freq_diff, spectral_centroid_diff
Wavelet Multi-scale decomposition dwt_energy_ratio, wavelet_entropy_diff
Temporal Time-dependent patterns acf_diff, trend_diff, volatility_ratio

Most Discriminative Features (by importance)

  1. ks_statistic - Kolmogorov-Smirnov test statistic
  2. mean_diff_normalized - Normalized mean difference
  3. std_ratio - Standard deviation ratio
  4. cohens_d - Cohen's effect size
  5. mann_whitney_u - Mann-Whitney U statistic

Repository Structure

structural_break_detection/
├── README.md                          # This file
├── requirements.txt                   # Dependencies
├── run_all_experiments.py             # Full experiment runner
├── quick_benchmark.py                 # Fast benchmarking
│
├── results_dataset_a.csv              # Dataset A results
├── results_dataset_a.md               # Dataset A results (markdown)
├── results_dataset_b.csv              # Dataset B results
├── results_dataset_b.md               # Dataset B results (markdown)
│
├── xgb_tuned_regularization/          # Top performer in local benchmarks
├── weighted_dynamic_ensemble/         # #2 by robust score
├── quad_model_ensemble/               # #3 by robust score
├── mlp_ensemble_deep_features/        # #4 by robust score
│
├── gradient_boost_comprehensive/      # Former #1, overfits
├── meta_stacking_7models/             # Former #2, overfits
│
├── wavelet_lstm/                      # Deep learning (underperforms)
├── hierarchical_transformer/          # Deep learning (underperforms)
│
├── qlearning_rolling_stats/           # RL approach
├── qlearning_bayesian_cpd/            # RL + Bayesian
├── dqn_base_model_selector/           # DQN approach
│
└── ... (25 detectors total)

Each detector folder contains:

  • features.py - Feature extraction class
  • model.py - Detector model class
  • main.py - Training and inference scripts

Usage

Installation

pip install -r requirements.txt

Training the Top Performer

cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib

Running Inference

python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib

Benchmarking All Detectors

# Quick test (top 5 detectors)
python quick_benchmark.py --data-dir /path/to/data

# Full benchmark (all 25 detectors)
python run_all_experiments.py --data-dir /path/to/data --output results.csv

Evaluation Metrics

Metric Description
ROC AUC Area Under ROC Curve (discrimination ability)
Stability Score Cross-dataset consistency (higher = better generalization)
Robust Score Min AUC × Stability (overall reliability)
F1 Score Harmonic mean of precision and recall
Recall True positive rate (sensitivity)

Lessons Learned

  1. Single-dataset benchmarks lie: The #1 model on Dataset A dropped to #6 on Dataset B.

  2. Regularization prevents overfitting: xgb_tuned_regularization uses strong L1/L2 penalties and generalizes well.

  3. Complexity ≠ Robustness: meta_stacking_7models (7 models, 32,000s) is less stable than xgb_tuned_regularization (1 model, 185s).

  4. Stability Score matters: Always evaluate on multiple datasets and measure consistency.

  5. Deep learning needs multivariate input: Transformer and LSTM fail on univariate time series.

  6. Simple features work: Statistical tests (KS, t-test) as features outperform complex architectures.

Dependencies

  • Python 3.8+
  • scikit-learn
  • xgboost
  • lightgbm
  • PyTorch (for neural network models)
  • PyWavelets
  • scipy
  • pandas
  • numpy

References

Financial Machine Learning

  • Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Statistical Methods

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
  • Welch, B. L. (1947). "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved." Biometrika, 34(1/2), 28–35. Link
  • Mann, H. B., & Whitney, D. R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other." The Annals of Mathematical Statistics, 18(1), 50–60. Link

Change Point Detection

  • Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115. Link
  • Adams, R. P., & MacKay, D. J. C. (2007). "Bayesian Online Changepoint Detection." arXiv:0710.3742. Link
  • Sharifi, A., Sun, W., & Seco, L. A. (2025). "Detecting Structural Breaks in Dynamic Environments Using Reinforcement Learning and Bayesian Change Point Models." SSRN. Link

Machine Learning

  • Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD, 785–794. Link

Deep Learning & Transformers

  • Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS, 30. Link
  • Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780.
  • Wang, Y. et al. (2024). "TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables." arXiv:2402.19072. Link
  • Liu, Y. et al. (2024). "ExoTST: Exogenous-Aware Temporal Sequence Transformer." arXiv:2410.12184. Link

Signal Processing

  • Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.
  • Song, J. H., Lopez de Prado, M., Simon, H., & Wu, K. (2014). "Exploring Irregular Time Series Through Non-Uniform Fast Fourier Transform." Proceedings of the International Conference for High Performance Computing, IEEE. Link

Acknowledgments

This project was developed for the ADIA Lab Structural Break Challenge, a machine learning competition hosted by CrunchDAO in partnership with ADIA Lab. The challenge focused on detecting structural breaks in univariate time series data.

License

MIT License