Skip to content

Rajveer-code/Finsight

Repository files navigation

FinSight β€” LLM-Powered Earnings Intelligence

Python PyTorch License

An end-to-end machine learning system extracting alpha signals from S&P 500 earnings call transcripts.

πŸ–₯️ Interactive Dashboard β€’ πŸ“Š Streamlit Demo β€’ πŸ“„ Technical Report β€’ πŸ“ˆ Results


What is FinSight?

Every quarter, 500+ S&P 500 companies hold earnings calls where management presents results and analysts ask probing questions. The linguistic content of these calls β€” management tone, analyst skepticism, guidance specificity β€” may contain signals that markets don't fully price immediately.

FinSight processes 14,584 earnings transcripts across 601 S&P 500 companies (2018–2024), extracts 34 NLP features using FinBERT and RAG, and trains walk-forward validated ML models to predict 5-day and 20-day post-earnings stock returns.


Pipeline

14,584 Earnings Transcripts (2018–2024)
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 1 β€” Data Ingestion           β”‚
β”‚  HuggingFace datasets + yfinance    β”‚
β”‚  601 companies Β· 1M+ price rows     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 2 β€” NLP Feature Extraction   β”‚
β”‚                                     β”‚
β”‚  FinBERT (ProsusAI)                 β”‚
β”‚  Β· Sentence-level sentiment         β”‚
β”‚  Β· Mgmt prepared remarks vs Q&A     β”‚
β”‚  Β· 14 sentiment features            β”‚
β”‚                                     β”‚
β”‚  RAG Pipeline (all-MiniLM-L6-v2)   β”‚
β”‚  Β· 380,507 embedded chunks          β”‚
β”‚  Β· 5 structured semantic queries    β”‚
β”‚  Β· 10 relevance + content features  β”‚
β”‚                                     β”‚
β”‚  Output: 34 features Β· 13,442 rows  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 3 β€” Prediction Models        β”‚
β”‚                                     β”‚
β”‚  Β· Baseline (Logistic Regression)   β”‚
β”‚  Β· FinBERT-only (XGBoost)          β”‚
β”‚  Β· RAG-only (XGBoost)              β”‚
β”‚  Β· XGBoost (all 34 features)       β”‚
β”‚  Β· LightGBM (all 34 features) β˜…    β”‚
β”‚  Β· LSTM (temporal 6-quarter seq)    β”‚
β”‚                                     β”‚
β”‚  Walk-forward CV Β· Zero leakage     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 4 β€” Backtesting              β”‚
β”‚  Long-short quartile portfolio      β”‚
β”‚  5-day and 20-day holding periods   β”‚
β”‚  10bps round-trip transaction cost  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 5 β€” Sector Analysis          β”‚
β”‚  GICS sector-level walk-forward     β”‚
β”‚  Energy IC = +0.311 (best)          β”‚
β”‚  Technology IC β‰ˆ 0 (efficient)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 6 β€” Dashboard + Report       β”‚
β”‚  Next.js Β· Streamlit Β· HF Spaces    β”‚
β”‚  8-page technical report            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Results

Walk-Forward Validation (2021–2024)

Train on years Tβˆ’3 to Tβˆ’1, test on year T. Zero data leakage.

Model IC Mean IC Std Hit Rate AUC
Baseline 0.0429 0.1141 ⚠️ 0.5312 0.5174
LightGBM β˜… 0.0198 0.0085 0.5329 0.5086
LSTM 0.0153 0.0211 0.5471 0.5060
XGBoost 0.0141 0.0180 0.5321 0.5099
RAG Only 0.0000 0.0295 0.5347 0.5086
FinBERT Only -0.0044 0.0117 0.5312 0.5007

IC = Information Coefficient (Pearson correlation of predictions vs actual 5-day returns). LightGBM is 10Γ— more stable than baseline (std=0.009 vs std=0.114). LSTM achieves the highest hit rate (54.7%) β€” best for directional prediction.

Top 5 Features by SHAP Importance

Rank Feature Group Mean |SHAP| Insight
1 qa_neg_ratio QA FinBERT 0.0541 Analyst pushback > management positivity
2 mgmt_sent_vol Mgmt FinBERT 0.0476 Inconsistent messaging = larger price moves
3 qa_n_sentences QA FinBERT 0.0453 Longer Q&A = more analyst scrutiny
4 mgmt_mean_neu Mgmt FinBERT 0.0445 Deliberate neutrality = hedging signal
5 rag_guidance_specificity_relevance RAG 0.0420 Specific guidance = clearer market reaction

Sector Analysis (Walk-Forward IC by GICS Sector)

Rank Sector IC Mean IC Std AUC
1 Energy β˜… +0.3111 0.2430 0.6393
2 Real Estate +0.0779 0.2861 0.5089
3 Industrials +0.0738 0.0359 0.5625
4 Utilities +0.0644 0.1428 0.4703
5 Consumer Staples +0.0613 0.1452 0.5212
9 Technology +0.0037 0.0983 0.4874
11 Materials -0.1321 0.2903 0.4958

Key finding: Energy IC = 0.311 is 83Γ— stronger than Technology IC β‰ˆ 0.004. Consistent with efficient market hypothesis by sector β€” Technology is efficiently priced, Energy has high information asymmetry from commodity price exposure.

Backtest Performance (Long-Short Quartile)

Metric 5-Day 20-Day
Annualized Return -0.91% -0.69%
Sharpe Ratio -0.81 -0.23 (+3.6Γ—)
Max Drawdown -4.24% -6.03%
Win Rate 37.5% 31.3%

Sharpe improves 3.6Γ— from 5-day to 20-day holding, consistent with PEAD theory (Bernard & Thomas 1989). Signal exists (IC=0.0198) but is insufficient to overcome 10bps transaction costs at a 5-day horizon. Extending to 20-day reduces the cost-to-signal ratio significantly.


Tech Stack

Component Technology
Language Python 3.10
NLP Model FinBERT (ProsusAI/finbert)
Embeddings all-MiniLM-L6-v2
Vector DB ChromaDB
ML Models XGBoost, LightGBM, PyTorch LSTM
Interpretability SHAP
Dashboard (v2) Next.js 14, TypeScript, Tailwind, Recharts, Framer Motion
Dashboard (v1) Streamlit + Plotly
Deployment Vercel (Next.js) + Hugging Face Spaces (Streamlit)
GPU NVIDIA RTX 4060 Laptop (CUDA 11.8)

Project Structure

finsight/
β”œβ”€β”€ config.py                        # Central configuration (paths, constants)
β”œβ”€β”€ run_ingestion.py                 # Stage 1 runner
β”œβ”€β”€ run_nlp.py                       # Stage 2 runner
β”œβ”€β”€ export_data.py                   # Export JSON for Next.js dashboard
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”œβ”€β”€ download_transcripts.py  # HuggingFace dataset download
β”‚   β”‚   β”œβ”€β”€ price_data.py            # yfinance price data
β”‚   β”‚   └── validate_data.py         # Data quality checks
β”‚   β”‚
β”‚   β”œβ”€β”€ nlp/
β”‚   β”‚   β”œβ”€β”€ finbert_sentiment.py     # FinBERT pipeline (GPU, checkpointing)
β”‚   β”‚   β”œβ”€β”€ rag_pipeline.py          # RAG feature extraction (GPU-accelerated)
β”‚   β”‚   └── build_feature_matrix.py  # Merge features + price returns
β”‚   β”‚
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ train_models.py          # XGBoost + LightGBM walk-forward
β”‚   β”‚   └── lstm_model.py            # LSTM sequence model
β”‚   β”‚
β”‚   β”œβ”€β”€ backtest/
β”‚   β”‚   β”œβ”€β”€ backtest_engine.py       # 5-day backtest
β”‚   β”‚   └── backtest_20d.py          # 20-day backtest + comparison
β”‚   β”‚
β”‚   β”œβ”€β”€ analysis/
β”‚   β”‚   └── sector_analysis.py       # GICS sector-level IC analysis
β”‚   β”‚
β”‚   └── dashboard/
β”‚       └── app.py                   # Streamlit dashboard (v1)
β”‚
β”œβ”€β”€ experiments/                     # Model results, SHAP, plots
β”œβ”€β”€ report/
β”‚   └── FinSight_Technical_Report.docx
└── requirements.txt

Reproducing Results

Setup

git clone https://github.com/Rajveer-code/Finsight.git
cd Finsight
python -m venv venv
venv\Scripts\activate          # Windows
pip install -r requirements.txt

Stage 1 β€” Data Ingestion (~30 min)

python run_ingestion.py
# Output: 14,584 transcripts, 1M+ price rows

Stage 2 β€” NLP Pipeline (~3 hours on GPU)

python run_nlp.py
# Output: 34 features Γ— 13,442 rows
# Checkpoints every 100/500 records β€” safe to interrupt

Stage 3 β€” Train Models (~10 min)

python src/models/train_models.py
python src/models/lstm_model.py

Stage 4 β€” Backtest (~1 min)

python src/backtest/backtest_engine.py
python src/backtest/backtest_20d.py

Stage 5 β€” Sector Analysis (~5 min)

python src/analysis/sector_analysis.py

Stage 6 β€” Dashboard

# Streamlit (v1)
streamlit run src/dashboard/app.py

# Export data for Next.js dashboard
python export_data.py

Key Design Decisions

Why walk-forward validation? Standard k-fold cross-validation leaks future information in time series. Walk-forward trains on years Tβˆ’3 to Tβˆ’1 and tests on year T only. No future data is ever seen during training.

Why FinBERT + RAG together? FinBERT captures emotional tone at the sentence level. RAG captures topical specificity β€” whether management actually discussed numerical guidance, new risks, or cost pressures. RAG features contribute 34.6% of total SHAP importance despite comprising fewer features.

Why LSTM alongside tree models? Tree models treat each earnings call as independent. The LSTM learns that a company with 6 consecutive quarters of deteriorating sentiment is different from one with a single bad quarter. Its 2022 IC of +0.047 β€” the strongest single fold across all models β€” validates this temporal signal.

Why both 5-day and 20-day backtests? Post-earnings announcement drift (PEAD) is documented at 20-60 day horizons. The 3.6Γ— Sharpe improvement from 5-day to 20-day validates that the signal takes time to be fully priced, consistent with Bernard & Thomas (1989).


Limitations & Future Work

  • Long-only 20-day backtest (eliminates short-selling costs)
  • Replace RAG keyword scoring with Llama-3 / Mistral generative scorer
  • Sector-stratified model training (separate models per sector)
  • Cross-lingual extension using multilingual FinBERT
  • Real-time pipeline streaming live earnings calls

References

  • Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063
  • Bernard, V. & Thomas, J. (1989). Post-Earnings-Announcement Drift. Journal of Accounting Research
  • Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS
  • Lundberg & Lee (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS
  • Loughran & McDonald (2011). When is a Liability not a Liability? Journal of Finance
  • Chan, Jegadeesh & Lakonishok (1996). Momentum Strategies. Journal of Finance, 51(5)

Author

Rajveer Singh Pall

Portfolio project for MSc Data Science application (ETH Zurich 2026).


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages