Skip to content

sarthakNaikare/quant-singularity-data-engine

Repository files navigation

🛢️ Quant Singularity Data Engine

Internship screening project for the Data Engine track at Quant Singularity. Built a production grade data layer for NIFTY derivatives market data covering ingestion, validation, warehousing, and typed access functions.


🧠 What this does

Raw vendor CSVs come in dirty, inconsistent, and undocumented. This pipeline cleans them, stores them in a queryable Parquet warehouse, and exposes four typed access functions that downstream teams can trust.

The validation module found 23 findings across all sources (4 errors, 19 warnings) including a feed correction duplicate on the day before expiry, a stale futures rollover label, and a 3-sigma FII outflow event on Aug 26.


🖥️ Screenshots

Pipeline running Idempotency pass
pipeline idempotency
Warehouse structure Validation findings
warehouse validation
Benchmark results Access smoke test
benchmark access

gitlog


⚙️ Tech stack

Layer Technology
🐍 Language Python 3.12
📦 Storage Apache Parquet via PyArrow
🦆 Query engine DuckDB (in-memory, no server needed)
📊 Data wrangling Pandas
📈 Experiment tracking MLflow
🗂️ Data format CSV vendor feed to Parquet warehouse
🖥️ Environment WSL2, Ubuntu 24, AMD Ryzen 7 5800H

🚀 Setup

git clone <your-repo-url>
cd quant_singularity
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

▶️ Single run command

python ingest.py

Reads all raw files, validates, writes warehouse, logs to MLflow. Running it twice on the same input produces byte-identical Parquet output.


📁 Project structure

ingest.py single command entrypoint access.py 4 typed access functions explore.py raw data exploration generate_readme.py generates this file from actual results validation/ validator.py standalone validation module validation_report.json 23 findings with decisions benchmark/ run_benchmark.py latency benchmark benchmark.csv results with CI intervals data/ nifty_spot/ 1-min OHLCV, 7 trading days options_chain/ 5-min chain snapshots, 12 files nifty_futures/ near and mid month futures aux/ VIX, FII/DII, NSE calendar warehouse/ cleaned Parquet partitioned by date spot/date=YYYY-MM-DD/ options/date=YYYY-MM-DD/expiry=YYYY-MM-DD/ futures/date=YYYY-MM-DD/ vix/date=YYYY-MM-DD/ fii_dii/ ss/ screenshots of key milestones

🔧 Access functions

Function What it returns Edge cases handled
get_price(timestamp) 1-min NIFTY spot bar pre-open, post-close, data gaps
get_features(timestamp) volatility, VWAP deviation, volume ratio, VIX warm-up period, missing VIX
get_signals(timestamp, expiry) options chain snapshot between snapshots, pre-open flag
get_features_batch(timestamps) batch feature matrix partial failures, never raises

🔍 Validation findings

Severity Source Finding
🟡 WARN nifty_spot/2025-08-22.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-08-25.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-08-26.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-08-27.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-08-28.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-09-01.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN nifty_spot/2025-09-02.csv last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN options/2025-08-22_2025-08-28.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-22_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-25_2025-08-28.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-25_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-26_2025-08-28.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-26_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-27_2025-08-28.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🔴 ERROR options/2025-08-27_2025-08-28.csv 6 duplicate timestamp+strike+side rows
🔴 ERROR options/2025-08-27_2025-08-28.csv 1 rows where close outside high/low range
🟡 WARN options/2025-08-27_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-28_2025-08-28.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-08-28_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-09-01_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN options/2025-09-02_2025-09-25.csv 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🔴 ERROR nifty_futures/2025-09-01.csv near_month_expiry still shows 2025-08-28 but trading date is 2025-09-01, contract already expired on Aug 28
🔴 ERROR nifty_futures/2025-09-02.csv near_month_expiry still shows 2025-08-28 but trading date is 2025-09-02, contract already expired on Aug 28

⏱️ Benchmark results

Tested on AMD Ryzen 7 5800H, 7.4GB RAM, WSL2 Ubuntu.

Function Median p99 CI
get_price 10.57ms 100.022ms +-2.919ms
get_features 14.013ms 32.668ms +-2.102ms
get_signals_weekly 13.428ms 17.664ms +-0.177ms
get_signals_monthly 13.707ms 21.707ms +-1.279ms
get_features_batch_1000 total=15021.3ms valid=100.0% 66.6 ts/sec

p99 is higher than median on cold runs due to Parquet file reads hitting disk. Once the OS page cache warms up, subsequent calls drop significantly.


🐛 Challenges and fixes

1. Options chain duplicate at 11:30 on Aug 27 The vendor sent 6 strikes twice at exactly 11:30. Prices slightly different between copies. Second copy had a close higher than the high on one row which is impossible. Kept the first copy since the second was a corrupted correction.

2. Futures expiry label never rolled after Aug 28 On Sep 01 and Sep 02, near_month_expiry still showed 2025-08-28 even though that contract expired. Caught this by comparing trading date against expiry date. Flagged with stale_expiry=True so the strategy team knows not to use the field.

3. Idempotency fix First version did not sort rows before writing Parquet. Two runs produced different row ordering and md5sum diffed. Fixed by sorting all dataframes by natural key before every write.

4. Pre-open snapshots Initially filtered everything before 09:15 out of the warehouse. Realised the ML team might want these for opening range features. Changed to retain with is_pre_open flag instead of dropping.


🧪 Other commands

python validation/validator.py     # run validation standalone
python benchmark/run_benchmark.py  # run latency benchmark
python access.py                   # smoke test all 4 access functions
python explore.py                  # raw data exploration
python generate_readme.py          # regenerate this README

Built by Sarthak Naikare for Quant Singularity Data Engine internship screening, May 2026.

About

Production-grade data layer for NIFTY derivatives market data. Parquet warehouse, DuckDB query engine, MLflow tracking, 4 typed access functions, and a validation module that found 5 real anomalies in the vendor feed.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages