🛢️ Quant Singularity Data Engine

Internship screening project for the Data Engine track at Quant Singularity. Built a production grade data layer for NIFTY derivatives market data covering ingestion, validation, warehousing, and typed access functions.

🧠 What this does

Raw vendor CSVs come in dirty, inconsistent, and undocumented. This pipeline cleans them, stores them in a queryable Parquet warehouse, and exposes four typed access functions that downstream teams can trust.

The validation module found 23 findings across all sources (4 errors, 19 warnings) including a feed correction duplicate on the day before expiry, a stale futures rollover label, and a 3-sigma FII outflow event on Aug 26.

🖥️ Screenshots

Pipeline running	Idempotency pass

Warehouse structure	Validation findings

Benchmark results	Access smoke test

⚙️ Tech stack

Layer	Technology
🐍 Language	Python 3.12
📦 Storage	Apache Parquet via PyArrow
🦆 Query engine	DuckDB (in-memory, no server needed)
📊 Data wrangling	Pandas
📈 Experiment tracking	MLflow
🗂️ Data format	CSV vendor feed to Parquet warehouse
🖥️ Environment	WSL2, Ubuntu 24, AMD Ryzen 7 5800H

🚀 Setup

git clone <your-repo-url>
cd quant_singularity
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

▶️ Single run command

python ingest.py

Reads all raw files, validates, writes warehouse, logs to MLflow. Running it twice on the same input produces byte-identical Parquet output.

📁 Project structure

ingest.py single command entrypoint access.py 4 typed access functions explore.py raw data exploration generate_readme.py generates this file from actual results validation/ validator.py standalone validation module validation_report.json 23 findings with decisions benchmark/ run_benchmark.py latency benchmark benchmark.csv results with CI intervals data/ nifty_spot/ 1-min OHLCV, 7 trading days options_chain/ 5-min chain snapshots, 12 files nifty_futures/ near and mid month futures aux/ VIX, FII/DII, NSE calendar warehouse/ cleaned Parquet partitioned by date spot/date=YYYY-MM-DD/ options/date=YYYY-MM-DD/expiry=YYYY-MM-DD/ futures/date=YYYY-MM-DD/ vix/date=YYYY-MM-DD/ fii_dii/ ss/ screenshots of key milestones

🔧 Access functions

Function	What it returns	Edge cases handled
`get_price(timestamp)`	1-min NIFTY spot bar	pre-open, post-close, data gaps
`get_features(timestamp)`	volatility, VWAP deviation, volume ratio, VIX	warm-up period, missing VIX
`get_signals(timestamp, expiry)`	options chain snapshot	between snapshots, pre-open flag
`get_features_batch(timestamps)`	batch feature matrix	partial failures, never raises

🔍 Validation findings

Severity	Source	Finding
🟡 WARN	nifty_spot/2025-08-22.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-08-25.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-08-26.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-08-27.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-08-28.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-09-01.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	nifty_spot/2025-09-02.csv	last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed
🟡 WARN	options/2025-08-22_2025-08-28.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-22_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-25_2025-08-28.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-25_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-26_2025-08-28.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-26_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-27_2025-08-28.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🔴 ERROR	options/2025-08-27_2025-08-28.csv	6 duplicate timestamp+strike+side rows
🔴 ERROR	options/2025-08-27_2025-08-28.csv	1 rows where close outside high/low range
🟡 WARN	options/2025-08-27_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-28_2025-08-28.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-08-28_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-09-01_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🟡 WARN	options/2025-09-02_2025-09-25.csv	174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10)
🔴 ERROR	nifty_futures/2025-09-01.csv	near_month_expiry still shows 2025-08-28 but trading date is 2025-09-01, contract already expired on Aug 28
🔴 ERROR	nifty_futures/2025-09-02.csv	near_month_expiry still shows 2025-08-28 but trading date is 2025-09-02, contract already expired on Aug 28

⏱️ Benchmark results

Tested on AMD Ryzen 7 5800H, 7.4GB RAM, WSL2 Ubuntu.

Function	Median	p99	CI
`get_price`	10.57ms	100.022ms	+-2.919ms
`get_features`	14.013ms	32.668ms	+-2.102ms
`get_signals_weekly`	13.428ms	17.664ms	+-0.177ms
`get_signals_monthly`	13.707ms	21.707ms	+-1.279ms
`get_features_batch_1000`	total=15021.3ms	valid=100.0%	66.6 ts/sec

p99 is higher than median on cold runs due to Parquet file reads hitting disk. Once the OS page cache warms up, subsequent calls drop significantly.

🐛 Challenges and fixes

1. Options chain duplicate at 11:30 on Aug 27 The vendor sent 6 strikes twice at exactly 11:30. Prices slightly different between copies. Second copy had a close higher than the high on one row which is impossible. Kept the first copy since the second was a corrupted correction.

2. Futures expiry label never rolled after Aug 28 On Sep 01 and Sep 02, near_month_expiry still showed 2025-08-28 even though that contract expired. Caught this by comparing trading date against expiry date. Flagged with stale_expiry=True so the strategy team knows not to use the field.

3. Idempotency fix First version did not sort rows before writing Parquet. Two runs produced different row ordering and md5sum diffed. Fixed by sorting all dataframes by natural key before every write.

4. Pre-open snapshots Initially filtered everything before 09:15 out of the warehouse. Realised the ML team might want these for opening range features. Changed to retain with is_pre_open flag instead of dropping.

🧪 Other commands

python validation/validator.py     # run validation standalone
python benchmark/run_benchmark.py  # run latency benchmark
python access.py                   # smoke test all 4 access functions
python explore.py                  # raw data exploration
python generate_readme.py          # regenerate this README

Built by Sarthak Naikare for Quant Singularity Data Engine internship screening, May 2026.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛢️ Quant Singularity Data Engine

🧠 What this does

🖥️ Screenshots

⚙️ Tech stack

🚀 Setup

▶️ Single run command

📁 Project structure

🔧 Access functions

🔍 Validation findings

⏱️ Benchmark results

🐛 Challenges and fixes

🧪 Other commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmark		benchmark
data		data
mlruns		mlruns
ss		ss
validation		validation
.gitignore		.gitignore
README.md		README.md
access.py		access.py
explore.py		explore.py
generate_readme.py		generate_readme.py
ingest.py		ingest.py
report.pdf		report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛢️ Quant Singularity Data Engine

🧠 What this does

🖥️ Screenshots

⚙️ Tech stack

🚀 Setup

▶️ Single run command

📁 Project structure

🔧 Access functions

🔍 Validation findings

⏱️ Benchmark results

🐛 Challenges and fixes

🧪 Other commands

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages