Internship screening project for the Data Engine track at Quant Singularity. Built a production grade data layer for NIFTY derivatives market data covering ingestion, validation, warehousing, and typed access functions.
Raw vendor CSVs come in dirty, inconsistent, and undocumented. This pipeline cleans them, stores them in a queryable Parquet warehouse, and exposes four typed access functions that downstream teams can trust.
The validation module found 23 findings across all sources (4 errors, 19 warnings) including a feed correction duplicate on the day before expiry, a stale futures rollover label, and a 3-sigma FII outflow event on Aug 26.
| Pipeline running | Idempotency pass |
|---|---|
![]() |
![]() |
| Warehouse structure | Validation findings |
|---|---|
![]() |
![]() |
| Benchmark results | Access smoke test |
|---|---|
![]() |
![]() |
| Layer | Technology |
|---|---|
| 🐍 Language | Python 3.12 |
| 📦 Storage | Apache Parquet via PyArrow |
| 🦆 Query engine | DuckDB (in-memory, no server needed) |
| 📊 Data wrangling | Pandas |
| 📈 Experiment tracking | MLflow |
| 🗂️ Data format | CSV vendor feed to Parquet warehouse |
| 🖥️ Environment | WSL2, Ubuntu 24, AMD Ryzen 7 5800H |
git clone <your-repo-url>
cd quant_singularity
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython ingest.pyReads all raw files, validates, writes warehouse, logs to MLflow. Running it twice on the same input produces byte-identical Parquet output.
ingest.py single command entrypoint access.py 4 typed access functions explore.py raw data exploration generate_readme.py generates this file from actual results validation/ validator.py standalone validation module validation_report.json 23 findings with decisions benchmark/ run_benchmark.py latency benchmark benchmark.csv results with CI intervals data/ nifty_spot/ 1-min OHLCV, 7 trading days options_chain/ 5-min chain snapshots, 12 files nifty_futures/ near and mid month futures aux/ VIX, FII/DII, NSE calendar warehouse/ cleaned Parquet partitioned by date spot/date=YYYY-MM-DD/ options/date=YYYY-MM-DD/expiry=YYYY-MM-DD/ futures/date=YYYY-MM-DD/ vix/date=YYYY-MM-DD/ fii_dii/ ss/ screenshots of key milestones
| Function | What it returns | Edge cases handled |
|---|---|---|
get_price(timestamp) |
1-min NIFTY spot bar | pre-open, post-close, data gaps |
get_features(timestamp) |
volatility, VWAP deviation, volume ratio, VIX | warm-up period, missing VIX |
get_signals(timestamp, expiry) |
options chain snapshot | between snapshots, pre-open flag |
get_features_batch(timestamps) |
batch feature matrix | partial failures, never raises |
| Severity | Source | Finding |
|---|---|---|
| 🟡 WARN | nifty_spot/2025-08-22.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-08-25.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-08-26.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-08-27.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-08-28.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-09-01.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | nifty_spot/2025-09-02.csv | last bar is 15:29:00, not 15:30:00 — closing bar missing from vendor feed |
| 🟡 WARN | options/2025-08-22_2025-08-28.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-22_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-25_2025-08-28.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-25_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-26_2025-08-28.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-26_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-27_2025-08-28.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🔴 ERROR | options/2025-08-27_2025-08-28.csv | 6 duplicate timestamp+strike+side rows |
| 🔴 ERROR | options/2025-08-27_2025-08-28.csv | 1 rows where close outside high/low range |
| 🟡 WARN | options/2025-08-27_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-28_2025-08-28.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-08-28_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-09-01_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🟡 WARN | options/2025-09-02_2025-09-25.csv | 174 rows before 09:15 (snapshots at 09:00, 09:05, 09:10) |
| 🔴 ERROR | nifty_futures/2025-09-01.csv | near_month_expiry still shows 2025-08-28 but trading date is 2025-09-01, contract already expired on Aug 28 |
| 🔴 ERROR | nifty_futures/2025-09-02.csv | near_month_expiry still shows 2025-08-28 but trading date is 2025-09-02, contract already expired on Aug 28 |
Tested on AMD Ryzen 7 5800H, 7.4GB RAM, WSL2 Ubuntu.
| Function | Median | p99 | CI |
|---|---|---|---|
get_price |
10.57ms | 100.022ms | +-2.919ms |
get_features |
14.013ms | 32.668ms | +-2.102ms |
get_signals_weekly |
13.428ms | 17.664ms | +-0.177ms |
get_signals_monthly |
13.707ms | 21.707ms | +-1.279ms |
get_features_batch_1000 |
total=15021.3ms | valid=100.0% | 66.6 ts/sec |
p99 is higher than median on cold runs due to Parquet file reads hitting disk. Once the OS page cache warms up, subsequent calls drop significantly.
1. Options chain duplicate at 11:30 on Aug 27 The vendor sent 6 strikes twice at exactly 11:30. Prices slightly different between copies. Second copy had a close higher than the high on one row which is impossible. Kept the first copy since the second was a corrupted correction.
2. Futures expiry label never rolled after Aug 28 On Sep 01 and Sep 02, near_month_expiry still showed 2025-08-28 even though that contract expired. Caught this by comparing trading date against expiry date. Flagged with stale_expiry=True so the strategy team knows not to use the field.
3. Idempotency fix First version did not sort rows before writing Parquet. Two runs produced different row ordering and md5sum diffed. Fixed by sorting all dataframes by natural key before every write.
4. Pre-open snapshots Initially filtered everything before 09:15 out of the warehouse. Realised the ML team might want these for opening range features. Changed to retain with is_pre_open flag instead of dropping.
python validation/validator.py # run validation standalone
python benchmark/run_benchmark.py # run latency benchmark
python access.py # smoke test all 4 access functions
python explore.py # raw data exploration
python generate_readme.py # regenerate this READMEBuilt by Sarthak Naikare for Quant Singularity Data Engine internship screening, May 2026.






