Query optimization on a 1M-row financial dataset. The interesting story here isn't the wall-clock times — at this size everything already runs in under two seconds — it's what happens to logical reads. Three of the four queries drop their I/O by 75–100%.
I put this together because I wanted a clean, end-to-end example of what proper SQL tuning looks like outside a job description bullet point. After years of doing this kind of work at a bank, the part that always gets glossed over is the "before" state — most repos jump straight to the optimised version. Here I keep both side by side: the same queries run against the same data, once without any indexing strategy and once with covering indexes, filtered indexes, statistics rebuilt, and a couple of rewrites where the original SQL was blocking the optimiser from doing its job.
Everything runs locally in Docker, on an M1 Mac, with SQL Server 2022 (amd64 emulation). No cloud bill, no proprietary data, reproducible from a single seed.
Measured on M1 MacBook Pro with 8 GB allocated to Docker Desktop. Each query was warmed up once and then timed with SET STATISTICS IO, TIME ON.
| # | Query | Pattern | Time (before → after) | Logical reads (before → after) |
|---|---|---|---|---|
| Q1 | Monthly merchant revenue | GROUP BY + window function over 1M rows | 1,785 ms → 1,647 ms | 12,281 → 6,833 (-44%) |
| Q2 | Fraud exposure by customer | 3-table join + GROUP BY + HAVING | 36 ms → 17 ms (-53%) | 12,281 → 3,041 (-75%) |
| Q3 | Account activity ratio | Aggregation across dim_accounts | 53 ms → 21 ms (-60%) | 12,281 → 2,978 (-75%) |
| Q4 | High-value pending transactions | Low-cardinality filter on 1M rows | 21 ms → 0 ms | 12,281 → 3 (-99.97%) |
Two things worth calling out honestly:
- Q4 is the headline. The filtered index turns a full clustered scan into a 3-page seek. From the optimiser's point of view this query no longer touches the fact table at scale.
- Q1 only improves ~8% in wall time even though logical reads drop 44%. That tells you the bottleneck wasn't I/O — it's the window function CPU work. Dropping the reads still matters under load (less contention, less memory pressure), but I wanted to leave this in the README rather than fake a 95% number, because real tuning work looks like this: some wins are I/O wins, some wins are CPU wins, and not every query has a magic index.
Not going to list every line — the optimisation script is in sql/04_optimization.sql if you want the full set. The moves that mattered:
- Covering indexes with
INCLUDE— stops SQL Server from doing key lookups to fetch columns that weren't in the index key. Q2 and Q3 benefit the most here, because the dim joins are now a seek + included columns instead of a seek + lookup roundtrip. - Filtered index on
status = 'PENDING'— Q4 was scanning a million rows to find the handful that are pending. A filtered index turns it into an instant seek. This is the one I'd point to in an interview. - Killed
FORMAT()in GROUP BY —FORMAT(date, 'yyyy-MM')is a function on a column, which makes the predicate non-SARGable and prevents any index from being used. Replaced withYEAR()+MONTH()and the plan flipped from scan to seek. UPDATE STATISTICS ... WITH FULLSCAN— financial data is skewed (most transactions cluster around a handful of merchants and a few months). Default sampling underestimates this badly. A full scan gave the optimiser better row estimates, which fixed a bad join order on Q2.- Minor CTE push-down — moved a couple of
WHEREclauses into CTEs so the optimiser reduces cardinality before joining the dimensions. Doesn't always help, but cheap to try.
If you flip through the before and after execution plans in docs/, the difference is visually obvious — fat green arrows over the fact table become thin arrows, and key lookups disappear.
- Docker Desktop with ~8 GB RAM allocated
- Python 3.11+
sqlcmdfrommssql-tools18, or VS Code with themssqlextension
The container is defined in docker-compose.yml. Credentials live in .env (the file is gitignored — copy .env.example and set your own SA_PASSWORD).
docker-compose up -d
# wait ~30s for the engine to finish initialising, then verify
docker exec financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$(grep SA_PASSWORD .env | cut -d= -f2)" -C \
-Q "SELECT @@VERSION"pip install -r requirements.txt
python python/generate_data.py --rows 1000000 --out /tmp/financial_dataAbout 30–40 seconds on M1. The generator is seeded (SEED=42) so anyone who clones the repo gets the same data and the same baselines.
Then copy the CSVs into the container so BULK INSERT can read them:
docker exec financial-sql-optimization-sql-server-1 mkdir -p /tmp/financial_data
docker cp /tmp/financial_data/. financial-sql-optimization-sql-server-1:/tmp/financial_data/SA_PASSWORD=$(grep SA_PASSWORD .env | cut -d= -f2)
cat sql/01_create_database.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C
cat sql/02_populate_data.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -Ccat sql/03_before_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C \
| tee docs/baseline_before.txtSTATISTICS IO, TIME is enabled inside the script, so the output shows logical reads and elapsed milliseconds for each of Q1–Q4. The docs/baseline_before.txt file in this repo is the actual output from my run.
cat sql/04_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -Ccat sql/03_before_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C \
| tee docs/baseline_after.txtCompare docs/baseline_before.txt and docs/baseline_after.txt. The summary in this README was extracted straight from those files.
cat sql/05_data_integrity.sql | docker exec -i financial-sql-optimization-sql-server-1 \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -CWhat it checks: referential integrity (no orphan FKs), type compliance, the expected ~0.5% fraud rate, and outlier detection on transaction amounts and settlement delays.
A snapshot of the integrity report output is in docs/data_integrity.png.
+--------------------------------------------------------------+
| MacBook Pro M1 (ARM64) |
| |
| +------------------+ CSVs +-----------------------+ |
| | Python 3.11 | --------> | Docker (amd64 emul.) | |
| | Faker + Pandas | | SQL Server 2022 | |
| | generate_data.py | | | |
| +------------------+ | FinancialDB | |
| | +- dim_customers | |
| | +- dim_accounts | |
| | +- dim_merchants | |
| | +- fact_transactions | |
| +-----------------------+ |
+--------------------------------------------------------------+
A note on the stack choice: I started this on Azure SQL Edge (ARM64-native) because it was the cleaner option for M1, but Azure SQL Edge was unstable in practice and Microsoft has been winding it down. Switched to standard SQL Server 2022 running under --platform linux/amd64 emulation. T-SQL is identical, the emulation overhead is small on this workload, and I get the real product instead of a stripped-down edge variant.
dim_customers 500 rows dim_merchants 1,000 rows
+- customer_id PK +- merchant_id PK
+- full_name +- merchant_name
+- email +- category (20 types)
+- country_code +- country_code
+- risk_profile LOW | MEDIUM | HIGH
+- is_active
dim_accounts 987 rows
+- account_id PK
+- customer_id FK -> dim_customers
+- account_type CHECKING | SAVINGS | INVESTMENT | CREDIT
+- balance
fact_transactions 1,000,000 rows <- this is where the work happens
+- transaction_id PK
+- account_id FK + merchant_id FK
+- transaction_dt / settlement_dt
+- amount (lognormal distribution, max 1M, median ~150)
+- transaction_type DEBIT | CREDIT | TRANSFER | REFUND
+- status COMPLETED | PENDING | FAILED | REVERSED
+- channel ONLINE | POS | ATM | WIRE
+- fraud_flag ~0.5% of rows, intentionally skewed amounts
Screenshots of the actual plans (taken from VS Code with the mssql extension) are in docs/. The before/after pattern repeats across the four queries: clustered index scan with wide arrows over the fact table on the left, narrow seeks with no key lookups on the right.
docs/q1_plan_before.jpgdocs/q2_plan_before.jpgdocs/q3_plan_before.jpgdocs/q4_plan_before.jpg
The arrow width is the visual that tells you what's happening — it represents cardinality estimates, and after the indexes go in those estimates collapse from "1 million rows" to "a few hundred at most".
financial-sql-optimization/
├── docker-compose.yml SQL Server 2022 service definition
├── .env.example copy to .env, set SA_PASSWORD
├── .gitignore excludes .env, .DS_Store, .venv
├── requirements.txt Python dependencies (pinned)
├── CLAUDE.md engineering notes / project context
├── CONNECTION.md how to connect (sqlcmd, VS Code, DBeaver)
├── NEXT_STEPS.md setup-complete walkthrough
├── README.md this file
├── sql/
│ ├── 01_create_database.sql schema + constraints + FKs
│ ├── 02_populate_data.sql bulk load + audit log
│ ├── 03_before_optimization.sql baseline queries (Q1-Q4)
│ ├── 04_optimization.sql indexes, stats, rewrites
│ └── 05_data_integrity.sql quality + anomaly detection
├── python/
│ └── generate_data.py 1M-row synthetic data generator
└── docs/
├── baseline_before.txt actual sqlcmd output (pre-optim)
├── baseline_after.txt actual sqlcmd output (post-optim)
├── Ejecution_optimization.txt output of the optimisation step
├── compare_benchmarks.py helper to diff before/after
└── q{1..4}_plan_before.jpg execution plan screenshots
| Tool | Version | Why |
|---|---|---|
| SQL Server 2022 | 16.x (linux/amd64) | the real product, not the edge variant |
| Docker Desktop | 4.x | local, reproducible, isolated |
| Python | 3.11 | data generation |
| Pandas / NumPy | latest | vectorised generation |
| Faker | 24 | realistic names, emails, categories |
| VS Code + mssql ext. | latest | replaces Azure Data Studio (deprecated 2026-02-28) |
A couple of things I learnt along the way that aren't obvious from the script:
The --platform linux/amd64 flag is what makes SQL Server 2022 actually start on an M1 — without it the container exits silently. Once it's there, performance is fine for 1M rows but I wouldn't run a 100M-row benchmark on emulation.
UPDATE STATISTICS WITH FULLSCAN is the single most underused tool I've seen in production. Default sampling is fast, but on skewed financial distributions it produces estimates that are off by an order of magnitude, and that's how you end up with hash joins where you wanted nested loops. Running it once explicitly after a big load is cheap insurance.
The compare-benchmarks helper in docs/ has a small bug — it matches every elapsed time regex including parse/compile times, which inflates the "before" numbers. The numbers in this README come from grepping SQL Server Execution Times directly. I'll fix the script in a follow-up commit.
Part of a broader portfolio focused on SQL, Python, and production-grade data work in the financial sector.