Skip to content

ludinecg/financial-sql-optimization

Repository files navigation

financial-sql-optimization

SQL Server 2022 Python Docker Pandas ARM64 Status

Query optimization on a 1M-row financial dataset. The interesting story here isn't the wall-clock times — at this size everything already runs in under two seconds — it's what happens to logical reads. Three of the four queries drop their I/O by 75–100%.


I put this together because I wanted a clean, end-to-end example of what proper SQL tuning looks like outside a job description bullet point. After years of doing this kind of work at a bank, the part that always gets glossed over is the "before" state — most repos jump straight to the optimised version. Here I keep both side by side: the same queries run against the same data, once without any indexing strategy and once with covering indexes, filtered indexes, statistics rebuilt, and a couple of rewrites where the original SQL was blocking the optimiser from doing its job.

Everything runs locally in Docker, on an M1 Mac, with SQL Server 2022 (amd64 emulation). No cloud bill, no proprietary data, reproducible from a single seed.


Results

Measured on M1 MacBook Pro with 8 GB allocated to Docker Desktop. Each query was warmed up once and then timed with SET STATISTICS IO, TIME ON.

# Query Pattern Time (before → after) Logical reads (before → after)
Q1 Monthly merchant revenue GROUP BY + window function over 1M rows 1,785 ms → 1,647 ms 12,281 → 6,833 (-44%)
Q2 Fraud exposure by customer 3-table join + GROUP BY + HAVING 36 ms → 17 ms (-53%) 12,281 → 3,041 (-75%)
Q3 Account activity ratio Aggregation across dim_accounts 53 ms → 21 ms (-60%) 12,281 → 2,978 (-75%)
Q4 High-value pending transactions Low-cardinality filter on 1M rows 21 ms → 0 ms 12,281 → 3 (-99.97%)

Two things worth calling out honestly:

  • Q4 is the headline. The filtered index turns a full clustered scan into a 3-page seek. From the optimiser's point of view this query no longer touches the fact table at scale.
  • Q1 only improves ~8% in wall time even though logical reads drop 44%. That tells you the bottleneck wasn't I/O — it's the window function CPU work. Dropping the reads still matters under load (less contention, less memory pressure), but I wanted to leave this in the README rather than fake a 95% number, because real tuning work looks like this: some wins are I/O wins, some wins are CPU wins, and not every query has a magic index.

What I actually changed

Not going to list every line — the optimisation script is in sql/04_optimization.sql if you want the full set. The moves that mattered:

  • Covering indexes with INCLUDE — stops SQL Server from doing key lookups to fetch columns that weren't in the index key. Q2 and Q3 benefit the most here, because the dim joins are now a seek + included columns instead of a seek + lookup roundtrip.
  • Filtered index on status = 'PENDING' — Q4 was scanning a million rows to find the handful that are pending. A filtered index turns it into an instant seek. This is the one I'd point to in an interview.
  • Killed FORMAT() in GROUP BYFORMAT(date, 'yyyy-MM') is a function on a column, which makes the predicate non-SARGable and prevents any index from being used. Replaced with YEAR() + MONTH() and the plan flipped from scan to seek.
  • UPDATE STATISTICS ... WITH FULLSCAN — financial data is skewed (most transactions cluster around a handful of merchants and a few months). Default sampling underestimates this badly. A full scan gave the optimiser better row estimates, which fixed a bad join order on Q2.
  • Minor CTE push-down — moved a couple of WHERE clauses into CTEs so the optimiser reduces cardinality before joining the dimensions. Doesn't always help, but cheap to try.

If you flip through the before and after execution plans in docs/, the difference is visually obvious — fat green arrows over the fact table become thin arrows, and key lookups disappear.


How to run

Prerequisites

  • Docker Desktop with ~8 GB RAM allocated
  • Python 3.11+
  • sqlcmd from mssql-tools18, or VS Code with the mssql extension

1 — Bring up SQL Server 2022 in Docker

The container is defined in docker-compose.yml. Credentials live in .env (the file is gitignored — copy .env.example and set your own SA_PASSWORD).

docker-compose up -d

# wait ~30s for the engine to finish initialising, then verify
docker exec financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$(grep SA_PASSWORD .env | cut -d= -f2)" -C \
  -Q "SELECT @@VERSION"

2 — Generate the synthetic dataset

pip install -r requirements.txt
python python/generate_data.py --rows 1000000 --out /tmp/financial_data

About 30–40 seconds on M1. The generator is seeded (SEED=42) so anyone who clones the repo gets the same data and the same baselines.

Then copy the CSVs into the container so BULK INSERT can read them:

docker exec financial-sql-optimization-sql-server-1 mkdir -p /tmp/financial_data
docker cp /tmp/financial_data/. financial-sql-optimization-sql-server-1:/tmp/financial_data/

3 — Build the schema and load the data

SA_PASSWORD=$(grep SA_PASSWORD .env | cut -d= -f2)

cat sql/01_create_database.sql | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C

cat sql/02_populate_data.sql  | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C

4 — Run the baseline (before optimisation)

cat sql/03_before_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C \
  | tee docs/baseline_before.txt

STATISTICS IO, TIME is enabled inside the script, so the output shows logical reads and elapsed milliseconds for each of Q1–Q4. The docs/baseline_before.txt file in this repo is the actual output from my run.

5 — Apply the optimisations

cat sql/04_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C

6 — Re-run the baseline to measure the change

cat sql/03_before_optimization.sql | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C \
  | tee docs/baseline_after.txt

Compare docs/baseline_before.txt and docs/baseline_after.txt. The summary in this README was extracted straight from those files.

7 — Data quality and integrity checks

cat sql/05_data_integrity.sql | docker exec -i financial-sql-optimization-sql-server-1 \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C

What it checks: referential integrity (no orphan FKs), type compliance, the expected ~0.5% fraud rate, and outlier detection on transaction amounts and settlement delays.

A snapshot of the integrity report output is in docs/data_integrity.png.


Architecture

+--------------------------------------------------------------+
|  MacBook Pro M1 (ARM64)                                      |
|                                                              |
|   +------------------+   CSVs    +-----------------------+   |
|   | Python 3.11      | --------> | Docker (amd64 emul.)  |   |
|   | Faker + Pandas   |           | SQL Server 2022       |   |
|   | generate_data.py |           |                       |   |
|   +------------------+           | FinancialDB           |   |
|                                  |  +- dim_customers     |   |
|                                  |  +- dim_accounts      |   |
|                                  |  +- dim_merchants     |   |
|                                  |  +- fact_transactions |   |
|                                  +-----------------------+   |
+--------------------------------------------------------------+

A note on the stack choice: I started this on Azure SQL Edge (ARM64-native) because it was the cleaner option for M1, but Azure SQL Edge was unstable in practice and Microsoft has been winding it down. Switched to standard SQL Server 2022 running under --platform linux/amd64 emulation. T-SQL is identical, the emulation overhead is small on this workload, and I get the real product instead of a stripped-down edge variant.


Dataset

dim_customers   500 rows               dim_merchants  1,000 rows
+- customer_id PK                      +- merchant_id PK
+- full_name                           +- merchant_name
+- email                               +- category (20 types)
+- country_code                        +- country_code
+- risk_profile  LOW | MEDIUM | HIGH
+- is_active

dim_accounts  987 rows
+- account_id PK
+- customer_id FK -> dim_customers
+- account_type  CHECKING | SAVINGS | INVESTMENT | CREDIT
+- balance

fact_transactions  1,000,000 rows    <- this is where the work happens
+- transaction_id PK
+- account_id FK + merchant_id FK
+- transaction_dt / settlement_dt
+- amount (lognormal distribution, max 1M, median ~150)
+- transaction_type  DEBIT | CREDIT | TRANSFER | REFUND
+- status  COMPLETED | PENDING | FAILED | REVERSED
+- channel  ONLINE | POS | ATM | WIRE
+- fraud_flag  ~0.5% of rows, intentionally skewed amounts

Execution plans

Screenshots of the actual plans (taken from VS Code with the mssql extension) are in docs/. The before/after pattern repeats across the four queries: clustered index scan with wide arrows over the fact table on the left, narrow seeks with no key lookups on the right.

  • docs/q1_plan_before.jpg
  • docs/q2_plan_before.jpg
  • docs/q3_plan_before.jpg
  • docs/q4_plan_before.jpg

The arrow width is the visual that tells you what's happening — it represents cardinality estimates, and after the indexes go in those estimates collapse from "1 million rows" to "a few hundred at most".


Project structure

financial-sql-optimization/
├── docker-compose.yml               SQL Server 2022 service definition
├── .env.example                     copy to .env, set SA_PASSWORD
├── .gitignore                       excludes .env, .DS_Store, .venv
├── requirements.txt                 Python dependencies (pinned)
├── CLAUDE.md                        engineering notes / project context
├── CONNECTION.md                    how to connect (sqlcmd, VS Code, DBeaver)
├── NEXT_STEPS.md                    setup-complete walkthrough
├── README.md                        this file
├── sql/
│   ├── 01_create_database.sql       schema + constraints + FKs
│   ├── 02_populate_data.sql         bulk load + audit log
│   ├── 03_before_optimization.sql   baseline queries (Q1-Q4)
│   ├── 04_optimization.sql          indexes, stats, rewrites
│   └── 05_data_integrity.sql        quality + anomaly detection
├── python/
│   └── generate_data.py             1M-row synthetic data generator
└── docs/
    ├── baseline_before.txt          actual sqlcmd output (pre-optim)
    ├── baseline_after.txt           actual sqlcmd output (post-optim)
    ├── Ejecution_optimization.txt   output of the optimisation step
    ├── compare_benchmarks.py        helper to diff before/after
    └── q{1..4}_plan_before.jpg      execution plan screenshots

Stack

Tool Version Why
SQL Server 2022 16.x (linux/amd64) the real product, not the edge variant
Docker Desktop 4.x local, reproducible, isolated
Python 3.11 data generation
Pandas / NumPy latest vectorised generation
Faker 24 realistic names, emails, categories
VS Code + mssql ext. latest replaces Azure Data Studio (deprecated 2026-02-28)

Notes

A couple of things I learnt along the way that aren't obvious from the script:

The --platform linux/amd64 flag is what makes SQL Server 2022 actually start on an M1 — without it the container exits silently. Once it's there, performance is fine for 1M rows but I wouldn't run a 100M-row benchmark on emulation.

UPDATE STATISTICS WITH FULLSCAN is the single most underused tool I've seen in production. Default sampling is fast, but on skewed financial distributions it produces estimates that are off by an order of magnitude, and that's how you end up with hash joins where you wanted nested loops. Running it once explicitly after a big load is cheap insurance.

The compare-benchmarks helper in docs/ has a small bug — it matches every elapsed time regex including parse/compile times, which inflates the "before" numbers. The numbers in this README come from grepping SQL Server Execution Times directly. I'll fix the script in a follow-up commit.


Part of a broader portfolio focused on SQL, Python, and production-grade data work in the financial sector.

About

End-to-end SQL Server 2022 tuning on a 1M-row financial dataset: indexes, query rewrites, data integrity. Logical reads cut 75-99% on three of four benchmark queries.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors