brej-29 · brej-29 · Jan 10, 2026 · Jan 10, 2026 · Jan 10, 2026 · Jan 10, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,37 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - "*"
+  pull_request:
+    branches:
+      - "*"
+
+jobs:
+  lint-and-test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -r requirements.txt
+          python -m pip install ruff
+
+      - name: Syntax check (compileall)
+        run: python scripts/check_syntax.py
+
+      - name: Ruff lint
+        run: ruff check .
+
+      - name: Run tests
+        run: pytest -q
diff --git a/.gitignore b/.gitignore
@@ -24,6 +24,7 @@ venv/
 data/
 outputs/
 logs/
+reports/
 
 # OS / editor files
 .DS_Store

diff --git a/README.md b/README.md
@@ -151,34 +151,112 @@ QLoRA/LoRA configuration, and troubleshooting (OOM, sequence length, etc.).
 
 ---
 
-## Evaluation (placeholder)
+## Evaluation
 
-> Evaluation scripts and methodology will be documented here later.
+The repository includes a lightweight but robust evaluation pipeline for
+text-to-SQL:
 
-Planned content for this section:
+- **Internal evaluation** on the processed `b-mc2/sql-create-context` val set.
+- **Secondary external validation** on the Spider dev split.
+
+See [`docs/evaluation.md`](./docs/evaluation.md) for full details. Below are the
+most common commands.
+
+### Internal Evaluation (b-mc2/sql-create-context val)
+
+Mock mode (no model required, exercises metrics/reporting):
+
+```bash
+python scripts/evaluate_internal.py --mock \
+  --val_path data/processed/val.jsonl \
+  --out_dir reports/
+```
 
-- How to run evaluation on:
-  - WikiSQL test split.
-  - (Optional) Spider dev set.
-- Metrics:
-  - Logical form accuracy (exact SQL match).
-  - Execution accuracy (matching query results).
-  - Latency benchmarks (p50/p95).
-- How to generate evaluation reports under `docs/` or `outputs/`.
+With a trained QLoRA adapter (GPU recommended):
+
+```bash
+python scripts/evaluate_internal.py \
+  --val_path data/processed/val.jsonl \
+  --base_model mistralai/Mistral-7B-Instruct-v0.1 \
+  --adapter_dir /path/to/outputs/adapters \
+  --device auto \
+  --max_examples 200 \
+  --out_dir reports/
+```
+
+Outputs:
+
+- `reports/eval_internal.json` – metrics, config, and sample predictions.
+- `reports/eval_internal.md` – human-readable summary.
+
+### External Validation (Spider dev)
+
+Mock mode (offline fixtures only, no internet required):
+
+```bash
+python scripts/evaluate_spider_external.py --mock \
+  --out_dir reports/
+```
+
+Full Spider dev evaluation with a trained model (requires internet + HF Datasets):
+
+```bash
+python scripts/evaluate_spider_external.py \
+  --base_model mistralai/Mistral-7B-Instruct-v0.1 \
+  --adapter_dir /path/to/outputs/adapters \
+  --device auto \
+  --spider_source xlangai/spider \
+  --schema_source richardr1126/spider-schema \
+  --spider_split validation \
+  --max_examples 200 \
+  --out_dir reports/
+```
+
+Outputs:
+
+- `reports/eval_spider.json` – metrics, config, and sample predictions.
+- `reports/eval_spider.md` – human-readable summary, including notes on
+  differences from official Spider evaluation.
 
 ---
 
-## External Validation (Spider dev) – planned
+## External Validation (Spider dev)
+
+After training on `b-mc2/sql-create-context`, we run a secondary evaluation
+harness on the **Spider** dev set (e.g., `xlangai/spider`) to measure
+generalization to harder, multi-table, cross-domain text-to-SQL tasks.
+
+Spider evaluation uses a **lightweight EM-style** metric suite:
+
+- Exact Match and No-values Exact Match on normalized SQL.
+- SQL parse success using `sqlglot`.
+- Schema adherence checks against serialized schemas from
+  `richardr1126/spider-schema` (licensed under **CC BY-SA 4.0**).
 
-After training on `b-mc2/sql-create-context`, we plan to add a secondary
-evaluation harness on the **Spider** dev set (e.g., `xlangai/spider`) to
-measure generalization to harder, multi-table, cross-domain text-to-SQL tasks.
+Spider and its schema helper are used **only for evaluation**, not for
+training.
+
+For details, see [`docs/external_validation.md`](./docs/external_validation.md).
 
-For the high-level plan, see [`docs/external_validation.md`](./docs/external_validation.md).
-_code
--new-</-
 ---
 
+## Local Development & QA
+
+For a quick local quality check before pushing changes, you can run:
+
+```bash
+# 1) Syntax validation across src/, scripts/, and app/
+python scripts/check_syntax.py
+
+# 2) Linting (requires ruff to be installed, e.g. `pip install ruff`)
+ruff check .
+
+# 3) Test suite (offline-friendly)
+pytest -q
+```
+
+These commands are also wired into the CI workflow (`.github/workflows/ci.yml`).
+
 ## Demo (placeholder)
 
 > The Streamlit UI will be documented here when implemented.
@@ -200,24 +278,55 @@ Current high-level layout:
 
 ```text
 .
-├── app/                    # Streamlit app (to be implemented)
-├── docs/                   # Documentation, design notes, evaluation reports
-├── notebooks/              # Jupyter/Colab notebooks for experimentation
-├── scripts/                # CLI scripts (e.g., dataset loading, training, eval)
-│   └── smoke_load_dataset.py
+├── app/                        # Streamlit app (to be implemented)
+├── docs/                       # Documentation, design notes, evaluation reports
+│   ├── dataset.md
+│   ├── training.md
+│   ├── evaluation.md
+│   └── external_validation.md
+├── notebooks/                  # Jupyter/Colab notebooks for experimentation
+├── scripts/                    # CLI scripts (dataset, training, evaluation)
+│   ├── build_dataset.py
+│   ├── smoke_load_dataset.py
+│   ├── train_qlora.py
+│   ├── evaluate_internal.py
+│   └── evaluate_spider_external.py
 ├── src/
-│   └── text2sql/           # Core Python package
+│   └── text2sql/               # Core Python package
 │       ├── __init__.py
-│       └── utils/          # Utility modules (to be implemented)
-│           └── __init__.py
+│       ├── data_prep.py
+│       ├── infer.py
+│       ├── training/
+│       │   ├── __init__.py
+│       │   ├── config.py
+│       │   └── formatting.py
+│       └── eval/
+│           ├── __init__.py
+│           ├── normalize.py
+│           ├── schema.py
+│           ├── metrics.py
+│           └── spider.py
 ├── tests/
-│   └── test_repo_smoke.py  # Basic smoke test (imports the package)
-├── .env.example            # Example environment file
+│   ├── fixtures/
+│   │   ├── sql_create_context_sample.jsonl
+│   │   ├── eval_internal_sample.jsonl
+│   │   ├── spider_sample.jsonl
+│   │   └── spider_schema_sample.jsonl
+│   ├── test_repo_smoke.py
+│   ├── test_build_dataset_offline.py
+│   ├── test_data_prep.py
+│   ├── test_prompt_formatting.py
+│   ├── test_normalize_sql.py
+│   ├── test_schema_adherence.py
+│   ├── test_metrics_aggregate.py
+│   └── test_prompt_building_spider.py
+├── .env.example                # Example environment file
 ├── .gitignore
-├── context.md              # Persistent project context & decisions
+├── context.md                  # Persistent project context & decisions
 ├── LICENSE
 ├── README.md
 └── requirements.txt
 ```
 
-As the project progresses, this structure will be refined and additional modules, scripts, and documentation will be added.
+As the project progresses, this structure will be refined and additional modules,
+scripts, and documentation will be added.
diff --git a/context.md b/context.md
@@ -119,6 +119,7 @@ This repo will contain:
 - **2026-01-10** – Decided to create our own deterministic validation split (default 8% of the data, seed=42) from the single `train` split shipped with `b-mc2/sql-create-context`, to enable reproducible model selection and early-stopping.
 - **2026-01-10** – Selected **`mistralai/Mistral-7B-Instruct-v0.1`** as the base model for fine-tuning, using **QLoRA (4-bit) + LoRA adapters** implemented via **Unsloth + bitsandbytes** for efficient training on a single GPU.
 - **2026-01-10** – Planned a **secondary external validation** step on **Spider dev** (e.g., `xlangai/spider`) after primary training on `b-mc2/sql-create-context`, to measure cross-domain, multi-table generalization.
+- **2026-01-10** – Implemented a dedicated evaluation pipeline (internal + Spider dev) using normalized SQL metrics, schema adherence checks, and lightweight external validation based on `xlangai/spider` and `richardr1126/spider-schema` (Spider used only for evaluation, not training).
 
 ---
 
@@ -132,4 +133,5 @@ This repo will contain:
   - Added basic pytest smoke test to verify that the `text2sql` package imports successfully.
 - **2026-01-10** – Updated dataset plan and smoke loader to use the parquet-backed **`b-mc2/sql-create-context`** dataset (compatible with `datasets>=4`) instead of the script-based `Salesforce/wikisql`, and documented this decision in the project context.
 - **2026-01-10** – Added a dataset preprocessing pipeline (`scripts/build_dataset.py`) that converts `b-mc2/sql-create-context` into Alpaca-style instruction-tuning JSONL files under `data/processed/` (train/val splits), along with reusable formatting utilities in `text2sql.data_prep`.
-- **2026-01-10** – Added QLoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src/text2sql/training/`, and documentation for training (`docs/training.md`) plus planned external validation on Spider dev (`docs/external_validation.md`).LoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src`.
+- **2026-01-10** – Added QLoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src/text2sql/training/`, and documentation for training (`docs/training.md`) plus planned external validation on Spider dev (`docs/external_validation.md`).
+- **2026-01-10** – Task 4: Added an evaluation pipeline with internal metrics on `b-mc2/sql-create-context` (Exact Match, No-values EM, SQL parse success, schema adherence) and a secondary external validation harness on Spider dev using `xlangai/spider` and `richardr1126/spider-schema`, along with reports under `reports/` and supporting documentation (`docs/evaluation.md`).