Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: CI

on:
push:
branches:
- "*"
pull_request:
branches:
- "*"

jobs:
lint-and-test:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install ruff

- name: Syntax check (compileall)
run: python scripts/check_syntax.py

- name: Ruff lint
run: ruff check .

- name: Run tests
run: pytest -q
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ venv/
data/
outputs/
logs/
reports/

# OS / editor files
.DS_Store
Expand Down
169 changes: 139 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,34 +151,112 @@ QLoRA/LoRA configuration, and troubleshooting (OOM, sequence length, etc.).

---

## Evaluation (placeholder)
## Evaluation

> Evaluation scripts and methodology will be documented here later.
The repository includes a lightweight but robust evaluation pipeline for
text-to-SQL:

Planned content for this section:
- **Internal evaluation** on the processed `b-mc2/sql-create-context` val set.
- **Secondary external validation** on the Spider dev split.

See [`docs/evaluation.md`](./docs/evaluation.md) for full details. Below are the
most common commands.

### Internal Evaluation (b-mc2/sql-create-context val)

Mock mode (no model required, exercises metrics/reporting):

```bash
python scripts/evaluate_internal.py --mock \
--val_path data/processed/val.jsonl \
--out_dir reports/
```

- How to run evaluation on:
- WikiSQL test split.
- (Optional) Spider dev set.
- Metrics:
- Logical form accuracy (exact SQL match).
- Execution accuracy (matching query results).
- Latency benchmarks (p50/p95).
- How to generate evaluation reports under `docs/` or `outputs/`.
With a trained QLoRA adapter (GPU recommended):

```bash
python scripts/evaluate_internal.py \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--max_examples 200 \
--out_dir reports/
```

Outputs:

- `reports/eval_internal.json` – metrics, config, and sample predictions.
- `reports/eval_internal.md` – human-readable summary.

### External Validation (Spider dev)

Mock mode (offline fixtures only, no internet required):

```bash
python scripts/evaluate_spider_external.py --mock \
--out_dir reports/
```

Full Spider dev evaluation with a trained model (requires internet + HF Datasets):

```bash
python scripts/evaluate_spider_external.py \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--spider_source xlangai/spider \
--schema_source richardr1126/spider-schema \
--spider_split validation \
--max_examples 200 \
--out_dir reports/
```

Outputs:

- `reports/eval_spider.json` – metrics, config, and sample predictions.
- `reports/eval_spider.md` – human-readable summary, including notes on
differences from official Spider evaluation.

---

## External Validation (Spider dev) – planned
## External Validation (Spider dev)

After training on `b-mc2/sql-create-context`, we run a secondary evaluation
harness on the **Spider** dev set (e.g., `xlangai/spider`) to measure
generalization to harder, multi-table, cross-domain text-to-SQL tasks.

Spider evaluation uses a **lightweight EM-style** metric suite:

- Exact Match and No-values Exact Match on normalized SQL.
- SQL parse success using `sqlglot`.
- Schema adherence checks against serialized schemas from
`richardr1126/spider-schema` (licensed under **CC BY-SA 4.0**).

After training on `b-mc2/sql-create-context`, we plan to add a secondary
evaluation harness on the **Spider** dev set (e.g., `xlangai/spider`) to
measure generalization to harder, multi-table, cross-domain text-to-SQL tasks.
Spider and its schema helper are used **only for evaluation**, not for
training.

For details, see [`docs/external_validation.md`](./docs/external_validation.md).

For the high-level plan, see [`docs/external_validation.md`](./docs/external_validation.md).
_code
-new-</-
---

## Local Development & QA

For a quick local quality check before pushing changes, you can run:

```bash
# 1) Syntax validation across src/, scripts/, and app/
python scripts/check_syntax.py

# 2) Linting (requires ruff to be installed, e.g. `pip install ruff`)
ruff check .

# 3) Test suite (offline-friendly)
pytest -q
```

These commands are also wired into the CI workflow (`.github/workflows/ci.yml`).

## Demo (placeholder)

> The Streamlit UI will be documented here when implemented.
Expand All @@ -200,24 +278,55 @@ Current high-level layout:

```text
.
├── app/ # Streamlit app (to be implemented)
├── docs/ # Documentation, design notes, evaluation reports
├── notebooks/ # Jupyter/Colab notebooks for experimentation
├── scripts/ # CLI scripts (e.g., dataset loading, training, eval)
│ └── smoke_load_dataset.py
├── app/ # Streamlit app (to be implemented)
├── docs/ # Documentation, design notes, evaluation reports
│ ├── dataset.md
│ ├── training.md
│ ├── evaluation.md
│ └── external_validation.md
├── notebooks/ # Jupyter/Colab notebooks for experimentation
├── scripts/ # CLI scripts (dataset, training, evaluation)
│ ├── build_dataset.py
│ ├── smoke_load_dataset.py
│ ├── train_qlora.py
│ ├── evaluate_internal.py
│ └── evaluate_spider_external.py
├── src/
│ └── text2sql/ # Core Python package
│ └── text2sql/ # Core Python package
│ ├── __init__.py
│ └── utils/ # Utility modules (to be implemented)
│ └── __init__.py
│ ├── data_prep.py
│ ├── infer.py
│ ├── training/
│ │ ├── __init__.py
│ │ ├── config.py
│ │ └── formatting.py
│ └── eval/
│ ├── __init__.py
│ ├── normalize.py
│ ├── schema.py
│ ├── metrics.py
│ └── spider.py
├── tests/
│ └── test_repo_smoke.py # Basic smoke test (imports the package)
├── .env.example # Example environment file
│ ├── fixtures/
│ │ ├── sql_create_context_sample.jsonl
│ │ ├── eval_internal_sample.jsonl
│ │ ├── spider_sample.jsonl
│ │ └── spider_schema_sample.jsonl
│ ├── test_repo_smoke.py
│ ├── test_build_dataset_offline.py
│ ├── test_data_prep.py
│ ├── test_prompt_formatting.py
│ ├── test_normalize_sql.py
│ ├── test_schema_adherence.py
│ ├── test_metrics_aggregate.py
│ └── test_prompt_building_spider.py
├── .env.example # Example environment file
├── .gitignore
├── context.md # Persistent project context & decisions
├── context.md # Persistent project context & decisions
├── LICENSE
├── README.md
└── requirements.txt
```

As the project progresses, this structure will be refined and additional modules, scripts, and documentation will be added.
As the project progresses, this structure will be refined and additional modules,
scripts, and documentation will be added.
4 changes: 3 additions & 1 deletion context.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ This repo will contain:
- **2026-01-10** – Decided to create our own deterministic validation split (default 8% of the data, seed=42) from the single `train` split shipped with `b-mc2/sql-create-context`, to enable reproducible model selection and early-stopping.
- **2026-01-10** – Selected **`mistralai/Mistral-7B-Instruct-v0.1`** as the base model for fine-tuning, using **QLoRA (4-bit) + LoRA adapters** implemented via **Unsloth + bitsandbytes** for efficient training on a single GPU.
- **2026-01-10** – Planned a **secondary external validation** step on **Spider dev** (e.g., `xlangai/spider`) after primary training on `b-mc2/sql-create-context`, to measure cross-domain, multi-table generalization.
- **2026-01-10** – Implemented a dedicated evaluation pipeline (internal + Spider dev) using normalized SQL metrics, schema adherence checks, and lightweight external validation based on `xlangai/spider` and `richardr1126/spider-schema` (Spider used only for evaluation, not training).

---

Expand All @@ -132,4 +133,5 @@ This repo will contain:
- Added basic pytest smoke test to verify that the `text2sql` package imports successfully.
- **2026-01-10** – Updated dataset plan and smoke loader to use the parquet-backed **`b-mc2/sql-create-context`** dataset (compatible with `datasets>=4`) instead of the script-based `Salesforce/wikisql`, and documented this decision in the project context.
- **2026-01-10** – Added a dataset preprocessing pipeline (`scripts/build_dataset.py`) that converts `b-mc2/sql-create-context` into Alpaca-style instruction-tuning JSONL files under `data/processed/` (train/val splits), along with reusable formatting utilities in `text2sql.data_prep`.
- **2026-01-10** – Added QLoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src/text2sql/training/`, and documentation for training (`docs/training.md`) plus planned external validation on Spider dev (`docs/external_validation.md`).LoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src`.
- **2026-01-10** – Added QLoRA training scaffolding: a detailed Colab-friendly notebook (`notebooks/finetune_mistral7b_qlora_text2sql.ipynb`), a reproducible training script (`scripts/train_qlora.py`), training utilities under `src/text2sql/training/`, and documentation for training (`docs/training.md`) plus planned external validation on Spider dev (`docs/external_validation.md`).
- **2026-01-10** – Task 4: Added an evaluation pipeline with internal metrics on `b-mc2/sql-create-context` (Exact Match, No-values EM, SQL parse success, schema adherence) and a secondary external validation harness on Spider dev using `xlangai/spider` and `richardr1126/spider-schema`, along with reports under `reports/` and supporting documentation (`docs/evaluation.md`).
Loading
Loading