Add evaluation pipeline + Spider external validation by brej-29 · Pull Request #4 · brej-29/analytics-copilot-text2sql

brej-29 · 2026-01-10T14:50:40Z

This PR introduces a full evaluation pipeline and a new Spider-based external validation step, including lightweight metrics, offline tests, and reports.

What’s included

Evaluation scripts
- scripts/evaluate_internal.py
  - Internal evaluation on data/processed/val.jsonl with EM, No-values EM, parse success, and schema adherence.
  - Supports mock mode (--mock) for offline testing using gold SQL as predictions.
  - Outputs JSON and Markdown reports under reports/ (eval_internal.json and eval_internal.md) with 10 example previews.
- scripts/evaluate_spider_external.py
  - Secondary external validation on Spider dev (mock mode via local fixtures or real HF datasets).
  - Builds prompts using Spider-specific schema contexts and the same internal evaluation metrics (EM, No-values EM, parse success, schema adherence).
  - Outputs reports under reports/ (eval_spider.json and eval_spider.md) with 10 example previews.
Evaluation library (src/text2sql/eval)
- normalize.py
  - normalize_sql(sql) and normalize_sql_no_values(sql) with placeholders for strings/numbers.
- schema.py
  - parse_create_table_context(context) -> {tables, columns_by_table}
  - referenced_identifiers(sql) -> {tables, columns}
  - schema_adherence(sql, context) -> bool
- metrics.py
  - exact_match(pred, gold) and aggregate_metrics(predictions, golds, contexts=None, compute_schema_adherence=False)
- init.py exports for ease of import
- spider.py
  - build_spider_prompt(schema_context, question)
  - build_schema_map(records) -> {db_id: schema_context}
Inference wrapper (src/text2sql/infer.py)
- load_model_for_inference(base_model, adapter_dir=None, device="auto")
- generate_sql(prompt, max_new_tokens, temperature, top_p)
- Supports LoRA adapters (PEFT) and merged base models; handles CPU/GPU fallback with logging.
Spider evaluation helpers (src/text2sql/eval/spider.py)
- build_spider_prompt and schema_map utilities used by external evaluation
Tests and fixtures
- tests/fixtures/eval_internal_sample.jsonl
- tests/fixtures/spider_sample.jsonl
- tests/fixtures/spider_schema_sample.jsonl
- tests for normalization, schema parsing, metrics aggregation, and Spider prompt building (pytest-based tests)
Documentation
- docs/evaluation.md draft explaining internal and Spider external evaluation, metrics, and limitations
- docs/external_validation.md updated to reflect the Task 4 implementation and licensing notes (CC BY-SA 4.0 for Spider schemas)
- README.md and context.md updated to include Evaluation commands and a decision log about the Spider external validation
Outputs and structure
- reports/ directory added to store evaluation artifacts (markdown + json reports)
- Offline tests designed to run without internet access (fixtures-based) to satisfy the no-internet requirement

How to run (local, mock and real modes)

Internal evaluation (mock, offline):
python scripts/evaluate_internal.py --mock --val_path data/processed/val.jsonl --out_dir reports/
Internal evaluation (real model, with adapters):
python scripts/evaluate_internal.py
--val_path data/processed/val.jsonl
--base_model mistralai/Mistral-7B-Instruct-v0.1
--adapter_dir /path/to/outputs/adapters
--device auto
--max_examples 200
--temperature 0.0
--top_p 0.9
--max_new_tokens 256
--out_dir reports/
Spider external evaluation (mock offline):
python scripts/evaluate_spider_external.py --mock --out_dir reports/
Spider external evaluation (real, with adapters):
python scripts/evaluate_spider_external.py
--base_model mistralai/Mistral-7B-Instruct-v0.1
--adapter_dir /path/to/outputs/adapters
--device auto
--spider_source xlangai/spider
--schema_source richardr1126/spider-schema
--spider_split validation
--max_examples 200
--out_dir reports/

Notes

Tests are offline-friendly and rely on fixtures; CI should pass without internet.
Real evaluation assumes access to the model and adapters and will produce both JSON metrics and human-readable Markdown reports for inclusion in portfolios.
The Spider evaluation uses lightweight, schema-adherence-based checks and normalized EM metrics, as a practical external validation rather than a full Spider benchmark reproduction.

Deliverable

A new branch with a new PR titled: "Add evaluation pipeline + Spider external validation" implementing the end-to-end evaluation workflow and tests as described above.

This pull request was co-created with Cosine Genie

Original Task: analytics-copilot-text2sql/zv2rmffkdrch
Author: Brejesh Balakrishnan

…ties for Text-to-SQL Co-authored-by: Cosine <agent@cosine.sh>

… examples with schemas; update tests to use load_spider_schema_map Co-authored-by: Cosine <agent@cosine.sh>

…-dtype) across evaluators; wire through load_model_for_inference; include tests and docs updates Co-authored-by: Cosine <agent@cosine.sh>

…ct methods in eval configs Co-authored-by: Cosine <agent@cosine.sh>

…sses Co-authored-by: Cosine <agent@cosine.sh>

brej-29 and others added 5 commits January 10, 2026 14:50

feat: add internal and Spider external evaluation pipelines and utili…

bf13356

…ties for Text-to-SQL Co-authored-by: Cosine <agent@cosine.sh>

feat(spider): add schema loader and pseudo-DDL converter; join Spider…

5fa49a1

… examples with schemas; update tests to use load_spider_schema_map Co-authored-by: Cosine <agent@cosine.sh>

feat: add 4-bit loading controls (--load_in_4bit/--no_load_in_4bit, -…

2bcc11a

…-dtype) across evaluators; wire through load_model_for_inference; include tests and docs updates Co-authored-by: Cosine <agent@cosine.sh>

chore: add CI workflow, syntax checker, README updates, and fix to_di…

6b533ae

…ct methods in eval configs Co-authored-by: Cosine <agent@cosine.sh>

fix: add default values for load_in_4bit and dtype in eval config cla…

fcfa84b

…sses Co-authored-by: Cosine <agent@cosine.sh>

brej-29 merged commit 534418d into main Jan 10, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation pipeline + Spider external validation#4

Add evaluation pipeline + Spider external validation#4
brej-29 merged 5 commits into
mainfrom
cosine/feat/eval-pipeline-spider-1

brej-29 commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brej-29 commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant