Skip to content

Add evaluation pipeline + Spider external validation#4

Merged
brej-29 merged 5 commits intomainfrom
cosine/feat/eval-pipeline-spider-1
Jan 10, 2026
Merged

Add evaluation pipeline + Spider external validation#4
brej-29 merged 5 commits intomainfrom
cosine/feat/eval-pipeline-spider-1

Conversation

@brej-29
Copy link
Copy Markdown
Owner

@brej-29 brej-29 commented Jan 10, 2026

This PR introduces a full evaluation pipeline and a new Spider-based external validation step, including lightweight metrics, offline tests, and reports.

What’s included

  • Evaluation scripts

    • scripts/evaluate_internal.py
      • Internal evaluation on data/processed/val.jsonl with EM, No-values EM, parse success, and schema adherence.
      • Supports mock mode (--mock) for offline testing using gold SQL as predictions.
      • Outputs JSON and Markdown reports under reports/ (eval_internal.json and eval_internal.md) with 10 example previews.
    • scripts/evaluate_spider_external.py
      • Secondary external validation on Spider dev (mock mode via local fixtures or real HF datasets).
      • Builds prompts using Spider-specific schema contexts and the same internal evaluation metrics (EM, No-values EM, parse success, schema adherence).
      • Outputs reports under reports/ (eval_spider.json and eval_spider.md) with 10 example previews.
  • Evaluation library (src/text2sql/eval)

    • normalize.py
      • normalize_sql(sql) and normalize_sql_no_values(sql) with placeholders for strings/numbers.
    • schema.py
      • parse_create_table_context(context) -> {tables, columns_by_table}
      • referenced_identifiers(sql) -> {tables, columns}
      • schema_adherence(sql, context) -> bool
    • metrics.py
      • exact_match(pred, gold) and aggregate_metrics(predictions, golds, contexts=None, compute_schema_adherence=False)
    • init.py exports for ease of import
    • spider.py
      • build_spider_prompt(schema_context, question)
      • build_schema_map(records) -> {db_id: schema_context}
  • Inference wrapper (src/text2sql/infer.py)

    • load_model_for_inference(base_model, adapter_dir=None, device="auto")
    • generate_sql(prompt, max_new_tokens, temperature, top_p)
    • Supports LoRA adapters (PEFT) and merged base models; handles CPU/GPU fallback with logging.
  • Spider evaluation helpers (src/text2sql/eval/spider.py)

    • build_spider_prompt and schema_map utilities used by external evaluation
  • Tests and fixtures

    • tests/fixtures/eval_internal_sample.jsonl
    • tests/fixtures/spider_sample.jsonl
    • tests/fixtures/spider_schema_sample.jsonl
    • tests for normalization, schema parsing, metrics aggregation, and Spider prompt building (pytest-based tests)
  • Documentation

    • docs/evaluation.md draft explaining internal and Spider external evaluation, metrics, and limitations
    • docs/external_validation.md updated to reflect the Task 4 implementation and licensing notes (CC BY-SA 4.0 for Spider schemas)
    • README.md and context.md updated to include Evaluation commands and a decision log about the Spider external validation
  • Outputs and structure

    • reports/ directory added to store evaluation artifacts (markdown + json reports)
    • Offline tests designed to run without internet access (fixtures-based) to satisfy the no-internet requirement

How to run (local, mock and real modes)

  • Internal evaluation (mock, offline):
    python scripts/evaluate_internal.py --mock --val_path data/processed/val.jsonl --out_dir reports/

  • Internal evaluation (real model, with adapters):
    python scripts/evaluate_internal.py
    --val_path data/processed/val.jsonl
    --base_model mistralai/Mistral-7B-Instruct-v0.1
    --adapter_dir /path/to/outputs/adapters
    --device auto
    --max_examples 200
    --temperature 0.0
    --top_p 0.9
    --max_new_tokens 256
    --out_dir reports/

  • Spider external evaluation (mock offline):
    python scripts/evaluate_spider_external.py --mock --out_dir reports/

  • Spider external evaluation (real, with adapters):
    python scripts/evaluate_spider_external.py
    --base_model mistralai/Mistral-7B-Instruct-v0.1
    --adapter_dir /path/to/outputs/adapters
    --device auto
    --spider_source xlangai/spider
    --schema_source richardr1126/spider-schema
    --spider_split validation
    --max_examples 200
    --out_dir reports/

Notes

  • Tests are offline-friendly and rely on fixtures; CI should pass without internet.
  • Real evaluation assumes access to the model and adapters and will produce both JSON metrics and human-readable Markdown reports for inclusion in portfolios.
  • The Spider evaluation uses lightweight, schema-adherence-based checks and normalized EM metrics, as a practical external validation rather than a full Spider benchmark reproduction.

Deliverable

  • A new branch with a new PR titled: "Add evaluation pipeline + Spider external validation" implementing the end-to-end evaluation workflow and tests as described above.

This pull request was co-created with Cosine Genie

Original Task: analytics-copilot-text2sql/zv2rmffkdrch
Author: Brejesh Balakrishnan

brej-29 and others added 5 commits January 10, 2026 14:50
…ties for Text-to-SQL

Co-authored-by: Cosine <agent@cosine.sh>
… examples with schemas; update tests to use load_spider_schema_map

Co-authored-by: Cosine <agent@cosine.sh>
…-dtype) across evaluators; wire through load_model_for_inference; include tests and docs updates

Co-authored-by: Cosine <agent@cosine.sh>
…ct methods in eval configs

Co-authored-by: Cosine <agent@cosine.sh>
…sses

Co-authored-by: Cosine <agent@cosine.sh>
@brej-29 brej-29 merged commit 534418d into main Jan 10, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant