Skip to content

Latest commit

 

History

History
208 lines (168 loc) · 7.77 KB

File metadata and controls

208 lines (168 loc) · 7.77 KB

Benchmarks

This directory contains benchmark datasets and configurations for evaluating text-to-SQL systems. Benchmarks are organized into two categories:

  • Full Benchmarks (benchmarks.json) - Full-scale benchmarks for comprehensive evaluation
  • Test Benchmarks (test-benchmarks.json) - Smaller subsets for quick validation and CI/CD

Production Benchmarks

Production benchmarks are defined in data/benchmarks.json and include:

BIRD Mini-Dev (SQLite)

BIRD Mini-Dev (PostgreSQL)

  • ID: bird_mini_dev_postgres
  • Description: BIRD-SQL Mini-Dev in PostgreSQL
  • Size: 500 questions
  • Database: PostgreSQL
  • Source: https://github.com/bird-bench/mini_dev
  • Setup Required: PostgreSQL database and POSTGRES_CONNECTION_STRING environment variable

Beaver

  • ID: beaver
  • Description: Beaver benchmark for business intelligence queries
  • Size: 209 questions
  • Database: MySQL
  • Source: https://peterbaile.github.io/beaver/
  • Setup Required: MySQL database and MYSQL_CONNECTION_STRING environment variable

Archer (English Dev)

Spider Dev

Spider Realistic

Test Benchmarks

Test benchmarks are defined in data/test-benchmarks.json and provide smaller subsets for:

  • Quick validation during development
  • Continuous Integration / Continuous Deployment (CI/CD) pipelines
  • Debugging and testing new features
  • Rapid iteration without full benchmark overhead

BIRD Mini-Dev SQLite Test (50)

  • ID: bird_mini_dev_sqlite_test_50
  • Description: 50-question test sample from BIRD Mini-Dev (SQLite)
  • Size: 50 questions
  • Database: SQLite
  • Data: benchmarks/test_benchmarks/bird_mini_dev_sqlite_test_50.json
  • Results: benchmarks/test_benchmarks/results/

BIRD Mini-Dev PostgreSQL Test (50)

  • ID: bird_mini_dev_postgres_test_50
  • Description: 50-question test sample from BIRD Mini-Dev (PostgreSQL)
  • Size: 50 questions
  • Database: PostgreSQL
  • Data: benchmarks/test_benchmarks/bird_mini_dev_postgres_test_50.json
  • Results: benchmarks/test_benchmarks/results/

Spider Dev Test (50)

  • ID: spider_dev_test_50
  • Description: 50-question test sample from Spider Dev
  • Size: 50 questions
  • Database: SQLite
  • Data: benchmarks/test_benchmarks/spider_dev_test_50.json
  • Results: benchmarks/test_benchmarks/results/

Beaver Test (10)

  • ID: beaver_test_10
  • Description: 10-question test sample from Beaver
  • Size: 10 questions
  • Database: MySQL
  • Data: benchmarks/test_benchmarks/beaver_test_10.json
  • Results: benchmarks/test_benchmarks/results/

Archer English Dev Test (10)

  • ID: archer_en_dev_test_10
  • Description: 10-question test sample from Archer English Dev
  • Size: 10 questions
  • Database: SQLite
  • Data: benchmarks/test_benchmarks/archer_en_dev_test_10.json
  • Results: benchmarks/test_benchmarks/results/

BIRD SQLite Test Benchmark (3)

  • ID: bird_sqlite_test_benchmark
  • Description: Minimal 3-question benchmark for code validation
  • Size: 3 questions
  • Database: SQLite (self-contained in test_benchmarks/db/)
  • Data: benchmarks/test_benchmarks/bird_sqlite_test_benchmark.json
  • Results: benchmarks/test_benchmarks/results/
  • Note: This is the smallest benchmark, ideal for quick smoke tests

Benchmark Configuration Format

Both benchmarks.json and test-benchmarks.json follow the same structure:

{
    "benchmark_id": {
        "name": "benchmark_id",
        "description": "Human-readable description with source URL",
        "data": "path/to/questions.json",
        "schema": "path/to/schema.json",
        "predictions": "path/to/predictions.json",
        "db_engine": {
            "db_type": "sqlite|postgres|mysql",
            "db_folder": "path/to/databases",  // For SQLite
            "schema_name": "public",           // For PostgreSQL
            "connection_string_env_var": "ENV_VAR_NAME"  // For PostgreSQL/MySQL
        }
    }
}

Database Setup

Most benchmarks require downloading databases or setting up database connections. See dbs/README.md for detailed setup instructions for:

  • BIRD Mini-Dev databases
  • Spider databases
  • Archer databases
  • PostgreSQL and MySQL connection setup

Usage Examples

Running a Production Benchmark

python scripts/run_experiment.py bird_mini_dev_sqlite

Running a Test Benchmark

python scripts/run_experiment.py bird_sqlite_test_benchmark

Running All Test Benchmarks

python scripts/run_all_benchmarks.py --test

Model Configuration

The run_all_benchmarks.py script supports separate model configurations for standard and agentic baselines:

  • Standard Baseline Models (--model_names): Used for fast, single-shot SQL generation

    • Default: 4 models (llama-3-3-70b, granite-4-h-small, llama-4-maverick-17b, gpt-oss-120b)
    • Fast execution, suitable for testing multiple models
  • Agentic Baseline Models (--agentic_models): Used for multi-step agentic SQL generation

    • Default: 1 model (gpt-oss-120b)
    • Slower execution due to multiple LLM calls per question
    • Optimized for efficiency by default

Example: Running all baselines with default configuration

# Standard baseline: 4 models, Agentic baselines: 1 model (10 total pipeline runs)
python scripts/run_all_benchmarks.py bird_sqlite_test_benchmark --test --run_all_baselines

Example: Custom model configuration

# Override agentic models to test multiple models
python scripts/run_all_benchmarks.py bird_sqlite_test_benchmark --test --run_all_baselines \
  --agentic_models "wxai:meta-llama/llama-3-3-70b-instruct" "wxai:ibm/granite-4-h-small"

# Override standard baseline models
python scripts/run_all_benchmarks.py bird_sqlite_test_benchmark --test --run_all_baselines \
  --model_names "wxai:openai/gpt-oss-120b"

Performance Note: When using --run_all_baselines, the script runs:

  • 1 standard baseline × N models (from --model_names)
  • 6 agentic variants × M models (from --agentic_models)

With defaults (N=4, M=1), this results in 10 pipeline runs instead of 28 (7 pipelines × 4 models), significantly reducing runtime while maintaining comprehensive coverage.

Results

Evaluation results are stored in:

  • Production benchmarks: data/results/ with dashboard at data/results/README.md
  • Test benchmarks: data/benchmarks/test_benchmarks/results/ with dashboard at data/benchmarks/test_benchmarks/results/README.md

Adding New Benchmarks

See CONTRIBUTING.md for guidelines on adding new benchmarks to the toolkit.