Skip to content

Latest commit

 

History

History
96 lines (66 loc) · 5.15 KB

File metadata and controls

96 lines (66 loc) · 5.15 KB

SQL Prediction Evaluation

This directory contains run_evaluation.py, a script for evaluating the results of SQL query predictions against ground truth queries and their result sets. It computes a variety of metrics, including execution accuracy, subset accuracy, and SQL equivalence, and outputs detailed and summary reports.

Requirements

Before running the script, install the required dependencies:

pip install -e .

or follow the instructions on the main README to setup your environment and do:

pip install -e .

Note on LLM Judge mode (--use_llm_judge) If you enable --use_llm_judge, you must have your LLM provider configured (e.g., API key and endpoint) as described in the project's main README. LLM evaluation may incur API costs and can be non-deterministic unless you fix parameters (e.g., temperature).

Usage

python scripts/evaluation/run_evaluation.py <input_file.json> [--output_file OUTPUT] [--summary_file SUMMARY] [--csv_summary_file CSV_SUMMARY] [--use_llm_judge] [--llm_judge_config_path]
  • <input_file.json>: Path to the input JSON file containing records with ground truth and predicted SQL queries and their result DataFrames.
  • --output_file: (Optional) Path to write the evaluated JSON output. Defaults to <input_file>_eval.json.
  • --summary_file: (Optional) Path to write the evaluation summary in JSON format. Defaults to <input_file>_eval_summary.json.
  • --csv_summary_file: (Optional) Path to write the evaluation summary in CSV format. Defaults to <input_file>_eval_summary.csv.
  • --use_llm_judge: (Optional) Enables LLM-as-judge metrics in addition to the standard metrics.
  • --llm_judge_config_path: (Optional) Enables LLM-as-judge metrics in addition to the standard metrics, and sets path to config yaml file for LLM-as-judge inference. See an example config file that is used as default: here.

Input Format

The input JSON should be a list of records. Each record must contain:

  • "sql": The ground truth SQL query.

  • "gt_df": The ground truth result DataFrame, encoded as a JSON string.

  • "predictions": A dictionary mapping model names to their predictions, each containing:

    • "predicted_sql": The predicted SQL query.
    • "predicted_df": The predicted result DataFrame, encoded as a JSON string.

For a sample input, see data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.json.

Output

  • The script adds evaluation metrics to each prediction in the output JSON.
  • It also produces summary statistics (mean and standard deviation) for each metric, per model, in both JSON and CSV formats.
  • A summary is printed to the console.

For sample outputs, see:

Metrics

  • execution_accuracy: Exact match between predicted and ground truth result sets.

  • non_empty_execution_accuracy: Match when both result sets are non-empty.

  • subset_non_empty_execution_accuracy: Match even with missing or additional columns when both sets are non-empty.

  • bird_execution_accuracy: Set-based match as in the BIRD benchmark.

  • is_sqlglot_parsable: Whether the predicted SQL is parsable by SQLGlot.

  • is_sqlparse_parsable: Whether the predicted SQL is parsable by sqlparse.

  • sqlglot_equivalence: SQLGlot-based equivalence.

  • sqlglot_optimized_equivalence: Optimized SQLGlot equivalence.

  • sqlparse_equivalence: sqlparse-based equivalence.

  • sql_exact_match: Exact string match of SQLs.

  • sql_syntactic_equivalence: Any of the above SQL equivalence metrics is true.

  • llm_score (only with --use_llm_judge): LLM-as-judge score indicating whether the prediction is correct per a rubric.

    • Depending on configuration, the per-prediction rationale (llm_explanation) may also be recorded.

Example

Run without LLM:

python scripts/evaluation/run_evaluation.py data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.json

Run with LLM-as-judge:

python scripts/evaluation/run_evaluation.py data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.json --use_llm_judge

This will produce:

  • data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval.json
  • data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.json
  • data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.csv