This directory contains run_evaluation.py, a script for evaluating the results of SQL query predictions against ground truth queries and their result sets. It computes a variety of metrics, including execution accuracy, subset accuracy, and SQL equivalence, and outputs detailed and summary reports.
Before running the script, install the required dependencies:
pip install -e .or follow the instructions on the main README to setup your environment and do:
pip install -e .Note on LLM Judge mode (
--use_llm_judge) If you enable--use_llm_judge, you must have your LLM provider configured (e.g., API key and endpoint) as described in the project's main README. LLM evaluation may incur API costs and can be non-deterministic unless you fix parameters (e.g., temperature).
python scripts/evaluation/run_evaluation.py <input_file.json> [--output_file OUTPUT] [--summary_file SUMMARY] [--csv_summary_file CSV_SUMMARY] [--use_llm_judge] [--llm_judge_config_path]<input_file.json>: Path to the input JSON file containing records with ground truth and predicted SQL queries and their result DataFrames.--output_file: (Optional) Path to write the evaluated JSON output. Defaults to<input_file>_eval.json.--summary_file: (Optional) Path to write the evaluation summary in JSON format. Defaults to<input_file>_eval_summary.json.--csv_summary_file: (Optional) Path to write the evaluation summary in CSV format. Defaults to<input_file>_eval_summary.csv.--use_llm_judge: (Optional) Enables LLM-as-judge metrics in addition to the standard metrics.--llm_judge_config_path: (Optional) Enables LLM-as-judge metrics in addition to the standard metrics, and sets path to config yaml file for LLM-as-judge inference. See an example config file that is used as default: here.
The input JSON should be a list of records. Each record must contain:
-
"sql": The ground truth SQL query. -
"gt_df": The ground truth result DataFrame, encoded as a JSON string. -
"predictions": A dictionary mapping model names to their predictions, each containing:"predicted_sql": The predicted SQL query."predicted_df": The predicted result DataFrame, encoded as a JSON string.
For a sample input, see data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.json.
- The script adds evaluation metrics to each prediction in the output JSON.
- It also produces summary statistics (mean and standard deviation) for each metric, per model, in both JSON and CSV formats.
- A summary is printed to the console.
For sample outputs, see:
data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval.jsondata/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.jsondata/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.csv
-
execution_accuracy: Exact match between predicted and ground truth result sets.
-
non_empty_execution_accuracy: Match when both result sets are non-empty.
-
subset_non_empty_execution_accuracy: Match even with missing or additional columns when both sets are non-empty.
-
bird_execution_accuracy: Set-based match as in the BIRD benchmark.
-
is_sqlglot_parsable: Whether the predicted SQL is parsable by SQLGlot.
-
is_sqlparse_parsable: Whether the predicted SQL is parsable by sqlparse.
-
sqlglot_equivalence: SQLGlot-based equivalence.
-
sqlglot_optimized_equivalence: Optimized SQLGlot equivalence.
-
sqlparse_equivalence: sqlparse-based equivalence.
-
sql_exact_match: Exact string match of SQLs.
-
sql_syntactic_equivalence: Any of the above SQL equivalence metrics is true.
-
llm_score (only with
--use_llm_judge): LLM-as-judge score indicating whether the prediction is correct per a rubric.- Depending on configuration, the per-prediction rationale (
llm_explanation) may also be recorded.
- Depending on configuration, the per-prediction rationale (
Run without LLM:
python scripts/evaluation/run_evaluation.py data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.jsonRun with LLM-as-judge:
python scripts/evaluation/run_evaluation.py data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions.json --use_llm_judgeThis will produce:
data/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval.jsondata/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.jsondata/benchmarks/test_benchmarks/results/bird_sqlite_test_benchmark-predictions_eval_summary.csv