Skip to content

Commit 46d6db2

Browse files
committed
v1.0.0 release changes
1 parent ee2aea4 commit 46d6db2

11 files changed

Lines changed: 1813 additions & 33 deletions

File tree

CHANGELOG.md

Lines changed: 9 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,14 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [Unreleased]
8+
## [1.0.0] - 2026-03-11
99

1010
### Added
11-
- Initial release of Text2SQL Evaluation Toolkit
12-
- Modular evaluation framework for text-to-SQL systems
13-
- Support for multiple benchmarks (BIRD-SQL, Spider, Beaver, Archer)
14-
- Execution-based evaluation metrics
15-
- LLM-as-judge evaluation support
16-
- Multiple ground truth support
17-
- Error analysis and visualization tools
18-
- SQL profiling capabilities
19-
- Agentic pipeline for iterative SQL generation
20-
- Comprehensive documentation and examples
21-
22-
### Changed
23-
24-
### Deprecated
25-
26-
### Removed
27-
28-
### Fixed
29-
30-
### Security
31-
32-
[unreleased]: https://github.com/IBM/text2sql-eval-toolkit/compare/v0.1.0...HEAD
11+
- Pip-installable `text2sql-eval-toolkit` library with packaged benchmark metadata.
12+
- Curated top-level Python API for evaluation (`evaluate_prediction`, `evaluate_predictions`, `run_evaluation`).
13+
- Execution orchestration helper (`run_execution`) and benchmark discovery utilities (`get_available_benchmarks`, `get_benchmarks_info`, `get_benchmark_info`).
14+
- Public inference pipelines (`LLMSQLGenerationPipeline`, `AgenticSQLGenerationPipeline`) for reproducing baseline and agentic experiments.
15+
- Re-exported low-level SQL comparison and parsing helpers (`compare_result_dfs`, `sql_exact_match`, etc.) from `unitxt.text2sql_utils`.
16+
- Library-focused README examples showing record-level, file-level, and benchmark-level usage.
17+
18+
[1.0.0]: https://github.com/IBM/text2sql-eval-toolkit/releases/tag/v1.0.0

README.md

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,18 @@ brew install uv
6464

6565
## Installation
6666

67-
### Option 1: Using UV (Recommended - Fast! ⚡)
67+
### From PyPI
68+
69+
```bash
70+
pip install text2sql-eval-toolkit
71+
72+
# Optional: install database-specific extras
73+
pip install "text2sql-eval-toolkit[mysql,presto,db2]"
74+
```
75+
76+
### From source (recommended for development)
77+
78+
Using UV:
6879

6980
```bash
7081
# Clone the repository
@@ -82,7 +93,7 @@ uv pip install -e .
8293
uv pip install -e ".[mysql,presto,db2]"
8394
```
8495

85-
### Option 2: Using pip/conda (Traditional)
96+
Using pip/conda:
8697

8798
```bash
8899
# Create conda environment (or use venv)
@@ -107,6 +118,83 @@ The toolkit comes with pre-defined public benchmarks including BIRD-SQL, Spider,
107118

108119
## Usage
109120

121+
### Using the library API
122+
123+
After installation you can import the toolkit and use both low-level and high-level evaluation APIs:
124+
125+
**Evaluate a single prediction record in memory**
126+
127+
```python
128+
from text2sql_eval_toolkit import evaluate_prediction, parse_dataframe, get_gt_sqls
129+
130+
# record comes from a benchmark JSON entry, prediction from your model
131+
record = {
132+
"id": "q1",
133+
"sql": "SELECT * FROM customers",
134+
"gt_df": some_serialized_dataframe, # JSON in pandas orient='split' format
135+
}
136+
prediction = {
137+
"predicted_sql": "SELECT * FROM customers",
138+
"predicted_df": some_serialized_dataframe, # same format
139+
}
140+
141+
result = evaluate_prediction(record, prediction)
142+
print(result["subset_non_empty_execution_accuracy"])
143+
```
144+
145+
**Evaluate a predictions JSON file**
146+
147+
```python
148+
from text2sql_eval_toolkit import evaluate_predictions
149+
150+
data, summary_df = evaluate_predictions(
151+
input_file="data/results/my-benchmark-predictions.json",
152+
)
153+
print(summary_df.head())
154+
```
155+
156+
**Run evaluation for a known benchmark ID**
157+
158+
```python
159+
from text2sql_eval_toolkit import get_available_benchmarks, run_evaluation
160+
161+
print(get_available_benchmarks()) # uses packaged benchmark metadata
162+
data, summary_df = run_evaluation("bird_mini_dev_sqlite")
163+
print(summary_df[["subset_non_empty_execution_accuracy_avg"]])
164+
```
165+
166+
**Run SQL execution for a benchmark before evaluation**
167+
168+
```python
169+
from text2sql_eval_toolkit import run_execution
170+
171+
# Requires appropriate DB connection env vars (e.g., POSTGRES_CONNECTION_STRING)
172+
run_execution("bird_mini_dev_postgres")
173+
```
174+
175+
**Use the inference pipelines**
176+
177+
```python
178+
from text2sql_eval_toolkit import LLMSQLGenerationPipeline, AgenticSQLGenerationPipeline
179+
180+
pipeline = LLMSQLGenerationPipeline()
181+
pipeline.run_pipeline(
182+
benchmark_id="bird_mini_dev_sqlite",
183+
model_name="wxai:ibm/granite-34b-code-instruct",
184+
model_parameters={"max_new_tokens": 512},
185+
)
186+
187+
agentic = AgenticSQLGenerationPipeline()
188+
agentic.run_pipeline(
189+
benchmark_id="bird_mini_dev_sqlite",
190+
model_name="wxai:ibm/granite-34b-code-instruct",
191+
model_parameters={"max_new_tokens": 512},
192+
max_attempts=3,
193+
)
194+
```
195+
196+
See the docstrings of the exported functions/classes in `text2sql_eval_toolkit.__init__` for the full list of public APIs.
197+
110198
### Running Experiments
111199

112200
**Single Benchmark:**

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,6 @@ package-dir = { "" = "src" }
6868

6969
[tool.setuptools.packages.find]
7070
where = ["src"]
71+
72+
[tool.setuptools.package-data]
73+
"text2sql_eval_toolkit" = ["data/*.json"]

src/text2sql_eval_toolkit/__init__.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,106 @@
22
# Copyright IBM Corp. 2025 - 2026
33
# SPDX-License-Identifier: Apache-2.0
44
#
5+
"""
6+
Public API for the text2sql-eval-toolkit library.
7+
8+
This package exposes multiple levels of functionality:
9+
10+
- Low-level, record-based evaluation (`evaluate_prediction`)
11+
- File-based evaluation over prediction JSON files (`evaluate_predictions`)
12+
- Benchmark-based orchestration that discovers files from benchmark metadata (`run_evaluation`, `run_execution`)
13+
- Inference pipelines for generating SQL (`LLMSQLGenerationPipeline`, `AgenticSQLGenerationPipeline`)
14+
- Utilities for discovering and inspecting available benchmarks (`get_available_benchmarks`, etc.)
15+
"""
16+
17+
from .evaluation.evaluation_tools import (
18+
evaluate_prediction,
19+
async_evaluate_predictions,
20+
evaluate_predictions,
21+
compute_summary,
22+
summary_to_df_csv,
23+
print_summary,
24+
run_evaluation,
25+
)
26+
from .evaluation.llm_as_judge import (
27+
load_llm_judge_config,
28+
evaluate_sql_prediction_with_llm,
29+
)
30+
from .evaluation import (
31+
compare_result_dfs,
32+
compare_dfs_bird_eval_logic,
33+
is_sqlglot_parsable,
34+
is_sqlparse_parsable,
35+
sqlglot_parsed_queries_equivalent,
36+
sqlglot_optimized_equivalence,
37+
sqlparse_queries_equivalent,
38+
sql_exact_match,
39+
)
40+
from .execution.execution_tools import run_execution
41+
from .inference.baseline_llm_pipeline import (
42+
LLMSQLGenerationPipelineSimple,
43+
LLMSQLGenerationPipeline,
44+
)
45+
from .inference.agentic_pipeline import AgenticSQLGenerationPipeline
46+
from .utils import (
47+
get_available_benchmarks,
48+
get_benchmarks_info,
49+
get_benchmark_info,
50+
run_with_timeout,
51+
run_with_timeout_async,
52+
parse_dataframe,
53+
truncate_dataframe,
54+
get_question_id,
55+
get_utterance,
56+
get_gt_sqls,
57+
get_question,
58+
get_default_eval_filename,
59+
add_summary_json_suffix,
60+
add_summary_csv_suffix,
61+
)
62+
63+
__all__ = [
64+
# Evaluation APIs
65+
"evaluate_prediction",
66+
"async_evaluate_predictions",
67+
"evaluate_predictions",
68+
"compute_summary",
69+
"summary_to_df_csv",
70+
"print_summary",
71+
"run_evaluation",
72+
# LLM-as-judge helpers
73+
"load_llm_judge_config",
74+
"evaluate_sql_prediction_with_llm",
75+
# Low-level SQL equivalence / parsing helpers (from unitxt.text2sql_utils)
76+
"compare_result_dfs",
77+
"compare_dfs_bird_eval_logic",
78+
"is_sqlglot_parsable",
79+
"is_sqlparse_parsable",
80+
"sqlglot_parsed_queries_equivalent",
81+
"sqlglot_optimized_equivalence",
82+
"sqlparse_queries_equivalent",
83+
"sql_exact_match",
84+
# Execution
85+
"run_execution",
86+
# Inference pipelines
87+
"LLMSQLGenerationPipelineSimple",
88+
"LLMSQLGenerationPipeline",
89+
"AgenticSQLGenerationPipeline",
90+
# Benchmark utilities
91+
"get_available_benchmarks",
92+
"get_benchmarks_info",
93+
"get_benchmark_info",
94+
# Misc utilities (advanced usage)
95+
"run_with_timeout",
96+
"run_with_timeout_async",
97+
"parse_dataframe",
98+
"truncate_dataframe",
99+
"get_question_id",
100+
"get_utterance",
101+
"get_gt_sqls",
102+
"get_question",
103+
"get_default_eval_filename",
104+
"add_summary_json_suffix",
105+
"add_summary_csv_suffix",
106+
]
5107

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#
2+
# Package data for text2sql_eval_toolkit.
3+
#
4+
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
{
2+
"bird_mini_dev_sqlite": {
3+
"name": "bird_mini_dev_sqlite",
4+
"description": "BIRD-SQL Mini-Dev in SQLite https://github.com/bird-bench/mini_dev",
5+
"data": "benchmarks/bird_mini_dev_sqlite.json",
6+
"schema": "benchmarks/bird_mini_dev_sqlite-schema.json",
7+
"predictions": "results/bird_mini_dev_sqlite-predictions.json",
8+
"db_engine": {
9+
"db_type": "sqlite",
10+
"db_folder": "benchmarks/dbs/bird/dev_databases"
11+
}
12+
},
13+
"bird_mini_dev_postgres": {
14+
"name": "bird_mini_dev_postgres",
15+
"description": "BIRD-SQL Mini-Dev in PostgreSQL https://github.com/bird-bench/mini_dev",
16+
"data": "benchmarks/bird_mini_dev_postgres.json",
17+
"schema": "benchmarks/bird_mini_dev_postgres-schema.json",
18+
"predictions": "results/bird_mini_dev_postgres-predictions.json",
19+
"db_engine": {
20+
"db_type": "postgres",
21+
"schema_name": "public",
22+
"connection_string_env_var": "POSTGRES_CONNECTION_STRING"
23+
}
24+
},
25+
"beaver": {
26+
"name": "beaver",
27+
"description": "Beaver benchmark https://peterbaile.github.io/beaver/",
28+
"data": "benchmarks/beaver.json",
29+
"schema": "benchmarks/beaver-schema.json",
30+
"predictions": "results/beaver-predictions.json",
31+
"db_engine": {
32+
"db_type": "mysql",
33+
"connection_string_env_var": "MYSQL_CONNECTION_STRING"
34+
}
35+
},
36+
"archer_en_dev": {
37+
"name": "archer_en_dev",
38+
"description": "Archer English Dev Set https://sig4kg.github.io/archer-bench/",
39+
"data": "benchmarks/archer_en_dev.json",
40+
"schema": "benchmarks/archer-schema.json",
41+
"predictions": "results/archer_en_dev-predictions.json",
42+
"db_engine": {
43+
"db_type": "sqlite",
44+
"db_folder": "benchmarks/dbs/archer/database"
45+
}
46+
},
47+
"spider_dev": {
48+
"name": "spider_dev",
49+
"description": "Spider Dev Set - Full 1,034 questions https://yale-lily.github.io/spider",
50+
"data": "benchmarks/spider-dev-converted.json",
51+
"schema": "benchmarks/spider-dev-schema.json",
52+
"predictions": "results/spider_dev-predictions.json",
53+
"db_engine": {
54+
"db_type": "sqlite",
55+
"db_folder": "benchmarks/dbs/spider/database"
56+
}
57+
},
58+
"spider_realistic": {
59+
"name": "spider_realistic",
60+
"description": "Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL https://zenodo.org/records/5205322",
61+
"data": "benchmarks/spider-realistic.json",
62+
"schema": "benchmarks/spider-dev-schema.json",
63+
"predictions": "results/spider_realistic-predictions.json",
64+
"db_engine": {
65+
"db_type": "sqlite",
66+
"db_folder": "benchmarks/dbs/spider/database"
67+
}
68+
}
69+
}

0 commit comments

Comments
 (0)