IBM
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 23 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 23 deletions
diff --git a/‎README.md‎
Lines changed: 90 additions & 2 deletions b/‎README.md‎
Lines changed: 90 additions & 2 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 3 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/text2sql_eval_toolkit/__init__.py‎
Lines changed: 102 additions & 0 deletions b/‎src/text2sql_eval_toolkit/__init__.py‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎src/text2sql_eval_toolkit/data/__init__.py‎
Lines changed: 4 additions & 0 deletions b/‎src/text2sql_eval_toolkit/data/__init__.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎src/text2sql_eval_toolkit/data/benchmarks.json‎
Lines changed: 69 additions & 0 deletions b/‎src/text2sql_eval_toolkit/data/benchmarks.json‎
Lines changed: 69 additions & 0 deletions
@@ -5,28 +5,14 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [Unreleased]
+## [1.0.0] - 2026-03-11
 
 ### Added
-- Initial release of Text2SQL Evaluation Toolkit
-- Modular evaluation framework for text-to-SQL systems
-- Support for multiple benchmarks (BIRD-SQL, Spider, Beaver, Archer)
-- Execution-based evaluation metrics
-- LLM-as-judge evaluation support
-- Multiple ground truth support
-- Error analysis and visualization tools
-- SQL profiling capabilities
-- Agentic pipeline for iterative SQL generation
-- Comprehensive documentation and examples
-
-### Changed
-
-### Deprecated
-
-### Removed
-
-### Fixed
-
-### Security
-
-[unreleased]: https://github.com/IBM/text2sql-eval-toolkit/compare/v0.1.0...HEAD
+- Pip-installable `text2sql-eval-toolkit` library with packaged benchmark metadata.
+- Curated top-level Python API for evaluation (`evaluate_prediction`, `evaluate_predictions`, `run_evaluation`).
+- Execution orchestration helper (`run_execution`) and benchmark discovery utilities (`get_available_benchmarks`, `get_benchmarks_info`, `get_benchmark_info`).
+- Public inference pipelines (`LLMSQLGenerationPipeline`, `AgenticSQLGenerationPipeline`) for reproducing baseline and agentic experiments.
+- Re-exported low-level SQL comparison and parsing helpers (`compare_result_dfs`, `sql_exact_match`, etc.) from `unitxt.text2sql_utils`.
+- Library-focused README examples showing record-level, file-level, and benchmark-level usage.
+
+[1.0.0]: https://github.com/IBM/text2sql-eval-toolkit/releases/tag/v1.0.0
@@ -64,7 +64,18 @@ brew install uv
 
 ## Installation
 
-### Option 1: Using UV (Recommended - Fast! ⚡)
+### From PyPI
+
+```bash
+pip install text2sql-eval-toolkit
+
+# Optional: install database-specific extras
+pip install "text2sql-eval-toolkit[mysql,presto,db2]"
+```
+
+### From source (recommended for development)
+
+Using UV:
 
 ```bash
 # Clone the repository
@@ -82,7 +93,7 @@ uv pip install -e .
 uv pip install -e ".[mysql,presto,db2]"
 ```
 
-### Option 2: Using pip/conda (Traditional)
+Using pip/conda:
 
 ```bash
 # Create conda environment (or use venv)
@@ -107,6 +118,83 @@ The toolkit comes with pre-defined public benchmarks including BIRD-SQL, Spider,
 
 ## Usage
 
+### Using the library API
+
+After installation you can import the toolkit and use both low-level and high-level evaluation APIs:
+
+**Evaluate a single prediction record in memory**
+
+```python
+from text2sql_eval_toolkit import evaluate_prediction, parse_dataframe, get_gt_sqls
+
+# record comes from a benchmark JSON entry, prediction from your model
+record = {
+    "id": "q1",
+    "sql": "SELECT * FROM customers",
+    "gt_df": some_serialized_dataframe,  # JSON in pandas orient='split' format
+}
+prediction = {
+    "predicted_sql": "SELECT * FROM customers",
+    "predicted_df": some_serialized_dataframe,  # same format
+}
+
+result = evaluate_prediction(record, prediction)
+print(result["subset_non_empty_execution_accuracy"])
+```
+
+**Evaluate a predictions JSON file**
+
+```python
+from text2sql_eval_toolkit import evaluate_predictions
+
+data, summary_df = evaluate_predictions(
+    input_file="data/results/my-benchmark-predictions.json",
+)
+print(summary_df.head())
+```
+
+**Run evaluation for a known benchmark ID**
+
+```python
+from text2sql_eval_toolkit import get_available_benchmarks, run_evaluation
+
+print(get_available_benchmarks())  # uses packaged benchmark metadata
+data, summary_df = run_evaluation("bird_mini_dev_sqlite")
+print(summary_df[["subset_non_empty_execution_accuracy_avg"]])
+```
+
+**Run SQL execution for a benchmark before evaluation**
+
+```python
+from text2sql_eval_toolkit import run_execution
+
+# Requires appropriate DB connection env vars (e.g., POSTGRES_CONNECTION_STRING)
+run_execution("bird_mini_dev_postgres")
+```
+
+**Use the inference pipelines**
+
+```python
+from text2sql_eval_toolkit import LLMSQLGenerationPipeline, AgenticSQLGenerationPipeline
+
+pipeline = LLMSQLGenerationPipeline()
+pipeline.run_pipeline(
+    benchmark_id="bird_mini_dev_sqlite",
+    model_name="wxai:ibm/granite-34b-code-instruct",
+    model_parameters={"max_new_tokens": 512},
+)
+
+agentic = AgenticSQLGenerationPipeline()
+agentic.run_pipeline(
+    benchmark_id="bird_mini_dev_sqlite",
+    model_name="wxai:ibm/granite-34b-code-instruct",
+    model_parameters={"max_new_tokens": 512},
+    max_attempts=3,
+)
+```
+
+See the docstrings of the exported functions/classes in `text2sql_eval_toolkit.__init__` for the full list of public APIs.
+
 ### Running Experiments
 
 **Single Benchmark:**
 
@@ -68,3 +68,6 @@ package-dir = { "" = "src" }
 
 [tool.setuptools.packages.find]
 where = ["src"]
+
+[tool.setuptools.package-data]
+"text2sql_eval_toolkit" = ["data/*.json"]
@@ -2,4 +2,106 @@
 # Copyright IBM Corp. 2025 - 2026
 # SPDX-License-Identifier: Apache-2.0
 #
+"""
+Public API for the text2sql-eval-toolkit library.
+
+This package exposes multiple levels of functionality:
+
+- Low-level, record-based evaluation (`evaluate_prediction`)
+- File-based evaluation over prediction JSON files (`evaluate_predictions`)
+- Benchmark-based orchestration that discovers files from benchmark metadata (`run_evaluation`, `run_execution`)
+- Inference pipelines for generating SQL (`LLMSQLGenerationPipeline`, `AgenticSQLGenerationPipeline`)
+- Utilities for discovering and inspecting available benchmarks (`get_available_benchmarks`, etc.)
+"""
+
+from .evaluation.evaluation_tools import (
+    evaluate_prediction,
+    async_evaluate_predictions,
+    evaluate_predictions,
+    compute_summary,
+    summary_to_df_csv,
+    print_summary,
+    run_evaluation,
+)
+from .evaluation.llm_as_judge import (
+    load_llm_judge_config,
+    evaluate_sql_prediction_with_llm,
+)
+from .evaluation import (
+    compare_result_dfs,
+    compare_dfs_bird_eval_logic,
+    is_sqlglot_parsable,
+    is_sqlparse_parsable,
+    sqlglot_parsed_queries_equivalent,
+    sqlglot_optimized_equivalence,
+    sqlparse_queries_equivalent,
+    sql_exact_match,
+)
+from .execution.execution_tools import run_execution
+from .inference.baseline_llm_pipeline import (
+    LLMSQLGenerationPipelineSimple,
+    LLMSQLGenerationPipeline,
+)
+from .inference.agentic_pipeline import AgenticSQLGenerationPipeline
+from .utils import (
+    get_available_benchmarks,
+    get_benchmarks_info,
+    get_benchmark_info,
+    run_with_timeout,
+    run_with_timeout_async,
+    parse_dataframe,
+    truncate_dataframe,
+    get_question_id,
+    get_utterance,
+    get_gt_sqls,
+    get_question,
+    get_default_eval_filename,
+    add_summary_json_suffix,
+    add_summary_csv_suffix,
+)
+
+__all__ = [
+    # Evaluation APIs
+    "evaluate_prediction",
+    "async_evaluate_predictions",
+    "evaluate_predictions",
+    "compute_summary",
+    "summary_to_df_csv",
+    "print_summary",
+    "run_evaluation",
+    # LLM-as-judge helpers
+    "load_llm_judge_config",
+    "evaluate_sql_prediction_with_llm",
+    # Low-level SQL equivalence / parsing helpers (from unitxt.text2sql_utils)
+    "compare_result_dfs",
+    "compare_dfs_bird_eval_logic",
+    "is_sqlglot_parsable",
+    "is_sqlparse_parsable",
+    "sqlglot_parsed_queries_equivalent",
+    "sqlglot_optimized_equivalence",
+    "sqlparse_queries_equivalent",
+    "sql_exact_match",
+    # Execution
+    "run_execution",
+    # Inference pipelines
+    "LLMSQLGenerationPipelineSimple",
+    "LLMSQLGenerationPipeline",
+    "AgenticSQLGenerationPipeline",
+    # Benchmark utilities
+    "get_available_benchmarks",
+    "get_benchmarks_info",
+    "get_benchmark_info",
+    # Misc utilities (advanced usage)
+    "run_with_timeout",
+    "run_with_timeout_async",
+    "parse_dataframe",
+    "truncate_dataframe",
+    "get_question_id",
+    "get_utterance",
+    "get_gt_sqls",
+    "get_question",
+    "get_default_eval_filename",
+    "add_summary_json_suffix",
+    "add_summary_csv_suffix",
+]
 
@@ -0,0 +1,4 @@
+#
+# Package data for text2sql_eval_toolkit.
+#
+
@@ -0,0 +1,69 @@
+{
+    "bird_mini_dev_sqlite": {
+        "name": "bird_mini_dev_sqlite",
+        "description": "BIRD-SQL Mini-Dev in SQLite https://github.com/bird-bench/mini_dev",
+        "data": "benchmarks/bird_mini_dev_sqlite.json",
+        "schema": "benchmarks/bird_mini_dev_sqlite-schema.json",
+        "predictions": "results/bird_mini_dev_sqlite-predictions.json",
+        "db_engine": {
+            "db_type": "sqlite",
+            "db_folder": "benchmarks/dbs/bird/dev_databases"
+        }
+    },
+    "bird_mini_dev_postgres": {
+        "name": "bird_mini_dev_postgres",
+        "description": "BIRD-SQL Mini-Dev in PostgreSQL https://github.com/bird-bench/mini_dev",
+        "data": "benchmarks/bird_mini_dev_postgres.json",
+        "schema": "benchmarks/bird_mini_dev_postgres-schema.json",
+        "predictions": "results/bird_mini_dev_postgres-predictions.json",
+        "db_engine": {
+            "db_type": "postgres",
+            "schema_name": "public",
+            "connection_string_env_var": "POSTGRES_CONNECTION_STRING"
+        }
+    },
+    "beaver": {
+        "name": "beaver",
+        "description": "Beaver benchmark https://peterbaile.github.io/beaver/",
+        "data": "benchmarks/beaver.json",
+        "schema": "benchmarks/beaver-schema.json",
+        "predictions": "results/beaver-predictions.json",
+        "db_engine": {
+            "db_type": "mysql",
+            "connection_string_env_var": "MYSQL_CONNECTION_STRING"
+        }
+    },
+    "archer_en_dev": {
+        "name": "archer_en_dev",
+        "description": "Archer English Dev Set https://sig4kg.github.io/archer-bench/",
+        "data": "benchmarks/archer_en_dev.json",
+        "schema": "benchmarks/archer-schema.json",
+        "predictions": "results/archer_en_dev-predictions.json",
+        "db_engine": {
+            "db_type": "sqlite",
+            "db_folder": "benchmarks/dbs/archer/database"
+        }
+    },
+    "spider_dev": {
+        "name": "spider_dev",
+        "description": "Spider Dev Set - Full 1,034 questions https://yale-lily.github.io/spider",
+        "data": "benchmarks/spider-dev-converted.json",
+        "schema": "benchmarks/spider-dev-schema.json",
+        "predictions": "results/spider_dev-predictions.json",
+        "db_engine": {
+            "db_type": "sqlite",
+            "db_folder": "benchmarks/dbs/spider/database"
+        }
+    },
+    "spider_realistic": {
+        "name": "spider_realistic",
+        "description": "Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL https://zenodo.org/records/5205322",
+        "data": "benchmarks/spider-realistic.json",
+        "schema": "benchmarks/spider-dev-schema.json",
+        "predictions": "results/spider_realistic-predictions.json",
+        "db_engine": {
+            "db_type": "sqlite",
+            "db_folder": "benchmarks/dbs/spider/database"
+        }
+    }
+}
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +#
 +# Package data for text2sql_eval_toolkit.
 +#
++