@@ -64,7 +64,18 @@ brew install uv
6464
6565## Installation
6666
67- ### Option 1: Using UV (Recommended - Fast! ⚡)
67+ ### From PyPI
68+
69+ ``` bash
70+ pip install text2sql-eval-toolkit
71+
72+ # Optional: install database-specific extras
73+ pip install " text2sql-eval-toolkit[mysql,presto,db2]"
74+ ```
75+
76+ ### From source (recommended for development)
77+
78+ Using UV:
6879
6980``` bash
7081# Clone the repository
@@ -82,7 +93,7 @@ uv pip install -e .
8293uv pip install -e " .[mysql,presto,db2]"
8394```
8495
85- ### Option 2: Using pip/conda (Traditional)
96+ Using pip/conda:
8697
8798``` bash
8899# Create conda environment (or use venv)
@@ -107,6 +118,83 @@ The toolkit comes with pre-defined public benchmarks including BIRD-SQL, Spider,
107118
108119## Usage
109120
121+ ### Using the library API
122+
123+ After installation you can import the toolkit and use both low-level and high-level evaluation APIs:
124+
125+ ** Evaluate a single prediction record in memory**
126+
127+ ``` python
128+ from text2sql_eval_toolkit import evaluate_prediction, parse_dataframe, get_gt_sqls
129+
130+ # record comes from a benchmark JSON entry, prediction from your model
131+ record = {
132+ " id" : " q1" ,
133+ " sql" : " SELECT * FROM customers" ,
134+ " gt_df" : some_serialized_dataframe, # JSON in pandas orient='split' format
135+ }
136+ prediction = {
137+ " predicted_sql" : " SELECT * FROM customers" ,
138+ " predicted_df" : some_serialized_dataframe, # same format
139+ }
140+
141+ result = evaluate_prediction(record, prediction)
142+ print (result[" subset_non_empty_execution_accuracy" ])
143+ ```
144+
145+ ** Evaluate a predictions JSON file**
146+
147+ ``` python
148+ from text2sql_eval_toolkit import evaluate_predictions
149+
150+ data, summary_df = evaluate_predictions(
151+ input_file = " data/results/my-benchmark-predictions.json" ,
152+ )
153+ print (summary_df.head())
154+ ```
155+
156+ ** Run evaluation for a known benchmark ID**
157+
158+ ``` python
159+ from text2sql_eval_toolkit import get_available_benchmarks, run_evaluation
160+
161+ print (get_available_benchmarks()) # uses packaged benchmark metadata
162+ data, summary_df = run_evaluation(" bird_mini_dev_sqlite" )
163+ print (summary_df[[" subset_non_empty_execution_accuracy_avg" ]])
164+ ```
165+
166+ ** Run SQL execution for a benchmark before evaluation**
167+
168+ ``` python
169+ from text2sql_eval_toolkit import run_execution
170+
171+ # Requires appropriate DB connection env vars (e.g., POSTGRES_CONNECTION_STRING)
172+ run_execution(" bird_mini_dev_postgres" )
173+ ```
174+
175+ ** Use the inference pipelines**
176+
177+ ``` python
178+ from text2sql_eval_toolkit import LLMSQLGenerationPipeline, AgenticSQLGenerationPipeline
179+
180+ pipeline = LLMSQLGenerationPipeline()
181+ pipeline.run_pipeline(
182+ benchmark_id = " bird_mini_dev_sqlite" ,
183+ model_name = " wxai:ibm/granite-34b-code-instruct" ,
184+ model_parameters = {" max_new_tokens" : 512 },
185+ )
186+
187+ agentic = AgenticSQLGenerationPipeline()
188+ agentic.run_pipeline(
189+ benchmark_id = " bird_mini_dev_sqlite" ,
190+ model_name = " wxai:ibm/granite-34b-code-instruct" ,
191+ model_parameters = {" max_new_tokens" : 512 },
192+ max_attempts = 3 ,
193+ )
194+ ```
195+
196+ See the docstrings of the exported functions/classes in ` text2sql_eval_toolkit.__init__ ` for the full list of public APIs.
197+
110198### Running Experiments
111199
112200** Single Benchmark:**
0 commit comments