Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 12 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,31 +42,27 @@ The package doesn't have the dataset, it is stored on our [HuggingFace page](htt

## Latest News 📣

* [2025/12] Evaluation class converted to function see [new `evaluate(...)` function](./llmsql/evaluation/evaluate.py#evaluate)
* [2026/03] Added support for API inference, for now only for OpenAI-compatable APIs, see [`inference_api()` function](./llmsql/inference/inference_api.py#inference_api)

* New page version added to [`https://llmsql.github.io/llmsql-benchmark/`](https://llmsql.github.io/llmsql-benchmark/)
* [2026/03] The page now contains first version of [leaderboard](https://llmsql.github.io/llmsql-benchmark/#:~:text=%F0%9F%93%8A%20Leaderboard%20%E2%80%94%20Execution%20Accuracy%20%28EX)!

* Vllm inference method now supports chat templates, see [`inference_vllm(...)`](./llmsql/inference/inference_vllm.py#inference_vllm).
* Transformers inference now supports custom chat tempalates with `chat_template` argument, see [`inference_transformers(...)`](./llmsql/inference/inference_transformers.py#inference_transformers)
* [2026/02] The new LLMSQL 2.0 version is out now! See the [dataset](https://huggingface.co/datasets/llmsql-bench/llmsql-2.0). The support is already added with the `version` parameter to each `inference` function.

* More stable and deterministic inference with [`inference_vllm(...)`](./llmsql/inference/inference_vllm.py#inference_vllm) function added by setting [some envars](./llmsql/inference/inference_vllm.py)
* [2025/12] Evaluation class converted to function see [new `evaluate(...)` function](./llmsql/evaluation/evaluate.py#evaluate)

* `padding_side` argument added to [`inference_transformers(...)`](./llmsql/inference/inference_transformers.py#inference_transformers) function with default `left` option.


## Usage Recommendations

Modern LLMs are already strong at **producing SQL queries without finetuning**.
Modern LLMs are already strong at producing SQL queries without finetuning.
We therefore recommend that most users:

1. **Run inference** directly on the full benchmark:
model_or_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
output_file="path_to_your_outputs.jsonl",
- Use [`llmsql.inference_transformers`](./llmsql/inference/inference_transformers.py) (the function for transformers inference) for generation of SQL predictions with your model. If you want to do vllm based inference, use [`llmsql.inference_vllm`](./llmsql/inference/inference_vllm.py). Works both with HF model id, e.g. `Qwen/Qwen2.5-1.5B-Instruct` and model instance passed directly, e.g. `inference_transformers(model_or_model_name_or_path=model, ...)`
- Use [`llmsql.inference_transformers`](./llmsql/inference/inference_transformers.py) (the function for transformers inference) for generation of SQL predictions with your model. If you want to do vllm based inference, use [`llmsql.inference_vllm`](./llmsql/inference/inference_vllm.py). Works both with HF model id, e.g. `Qwen/Qwen2.5-1.5B-Instruct` and model instance passed directly, e.g. `inference_transformers(model_or_model_name_or_path=model, ...)`. The api inference is also supported, see [`inference_api()`](./llmsql/inference/inference_api.py#inference_api)
- Evaluate results against the benchmark with the [`llmsql.evaluate`](./llmsql/evaluation/evaluator.py) function.

2. **Optional finetuning**:
- For research or domain adaptation, we provide finetuning version for HF models. Use [Finetune Ready](https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready) dataset from HuggingFace.
- For research or domain adaptation, we provide finetuning version for HF models. Use [Finetune Ready](https://huggingface.co/collections/llmsql-bench/fine-tune-ready-versions-of-the-llmsql-benchmark) datasets from HuggingFace.

> [!Tip]
> You can find additional manuals in the README files of each folder([Inferece Readme](./llmsql/inference/README.md), [Evaluation Readme](./llmsql/evaluation/README.md))
Expand All @@ -80,7 +76,7 @@ We therefore recommend that most users:
```

llmsql/
├── evaluation/ # Scripts for downloading DB + evaluating predictions
├── evaluation/ # Scripts for evaluation
└── inference/ # Generate SQL queries with your LLM
```

Expand Down Expand Up @@ -159,10 +155,12 @@ print(report)
```


For more examples check the [examples folder](./examples/)

## Prompt Template

The prompt defines explicit constraints on the generated output.
The model is instructed to output only a valid SQL `SELECT` query, to use a fixed table name (`"Table"`) **(which will be replaced with the actual table name during evaluation)**, to quote all table and column names, and to restrict generation to the specified SQL functions, condition operators, and keywords.
The prompt defines explicit constraints on the generated output.
The model is instructed to output only a valid SQL `SELECT` query, to use a fixed table name (`"Table"`) **(which will be replaced with the actual table name during evaluation)**, to quote all table and column names, and to restrict generation to the specified SQL functions, condition operators, and keywords.
The full prompt specification is provided in the prompt template.

Below is an example of the **5-shot prompt template** used during inference.
Expand Down Expand Up @@ -224,13 +222,6 @@ Implementations of 0-shot, 1-shot, and 5-shot prompt templates are available her
👉 [link-to-file](./llmsql/prompts/prompts.py)



## Suggested Workflow

* **Primary**: Run inference on all questions with vllm or transformers → Evaluate with `evaluate()`.
* **Secondary (optional)**: Fine-tune on `train/val` → Test on `test_questions.jsonl`. You can find the datasets here [HF Finetune Ready](https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready).


## Contributing

Check out our [open issues](https://github.com/LLMSQL/llmsql-benchmark/issues), fork this repo and feel free to submit pull requests!
Expand Down
6 changes: 3 additions & 3 deletions docs/_templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,15 @@ <h3>1️⃣ Installation</h3>
<h3>2️⃣ Inference from CLI</h3>

<p><strong>vLLM Backend (Recommended)</strong></p>
<pre><code>llmsql inference --method vllm \
<pre><code>llmsql inference vllm \
--model-name Qwen/Qwen2.5-1.5B-Instruct \
--output-file outputs/preds.jsonl \
--batch-size 8 \
--num_fewshots 5 \
--temperature 0.0</code></pre>

<p><strong>Transformers Backend</strong></p>
<pre><code>llmsql inference --method transformers \
<pre><code>llmsql inference transformers \
--model-or-model-name-or-path Qwen/Qwen2.5-1.5B-Instruct \
--output-file outputs/preds.jsonl \
--batch-size 8 \
Expand Down Expand Up @@ -163,7 +163,7 @@ <h2 id="citation">📄 Citation</h2>
<pre><code>@inproceedings{llmsql_bench,
title={LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL},
author={Pihulski, Dzmitry and Charchut, Karol and Novogrodskaia, Viktoria and Koco{'n}, Jan},
booktitle={2025 IEEE ICувцDMW},
booktitle={2025 IEEE International Conference on Data Mining Workshops (ICDMW)},
year={2025},
organization={IEEE}
}
Expand Down
6 changes: 6 additions & 0 deletions docs/docs/inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ Inference API Reference

---

.. automodule:: llmsql.inference.inference_api
:members:
:undoc-members:

---

.. raw:: html

<div style="text-align:center; margin-top:2rem; color:#666;">
Expand Down
35 changes: 35 additions & 0 deletions docs/docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,41 @@ Using vllm backend.
print(report)


Using OpenAI-compateble API.

.. code-block:: python

from llmsql import inference_api
from dotenv import load_dotenv
import os
load_dotenv()

# Run inference (will take some time)
results = inference_api(
model_name="gpt-5-mini",
base_url="https://api.openai.com/v1/",
api_key=os.environ["OPENAI_API_KEY"],
api_kwargs={
"response_format": {
"type": "text"
},
"verbosity": "medium",
"reasoning_effort": "medium",
"store": False
},
requests_per_minute=100,
output_file="test_output_api.jsonl",
limit=50,
num_fewshots = 5,
seed=42,
version="2.0"
)

# Evaluate the results
evaluator = LLMSQLEvaluator()
report = evaluator.evaluate(outputs_path="outputs/preds_transformers.jsonl")
print(report)

---

.. raw:: html
Expand Down
132 changes: 132 additions & 0 deletions examples/inference_api.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "5409b21a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from llmsql import inference_api\n",
"from dotenv import load_dotenv\n",
"import os\n",
"load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "581e9c25",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-03-04 08:10:34,504 [INFO] llmsql-bench: Removing existing path: llmsql_workdir/questions.jsonl\n",
"2026-03-04 08:10:34,506 [INFO] llmsql-bench: Downloading questions.jsonl from Hugging Face Hub...\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a71443d8f32840838ba484eadf26d9d0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"questions.jsonl: 0%| | 0.00/18.3M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-03-04 08:10:35,608 [INFO] llmsql-bench: Downloaded questions.jsonl to: llmsql_workdir/questions.jsonl\n",
"2026-03-04 08:10:35,608 [INFO] llmsql-bench: Removing existing path: llmsql_workdir/tables.jsonl\n",
"2026-03-04 08:10:35,611 [INFO] llmsql-bench: Downloading tables.jsonl from Hugging Face Hub...\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "62ec9ecc8d8b48f7a835019688ee1894",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tables.jsonl: 0%| | 0.00/45.3M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-03-04 08:10:36,293 [INFO] llmsql-bench: Downloaded tables.jsonl to: llmsql_workdir/tables.jsonl\n",
"Generating: 100%|██████████| 50/50 [00:33<00:00, 1.48it/s]\n",
"2026-03-04 08:11:11,394 [INFO] llmsql-bench: Generation completed. 50 results saved to test_output_api.jsonl\n"
]
}
],
"source": [
"results = inference_api(\n",
" model_name=\"gpt-5-mini\",\n",
" base_url=\"https://api.openai.com/v1/\",\n",
" api_key=os.environ[\"OPENAI_API_KEY\"],\n",
" api_kwargs={\n",
" \"response_format\": {\n",
" \"type\": \"text\"\n",
" },\n",
" \"verbosity\": \"medium\",\n",
" \"reasoning_effort\": \"medium\",\n",
" \"store\": False\n",
" },\n",
" requests_per_minute=100,\n",
" output_file=\"test_output_api.jsonl\",\n",
" limit=50,\n",
" num_fewshots = 5,\n",
" seed=42,\n",
" version=\"2.0\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llmsql-benchmark-3.11 (3.11.13)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
50 changes: 50 additions & 0 deletions examples/test_output_api.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{"question_id": 15, "completion": "SELECT \"Launched\" FROM \"Table\" WHERE \"Laid down\" = \"September 1, 1964\";"}
{"question_id": 48, "completion": "SELECT \"Tone (Latvian notation: /~/ - level, /^/ - broken)\" FROM \"Table\" WHERE \"Translation\" = \"Gen.Sing. plague\";"}
{"question_id": 4, "completion": "SELECT COUNT(\"School/Club Team\") FROM \"Table\" WHERE \"Player\" = \"Jalen Rose\";"}
{"question_id": 21, "completion": "SELECT \"US Air Force equivalent\" FROM \"Table\" WHERE \"Commonwealth equivalent\" = \"Pilot Officer\";"}
{"question_id": 31, "completion": "SELECT COUNT(\"Total W–L\") FROM \"Table\" WHERE \"Doubles W–L\" = \"11–11\";"}
{"question_id": 42, "completion": "SELECT \"BTs retail price (regulated)\" FROM \"Table\" WHERE \"Tariff code\" = \"ff0 PRS\";"}
{"question_id": 7, "completion": "SELECT \"Date\" FROM \"Table\" WHERE \"Circuit\" = \"Misano\";"}
{"question_id": 28, "completion": "SELECT \"Rank Each wrestlers total number of days as champion are ranked highest to lowest; wrestlers with the same number mean that they are tied for that certain rank.\" FROM \"Table\" WHERE \"Wrestler\" = \"Go Shiozaki\";"}
{"question_id": 36, "completion": "SELECT \"Frequency\" FROM \"Table\" WHERE \"Market/Rank\" = \"Burlington - Plattsburgh , Vermont - New York /143\";"}
{"question_id": 23, "completion": "SELECT \"Rank in Spanish\" FROM \"Table\" WHERE \"Rank in English\" = \"Major\";"}
{"question_id": 11, "completion": "SELECT COUNT(DISTINCT \"Nationality\") FROM \"Table\" WHERE \"NHL team\" = \"New Jersey Devils\";"}
{"question_id": 47, "completion": "SELECT \"BTs retail price (regulated)\" FROM \"Table\" WHERE \"Tariff code\" = \"g10\";"}
{"question_id": 12, "completion": "SELECT \"Pick\" FROM \"Table\" WHERE \"Player\" = \"Dorain Anneck\";"}
{"question_id": 16, "completion": "SELECT \"#\" FROM \"Table\" WHERE \"Commissioned\" = \"December 18, 1965\";"}
{"question_id": 27, "completion": "SELECT \"Combined days\" FROM \"Table\" WHERE \"Wrestler\" = \"Go Shiozaki\";"}
{"question_id": 32, "completion": "SELECT COUNT(\"Singles W–L\") FROM \"Table\" WHERE \"Doubles W–L\" = \"11–14\";"}
{"question_id": 22, "completion": "SELECT \"Commonwealth equivalent\" FROM \"Table\" WHERE \"US Air Force equivalent\" = \"Major General\";"}
{"question_id": 43, "completion": "SELECT \"Approx premium\" FROM \"Table\" WHERE \"Tariff code\" = \"g9\";"}
{"question_id": 49, "completion": "SELECT MIN(\"Radius (R ☉ )\") FROM \"Table\";"}
{"question_id": 34, "completion": "SELECT MAX(\"Ties played\") FROM \"Table\" WHERE \"Player\" = \"Josip Palada Category:Articles with hCards\";"}
{"question_id": 39, "completion": "SELECT \"Format\" FROM \"Table\" WHERE \"Branding\" = \"1290 WKBK W281AU 104.1\";"}
{"question_id": 6, "completion": "SELECT \"No\" FROM \"Table\" WHERE \"Race winner\" = \"Kevin Curtain\";"}
{"question_id": 18, "completion": "SELECT \"Laid down\" FROM \"Table\" WHERE \"Commissioned\" = \"October 29, 1965\";"}
{"question_id": 13, "completion": "SELECT \"Nationality\" FROM \"Table\" WHERE \"NHL team\" = \"Vancouver Canucks\";"}
{"question_id": 38, "completion": "SELECT \"Branding\" FROM \"Table\" WHERE \"Calls\" = \"WRKO\";"}
{"question_id": 2, "completion": "SELECT \"School/Club Team\" FROM \"Table\" WHERE \"Years in Toronto\" = \"1995-96\";"}
{"question_id": 29, "completion": "SELECT \"Province\" FROM \"Table\" WHERE \"Electorate\" = \"Grey and Bell\";"}
{"question_id": 44, "completion": "SELECT COUNT(\"Tariff code\") FROM \"Table\" WHERE \"BTs retail price (regulated)\" = \"2p/min or inclusive\";"}
{"question_id": 26, "completion": "SELECT \"Rank Each wrestlers total number of days as champion are ranked highest to lowest; wrestlers with the same number mean that they are tied for that certain rank.\" FROM \"Table\" WHERE \"Wrestler\" = \"Bryan Danielson\";"}
{"question_id": 33, "completion": "SELECT \"Total W–L\" FROM \"Table\" WHERE \"Player\" = \"Boro Jovanović Category:Articles with hCards\";"}
{"question_id": 37, "completion": "SELECT \"Branding\" FROM \"Table\" WHERE \"Group owner\" = \"Qantam of Cape Cod, LLC\";"}
{"question_id": 8, "completion": "SELECT COUNT(DISTINCT \"Position\") FROM \"Table\" WHERE \"College/junior/club team\" = \"Sherbrooke Faucons (QMJHL)\";"}
{"question_id": 3, "completion": "SELECT \"School/Club Team\" FROM \"Table\" WHERE \"Years in Toronto\" = \"2003-06\";"}
{"question_id": 24, "completion": "SELECT \"Wrestler\" FROM \"Table\" WHERE \"# of reigns\" = 2;"}
{"question_id": 14, "completion": "SELECT \"Pick\" FROM \"Table\" WHERE \"College/junior/club team\" = \"Springfield Olympics (NEJHL)\";"}
{"question_id": 45, "completion": "SELECT COUNT(\"Tariff code\") FROM \"Table\" WHERE \"BTs retail price (regulated)\" = \"2.553p/min\";"}
{"question_id": 30, "completion": "SELECT \"Province\" FROM \"Table\" WHERE \"Electorate\" = \"Bay of Islands\";"}
{"question_id": 25, "completion": "SELECT MIN(\"# of reigns\") FROM \"Table\";"}
{"question_id": 19, "completion": "SELECT \"Commonwealth equivalent\" FROM \"Table\" WHERE \"Rank in Spanish\" = \"Coronel\";"}
{"question_id": 40, "completion": "SELECT \"Market/Rank\" FROM \"Table\" WHERE \"Calls\" = \"WCRN\";"}
{"question_id": 35, "completion": "SELECT SUM(\"Ties played\") FROM \"Table\" WHERE \"Total W–L\" = \"38–24\";"}
{"question_id": 50, "completion": "SELECT \"Spectral type\" FROM \"Table\" WHERE \"Star (Pismis24-#)\" = \"1SW\";"}
{"question_id": 20, "completion": "SELECT \"Rank in Spanish\" FROM \"Table\" WHERE \"Rank in English\" = \"Group Captain\";"}
{"question_id": 1, "completion": "SELECT \"Nationality\" FROM \"Table\" WHERE \"Player\" = \"Terrence Ross\";"}
{"question_id": 5, "completion": "SELECT \"Circuit\" FROM \"Table\" WHERE \"Round\" = \"Assen\";"}
{"question_id": 10, "completion": "SELECT COUNT(DISTINCT \"College/junior/club team\") FROM \"Table\" WHERE \"NHL team\" = \"Washington Capitals\";"}
{"question_id": 46, "completion": "SELECT \"Prefixes\" FROM \"Table\" WHERE \"Scheme\" = \"Pence per minute, fixed at all times\" AND \"Approx premium\" = \"3p/min\";"}
{"question_id": 17, "completion": "SELECT \"#\" FROM \"Table\" WHERE \"Commissioned\" = \"September 30, 1967\";"}
{"question_id": 9, "completion": "SELECT \"Nationality\" FROM \"Table\" WHERE \"College/junior/club team\" = \"Thunder Bay Flyers (USHL)\";"}
{"question_id": 41, "completion": "SELECT \"Frequency\" FROM \"Table\" WHERE \"Calls\" = \"WEGP\";"}
7 changes: 6 additions & 1 deletion llmsql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,12 @@ def __getattr__(name: str): # type: ignore
from .inference.inference_transformers import inference_transformers

return inference_transformers
elif name == "inference_api":
from .inference.inference_api import inference_api

return inference_api

raise AttributeError(f"module {__name__} has no attribute {name!r}")


__all__ = ["evaluate", "inference_vllm", "inference_transformers"]
__all__ = ["evaluate", "inference_vllm", "inference_transformers", "inference_api"]
Loading