OpenSQL

The source code of paper "OpenSQL: Boosting Text-to-SQL with Schema-Aware Learning and Reasoning-Enhanced SQL Generation".

Overview

This repository contains the following code:

preprocessing scripts
training scripts (./training)
inference scripts
data

Environment Setup

Since we use vLLM for inference, we recommend separating the environments for training and inference.

For Training

conda create -n train_env python=3.12
conda activate train_env
pip install torch==2.7.1 deepspeed==0.17.2 accelerate==1.9.0 trl==0.19.0
transformers==4.54.1 datasets==3.6.0 loguru tqdm modelscope
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolation

For Evaluation

conda create -n eval_env python=3.12
conda activate eval_env

pip install vllm==0.9.1 faiss-gpu-cu12==1.11.0 sentence-transformers==5.0.0 transformers==4.52.4 tqdm loguru func_timeout

Project Structure

.
├── data.zip                      # Compressed datasets.
├── evaluation
│   ├── generation.py             # Generates SQL candidates.
│   ├── global_schema_linking.py  # Performs global schema linking.
│   ├── local_schema_linking.py   # Performs local schema linking.
│   ├── run_Dr_Spider.sh          # Evaluation script for Dr.Spider.
│   ├── select.py                 # Selects the best SQL output.
│   ├── spiders_evalute.py        # Official evaluation script for Spider.
│   ├── start_pipeline.sh         # Runs the end-to-end evaluation pipeline.
│   └── utils
│       ├── embed_values.py       # Utility for value-based retrieval.
│       └── ir_to_schema.py       # Decodes IR to a linearized schema.
├── preprocessing
│   ├── IR                        # Directory for pre-generated IR files.
│   │   ├── BIRD_dev_ir.json      # IR for the BIRD dev set.
│   │   ├── BIRD_train_ir.json    # IR for the BIRD train set.
│   │   ├── ...                   # Other dataset-specific IR files.
│   │   └── Spider_test_ir.json   # IR for the Spider test set.
│   ├── schema_and_IR
│   │   └── to_ir.py              # Converts database schema to IR format.
│   ├── schema_linking_dpo
│   │   └── make_dpo_data.py      # Generates preference data for DPO.
│   ├── schema_linking_sft
│   │   └── make_schema_linking_label.py # Creates labels for schema linking SFT.
│   └── value_index
│       └── embed_values.py       # Builds FAISS index for database values.
├── README.md                     # The project's README file.
└── training
    ├── DPO.json                  # Config/data for DPO.
    ├── generation
    │   ├── launch_generation.sh  # Launches SFT for the generation model.
    │   └── sft_with_control.py   # SFT script for the SQL generator.
    ├── schema_linking
    │   ├── DPO
    │   │   ├── launch_global_dpo.sh # Launches DPO for global schema linker.
    │   │   └── schema_dpo.py      # DPO training script for schema linking.
    │   ├── Local
    │   │   ├── launch_local.sh    # Launches training for local schema linker.
    │   │   └── schema_local.py    # Training script for local schema linker.
    │   └── SFT
    │       ├── launch_global_sft.sh # Launches SFT for global schema linker.
    │       └── schema_sft.py      # SFT training script for schema linking.
    ├── selector
    │   ├── launch_select.sh      # Launches training for the selector model.
    │   └── train_selector.py     # Training script for the selector model.
    └── SFT.json                  # Config/data for SFT.

User Guide

1. Preprocessing

We have provided a series of preprocessing scripts, including scripts for schema processing and scripts for data augmentation.

Intermediate-Representation. We use IR to generate linearized schema representation. Each IR contains table descriptions, column descriptions and foreign-key relationships.

The IR files in our experiments are in ./preprocessing/IR.

We use two files to generate IR and decode IR:

preprocessing/schema_and_IR/to_ir.py: Given a database, produce an IR in a file. (Then this IR can be decoded to produce sub-schema representations)
evaluation/utils/ir_to_schema.py: Given an IR and a python dictionary, decode an IR to output a linearized schema representation.

Valude Index. We use preprocessing/value_index/embed_values.py to embed databases with FAISS and extract similar values in run-time.

class ColumnVectorIndex contains the code about extracting similar values.

function embed_values_in_db is used to generate vector indexes of the database.

2. Training

The training scripts are contained in ./training.

Specifiaclly schema_linking , generation and selector directories contain the training scripts for all our models. In each script, we have provided detailed comments to guide you in modifying the environment variables. After configuring the variables and file paths, you can initiate the training using command-line parameters such as launch_global_sft.sh 1e-5 3.

3. evalutation pipeline

We formulate OpenSQL as sequential steps to generate SQL:

global schema linking
local schema linking
generate SQL candidates
perform pairwise selection to choose the final output.

./evaluation_pipeline/start_pipeline.sh is used to start the evaluation pipeline.

TODO

release training data, training scripts and preprocessing scripts
release the fine-tuned models.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
evaluation		evaluation
preprocessing		preprocessing
training		training
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
data.zip		data.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSQL

Overview

Environment Setup

Project Structure

User Guide

1. Preprocessing

2. Training

3. evalutation pipeline

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenSQL

Overview

Environment Setup

Project Structure

User Guide

1. Preprocessing

2. Training

3. evalutation pipeline

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages