Skip to content

ruilinWho/OpenSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSQL

The source code of paper "OpenSQL: Boosting Text-to-SQL with Schema-Aware Learning and Reasoning-Enhanced SQL Generation".

Overview

This repository contains the following code:

  1. preprocessing scripts
  2. training scripts (./training)
  3. inference scripts
  4. data

Environment Setup

Since we use vLLM for inference, we recommend separating the environments for training and inference.

  • For Training
conda create -n train_env python=3.12
conda activate train_env
pip install torch==2.7.1 deepspeed==0.17.2 accelerate==1.9.0 trl==0.19.0
transformers==4.54.1 datasets==3.6.0 loguru tqdm modelscope
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolation
  • For Evaluation
conda create -n eval_env python=3.12
conda activate eval_env

pip install vllm==0.9.1 faiss-gpu-cu12==1.11.0 sentence-transformers==5.0.0 transformers==4.52.4 tqdm loguru func_timeout

Project Structure

.
├── data.zip                      # Compressed datasets.
├── evaluation
│   ├── generation.py             # Generates SQL candidates.
│   ├── global_schema_linking.py  # Performs global schema linking.
│   ├── local_schema_linking.py   # Performs local schema linking.
│   ├── run_Dr_Spider.sh          # Evaluation script for Dr.Spider.
│   ├── select.py                 # Selects the best SQL output.
│   ├── spiders_evalute.py        # Official evaluation script for Spider.
│   ├── start_pipeline.sh         # Runs the end-to-end evaluation pipeline.
│   └── utils
│       ├── embed_values.py       # Utility for value-based retrieval.
│       └── ir_to_schema.py       # Decodes IR to a linearized schema.
├── preprocessing
│   ├── IR                        # Directory for pre-generated IR files.
│   │   ├── BIRD_dev_ir.json      # IR for the BIRD dev set.
│   │   ├── BIRD_train_ir.json    # IR for the BIRD train set.
│   │   ├── ...                   # Other dataset-specific IR files.
│   │   └── Spider_test_ir.json   # IR for the Spider test set.
│   ├── schema_and_IR
│   │   └── to_ir.py              # Converts database schema to IR format.
│   ├── schema_linking_dpo
│   │   └── make_dpo_data.py      # Generates preference data for DPO.
│   ├── schema_linking_sft
│   │   └── make_schema_linking_label.py # Creates labels for schema linking SFT.
│   └── value_index
│       └── embed_values.py       # Builds FAISS index for database values.
├── README.md                     # The project's README file.
└── training
    ├── DPO.json                  # Config/data for DPO.
    ├── generation
    │   ├── launch_generation.sh  # Launches SFT for the generation model.
    │   └── sft_with_control.py   # SFT script for the SQL generator.
    ├── schema_linking
    │   ├── DPO
    │   │   ├── launch_global_dpo.sh # Launches DPO for global schema linker.
    │   │   └── schema_dpo.py      # DPO training script for schema linking.
    │   ├── Local
    │   │   ├── launch_local.sh    # Launches training for local schema linker.
    │   │   └── schema_local.py    # Training script for local schema linker.
    │   └── SFT
    │       ├── launch_global_sft.sh # Launches SFT for global schema linker.
    │       └── schema_sft.py      # SFT training script for schema linking.
    ├── selector
    │   ├── launch_select.sh      # Launches training for the selector model.
    │   └── train_selector.py     # Training script for the selector model.
    └── SFT.json                  # Config/data for SFT.

User Guide

1. Preprocessing

We have provided a series of preprocessing scripts, including scripts for schema processing and scripts for data augmentation.

Intermediate-Representation. We use IR to generate linearized schema representation. Each IR contains table descriptions, column descriptions and foreign-key relationships.

The IR files in our experiments are in ./preprocessing/IR.

We use two files to generate IR and decode IR:

  • preprocessing/schema_and_IR/to_ir.py: Given a database, produce an IR in a file. (Then this IR can be decoded to produce sub-schema representations)
  • evaluation/utils/ir_to_schema.py: Given an IR and a python dictionary, decode an IR to output a linearized schema representation.

Valude Index. We use preprocessing/value_index/embed_values.py to embed databases with FAISS and extract similar values in run-time.

class ColumnVectorIndex contains the code about extracting similar values.

function embed_values_in_db is used to generate vector indexes of the database.

2. Training

The training scripts are contained in ./training.

Specifiaclly schema_linking , generation and selector directories contain the training scripts for all our models. In each script, we have provided detailed comments to guide you in modifying the environment variables. After configuring the variables and file paths, you can initiate the training using command-line parameters such as launch_global_sft.sh 1e-5 3.

3. evalutation pipeline

We formulate OpenSQL as sequential steps to generate SQL:

  • global schema linking
  • local schema linking
  • generate SQL candidates
  • perform pairwise selection to choose the final output.

./evaluation_pipeline/start_pipeline.sh is used to start the evaluation pipeline.

TODO

  • release training data, training scripts and preprocessing scripts
  • release the fine-tuned models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors