Text2SQL-Flow is a comprehensive SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. This framework addresses the critical challenges in Text-to-SQL tasks where model performance is constrained by the scarcity, limited diversity, and structural simplicity of existing datasets.
Based on this framework, we introduce SQLFLOW, a high-quality dataset comprising 75,386 annotated examples. The dataset is derived from widely used benchmarks including Spider, BIRD, and EHRSQL, substantially extending their complexity frontiers.
Paper Link: https://arxiv.org/abs/2511.10192v3
GitHub Link: https://github.com/TechNomad-ds/Text2SQL-Flow
SQLFLOW is constructed through a pipeline that enhances both distribution-level and query-level diversity.
The dataset includes:
- Composition: 75,386 entries in totakl, including 30,502 from Spider-train, 33,163 from BIRD-train, and 11,721 from EHRSQL-train.
- Rich Annotations: Each sample consists of <
Augmented SQL,Question,Schema,Prompt,CoT reasoning trace>. - Diverse Augmentation: Covers six dimensions including Data Value Transformations, Query Structure Modifications, and Complexity Enhancements.
- Quality Control: All samples are verified via SQL executability and NL-SQL correspondence filters.
- Dual Application: Text2SQL-Flow supports both fine-tuning and prompt-based settings with masked alignment retrieval.
We evaluate models fine-tuned on SQLFLOW across mainstream benchmarks. Under the same data scale, models trained on SQLFLOW consistently outperform those trained on other synthetic datasets like SynSQL.
conda create -n text2sql_flow python=3.10
conda activate text2sql_flow
git clone https://github.com/OpenDCAI/DataFlow
cd DataFlow
pip install -e .cd ..
mkdir run_dataflow
cd run_dataflow
dataflow initPlease set your API key via environment variables:
export DF_API_KEY="sk-xxxxx"In the pipeline code, configure the language model and embedding model used for data construction. For example:
self.llm_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions", # Can be replaced with a custom endpoint
model_name="gpt-4o", # Project data was constructed using gpt-4o
max_workers=100
)
self.embedding_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/embeddings",
model_name="text-embedding-ada-002", # Can be replaced with other embedding models
max_workers=100
)Run the complete Text2SQL-Flow augmentation pipeline:
cd data_augmentation
bash run_batch.shThis will:
- Generate Augmented SQL from seed data
- Create natural language questions
- Generate chain-of-thought reasoning traces
- Apply SQL executability and NL–SQL correspondence filters
Fine-tune open-source LLMs on the generated SQLFlow dataset and evaluate fine-tuned LLMs across mainstream Text-to-SQL benchmarks:
cd llm_train_and_evalute
# 1. Configure paths and variables in train_llm.sh
# 2. Run training
bash train_llm.sh
# Edit MODELS and paths in eval_bench.sh, then run:
bash eval_bench.shTrain and use the masked Question-SQL alignment retrieval strategy:
# Ensure the DAIL-SQL and SWIFT environments are set up.
# See /retrieval_model_train_and_evaluate/README.md for details.
cd retrieval_model_train_and_evaluate
# 1. Build Training Data
cd training_data_process
bash get_training_data.sh
# 2. Train Embedding Model with SWIFT SFT
cd ../model_training
bash run_swift.sh
# 2. Retrieve Few-shot Examples & Build Prompts
cd ../retrieval_stategy
bash rungenerate_retrieval_prompt_swift.sh| Model and Dataset | Description | Download Latest on HuggingFace |
|---|---|---|
| SQLFlow | Text-to-SQL dataset | 🤗 debugger123/SQLFlow |
| SQLFlow-Retrieval-0.6B | Embedding model used for few-shot example retrieval | 🤗 xccr/SQLFlow-Retrieval-0.6B |
If you find Text2SQL-Flow helpful, please consider to cite it. Thank you! :)
@article{cai2026text2sqlflow,
title={TEXT2SQL-FLOW: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL},
author={Cai, Qifeng and Liang, Hao and Xu, Chang and Xie, Tao and Zhang, Wentao and Cui, Bin},
journal={arXiv preprint arXiv:2511.10192v3},
year={2026}
}

