Execution-Guided Text-to-SQL Post-Training

This project implements an execution-guided Text-to-SQL post-training and verification system on Spider. It uses Qwen2.5-Coder-3B-Instruct as the base model, mines preference data from Spider train_spider.json, and evaluates on the full Spider dev split of 1034 examples without using dev examples for training or DPO pair construction.

The core idea is to model Text-to-SQL as a verifier-backed training loop rather than a single prompt:

Generate SQL from question + retrieved schema.
Execute the SQL inside a read-only SQLite sandbox.
Feed SQLite errors or result-mismatch feedback back to the model.
Mine preference pairs from successful repairs.
Train with LoRA SFT or DPO.

The main experiments focus on LoRA SFT/DPO, pair-quality filtering, Column-Minimal Repair, and clean-split error analysis. The pipeline also exports rollout records that can be reused for future RL-style training, but GRPO training is not part of the current reported results.

TL;DR

Base model: Qwen2.5-Coder-3B-Instruct.
Training data: Spider train_spider.json, with TRAIN_LIMIT=1000 for preference mining.
Evaluation data: full Spider dev split, 1034 examples, clean split.
Best result: SFT+DPO + Column-Minimal Repair improves Final EX from 68.90% to 73.55% over Base + CM Repair.
Main contribution: execution-guided rollout -> preference mining -> LoRA SFT/DPO -> verifier-aware error analysis.

Pipeline Overview

flowchart TD
    A["Question + Retrieved Schema"] --> B["LLM SQL Generation"]
    B --> C["SQLite Sandbox Execution"]
    C --> D["Error / Result Feedback"]
    D --> E["Column-Minimal Repair Rollout"]
    E --> F["DPO Pair Mining / Rollout Export"]
    F --> G["LoRA SFT / DPO"]
    G --> H["Evaluation + Error Analysis"]

Core Contributions

Execution-Guided Self-Repair: turns one-shot SQL generation into a verifier-backed interaction task: generate SQL, execute it, read feedback, then repair.
Column-Minimal Repair Feedback: diagnoses invalid columns, alias ownership, candidate owner tables, available schema columns, and asks the model to remove unnecessary joins when a minimal SQL is enough.
Preference Mining for DPO: mines same-prompt first-turn pairs, same-feedback repair pairs, gold fallback pairs, and synthetic hard negatives from execution-verified rollouts.
DPO Pair Quality Dashboard: reports reward margin, SQL edit distance, duplicate rate, pair-type distribution, chosen executable rate, and hard-negative ratio before/after filtering.
Verifier-Aware Error Routing: separates executor-visible errors, guarded errors, and external-verifier semantic errors so online repair and offline verified-failure repair are not mixed together.
Error Taxonomy Analysis: shows the main bottleneck is not schema retrieval recall under top-k=6, but table-column ownership, join reasoning, and result grounding.

What Is Implemented

Spider/WikiSQL-style JSON or JSONL ingestion.
SQLite read-only execution sandbox with timeout.
Baseline one-shot mode via --max_turns 1.
Multi-turn self-repair rollout: SQL -> execute -> feedback -> revised SQL.
Repair feedback detail switch:
- --feedback_detail basic: legacy SQLite error/result-status feedback.
- --feedback_detail column: column-aware diagnostics with invalid identifiers, alias/table ownership hints, selected-table columns, and row-count/sample mismatch details.
- --feedback_detail minimal: column-aware feedback plus a minimal-SQL repair constraint that asks the model to remove unnecessary joins instead of only patching aliases.
Schema retrieval modes: retrieved token-overlap baseline, bm25, and bm25_fk with foreign-key expansion.
Execution-guided DPO pair mining:
- chosen: repaired SQL that executes to the gold result, or gold SQL when enabled.
- rejected: first-turn wrong SQL.
- If a repair turn samples both correct and failed repairs under the same question + schema + previous SQL + execution feedback prompt, the pipeline prefers a cleaner repair_correct_vs_repair_failed pair.
- Gold fallback is used only to create a chosen completion for failed training pairs; it is not counted as a model prediction in evaluation metrics.
Preference-pair filtering and balancing:
- per-input and per-pair-type caps to prevent synthetic pairs from dominating.
- optional reward-margin filtering with --min_reward_margin.
- optional SQL token edit-distance filtering with --max_sql_edit_distance_ratio.
Rule reward logging:
- 1.0 if correct in one turn.
- 0.7 - 0.1 * (turn - 2) if self-corrected.
- 0.2 if executable but result is wrong.
- -0.2 for syntax/schema errors before the final turn.
- -0.5 if still wrong after max turns.
Metrics:
- first-turn example accuracy
- final example execution accuracy
- first-turn candidate execution accuracy
- candidate execution accuracy
- executable rate
- schema valid rate
- repair success rate

When --num_samples > 1, example-level metrics answer "did this question get at least one correct SQL?", while candidate-level metrics answer "what fraction of generated SQL candidates were correct?".

Optional Local Smoke Test

For fast local verification, the repository includes a tiny SQLite dataset and a mock generator. This path is only a smoke test for the data flow; the reported results below use Qwen2.5-Coder-3B-Instruct on Spider.

python3 scripts/run_demo_pipeline.py

Or run the same steps manually.

Create a tiny Spider-style demo database:

python3 scripts/make_demo_dataset.py

Run a no-download mock rollout. The mock generator intentionally emits a bad first SQL and then repairs to the gold SQL, so the execution-feedback and DPO-construction code path is exercised:

python3 text2sql_trajectory_builder.py \
  --dataset_path data/demo/dev.jsonl \
  --db_root data/demo/database \
  --output_dir outputs/demo_mock \
  --generator mock \
  --max_turns 3

Inspect:

cat outputs/demo_mock/summary.json
head -n 1 outputs/demo_mock/trajectories.jsonl
cat outputs/demo_mock/dpo_pairs.json

Reproducible Experiment Scripts

The experiments/ directory contains shell entry points for a compact, reproducible delivery package. By default, they assume:

SPIDER_ROOT=data/spider_data
MODEL_PATH=models/Qwen2.5-Coder-3B-Instruct
OUTPUT_ROOT=outputs
TRAIN_LIMIT=1000

Override these with environment variables when needed:

SPIDER_ROOT=/path/to/spider \
MODEL_PATH=models/Qwen2.5-Coder-3B-Instruct \
TRAIN_LIMIT=1000 \
bash experiments/03_mine_dpo_pairs.sh

Recommended order:

bash experiments/00_make_demo.sh
bash experiments/01_eval_base_oneshot.sh
bash experiments/02_eval_base_repair.sh
bash experiments/03_mine_dpo_pairs.sh
bash experiments/04_train_dpo.sh
bash experiments/05_eval_dpo_repair.sh
bash experiments/06_compare.sh

Small artifacts can be copied into outputs_key/, while large trajectories, adapters, and raw rollouts should stay in outputs/ or external storage.

Baseline Evaluation

For a real local model:

python3 text2sql_trajectory_builder.py \
  --dataset_path /path/to/spider/dev.json \
  --db_root /path/to/spider/database \
  --output_dir outputs/baseline \
  --generator hf \
  --model_path /path/to/model \
  --max_turns 1 \
  --schema_mode retrieved \
  --top_k_tables 6

Use summary.json for execution accuracy, repair success, and executable rate.

Schema retrieval ablations:

python3 text2sql_trajectory_builder.py \
  --dataset_path "$SPIDER_ROOT/dev.json" \
  --db_root "$SPIDER_ROOT/database" \
  --output_dir outputs/spider_200_base_oneshot_bm25_fk \
  --generator hf \
  --model_path models/Qwen2.5-Coder-3B-Instruct \
  --limit 200 \
  --max_turns 1 \
  --num_samples 1 \
  --temperature 0 \
  --schema_mode bm25_fk \
  --top_k_tables 6 \
  --fk_hops 1

Recommended Local Base Models

Use a text or code causal language model, not a vision-language checkpoint. Good practical choices:

Best quality: Qwen2.5-Coder-7B-Instruct
Best short-iteration choice: Qwen2.5-Coder-3B-Instruct
General baseline: Qwen2.5-3B/7B-Instruct
Use 1.5B only for debugging or very small GPUs

Download with the Hugging Face mirror:

export HF_ENDPOINT=https://hf-mirror.com

python3 scripts/download_hf_model.py \
  --repo_id Qwen/Qwen2.5-Coder-3B-Instruct \
  --local_dir models/Qwen2.5-Coder-3B-Instruct

python3 scripts/download_hf_model.py \
  --repo_id Qwen/Qwen2.5-Coder-7B-Instruct \
  --local_dir models/Qwen2.5-Coder-7B-Instruct

python3 scripts/download_hf_model.py \
  --repo_id Qwen/Qwen2.5-3B-Instruct \
  --local_dir models/Qwen2.5-3B-Instruct

Pass the local Hugging Face model directory after download:

python3 text2sql_trajectory_builder.py \
  --dataset_path data/demo/dev.jsonl \
  --db_root data/demo/database \
  --output_dir outputs/local_model_demo \
  --generator hf \
  --model_path models/Qwen2.5-Coder-3B-Instruct \
  --max_turns 3

Self-Repair Rollouts And DPO Mining

python3 text2sql_trajectory_builder.py \
  --dataset_path "$SPIDER_ROOT/train_spider.json" \
  --db_root "$SPIDER_ROOT/database" \
  --output_dir outputs/repair_rollouts \
  --generator hf \
  --model_path /path/to/model \
  --max_turns 3 \
  --feedback_mode result_status \
  --feedback_detail basic \
  --num_samples 1 \
  --use_gold_when_failed

Important outputs:

trajectories.jsonl: per-example SQL attempts, execution status, errors, rewards.
dpo_pairs.json: prompt/chosen/rejected triples.
grpo_rollouts.jsonl: prompt, completions, and scalar rewards exported for future RL-style extensions.

When --num_samples > 1, repair turns can produce same-prompt repair-policy DPO pairs:

prompt   = question + schema + previous_sql + execution_feedback
chosen   = repair candidate that executes to the gold result
rejected = repair candidate from the same repair prompt that still fails
pair_type = repair_correct_vs_repair_failed

If no same-prompt repair pair exists, the builder falls back to the older natural pairs such as self_repair_success_vs_failed_attempt, first_turn_correct_sample_vs_failed_sample, or gold_vs_failed_attempt.

Column-aware repair feedback keeps the same rollout path but adds schema-grounded repair hints:

python3 text2sql_trajectory_builder.py \
  --dataset_path "$SPIDER_ROOT/train_spider.json" \
  --db_root "$SPIDER_ROOT/database" \
  --output_dir outputs/repair_rollouts_column \
  --generator hf \
  --model_path /path/to/model \
  --max_turns 3 \
  --feedback_mode result_status \
  --feedback_detail column \
  --num_samples 1 \
  --use_gold_when_failed

LoRA SFT Or DPO

Install training dependencies:

pip install -r requirements.txt

LoRA SFT from successful/gold SQL:

python3 training/train_lora_sft.py \
  --model_name_or_path /path/to/model \
  --trajectories outputs/repair_rollouts/trajectories.jsonl \
  --dpo_pairs outputs/repair_rollouts/dpo_pairs.json \
  --output_dir outputs/lora_sft

LoRA DPO from execution-guided preference pairs:

python3 training/train_lora_dpo.py \
  --model_name_or_path /path/to/model \
  --dpo_pairs outputs/repair_rollouts/dpo_pairs.json \
  --output_dir outputs/lora_dpo

Add --bf16 on supported GPUs.

Evaluate a trained LoRA adapter with the same rollout script:

python3 text2sql_trajectory_builder.py \
  --dataset_path /path/to/spider/dev.json \
  --db_root /path/to/spider/database \
  --output_dir outputs/eval_dpo_adapter \
  --generator hf \
  --model_path models/Qwen2.5-Coder-3B-Instruct \
  --adapter_path outputs/dpo_qwen25_coder_3b_spider200/adapter \
  --limit 200 \
  --max_turns 1 \
  --num_samples 1 \
  --temperature 0 \
  --schema_mode retrieved \
  --top_k_tables 6 \
  --feedback_mode result_status

Dataset Format

The bundled demo dataset is enough for smoke testing and does not require downloading anything:

python3 scripts/make_demo_dataset.py

For meaningful experiments, download a real Text-to-SQL benchmark locally. Spider is the best fit for this pipeline because it already includes multiple SQLite databases and gold SQL. WikiSQL can also be used, but you need to provide or convert table files into SQLite databases so each example resolves to a db_path or db_id.

Each example can be Spider-style:

{
  "db_id": "company",
  "question": "How many employees work in Sales?",
  "query": "SELECT COUNT(*) FROM employees ..."
}

Supported aliases:

question: question, utterance, nl
gold SQL: query, sql, gold_sql
database: db_path, database_path, sqlite_path, or db_id plus --db_root

For Spider layout, the script searches:

{db_root}/{db_id}/{db_id}.sqlite
{db_root}/{db_id}.sqlite

Experiment Results

Column-Minimal Repair (CM Repair) means column-aware execution feedback plus the minimal-SQL repair constraint implemented by --feedback_detail minimal.

Main evaluation uses Qwen2.5-Coder-3B-Instruct with greedy decoding. Preference data is mined from Spider train_spider.json; Spider dev full 1034 examples are used only for final clean-split evaluation.

--use_gold_when_failed affects DPO pair construction only: when all model attempts fail, gold SQL can be used as the chosen side of a preference pair. It does not change first-turn EX or final EX, which are computed only from model-generated SQL candidates.

Clean-Split Main Results

Run	First-turn EX	Final EX	Repair SR	Executable
Base-OneShot	57.56%	57.56%	/	83.43%
Base + CM Repair	57.56%	68.90%	26.71%	81.20%
SFT-OneShot	61.92%	61.92%	/	82.54%
SFT + CM Repair	61.92%	71.41%	24.94%	82.54%
DPO-OneShot	62.11%	62.11%	/	82.02%
DPO + CM Repair	62.11%	71.80%	25.58%	82.02%
SFT+DPO-OneShot	67.25%	67.25%	/	84.45%
SFT+DPO + CM Repair	67.25%	73.55%	19.23%	84.45%

SFT and DPO both improve first-turn SQL generation over the base model, while CM Repair adds verified-failure correction on top. SFT+DPO gives the best clean-split result, suggesting that supervised SQL imitation and execution-guided preference optimization are complementary in this setting. DPO rows use the final synthetic-ratio-controlled Synth20 preference set with 701 filtered pairs: 557 natural rollout pairs and 144 synthetic hard negatives.

Executable is the percentage of generated SQL candidates that pass SQL parsing, schema checks, and SQLite execution without runtime errors. It measures execution stability, not whether the returned result matches the gold answer.

Verified-Failure Setting

This project does not assume that an online system magically knows whether every SQL answer is semantically correct. Instead, it studies the verified-failure setting: after a verifier catches a failed SQL, the model uses execution feedback to repair it, and the verified failure trajectories are converted into preference data.

In deployment, the verifier can be a SQL executor for syntax/schema/runtime errors, plus business rules, tests, user feedback, result sanity checks, or an LLM judge for semantic failures. In Spider, gold execution results provide a controlled verifier for measuring repair and mining execution-guided preference pairs.

pred SQL -> verifier/executor -> failure feedback -> repair SQL
                                  -> chosen/rejected preference pair

DPO Data Engine

Preference data is mined from model-generated failures rather than only from static gold SQL. Natural rollout pairs are combined with synthetic hard negatives, then cleaned with reward-margin and SQL edit-distance filters.

Pair Type	Purpose
`gold_vs_failed_attempt`	Use gold SQL as chosen when rollout fails after max turns.
`first_turn_correct_sample_vs_failed_sample`	Compare correct and failed first-turn candidates under the same prompt.
`self_repair_success_vs_failed_attempt`	Learn from verified repairs that fixed an initial failed SQL.
`repair_correct_vs_repair_failed`	Compare correct and failed repair candidates under the same feedback prompt.
`synthetic_wrong_column` / `synthetic_missing_join` / `synthetic_wrong_aggregation`	Add targeted hard negatives for column grounding, join reasoning, and aggregation errors.

Filtering and balancing are used to avoid noisy preference pairs and synthetic-data distribution shift:

--min_reward_margin: removes weak preferences.
--max_sql_edit_distance_ratio: keeps chosen/rejected SQL structurally related.
--input_fraction_limits and --pair_type_max_fractions: prevent synthetic pairs or one pair type from dominating.

Final DPO data quality:

Metric	Value
filtered pairs	701
natural rollout pairs	557
synthetic hard negatives	144
synthetic fraction	20.54%
gold fallback pairs	140
avg reward margin	0.9986
avg SQL edit-distance ratio	0.4152
duplicate rate	0.00%
chosen executable rate	100.00%
executable hard-negative rate	71.04%

Error-Type Repair Breakdown

The aligned Synth20 runs show that DPO reduces first-turn errors from 438 to 391 and increases final correct examples from 711 to 741. SFT+DPO further reduces first-turn errors to 338 and increases final correct examples to 759 on Spider dev full.

Error Type	Base First-turn Errors	DPO First-turn Errors	SFT+DPO First-turn Errors	Base Repaired	DPO Repaired	SFT+DPO Repaired
`wrong_result`	223	105	99	55	17	6
`wrong_row_count`	0	92	90	0	30	24
`no_such_column`	156	142	100	51	42	29
`empty_result`	44	39	32	9	8	2
`no_such_table`	9	9	9	1	2	2
`other_execution_error`	6	4	6	1	1	2

Verifier-Aware Routing

Verifier-aware routing separates executor-visible errors from guarded and external-verifier semantic failures. The aligned Synth20 runs show that SFT+DPO reduces both offline semantic failures and online-repairable schema/execution failures.

Deployment Scope	Base-CM	DPO-CM	SFT+DPO-CM
online repair	171	155	117
guarded repair	44	39	32
offline or external-verifier repair	223	197	189

Synthetic Ratio Ablation

Synthetic hard negatives are useful but affect the first-turn / repair trade-off. Reducing the synthetic fraction from 29.95% to 20.54% preserves most first-turn gains while improving repair success and final execution accuracy.

Run	First-turn EX	Final EX	Repair SR	Executable
Base + CM Repair	57.56%	68.90%	26.71%	81.20%
DPO1000-Synth30	63.08%	71.12%	21.78%	82.32%
DPO1000-Synth20	62.11%	71.80%	25.58%	82.02%

SFT / DPO / SFT+DPO Method Ablation

The final clean-split ablation shows that SFT and DPO are complementary in the current data setting. SFT improves the base SQL generation ability, while DPO further optimizes preference ranking over execution-verified SQL candidates.

Run	First-turn EX	Final EX	Repair SR	Executable
Base + CM Repair	57.56%	68.90%	26.71%	81.20%
SFT + CM Repair	61.92%	71.41%	24.94%	82.54%
DPO + CM Repair	62.11%	71.80%	25.58%	82.02%
SFT+DPO + CM Repair	67.25%	73.55%	19.23%	84.45%

Repair Feedback Ablation

Basic Repair uses SQLite error/result-status feedback. CM Repair adds column-aware diagnostics, alias-table binding, candidate column owners, available columns, and the minimal-SQL repair constraint. On Spider dev full, CM Repair improves both the base model and the post-trained SFT+DPO model.

Run	First-turn EX	Final EX	Repair SR	Executable
Base + Basic Repair	57.56%	61.43%	9.13%	75.89%
Base + CM Repair	57.56%	68.90%	26.71%	81.20%
SFT+DPO + Basic Repair	67.25%	71.12%	11.83%	81.37%
SFT+DPO + CM Repair	67.25%	73.55%	19.23%	84.45%

Feedback And DPO Ablation

The following Spider dev 200 results were used during development to compare feedback variants.

Run	First-turn EX	Final EX	Repair SR	Executable
Base-Repair	49.50%	53.00%	6.93%	58.73%
BalancedDPO-Basic	51.00%	56.00%	10.20%	58.55%
BalancedDPO-Column	51.00%	55.50%	9.18%	62.27%
BalancedDPO + CM Repair	51.00%	58.50%	15.31%	64.32%

Early SFT / DPO / SFT+DPO Ablation

Run	First-turn EX	Final EX	Repair SR	Executable
Base-Repair	49.50%	53.00%	6.93%	58.73%
BalancedDPO + CM Repair	51.00%	58.50%	15.31%	64.32%
SFT + CM Repair	45.00%	56.00%	20.00%	59.41%
SFT+DPO + CM Repair	46.50%	56.00%	17.76%	61.90%

This early Spider 200 ablation was used for rapid iteration. In the final train1000 -> dev full clean-split setting, the cleaner SFT data and completion-only training no longer show the same first-turn regression; SFT+DPO achieves the best first-turn and final execution accuracy.

Key Findings

Schema retrieval was not the main limiter on the Spider 200 slice: retrieved, bm25, and bm25_fk all reached 100% gold-table recall with the same selected table sets. Error analysis instead showed that no_such_column failures mostly came from wrong column owners, alias/table-column mapping errors, and extra joins that changed alias bindings.

Balanced DPO improved overall SQL stability versus heavy synthetic DPO, but did not fully solve column grounding. Column-Minimal Repair feedback partially addressed this: it tells the model which identifier is invalid, which table an alias maps to, which tables actually own the column, what columns are available in the selected schema, and asks the model to prefer a minimal SQL instead of only patching aliases. On Spider 200, this reduced no_such_column candidates from 146 to 130, improved repair success rate from 10.2% to 15.3%, and improved final execution accuracy from 56.0% to 58.5%. The remaining bottleneck is still robust column grounding under joins and aliases.

DPO Scaling Caveat

Scaling DPO data changes the preference distribution and may require retuning the natural/gold/synthetic mixture. In the train2000 scaling check, blindly increasing preference data shifted the model toward repair-oriented behavior and caused first-turn regression, suggesting that preference data quality, pair-type balance, and filtering are more important than raw pair count.

Final Reproduction Commands

For the final train1000 -> Spider dev full setup, prefer the reproducible shell entry points in experiments/ instead of the older ad-hoc commands used during development.

bash experiments/00_make_demo.sh
bash experiments/01_eval_base_oneshot.sh
bash experiments/02_eval_base_repair.sh
bash experiments/03_mine_dpo_pairs.sh
bash experiments/04_train_dpo.sh
bash experiments/05_eval_dpo_repair.sh
bash experiments/06_compare.sh

The final DPO data is built from Spider train_spider.json with TRAIN_LIMIT=1000, then merged with synthetic hard negatives using Synth20 filtering:

python3 scripts/merge_dpo_pairs.py \
  --inputs \
    outputs/train1000_rollout_qwen25_coder_3b/dpo_pairs.json \
    outputs/train1000_synthetic_errors/dpo_pairs.json \
  --output outputs/train1000_filtered_synth20_dpo_pairs.json \
  --input_fraction_limits 1=0.20 \
  --pair_type_max_fractions gold_vs_failed_attempt=0.20 \
  --min_reward_margin 0.3 \
  --max_sql_edit_distance_ratio 1.0 \
  --shuffle

Regenerate final comparison and attribution reports after aligned evaluations:

python3 analysis/compare_runs.py \
  outputs/devfull_base_repair_minimal/summary.json \
  outputs/devfull_sft_repair_minimal/summary.json \
  outputs/devfull_dpo_synth20_repair_cm_final/summary.json \
  outputs/devfull_sft_dpo_synth20_repair_cm_final/summary.json \
  --labels Base-CM SFT-CM DPO-CM SFT+DPO-CM \
  --output outputs/reports/devfull_final_method_ablation_aligned.md

python3 analysis/repair_breakdown.py \
  --runs \
    Base-CM=outputs/devfull_base_repair_minimal/trajectories.jsonl \
    DPO-CM=outputs/devfull_dpo_synth20_repair_cm_final/trajectories.jsonl \
    SFT+DPO-CM=outputs/devfull_sft_dpo_synth20_repair_cm_final/trajectories.jsonl \
  --output_json outputs/reports/devfull_synth20_repair_breakdown.json \
  --output_md outputs/reports/devfull_synth20_repair_breakdown.md

python3 analysis/repair_routing_report.py \
  --runs \
    Base-CM=outputs/devfull_base_repair_minimal/trajectories.jsonl \
    DPO-CM=outputs/devfull_dpo_synth20_repair_cm_final/trajectories.jsonl \
    SFT+DPO-CM=outputs/devfull_sft_dpo_synth20_repair_cm_final/trajectories.jsonl \
  --output_json outputs/reports/devfull_synth20_repair_routing.json \
  --output_md outputs/reports/devfull_synth20_repair_routing.md

For pair quality diagnostics:

python3 analysis/pair_quality_report.py \
  --datasets \
    Raw=outputs/train1000_raw_dpo_pairs.json \
    Synth20=outputs/train1000_filtered_synth20_dpo_pairs.json \
  --output_json outputs/reports/train1000_pair_quality.json \
  --output_md outputs/reports/train1000_pair_quality.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Execution-Guided Text-to-SQL Post-Training

TL;DR

Pipeline Overview

Core Contributions

What Is Implemented

Optional Local Smoke Test

Reproducible Experiment Scripts

Baseline Evaluation

Recommended Local Base Models

Self-Repair Rollouts And DPO Mining

LoRA SFT Or DPO

Dataset Format

Experiment Results

Clean-Split Main Results

Verified-Failure Setting

DPO Data Engine

Error-Type Repair Breakdown

Verifier-Aware Routing

Synthetic Ratio Ablation

SFT / DPO / SFT+DPO Method Ablation

Repair Feedback Ablation

Feedback And DPO Ablation

Early SFT / DPO / SFT+DPO Ablation

Key Findings

DPO Scaling Caveat

Final Reproduction Commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis		analysis
experiments		experiments
outputs_key		outputs_key
scripts		scripts
training		training
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
text2sql_trajectory_builder.py		text2sql_trajectory_builder.py

Folders and files

Latest commit

History

Repository files navigation

Execution-Guided Text-to-SQL Post-Training

TL;DR

Pipeline Overview

Core Contributions

What Is Implemented

Optional Local Smoke Test

Reproducible Experiment Scripts

Baseline Evaluation

Recommended Local Base Models

Self-Repair Rollouts And DPO Mining

LoRA SFT Or DPO

Dataset Format

Experiment Results

Clean-Split Main Results

Verified-Failure Setting

DPO Data Engine

Error-Type Repair Breakdown

Verifier-Aware Routing

Synthetic Ratio Ablation

SFT / DPO / SFT+DPO Method Ablation

Repair Feedback Ablation

Feedback And DPO Ablation

Early SFT / DPO / SFT+DPO Ablation

Key Findings

DPO Scaling Caveat

Final Reproduction Commands

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages