NL2SQL L20 Technical Report

Abstract

This report summarizes a single-GPU NL2SQL benchmark built around Qwen2.5-Coder-7B-Instruct, LoRA fine-tuning, and inference-time candidate selection. The project studies how far a Spider-trained 7B model can be pushed without additional pretraining. The best Spider dev result is 82.20% official execution accuracy from a retrospective n=4 value-aware voting subset. The best fresh BIRD Mini-Dev result is 47.80% execution accuracy from value-aware voting with 20 candidates at temperature 0.9. A seven-pool retrospective BIRD union reaches 48.40% execution accuracy at an n=80 candidate budget.

Core Research Question

The project asks:

Given a single NVIDIA L20, a 7B open model, and Spider supervision, how much NL2SQL quality can be recovered through schema-rich prompting and inference-time candidate selection rather than more pretraining?

This framing is intentionally resource-constrained. The goal is not to claim public leaderboard SOTA, but to produce a reproducible engineering benchmark for studying candidate selection, execution feedback, and out-of-domain transfer.

Method

Base Model and Training

Base model: Qwen2.5-Coder-7B-Instruct
Fine-tuning: LoRA adapters
Hardware target: one NVIDIA L20
Training data: Spider-style supervised text-to-SQL data
Evaluation targets: Spider dev and BIRD Mini-Dev

The training variants differ mainly in input representation:

Adapter family	Input representation
Direct	question plus database schema
Schema-aware	linked schema, foreign-key hints, and evidence
Rich-context	schema hints, M-Schema style text, value hints, and evidence
Candidate repair	repair examples derived from generated candidate pools

Inference Strategies

The benchmark compares direct decoding with several candidate-selection strategies:

Strategy	Description
Direct	one SQL candidate from one prompt family
MCR	multi-candidate reranking from rich-context prompts
VAV	value-aware voting over multiple prompt families and samples
EGS	execution-guided schema reranking
Candidate repair	learned adapter repairs generated candidates
Union cost curve	retrospective selection over saved candidate pools at different budgets

Evaluation

The local evaluator reports:

EM: strict normalized exact match;
EX: SQLite execution accuracy;
Err: execution-error rate;
Hall: schema hallucination rate.

For Spider, selected runs are also exported to the upstream official evaluator. For BIRD, execution accuracy is the primary metric because equivalent SQL can differ strongly in surface form, making strict EM especially brittle in out-of-domain transfer.

Experiment Summary

Spider Dev

Run	Candidates	Local EM	Local EX	Official EM	Official EX
Direct	1	48.94%	75.73%	-	-
Schema-aware	1	48.65%	76.40%	-	-
Rich-context	1	50.48%	78.72%	-	-
MCR	8	51.06%	80.37%	-	-
VAV full pool	30	52.03%	82.11%	78.50%	81.90%
VAV cost curve	4	50.39%	82.59%	78.10%	82.20%
EGS	32	51.06%	81.33%	78.40%	82.00%
Candidate repair	repaired	50.10%	81.14%	-	-

The strongest official Spider result is the n=4 VAV retrospective subset. It is notable because it beats the n=30 full-pool official EX while using a smaller selected budget, suggesting that candidate quality and selection are more important than raw candidate count after a certain point.

BIRD Mini-Dev

Run	Candidates / temp	EM	EX	Err	Hall
Direct	1	0.80%	21.60%	20.00%	12.40%
Schema-aware	1	1.20%	37.40%	15.00%	8.40%
Rich-context	1	0.60%	37.20%	16.60%	10.80%
MCR	8	0.60%	39.20%	6.80%	4.80%
Candidate repair baseline	repaired	1.20%	42.20%	4.40%	2.40%
EGS	n=16	0.80%	41.40%	2.00%	1.20%
VAV	n=16	1.00%	46.40%	1.80%	1.60%
VAV	n=20 default	0.60%	46.80%	2.00%	1.60%
VAV	n=20, temp 0.9	0.80%	47.80%	1.60%	1.40%
VAV	n=20, temp 0.85	1.00%	47.20%	2.00%	2.20%
VAV	n=20, temp 0.92	1.00%	45.00%	2.00%	2.00%
VAV repeat	n=20, temp 0.9b	0.80%	45.60%	2.20%	1.60%
VAV repeat	n=20, temp 0.9c	1.20%	45.20%	2.40%	2.60%
Value-grounded VAV	n=20, temp 0.9	1.00%	45.20%	2.60%	2.40%

The strongest fresh BIRD result is VAV n=20 at temperature 0.9, with 47.80% EX. The low EM should not be read as a failure of execution quality. It reflects strict string matching under a cross-domain transfer setting where SQL forms can diverge while still producing the correct answer.

Retrospective BIRD Union

The best retrospective BIRD result uses a seven-pool union:

Budget	EX
n=20	46.40%
n=40	46.60%
n=60	48.20%
n=80	48.40%
n=100	48.20%
n=120	47.60%
n=140	48.00%

This result is best treated as verifier/reranker development evidence. It shows that the saved pools contain complementary correct candidates, but that a better selector is needed to convert those candidates into a fresh deployable improvement.

Figures

Ablations and Findings

Schema and Context

Rich-context prompting improves Spider single-candidate EX from 75.73% to 78.72%. For BIRD, schema-aware prompting improves EX from 21.60% to 37.40%, showing that schema and evidence formatting matter strongly for out-of-domain transfer.

Candidate Selection

MCR and VAV are the main performance levers. On Spider, MCR n=8 reaches 80.37% local EX and VAV reaches the best official result. On BIRD, VAV n=20 at temperature 0.9 reaches 47.80% EX, well above direct and MCR baselines.

Temperature and Diversity

Temperature sweeps show a narrow optimum around 0.9 for the fresh BIRD run. Additional independent 0.9 pools do not beat the best fresh pool, but they add useful candidates for retrospective union analysis.

Candidate Repair

Candidate repair is a negative BIRD ablation in this snapshot. It lowers some error and hallucination rates but reduces EX relative to VAV. This suggests that the repair adapter is overfitting to Spider-style repairs or changing semantically correct candidates.

Value-Grounded Prompts

The BIRD-specific value-grounded prompt set does not improve the best fresh BIRD result. It remains useful as a diagnostic because it tests whether more explicit evidence and join-path prompting can recover failures without changing the model.

Limitations

This is not a public leaderboard submission.
BIRD Mini-Dev is an evaluation proxy, not the full hidden BIRD benchmark.
Retrospective union results use saved candidate pools and should not be compared to fresh single-run generation.
BIRD EM is low because the model is Spider-supervised and because strict SQL string matching is brittle under equivalent-query variation.
The project does not include additional pretraining, larger models, proprietary inference, or external production-grade retrievers.

Reproducibility

The repository includes saved predictions, result JSON files, official Spider evaluator stdout for key runs, and unit tests. A no-GPU sanity check is:

python3 -m pytest tests/

Expected result for this snapshot:

25 passed

The main README contains full artifact links and commands for benchmark-suite runs, cost-curve generation, and Spider official export.

Conclusion

The experiments support a practical conclusion: for resource-constrained NL2SQL systems, well-instrumented inference-time candidate selection can be as important as the adapter itself. The strongest next research direction is not more repeated sampling by itself, but a better verifier or reranker trained against the saved candidate pools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NL2SQL L20 Technical Report

Abstract

Core Research Question

Method

Base Model and Training

Inference Strategies

Evaluation

Experiment Summary

Spider Dev

BIRD Mini-Dev

Retrospective BIRD Union

Figures

Ablations and Findings

Schema and Context

Candidate Selection

Temperature and Diversity

Candidate Repair

Value-Grounded Prompts

Limitations

Reproducibility

Conclusion

FilesExpand file tree

technical_report.md

Latest commit

History

technical_report.md

File metadata and controls

NL2SQL L20 Technical Report

Abstract

Core Research Question

Method

Base Model and Training

Inference Strategies

Evaluation

Experiment Summary

Spider Dev

BIRD Mini-Dev

Retrospective BIRD Union

Figures

Ablations and Findings

Schema and Context

Candidate Selection

Temperature and Diversity

Candidate Repair

Value-Grounded Prompts

Limitations

Reproducibility

Conclusion