Skip to content

thiborose/text-to-sql-llm-finetuning

Repository files navigation

Important

This project is the result of a coding challenge completed over a weekend. It is not production-ready, and several assumptions were made to complete it within the limited timeframe.

Translating Natural Language to SQL

Project Overview

This repository contains a solution to a weekend coding challenge that focuses on developing a system to translate natural language questions into SQL queries. The goal is to enable non-technical users to query databases using everyday language instead of writing SQL code directly.

The approach involves fine-tuning a general-purpose language model (LLM) specifically for the SQL generation task. By training the model on pairs of natural language questions and their corresponding SQL queries, it learns to generate syntactically correct and semantically accurate SQL based on user input. The project demonstrates how domain-specific fine-tuning can adapt a general language model to specialized tasks, even with relatively limited computational resources.

(The deployed instance is currently turned off) Test it out here: http://143.110.221.84

Fine-Tuning the LLM

I decided to fine-tune HuggingFaceTB/SmolLM2-135M because:

  • It is relatively small, with 135M parameters and a total size of less than 1GB. This gives me more flexibility to choose the training environment, speeds up training, and allows inference on a CPU.
  • It was trained on general language tasks and performs poorly on SQL generation, so the impact of fine-tuning will be clearer.

For the fine-tuning tasks, I made the following assumptions:

  • The only task of interest is SQL generation; therefore, it does not matter if the LLM loses other capabilities (e.g., general language understanding) during fine-tuning.
  • I will not include contextual information in training, such as the existing SQL environment (tables, columns), or explanations of the output SQL query. Although this could improve robustness, I kept it simple due to limited time.

Fine-Tuning Process

  • Used a Google Colab notebook with a T4 GPU runtime. A copy of the notebook is available at ./1_training.ipynb.
  • Used the training split of the dataset gretelai/synthetic_text_to_sql, specifically the sql and sql_prompt columns.
  • Used the trl library.
  • Training parameters:
    • max_steps: 3000
    • per_device_train_batch_size: 2
    • gradient_accumulation_steps: 8
    • learning_rate: 2e-5
    • save_steps: 500
    • eval_steps: 500

I published the fine-tuned model on Hugging Face at thiborose/SmolLM2-FT-SQL.

Evaluation of the Fine-Tuned LLM

The evaluation process is implemented in 2_evaluation.ipynb and uses two main metrics:

1. Syntactic Correctness

This metric checks if the SQL generated by the model is syntactically valid:

  • For each generated SQL query, the sqlglot library attempts to parse it.
  • If parsing succeeds, the query is considered syntactically correct; otherwise, it is flagged incorrect.
  • The notebook counts the number of syntactically correct queries and shows examples of queries with syntax errors and their error messages.

2. LLM-Led Evaluation (Semantic Correctness)

This metric uses an LLM to evaluate if the generated SQL query answers the intended natural language question:

  • For each (prompt, generated SQL) pair, a prompt is constructed for an external LLM (a GPT-4 instance via Azure OpenAI).
  • The prompt instructs the evaluator LLM to return a strict JSON object indicating whether the SQL is correct; if not, it specifies the error type (incomplete, irrelevant, or logic_error) and a brief explanation.
  • Evaluation is done programmatically on each test set example. Results are collected and saved for analysis.
  • The notebook parses and aggregates the results, visualizes error type distribution, and shows qualitative examples per error category.

Using a strong LLM as an automatic judge is especially useful because multiple SQL queries can be correct for the same prompt. This evaluation is more robust than simple string matching against reference SQL.

Notes on Evaluation Data

  • The evaluation set is from the test split of the gretelai/synthetic_text_to_sql dataset.
  • Due to time constraints, only 2,074 samples were evaluated.
  • Results are saved in 2.2_parsed_output.json.

Evaluation Results Summary

  • Total samples evaluated: 2,074
  • Correct SQL queries (according to the evaluating LLM): 686 (33.1%)
  • Incorrect SQL queries: 1,388
  • Error types distribution among incorrect queries:
    • logic_error: 1,142
    • incomplete: 227
    • irrelevant: 19
  • Syntactically correct queries (per sqlglot): 2,021 / 2,074

Comments on the Results

The model achieves a syntactic correctness rate of about 97.5%, which indicates it reliably produces valid SQL syntax. However, semantic correctness is much lower at 33.1%, revealing that while the SQL queries are well-formed, they often fail to translate the natural language questions accurately. The dominant error type is logic_error, suggesting the model struggles primarily with the correct logical construction of queries rather than syntax or missing parts.

These results highlight that fine-tuning improved the model's ability to generate syntactically valid SQL but that significant work remains to improve the semantic understanding and correctness of the generated queries. Considering the small model size and limited training time, these results are an expected starting point for further improvements.

Unfortunately, no baseline evaluation has been formally conducted due to time constraints. However, preliminary qualitative testing of the base model (HuggingFaceTB/SmolLM2-135M) revealed that it was unable to generate any coherent SQL queries. In many cases, it failed to produce syntactically valid output and frequently generated irrelevant or incoherent text. While this does not replace a proper quantitative baseline, it strongly suggests that the base model lacks the capabilities required for the SQL generation task and that the fine-tuning has had a meaningful impact.

Deployment

A simple Streamlit app is provided in app.py for interactive testing of the fine-tuned model. The app loads the fine-tuned model using Hugging Face Transformers and provides a text area for users to input natural language questions. The generated SQL is formatted and displayed for easy inspection. The app is dockerized for easy deployment and is available deployed at http://143.110.221.84.

About

Weekend project, fine-tuned small language model (135M parameters) that translates natural language questions to SQL queries. Includes evaluation metrics and a Streamlit demo app.

Resources

Stars

Watchers

Forks

Contributors

Languages