Important
This project is the result of a coding challenge completed over a weekend. It is not production-ready, and several assumptions were made to complete it within the limited timeframe.
This repository contains a solution to a weekend coding challenge that focuses on developing a system to translate natural language questions into SQL queries. The goal is to enable non-technical users to query databases using everyday language instead of writing SQL code directly.
The approach involves fine-tuning a general-purpose language model (LLM) specifically for the SQL generation task. By training the model on pairs of natural language questions and their corresponding SQL queries, it learns to generate syntactically correct and semantically accurate SQL based on user input. The project demonstrates how domain-specific fine-tuning can adapt a general language model to specialized tasks, even with relatively limited computational resources.
(The deployed instance is currently turned off)
Test it out here: http://143.110.221.84
I decided to fine-tune HuggingFaceTB/SmolLM2-135M because:
- It is relatively small, with 135M parameters and a total size of less than 1GB. This gives me more flexibility to choose the training environment, speeds up training, and allows inference on a CPU.
- It was trained on general language tasks and performs poorly on SQL generation, so the impact of fine-tuning will be clearer.
For the fine-tuning tasks, I made the following assumptions:
- The only task of interest is SQL generation; therefore, it does not matter if the LLM loses other capabilities (e.g., general language understanding) during fine-tuning.
- I will not include contextual information in training, such as the existing SQL environment (tables, columns), or explanations of the output SQL query. Although this could improve robustness, I kept it simple due to limited time.
- Used a Google Colab notebook with a T4 GPU runtime. A copy of the notebook is available at
./1_training.ipynb. - Used the training split of the dataset
gretelai/synthetic_text_to_sql, specifically thesqlandsql_promptcolumns. - Used the
trllibrary. - Training parameters:
max_steps:3000per_device_train_batch_size:2gradient_accumulation_steps:8learning_rate:2e-5save_steps:500eval_steps:500
I published the fine-tuned model on Hugging Face at thiborose/SmolLM2-FT-SQL.
The evaluation process is implemented in 2_evaluation.ipynb and uses two main metrics:
This metric checks if the SQL generated by the model is syntactically valid:
- For each generated SQL query, the
sqlglotlibrary attempts to parse it. - If parsing succeeds, the query is considered syntactically correct; otherwise, it is flagged incorrect.
- The notebook counts the number of syntactically correct queries and shows examples of queries with syntax errors and their error messages.
This metric uses an LLM to evaluate if the generated SQL query answers the intended natural language question:
- For each (prompt, generated SQL) pair, a prompt is constructed for an external LLM (a GPT-4 instance via Azure OpenAI).
- The prompt instructs the evaluator LLM to return a strict JSON object indicating whether the SQL is correct; if not, it specifies the error type (
incomplete,irrelevant, orlogic_error) and a brief explanation. - Evaluation is done programmatically on each test set example. Results are collected and saved for analysis.
- The notebook parses and aggregates the results, visualizes error type distribution, and shows qualitative examples per error category.
Using a strong LLM as an automatic judge is especially useful because multiple SQL queries can be correct for the same prompt. This evaluation is more robust than simple string matching against reference SQL.
- The evaluation set is from the
testsplit of thegretelai/synthetic_text_to_sqldataset. - Due to time constraints, only 2,074 samples were evaluated.
- Results are saved in
2.2_parsed_output.json.
- Total samples evaluated: 2,074
- Correct SQL queries (according to the evaluating LLM): 686 (33.1%)
- Incorrect SQL queries: 1,388
- Error types distribution among incorrect queries:
logic_error: 1,142incomplete: 227irrelevant: 19
- Syntactically correct queries (per
sqlglot): 2,021 / 2,074
The model achieves a syntactic correctness rate of about 97.5%, which indicates it reliably produces valid SQL syntax. However, semantic correctness is much lower at 33.1%, revealing that while the SQL queries are well-formed, they often fail to translate the natural language questions accurately. The dominant error type is logic_error, suggesting the model struggles primarily with the correct logical construction of queries rather than syntax or missing parts.
These results highlight that fine-tuning improved the model's ability to generate syntactically valid SQL but that significant work remains to improve the semantic understanding and correctness of the generated queries. Considering the small model size and limited training time, these results are an expected starting point for further improvements.
Unfortunately, no baseline evaluation has been formally conducted due to time constraints. However, preliminary qualitative testing of the base model (HuggingFaceTB/SmolLM2-135M) revealed that it was unable to generate any coherent SQL queries. In many cases, it failed to produce syntactically valid output and frequently generated irrelevant or incoherent text. While this does not replace a proper quantitative baseline, it strongly suggests that the base model lacks the capabilities required for the SQL generation task and that the fine-tuning has had a meaningful impact.
A simple Streamlit app is provided in app.py for interactive testing of the fine-tuned model. The app loads the fine-tuned model using Hugging Face Transformers and provides a text area for users to input natural language questions. The generated SQL is formatted and displayed for easy inspection. The app is dockerized for easy deployment and is available deployed at http://143.110.221.84.