Fine-tuning Qwen2.5-Coder-7B-Instruct for text-to-SQL generation using the SynSQL-2.5M dataset with distributed training via DeepSpeed and HuggingFace Accelerate.
uv sync
cp .env.example .env # Configure environment variablesCopy .env.example to .env and configure:
| Variable | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS credentials for S3 checkpoint uploads |
AWS_SECRET_ACCESS_KEY |
AWS credentials for S3 checkpoint uploads |
WANDB_API_KEY |
Weights & Biases API key for experiment tracking |
Preprocesses SynSQL-2.5M dataset into memory-mapped Arrow files for efficient distributed loading.
python src/prepare_dataset.py --output-dir ./data/processedOptions:
--model-name: Tokenizer model (default:Qwen/Qwen2.5-Coder-7B-Instruct)--max-length: Maximum sequence length (default:4096)--num-proc: Number of parallel workers (default:8)--val-ratio: Validation split ratio (default:0.01)--test-ratio: Test split ratio (default:0.01)
Launch distributed training with Accelerate:
accelerate launch --config_file config/accelerate.yml src/train.py --config config/train.yamlResume from checkpoint:
accelerate launch --config_file config/accelerate.yml src/train.py --config config/train.yaml --resume ./checkpoints/checkpoint-500Initialize from checkpoint (fresh optimizer state):
accelerate launch --config_file config/accelerate.yml src/train.py --config config/train.yaml --init-from ./checkpoints/checkpoint-500Evaluate model on test set (computes loss, perplexity, and exact match accuracy):
python src/evaluate.py --config config/train.yamlEvaluate specific checkpoint:
python src/evaluate.py --config config/train.yaml --checkpoint checkpoint-500Options:
--adapter-path: Path to LoRA adapter (default:checkpoints/final)--max-samples: Limit number of test samples--num-examples: Number of sample outputs to display (default:5)
Merge LoRA adapter into base model for deployment:
python src/merge_weights.py --config config/train.yaml --adapter ./checkpoints/checkpoint-1000 --output ./merged_modelOptions:
--dtype: Output precision -bf16,fp16, orfp32(default:bf16)--force: Overwrite output directory if it exists