This repository implements a three-phase experimental study on language model fine-tuning for Python code generation, comparing Full Fine-Tuning (FFT), Supervised Fine-Tuning with Q-LoRA, and Direct Preference Optimization (DPO) for behavioral alignment.
- Overview
- Project Structure
- Experimental Phases
- Installation
- Usage
- Results and Analysis
- Technical Specifications
- References
This project explores different approaches to fine-tuning language models for code generation:
- Baseline Approach (FFT): Full parameter fine-tuning of an encoder-based model (XLM-RoBERTa-large)
- Parameter-Efficient Training (Q-LoRA SFT): Low-rank adaptation of a proper decoder model (Qwen2-1.5B-Instruct)
- Behavioral Alignment (DPO): Preference-based optimization for safe and helpful code generation
- Capability Acquisition: Train models to generate functional Python code
- Efficiency Comparison: Analyze full fine-tuning vs. parameter-efficient methods
- Behavioral Control: Implement preference optimization for aligned model behavior
- Architecture Analysis: Compare encoder-based vs. decoder-based models for generation tasks
Code-Generation-and-Guarding/
├── FFT_Phase.ipynb # Part I: Full Fine-Tuning (RoBERTa)
├── SFT_Phase.ipynb # Part II: Q-LoRA SFT (Qwen)
├── DPO_Phase.ipynb # Part III: Direct Preference Optimization
├── README.md # This file
└── wandb/ # Experiment tracking logs
├── config.yaml
├── requirements.txt
└── models/
└── fft-roberta-final/
Objective: Establish a baseline by fine-tuning all parameters of an encoder model adapted for generation.
| Parameter | Value |
|---|---|
| Model | FacebookAI/xlm-roberta-large (0.56B params) |
| Architecture | Encoder + Causal LM head |
| Dataset | flytech/python-codes-25k (full) |
| Training Type | Full parameter updates (~560M params) |
| Batch Size | 4 |
| Learning Rate | 5e-5 |
| Epochs | 2 |
| Optimizer | AdamW |
| Precision | BF16 |
- ✅ All model parameters are updated during training
- ✅ Memory-intensive approach (requires gradient checkpointing)
- ✅ Computes loss over full instruction + code sequence
⚠️ Architectural limitation: Encoder model with bidirectional attention (not ideal for generation)
- Moderate code generation capability
- Limited by encoder architecture (bidirectional attention patterns)
- Serves as baseline for comparison with proper decoder models
Objective: Implement parameter-efficient fine-tuning using LoRA with 4-bit quantization on a proper decoder model.
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen2-1.5B-Instruct |
| Architecture | Decoder-only (proper causal LM) |
| Dataset | flytech/python-codes-25k (3,000 samples) |
| Quantization | 4-bit NF4 (via BitsAndBytes) |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| Target Modules | All linear layers |
| Trainable Params | ~0.5% of total (thanks to LoRA) |
| Batch Size | 1 (gradient accumulation: 4) |
| Learning Rate | 2e-4 |
| Epochs | 1 |
| Optimizer | paged_adamw_8bit |
- ✅ Parameter-Efficient: Only LoRA adapters trained (~0.5% of parameters)
- ✅ Memory-Efficient: 4-bit quantization reduces memory footprint by 75%
- ✅ Proper Architecture: Decoder-only model designed for generation
- ✅ Instruction Masking: Loss computed only on code completions (not prompts)
- ✅ System Prompt: "You are a helpful Python coding assistant"
- Better Architecture: Decoder-only vs. adapted encoder
- Efficiency: ~280x fewer trainable parameters
- Memory: 4-bit quantization enables larger models on same hardware
- Quality: Superior generation due to causal attention
Objective: Align the SFT model with human preferences using preference pairs to ensure safe and helpful behavior.
| Parameter | Value |
|---|---|
| Base Model | SFT checkpoint from Phase II |
| Dataset | jondurbin/truthy-dpo-v0.1 |
| Beta Values | 0.1, 0.5, 0.8, 1.0 |
| Batch Size | 1 (gradient accumulation: 4) |
| Learning Rate | 1e-5 |
| Epochs | 3 |
| Max Prompt Length | 512 |
| Max Total Length | 1024 |
The beta (β) parameter controls the strength of preference optimization:
- β = 0.1: Weak preference signal (closer to SFT baseline)
- β = 0.5: Moderate alignment
- β = 0.8: Strong alignment
- β = 1.0: Maximum preference enforcement
- Policy Model: Trainable LoRA adapters on top of SFT checkpoint
- Reference Model: Frozen SFT checkpoint (for computing KL divergence)
- Preference Pairs: Each sample contains:
- Prompt
- Chosen response (preferred)
- Rejected response (dispreferred)
- ✅ Dual-Adapter Setup: SFT adapters + DPO adapters (stacked)
- ✅ Behavioral Alignment: Model learns to prefer helpful/safe responses
- ✅ KL Constraint: Beta controls divergence from reference model
- ✅ No Reward Model: Direct optimization without separate reward model
- Python 3.8+
- CUDA-capable GPU (recommended: 16GB+ VRAM)
- Google Colab (if running notebooks as-is)
pip install torch transformers datasets
pip install peft trl bitsandbytes accelerate
pip install wandb- Clone the repository:
git clone https://github.com/yourusername/Code-Generation-and-Guarding.git
cd Code-Generation-and-Guarding- Install dependencies:
pip install -r wandb/requirements.txt- Login to Weights & Biases (for experiment tracking):
wandb login- For Google Colab users:
- Mount Google Drive for checkpoint persistence
- Ensure GPU runtime is enabled
# Open FFT_Phase.ipynb in Jupyter/Colab
# Run all cells sequentially
# Model will be saved to ./models/fft-roberta-final/Key steps:
- Load XLM-RoBERTa-large with causal LM head
- Tokenize full dataset (instruction + code)
- Train with AdamW optimizer (2 epochs)
- Generate test outputs
# Open SFT_Phase.ipynb
# Update OUTPUT_DIR to your Google Drive path
# Run all cells
# Model saved to qwen2-1.5b-peft-skill/Key steps:
- Load Qwen2-1.5B with 4-bit quantization
- Apply LoRA adapters (r=16, alpha=32)
- Mask system prompt in loss computation
- Train on 3,000 samples
- Test with 10 coding prompts
# Open DPO_Phase.ipynb
# Set SFT_CHECKPOINT_PATH to your SFT model
# Run cells for each beta value (0.1, 0.5, 0.8, 1.0)
# Models saved to dpo_qwen/, dpo_qwen05/, etc.Key steps:
- Load SFT checkpoint as base
- Create policy model (trainable) and reference model (frozen)
- Train with DPO loss for each beta
- Generate outputs for all beta values
- Compare behavioral differences
All experiments logged to Weights & Biases for comprehensive analysis:
- LSFT (SFT Training Loss): Tracks learning in Phase I & II
- LDPO (DPO Loss): Preference optimization loss in Phase III
- Learning rate schedules, gradient norms, training speed
| Phase | Model | Trainable Params | Memory | Generation Quality |
|---|---|---|---|---|
| FFT | RoBERTa-large | 560M (100%) | High | Limited (encoder issue) |
| SFT | Qwen2-1.5B | 2M (0.5%) | Low (4-bit) | High (proper decoder) |
| DPO | Qwen2-1.5B | 2M (0.5%) | Low (4-bit) | High + Aligned |
- Architecture Matters: Decoder-only models significantly outperform adapted encoders for generation
- Efficiency Gains: Q-LoRA achieves better results with 280x fewer trainable parameters
- Beta Impact: Higher beta values enforce stronger alignment but may reduce creativity
- Memory Efficiency: 4-bit quantization enables training of larger models on limited hardware
- Size: 25,000 instruction-code pairs
- Language: Python
- Format:
{"instruction": str, "output": str} - Usage:
- Phase I: Full dataset
- Phase II: 3,000 samples (subset)
- Type: Preference pairs dataset
- Format:
{"prompt": str, "chosen": str, "rejected": str} - Purpose: DPO alignment training
- Usage: Phase III
| Hyperparameter | Phase I (FFT) | Phase II (SFT) | Phase III (DPO) | Justification |
|---|---|---|---|---|
| Learning Rate | 5e-5 | 2e-4 | 1e-5 | Lower for full params, higher for adapters |
| Batch Size | 4 | 1 | 1 | Memory constraints with quantization |
| Epochs | 2 | 1 | 3 | Fewer epochs for adapter training |
| Gradient Accumulation | 1 | 4 | 4 | Effective batch size = 4 |
| Warmup Ratio | 10% | 3% | 10% | Stabilize early training |
- FFT (Phase I): ~20GB VRAM (with gradient checkpointing)
- SFT (Phase II): ~8GB VRAM (4-bit quantization + LoRA)
- DPO (Phase III): ~12GB VRAM (policy + reference models)