Code Generation and Guarding: A Comprehensive Fine-Tuning Study

This repository implements a three-phase experimental study on language model fine-tuning for Python code generation, comparing Full Fine-Tuning (FFT), Supervised Fine-Tuning with Q-LoRA, and Direct Preference Optimization (DPO) for behavioral alignment.

🎯 Overview

This project explores different approaches to fine-tuning language models for code generation:

Baseline Approach (FFT): Full parameter fine-tuning of an encoder-based model (XLM-RoBERTa-large)
Parameter-Efficient Training (Q-LoRA SFT): Low-rank adaptation of a proper decoder model (Qwen2-1.5B-Instruct)
Behavioral Alignment (DPO): Preference-based optimization for safe and helpful code generation

Key Objectives

Capability Acquisition: Train models to generate functional Python code
Efficiency Comparison: Analyze full fine-tuning vs. parameter-efficient methods
Behavioral Control: Implement preference optimization for aligned model behavior
Architecture Analysis: Compare encoder-based vs. decoder-based models for generation tasks

📁 Project Structure

Code-Generation-and-Guarding/
├── FFT_Phase.ipynb           # Part I: Full Fine-Tuning (RoBERTa)
├── SFT_Phase.ipynb           # Part II: Q-LoRA SFT (Qwen)
├── DPO_Phase.ipynb           # Part III: Direct Preference Optimization
├── README.md                 # This file
└── wandb/                    # Experiment tracking logs
    ├── config.yaml
    ├── requirements.txt
    └── models/
        └── fft-roberta-final/

🔬 Experimental Phases

Phase I: Full Fine-Tuning (FFT)

Objective: Establish a baseline by fine-tuning all parameters of an encoder model adapted for generation.

Configuration

Parameter	Value
Model	FacebookAI/xlm-roberta-large (0.56B params)
Architecture	Encoder + Causal LM head
Dataset	flytech/python-codes-25k (full)
Training Type	Full parameter updates (~560M params)
Batch Size	4
Learning Rate	5e-5
Epochs	2
Optimizer	AdamW
Precision	BF16

Key Characteristics

✅ All model parameters are updated during training
✅ Memory-intensive approach (requires gradient checkpointing)
✅ Computes loss over full instruction + code sequence
⚠️ Architectural limitation: Encoder model with bidirectional attention (not ideal for generation)

Expected Outcomes

Moderate code generation capability
Limited by encoder architecture (bidirectional attention patterns)
Serves as baseline for comparison with proper decoder models

Phase II: Q-LoRA Supervised Fine-Tuning

Objective: Implement parameter-efficient fine-tuning using LoRA with 4-bit quantization on a proper decoder model.

Configuration

Parameter	Value
Model	Qwen/Qwen2-1.5B-Instruct
Architecture	Decoder-only (proper causal LM)
Dataset	flytech/python-codes-25k (3,000 samples)
Quantization	4-bit NF4 (via BitsAndBytes)
LoRA Rank (r)	16
LoRA Alpha	32
Target Modules	All linear layers
Trainable Params	~0.5% of total (thanks to LoRA)
Batch Size	1 (gradient accumulation: 4)
Learning Rate	2e-4
Epochs	1
Optimizer	paged_adamw_8bit

Key Characteristics

✅ Parameter-Efficient: Only LoRA adapters trained (~0.5% of parameters)
✅ Memory-Efficient: 4-bit quantization reduces memory footprint by 75%
✅ Proper Architecture: Decoder-only model designed for generation
✅ Instruction Masking: Loss computed only on code completions (not prompts)
✅ System Prompt: "You are a helpful Python coding assistant"

Advantages over FFT

Better Architecture: Decoder-only vs. adapted encoder
Efficiency: ~280x fewer trainable parameters
Memory: 4-bit quantization enables larger models on same hardware
Quality: Superior generation due to causal attention

Phase III: Direct Preference Optimization

Objective: Align the SFT model with human preferences using preference pairs to ensure safe and helpful behavior.

Configuration

Parameter	Value
Base Model	SFT checkpoint from Phase II
Dataset	jondurbin/truthy-dpo-v0.1
Beta Values	0.1, 0.5, 0.8, 1.0
Batch Size	1 (gradient accumulation: 4)
Learning Rate	1e-5
Epochs	3
Max Prompt Length	512
Max Total Length	1024

Beta Parameter Experiments

The beta (β) parameter controls the strength of preference optimization:

β = 0.1: Weak preference signal (closer to SFT baseline)
β = 0.5: Moderate alignment
β = 0.8: Strong alignment
β = 1.0: Maximum preference enforcement

DPO Training Process

Policy Model: Trainable LoRA adapters on top of SFT checkpoint
Reference Model: Frozen SFT checkpoint (for computing KL divergence)
Preference Pairs: Each sample contains:
- Prompt
- Chosen response (preferred)
- Rejected response (dispreferred)

Key Characteristics

✅ Dual-Adapter Setup: SFT adapters + DPO adapters (stacked)
✅ Behavioral Alignment: Model learns to prefer helpful/safe responses
✅ KL Constraint: Beta controls divergence from reference model
✅ No Reward Model: Direct optimization without separate reward model

🚀 Installation

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended: 16GB+ VRAM)
Google Colab (if running notebooks as-is)

Required Libraries

pip install torch transformers datasets
pip install peft trl bitsandbytes accelerate
pip install wandb

Environment Setup

Clone the repository:

git clone https://github.com/yourusername/Code-Generation-and-Guarding.git
cd Code-Generation-and-Guarding

Install dependencies:

pip install -r wandb/requirements.txt

Login to Weights & Biases (for experiment tracking):

wandb login

For Google Colab users:
- Mount Google Drive for checkpoint persistence
- Ensure GPU runtime is enabled

💻 Usage

Phase I: Full Fine-Tuning

# Open FFT_Phase.ipynb in Jupyter/Colab
# Run all cells sequentially
# Model will be saved to ./models/fft-roberta-final/

Key steps:

Load XLM-RoBERTa-large with causal LM head
Tokenize full dataset (instruction + code)
Train with AdamW optimizer (2 epochs)
Generate test outputs

Phase II: Q-LoRA SFT

# Open SFT_Phase.ipynb
# Update OUTPUT_DIR to your Google Drive path
# Run all cells
# Model saved to qwen2-1.5b-peft-skill/

Key steps:

Load Qwen2-1.5B with 4-bit quantization
Apply LoRA adapters (r=16, alpha=32)
Mask system prompt in loss computation
Train on 3,000 samples
Test with 10 coding prompts

Phase III: DPO Alignment

# Open DPO_Phase.ipynb
# Set SFT_CHECKPOINT_PATH to your SFT model
# Run cells for each beta value (0.1, 0.5, 0.8, 1.0)
# Models saved to dpo_qwen/, dpo_qwen05/, etc.

Key steps:

Load SFT checkpoint as base
Create policy model (trainable) and reference model (frozen)
Train with DPO loss for each beta
Generate outputs for all beta values
Compare behavioral differences

📊 Results and Analysis

Training Metrics

All experiments logged to Weights & Biases for comprehensive analysis:

LSFT (SFT Training Loss): Tracks learning in Phase I & II
LDPO (DPO Loss): Preference optimization loss in Phase III
Learning rate schedules, gradient norms, training speed

Performance Comparison

Phase	Model	Trainable Params	Memory	Generation Quality
FFT	RoBERTa-large	560M (100%)	High	Limited (encoder issue)
SFT	Qwen2-1.5B	2M (0.5%)	Low (4-bit)	High (proper decoder)
DPO	Qwen2-1.5B	2M (0.5%)	Low (4-bit)	High + Aligned

Key Findings

Architecture Matters: Decoder-only models significantly outperform adapted encoders for generation
Efficiency Gains: Q-LoRA achieves better results with 280x fewer trainable parameters
Beta Impact: Higher beta values enforce stronger alignment but may reduce creativity
Memory Efficiency: 4-bit quantization enables training of larger models on limited hardware

⚙️ Technical Specifications

Dataset Information

flytech/python-codes-25k

Size: 25,000 instruction-code pairs
Language: Python
Format: {"instruction": str, "output": str}
Usage:
- Phase I: Full dataset
- Phase II: 3,000 samples (subset)

jondurbin/truthy-dpo-v0.1

Type: Preference pairs dataset
Format: {"prompt": str, "chosen": str, "rejected": str}
Purpose: DPO alignment training
Usage: Phase III

Hyperparameter Rationale

Hyperparameter	Phase I (FFT)	Phase II (SFT)	Phase III (DPO)	Justification
Learning Rate	5e-5	2e-4	1e-5	Lower for full params, higher for adapters
Batch Size	4	1	1	Memory constraints with quantization
Epochs	2	1	3	Fewer epochs for adapter training
Gradient Accumulation	1	4	4	Effective batch size = 4
Warmup Ratio	10%	3%	10%	Stabilize early training

Memory Requirements

FFT (Phase I): ~20GB VRAM (with gradient checkpointing)
SFT (Phase II): ~8GB VRAM (4-bit quantization + LoRA)
DPO (Phase III): ~12GB VRAM (policy + reference models)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
DPO_Phase.ipynb		DPO_Phase.ipynb
FFT_Phase.ipynb		FFT_Phase.ipynb
README.md		README.md
SFT_Phase.ipynb		SFT_Phase.ipynb

Folders and files

Latest commit

History

Repository files navigation

Code Generation and Guarding: A Comprehensive Fine-Tuning Study

📋 Table of Contents

🎯 Overview

Key Objectives

📁 Project Structure

🔬 Experimental Phases

Phase I: Full Fine-Tuning (FFT)

Configuration

Key Characteristics

Expected Outcomes

Phase II: Q-LoRA Supervised Fine-Tuning

Configuration

Key Characteristics

Advantages over FFT

Phase III: Direct Preference Optimization

Configuration

Beta Parameter Experiments

DPO Training Process

Key Characteristics

🚀 Installation

Prerequisites

Required Libraries

Environment Setup

💻 Usage

Phase I: Full Fine-Tuning

Phase II: Q-LoRA SFT

Phase III: DPO Alignment

📊 Results and Analysis

Training Metrics

Performance Comparison

Key Findings

⚙️ Technical Specifications

Dataset Information

flytech/python-codes-25k

jondurbin/truthy-dpo-v0.1

Hyperparameter Rationale

Memory Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages