Skip to content

ranimeshehata/Code-Generation-and-Guarding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Generation and Guarding: A Comprehensive Fine-Tuning Study

This repository implements a three-phase experimental study on language model fine-tuning for Python code generation, comparing Full Fine-Tuning (FFT), Supervised Fine-Tuning with Q-LoRA, and Direct Preference Optimization (DPO) for behavioral alignment.

📋 Table of Contents

🎯 Overview

This project explores different approaches to fine-tuning language models for code generation:

  1. Baseline Approach (FFT): Full parameter fine-tuning of an encoder-based model (XLM-RoBERTa-large)
  2. Parameter-Efficient Training (Q-LoRA SFT): Low-rank adaptation of a proper decoder model (Qwen2-1.5B-Instruct)
  3. Behavioral Alignment (DPO): Preference-based optimization for safe and helpful code generation

Key Objectives

  • Capability Acquisition: Train models to generate functional Python code
  • Efficiency Comparison: Analyze full fine-tuning vs. parameter-efficient methods
  • Behavioral Control: Implement preference optimization for aligned model behavior
  • Architecture Analysis: Compare encoder-based vs. decoder-based models for generation tasks

📁 Project Structure

Code-Generation-and-Guarding/
├── FFT_Phase.ipynb           # Part I: Full Fine-Tuning (RoBERTa)
├── SFT_Phase.ipynb           # Part II: Q-LoRA SFT (Qwen)
├── DPO_Phase.ipynb           # Part III: Direct Preference Optimization
├── README.md                 # This file
└── wandb/                    # Experiment tracking logs
    ├── config.yaml
    ├── requirements.txt
    └── models/
        └── fft-roberta-final/

🔬 Experimental Phases

Phase I: Full Fine-Tuning (FFT)

Objective: Establish a baseline by fine-tuning all parameters of an encoder model adapted for generation.

Configuration

Parameter Value
Model FacebookAI/xlm-roberta-large (0.56B params)
Architecture Encoder + Causal LM head
Dataset flytech/python-codes-25k (full)
Training Type Full parameter updates (~560M params)
Batch Size 4
Learning Rate 5e-5
Epochs 2
Optimizer AdamW
Precision BF16

Key Characteristics

  • ✅ All model parameters are updated during training
  • ✅ Memory-intensive approach (requires gradient checkpointing)
  • ✅ Computes loss over full instruction + code sequence
  • ⚠️ Architectural limitation: Encoder model with bidirectional attention (not ideal for generation)

Expected Outcomes

  • Moderate code generation capability
  • Limited by encoder architecture (bidirectional attention patterns)
  • Serves as baseline for comparison with proper decoder models

Phase II: Q-LoRA Supervised Fine-Tuning

Objective: Implement parameter-efficient fine-tuning using LoRA with 4-bit quantization on a proper decoder model.

Configuration

Parameter Value
Model Qwen/Qwen2-1.5B-Instruct
Architecture Decoder-only (proper causal LM)
Dataset flytech/python-codes-25k (3,000 samples)
Quantization 4-bit NF4 (via BitsAndBytes)
LoRA Rank (r) 16
LoRA Alpha 32
Target Modules All linear layers
Trainable Params ~0.5% of total (thanks to LoRA)
Batch Size 1 (gradient accumulation: 4)
Learning Rate 2e-4
Epochs 1
Optimizer paged_adamw_8bit

Key Characteristics

  • Parameter-Efficient: Only LoRA adapters trained (~0.5% of parameters)
  • Memory-Efficient: 4-bit quantization reduces memory footprint by 75%
  • Proper Architecture: Decoder-only model designed for generation
  • Instruction Masking: Loss computed only on code completions (not prompts)
  • System Prompt: "You are a helpful Python coding assistant"

Advantages over FFT

  1. Better Architecture: Decoder-only vs. adapted encoder
  2. Efficiency: ~280x fewer trainable parameters
  3. Memory: 4-bit quantization enables larger models on same hardware
  4. Quality: Superior generation due to causal attention

Phase III: Direct Preference Optimization

Objective: Align the SFT model with human preferences using preference pairs to ensure safe and helpful behavior.

Configuration

Parameter Value
Base Model SFT checkpoint from Phase II
Dataset jondurbin/truthy-dpo-v0.1
Beta Values 0.1, 0.5, 0.8, 1.0
Batch Size 1 (gradient accumulation: 4)
Learning Rate 1e-5
Epochs 3
Max Prompt Length 512
Max Total Length 1024

Beta Parameter Experiments

The beta (β) parameter controls the strength of preference optimization:

  • β = 0.1: Weak preference signal (closer to SFT baseline)
  • β = 0.5: Moderate alignment
  • β = 0.8: Strong alignment
  • β = 1.0: Maximum preference enforcement

DPO Training Process

  1. Policy Model: Trainable LoRA adapters on top of SFT checkpoint
  2. Reference Model: Frozen SFT checkpoint (for computing KL divergence)
  3. Preference Pairs: Each sample contains:
    • Prompt
    • Chosen response (preferred)
    • Rejected response (dispreferred)

Key Characteristics

  • Dual-Adapter Setup: SFT adapters + DPO adapters (stacked)
  • Behavioral Alignment: Model learns to prefer helpful/safe responses
  • KL Constraint: Beta controls divergence from reference model
  • No Reward Model: Direct optimization without separate reward model

🚀 Installation

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended: 16GB+ VRAM)
  • Google Colab (if running notebooks as-is)

Required Libraries

pip install torch transformers datasets
pip install peft trl bitsandbytes accelerate
pip install wandb

Environment Setup

  1. Clone the repository:
git clone https://github.com/yourusername/Code-Generation-and-Guarding.git
cd Code-Generation-and-Guarding
  1. Install dependencies:
pip install -r wandb/requirements.txt
  1. Login to Weights & Biases (for experiment tracking):
wandb login
  1. For Google Colab users:
    • Mount Google Drive for checkpoint persistence
    • Ensure GPU runtime is enabled

💻 Usage

Phase I: Full Fine-Tuning

# Open FFT_Phase.ipynb in Jupyter/Colab
# Run all cells sequentially
# Model will be saved to ./models/fft-roberta-final/

Key steps:

  1. Load XLM-RoBERTa-large with causal LM head
  2. Tokenize full dataset (instruction + code)
  3. Train with AdamW optimizer (2 epochs)
  4. Generate test outputs

Phase II: Q-LoRA SFT

# Open SFT_Phase.ipynb
# Update OUTPUT_DIR to your Google Drive path
# Run all cells
# Model saved to qwen2-1.5b-peft-skill/

Key steps:

  1. Load Qwen2-1.5B with 4-bit quantization
  2. Apply LoRA adapters (r=16, alpha=32)
  3. Mask system prompt in loss computation
  4. Train on 3,000 samples
  5. Test with 10 coding prompts

Phase III: DPO Alignment

# Open DPO_Phase.ipynb
# Set SFT_CHECKPOINT_PATH to your SFT model
# Run cells for each beta value (0.1, 0.5, 0.8, 1.0)
# Models saved to dpo_qwen/, dpo_qwen05/, etc.

Key steps:

  1. Load SFT checkpoint as base
  2. Create policy model (trainable) and reference model (frozen)
  3. Train with DPO loss for each beta
  4. Generate outputs for all beta values
  5. Compare behavioral differences

📊 Results and Analysis

Training Metrics

All experiments logged to Weights & Biases for comprehensive analysis:

  • LSFT (SFT Training Loss): Tracks learning in Phase I & II
  • LDPO (DPO Loss): Preference optimization loss in Phase III
  • Learning rate schedules, gradient norms, training speed

Performance Comparison

Phase Model Trainable Params Memory Generation Quality
FFT RoBERTa-large 560M (100%) High Limited (encoder issue)
SFT Qwen2-1.5B 2M (0.5%) Low (4-bit) High (proper decoder)
DPO Qwen2-1.5B 2M (0.5%) Low (4-bit) High + Aligned

Key Findings

  1. Architecture Matters: Decoder-only models significantly outperform adapted encoders for generation
  2. Efficiency Gains: Q-LoRA achieves better results with 280x fewer trainable parameters
  3. Beta Impact: Higher beta values enforce stronger alignment but may reduce creativity
  4. Memory Efficiency: 4-bit quantization enables training of larger models on limited hardware

⚙️ Technical Specifications

Dataset Information

flytech/python-codes-25k

  • Size: 25,000 instruction-code pairs
  • Language: Python
  • Format: {"instruction": str, "output": str}
  • Usage:
    • Phase I: Full dataset
    • Phase II: 3,000 samples (subset)

jondurbin/truthy-dpo-v0.1

  • Type: Preference pairs dataset
  • Format: {"prompt": str, "chosen": str, "rejected": str}
  • Purpose: DPO alignment training
  • Usage: Phase III

Hyperparameter Rationale

Hyperparameter Phase I (FFT) Phase II (SFT) Phase III (DPO) Justification
Learning Rate 5e-5 2e-4 1e-5 Lower for full params, higher for adapters
Batch Size 4 1 1 Memory constraints with quantization
Epochs 2 1 3 Fewer epochs for adapter training
Gradient Accumulation 1 4 4 Effective batch size = 4
Warmup Ratio 10% 3% 10% Stabilize early training

Memory Requirements

  • FFT (Phase I): ~20GB VRAM (with gradient checkpointing)
  • SFT (Phase II): ~8GB VRAM (4-bit quantization + LoRA)
  • DPO (Phase III): ~12GB VRAM (policy + reference models)

About

This repository implements a three-phase experimental study on language model fine-tuning for Python code generation, comparing Full Fine-Tuning (FFT), Supervised Fine-Tuning with Q-LoRA, and Direct Preference Optimization (DPO) for behavioral alignment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors