Skip to content

cujoramirez/FinalProject_Deeplearning_Distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Distillation: EfficientNet-B0 Baseline vs Distilled

A comparative analysis of EfficientNet-B0 trained with standard cross-entropy loss versus knowledge distillation from EfficientNet-B2.

Project Structure

final_DL/
├── config.py              # Configuration and hyperparameters
├── data_loader.py         # Dataset loading and preprocessing
├── models.py              # Model definitions and distillation loss
├── train_baseline.py      # Baseline training script
├── train_distillation.py  # Knowledge distillation training script
├── evaluate.py            # Evaluation and comparison utilities
├── utils.py               # Helper functions
├── requirements.txt       # Python dependencies
├── data/                  # Downloaded datasets (auto-created)
├── checkpoints/           # Saved model weights (auto-created)
└── results/               # Training logs and plots (auto-created)

Setup

1. Install Dependencies

pip install -r requirements.txt

2. Verify GPU (Optional)

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only'}")

Usage

Step 1: Train Baseline Model

python train_baseline.py

This trains EfficientNet-B0 with standard cross-entropy loss.

Step 2: Train Distilled Model

python train_distillation.py

This trains EfficientNet-B0 using knowledge distillation from EfficientNet-B2.

Step 3: Compare Results

python evaluate.py

This generates:

  • Performance comparison plots
  • Confusion matrices
  • Detailed comparison report

Configuration

Edit config.py to customize:

Parameter Default Description
DATASET_NAME "TinyImageNet" Dataset to use (place under data/tiny-imagenet-200)
TEACHER_MODEL "efficientnet_b2" Teacher model architecture
STUDENT_MODEL "efficientnet_b0" Student model architecture
TEMPERATURE 4.0 Distillation temperature
ALPHA 0.7 Weight for soft targets
BATCH_SIZE 32 Batch size (RTX 5070 headroom; lower if OOM)
NUM_EPOCHS 50 Training epochs

Model Sizes

Model Parameters Size (MB) GPU Memory (batch=32 @ 64×64)
EfficientNet-B0 (Student) 5.3M ~20 MB ~2-3 GB
EfficientNet-B2 (Teacher) 9.1M ~35 MB ~4-5 GB
Combined (Training) - - ~6-8 GB (fits RTX 5070)

Knowledge Distillation

The distillation loss combines:

  • Soft loss: KL divergence between teacher and student soft predictions
  • Hard loss: Cross-entropy with ground truth labels

$$\mathcal{L} = \alpha \cdot T^2 \cdot \text{KL}(\sigma(z_s/T) | \sigma(z_t/T)) + (1-\alpha) \cdot \text{CE}(z_s, y)$$

Where:

  • $z_s$, $z_t$ = student and teacher logits
  • $T$ = temperature (higher = softer probabilities)
  • $\alpha$ = weight for soft targets
  • $\sigma$ = softmax function

Expected Results

After training on TinyImageNet (64×64):

  • Distilled model should outperform baseline by a few percentage points (exact delta varies)
  • Both models share the same inference cost (identical architecture)
  • Distilled model benefits from teacher soft targets across 200 classes

Troubleshooting

Out of Memory (OOM)

  • Reduce BATCH_SIZE in config.py (try 8)
  • Current config is optimized for 6GB VRAM with B2 teacher

Slow Training

  • Reduce NUM_EPOCHS
  • Use smaller dataset (Flowers102)
  • Ensure CUDA is being used

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published