A comparative analysis of EfficientNet-B0 trained with standard cross-entropy loss versus knowledge distillation from EfficientNet-B2.
final_DL/
├── config.py # Configuration and hyperparameters
├── data_loader.py # Dataset loading and preprocessing
├── models.py # Model definitions and distillation loss
├── train_baseline.py # Baseline training script
├── train_distillation.py # Knowledge distillation training script
├── evaluate.py # Evaluation and comparison utilities
├── utils.py # Helper functions
├── requirements.txt # Python dependencies
├── data/ # Downloaded datasets (auto-created)
├── checkpoints/ # Saved model weights (auto-created)
└── results/ # Training logs and plots (auto-created)
pip install -r requirements.txtimport torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only'}")python train_baseline.pyThis trains EfficientNet-B0 with standard cross-entropy loss.
python train_distillation.pyThis trains EfficientNet-B0 using knowledge distillation from EfficientNet-B2.
python evaluate.pyThis generates:
- Performance comparison plots
- Confusion matrices
- Detailed comparison report
Edit config.py to customize:
| Parameter | Default | Description |
|---|---|---|
DATASET_NAME |
"TinyImageNet" | Dataset to use (place under data/tiny-imagenet-200) |
TEACHER_MODEL |
"efficientnet_b2" | Teacher model architecture |
STUDENT_MODEL |
"efficientnet_b0" | Student model architecture |
TEMPERATURE |
4.0 | Distillation temperature |
ALPHA |
0.7 | Weight for soft targets |
BATCH_SIZE |
32 | Batch size (RTX 5070 headroom; lower if OOM) |
NUM_EPOCHS |
50 | Training epochs |
| Model | Parameters | Size (MB) | GPU Memory (batch=32 @ 64×64) |
|---|---|---|---|
| EfficientNet-B0 (Student) | 5.3M | ~20 MB | ~2-3 GB |
| EfficientNet-B2 (Teacher) | 9.1M | ~35 MB | ~4-5 GB |
| Combined (Training) | - | - | ~6-8 GB (fits RTX 5070) |
The distillation loss combines:
- Soft loss: KL divergence between teacher and student soft predictions
- Hard loss: Cross-entropy with ground truth labels
Where:
-
$z_s$ ,$z_t$ = student and teacher logits -
$T$ = temperature (higher = softer probabilities) -
$\alpha$ = weight for soft targets -
$\sigma$ = softmax function
After training on TinyImageNet (64×64):
- Distilled model should outperform baseline by a few percentage points (exact delta varies)
- Both models share the same inference cost (identical architecture)
- Distilled model benefits from teacher soft targets across 200 classes
- Reduce
BATCH_SIZEinconfig.py(try 8) - Current config is optimized for 6GB VRAM with B2 teacher
- Reduce
NUM_EPOCHS - Use smaller dataset (Flowers102)
- Ensure CUDA is being used