Student: Martynas Prascevicius (001263199) Duration: 5 minutes
Hello, my name is Martynas Prascevicius.
This is my project on optimizing DistilBERT for sentiment analysis.
I focused on ONE model and tested different settings to find the best ones.
DistilBERT comes with recommended settings from the BERT paper: learning rate 2e-5 to 5e-5, batch size 16, and 3 epochs.
But are these REALLY the best settings for sentiment analysis?
My research question is: What are the best settings for DistilBERT when analyzing movie reviews?
I ran 11 experiments in 4 phases. Each phase tests ONE setting while keeping everything else the same.
Phase 1: Baseline with recommended settings - 90.59% accuracy on IMDB reviews.
Phase 2: Tested 4 learning rates - 1e-5, 2e-5, 3e-5, and 5e-5.
Phase 3: Tested 3 batch sizes - 8, 16, and 32.
Phase 4: Tested 3, 4, and 5 epochs.
I used IMDB dataset with 25,000 reviews for training and 25,000 for testing. Total time was 20 hours on Mac mini M4.
My first finding is about learning rate.
2e-5 got the BEST performance at 90.62% accuracy. The slower rate 1e-5 got only 90.45%. The faster rates got worse - 5e-5 reached only 90.21%.
This validates the BERT paper recommendations. The baseline rate 2e-5 is optimal. Slower rates do not learn enough. Faster rates learn too quickly and miss the best solution.
This confirms that the original BERT guidance works well for sentiment analysis.
My second finding is about batch size.
Batch size 16 got the BEST accuracy at 90.60%. Batch 8 got 90.30%. Batch 32 got the WORST at 90.12%.
Larger batches learn too precisely from training data, which hurts performance on new reviews. Medium batches learn more general patterns. Batch 16 is 0.48 percentage points better than batch 32.
My third finding is about training duration.
At 3 epochs: 90.59% test, 96.73% training - the baseline.
At 4 epochs: 90.64% test, 98.11% training - gap growing.
At 5 epochs: 90.84% test, 98.95% training - 8.1% gap.
Test accuracy improves even though the model memorizes training data. But the growing gap means limited robustness. For production, 3 to 4 epochs is safer.
Across all 11 experiments, clear patterns emerge.
Highest accuracy: 90.84% with 5 epochs, but shows overfitting. Best learning rate: 90.62% with 2e-5. Worst: 90.12% with batch 32. Baseline: 90.59%.
The optimal configuration: learning rate 2e-5, batch size 16, and 3 to 4 epochs.
This gets 90.6 to 90.8% accuracy - 120 to 180 fewer errors on 25,000 reviews compared to bad settings.
Three key contributions.
First, BERT recommended learning rate 2e-5 is optimal. The original guidance is correct.
Second, larger batch sizes hurt performance. Batch 16 is 0.48 points better than batch 32.
Third, overfitting is complex. Test accuracy improves even when memorizing. The 8.1% gap at 5 epochs shows limited robustness.
In conclusion, I ran 11 experiments testing different settings.
BERT recommended settings work well. Careful tuning gives 0.25% gain over baseline.
Thank you.
Total Duration: Approximately 4 minutes 20 seconds Recommended pace: Natural, clear speaking without rushing
| Slide | Topic | Time | Cumulative |
|---|---|---|---|
| 1 | Title | 15s | 0:15 |
| 2 | Problem | 25s | 0:40 |
| 3 | Methodology | 40s | 1:20 |
| 4 | Learning Rate | 40s | 2:00 |
| 5 | Batch Size | 30s | 2:30 |
| 6 | Training Duration | 30s | 3:00 |
| 7 | Results Summary | 30s | 3:30 |
| 8 | Key Findings | 25s | 3:55 |
| 9 | Conclusion | 15s | 4:10 |
Total speaking time: 4 minutes 10 seconds With natural pauses: ~4 minutes 30 seconds Target: Under 5 minutes ✓