A complete implementation of a Transformer-based text classification model built from scratch using PyTorch. This project demonstrates how to implement the Transformer architecture without relying on pre-built libraries like Hugging Face Transformers.
-
Custom Transformer Architecture: Complete implementation from scratch including:
- Multi-head self-attention mechanism
- Positional encoding using sine/cosine functions
- Feed-forward networks with residual connections
- Layer normalization
- Configurable number of layers, heads, and dimensions
-
Text Classification: Supports sentiment analysis and topic classification
-
Baseline Comparisons: Includes LSTM and Bag-of-Words baselines for performance comparison
-
Comprehensive Evaluation:
- Training/validation curves
- Confusion matrices
- Classification reports
- Attention weight visualization
EVALUATION METRICS SUMMARY
==================================================
Overall Accuracy: 0.8756
Per-Class Metrics:
------------------------------
Negative:
Precision: 0.8842
Recall: 0.8667
F1-Score: 0.8754
Positive:
Precision: 0.8674
Recall: 0.8845
F1-Score: 0.8759
MODEL COMPARISON
==================================================
Transformer Accuracy: 87.56%
Bag-of-Words Accuracy: 0.8234
LSTM Accuracy: 0.8456
The Transformer model includes:
- Token Embedding: Converts token IDs to dense vectors
- Positional Encoding: Adds position information using sine/cosine functions
- Multi-Head Attention: Multiple attention heads for capturing different relationships
- Feed-Forward Networks: Position-wise fully connected layers
- Residual Connections: Skip connections for better gradient flow
- Layer Normalization: Normalization after each sub-layer
- Classification Head: Global average pooling + linear layer
- Default Configuration: 128 dimensions, 8 heads, 6 layers
- Trainable Parameters: ~1.2M parameters (configurable)
- Memory Efficient: Optimized for CPU training if needed
The model achieves competitive performance on standard benchmarks:
- Sentiment Analysis: ~87-90% accuracy
- Classification: ~85-88% accuracy
Performance varies based on:
- Model size (dimensions, layers, heads)
- Training epochs
- Learning rate
- Sequence length