This repository contains implementations of increasingly sophisticated neural language models, progressing from simple statistical models to advanced transformer architectures.
This is a learning-focused repository documenting the progression from basic character-level language models to more complex deep learning approaches. The work is largely inspired by Andrej Karpathy's educational content on neural networks and language models.
-
1-bigram_language_model.ipynb
- Simplest language model using bigrams (2-character sequences)
- Both statistical approach (counting bigrams) and neural network approach
- Demonstrates character encoding, probability distributions, and sampling
- Foundation for understanding sequence modeling
-
2-trigram_language_model.ipynb
- Extends bigram to trigrams (3-character sequences)
- Takes 2 input characters to predict the next character
- Increases context and model expressiveness
- Shows how to build 2D count matrices for higher-order n-grams
-
3-mlp_language_model.ipynb
- Multi-Layer Perceptron approach to language modeling
- Uses context windows (typically block_size=3)
- Introduces embedding layers for character encoding
- More sophisticated than n-gram counting methods
-
4-mlp_improvement.ipynb
- Enhanced MLP with better architecture and training
- Implements proper train/dev/test split
- Uses Xavier uniform initialization for better convergence
- Demonstrates hyperparameter tuning and learning rate scheduling
- Generates coherent name sequences
-
5-activations_gradients_batchnorm.ipynb
- Deep dive into activation functions and gradient flow
- Explores batch normalization effects
- Analyzes internal layer dynamics during training
- Addresses vanishing/exploding gradient problems
-
6-backprop_ninja.ipynb
- Manual backpropagation implementation
- Detailed breakdown of gradient computations
- Understanding automatic differentiation mechanics
- Advanced debugging techniques for neural networks
-
7-wavenet.ipynb
- WaveNet-style dilated convolutions for sequence modeling
- Exponentially expanding receptive fields
- Improved context understanding
- More sophisticated architecture for better performance
- gpt_dev.ipynb: Transformer model implementation with multi-head self-attention
- v2.py: Full transformer decoder implementation
- Multi-head self-attention
- Feed-forward networks
- Positional embeddings
- Configurable layers and heads
- Training loop with evaluation metrics
- Text generation capabilities
- Character-level tokenization
- One-hot encoding
- Probability distributions and sampling
- Loss functions (negative log-likelihood, cross-entropy)
- N-gram Models: Statistical baseline using counting
- Feedforward Networks: MLPs with embeddings and hidden layers
- Convolutional Approaches: Dilated convolutions (WaveNet)
- Transformers: Self-attention mechanisms and positional encoding
The notebooks demonstrate progression in model capability:
- Bigram: Simple character patterns, coherent but limited
- Trigram: More context, better structure
- MLP: Neural approach, learns non-linear patterns
- WaveNet: Dilated receptive fields, richer representations
- Transformer: Self-attention, long-range dependencies, state-of-the-art results
- All models use character-level tokenization
- Device detection (MPS for Apple Silicon, CPU fallback)
- Generator seeds for reproducibility
- Evaluation metrics tracked during training
- Sampling/generation functions included in each implementation
This work is educational and builds upon concepts from:
- Andrej Karpathy's lecture series on neural networks
- The Transformer paper and subsequent work
- Classical language modeling techniques
Last Updated: February 2026