A PyTorch implementation of a GPT-style language model built from scratch.
This project implements a GPT (Generative Pre-trained Transformer) model from the ground up, featuring:
- Multi-head self-attention mechanism
- Transformer blocks with pre-layer normalization
- Custom GELU activation and LayerNorm
- Complete training pipeline
- Text generation capabilities
building_llm_from_scratch/
├── src/
│ ├── model/
│ │ ├── attention.py # Multi-head self-attention
│ │ ├── layers.py # GELU, LayerNorm, FeedForward
│ │ ├── transformer_block.py # Transformer block
│ │ └── gpt_model.py # Complete GPT model
│ ├── data/
│ │ └── dataset.py # Dataset and DataLoader
│ ├── config.py # Model configurations
│ ├── utils.py # Text generation utilities
│ └── visualization.py # Training plots
├── data/ # Training data directory
├── main.py # Main training script
├── train.py # Training loop
├── requirements.txt # Dependencies
└── README.md # Documentation
- Clean Architecture: Production-ready code with comprehensive docstrings and type hints
- GPT-124M Configuration: Implements a GPT model with ~124M parameters
- Custom Implementation: Built from scratch including attention, layer norm, and GELU
- Training Pipeline: Complete training loop with evaluation and sample generation
- Text Generation: Autoregressive text generation with greedy decoding
- Visualization: Training and validation loss plotting
- Modular Design: Well-organized codebase for easy extension and modification
# Install dependencies using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txt-
Prepare your data: Place your training text file in the
data/directory -
Configure training: Edit hyperparameters in
main.pyif needed -
Run training:
python main.pyDefault configuration (GPT-124M):
- Vocabulary size: 50,257 (GPT-2 tokenizer)
- Context length: 128 tokens
- Embedding dimension: 768
- Number of attention heads: 12
- Number of transformer layers: 12
- Dropout: 0.1
- QKV bias: True
You can modify these settings in src/config.py or create new configurations.
from src.model.gpt_model import GPTModel
from src.config import GPT_CONFIG_124M
from src.utils import generate_and_print
import tiktoken
import torch
# Load model
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load('model.pth'))
# Generate text
tokenizer = tiktoken.get_encoding("r50k_base")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text = generate_and_print(
model=model,
tokenizer=tokenizer,
device=device,
prompt="Once upon a time",
max_new_tokens=100
)
print(text)- Python 3.8+
- PyTorch 2.0+
- tiktoken
- matplotlib