This repository is my journal for hands-on exploration of building and training large language model from scratch. Before going into fine-tuning and playing with pre-built LLM, I wanted to understand under the hood working of LLMs. Here's what I learned along the way.
- Built basic tokenizer to convert raw text into model comprehensible token ids.
- Converted token ids to word embeddings to capture semantic relationships between tokens.
- Created dataset for LLM pretraining using small custom text data.
- Implemented self-attention mechanism.
- Progressively built masked attention to focus only on current and previous tokens.
- Created multi-head attention to let the model learn variety of relationships at the same level of details (nouns, verbs, subject, objects, punctuation).
- Implemented layernorm, shortcut-connection, feedforward neural network.
- Combined these into the core transformer building blocks
- Architected a GPT2-style language model using transformer blocks
- Pretrained the LLM on prepared dataset to predict the next word based input sequence.
- Implemented temperature scaling and top-k sampling for controlled text generation.
- Successfully loaded and utilized OpenAI's pre-trained GPT2 weights.
- Used OpenAI's pre-trained weights.
- Modified LLM architecture making it suitable to the classification task.
- Fine-tuned the model for practical spam classification.
- Implemented instruction fine-tuning to make the model follow specific commands.
- Explored parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation).
- Experimented with different LoRA ranks (8, 16, 32, 128) to find optimal balance.
- Compared performance with and without considering instruction tokens in training loss.
- Implemented preference alignment to train LLM to align its response with user's preference like tone, structure, details, relevance, etc.
- Used Direct Preference Optimization approach over instruction-tuned LLM.
"What I cannot create,
I do not understand"
- Richard P. Feynman
Having just an abstract idea of a concept isn't enough to solve a problem fundamentally. With above aim in my mind and freely following the curiosity about workings of a LLM, I started my LLM journey. In this journey, I have tried to piece together each component - from basic tokenization to intricate attention mechanisms to autoregressively generate coherent text - by building them from zero.