Hands-on sandbox exercises for learning Large Language Models (LLMs), organized by the chapters of Jurafsky & Martin's Speech and Language Processing (3rd ed.) — the draft is free and openly published online. Each exercise pairs a short theory recap with a runnable coding task, oriented toward political-science text (speeches, parliamentary debates, reviews, news, social media).
This repo is the practice companion to the LLM concept notes in Notion: every concept page there has a matching exercise here.
Folders are numbered to match the J&M chapters (3rd ed., latest draft). Each chapter folder contains one subfolder per exercise, and each exercise has:
README.md— the task spec (objective, dataset, tasks, success criteria, reference)*_sandbox.py— a starter script withTODOs to fill in
| Folder | J&M Chapter | Exercises |
|---|---|---|
02_tokenization/ |
Ch. 2 | Tokens · Training Data & Corpora |
03_ngram_language_models/ |
Ch. 3 | N-gram Language Models · Perplexity & LM Evaluation |
04_naive_bayes_sentiment/ |
Ch. 4 | VADER vs. Star Ratings · Classification Metrics |
05_logistic_regression/ |
Ch. 5 | Logistic Regression & Text Classification · Bag of Words & TF-IDF |
06_vector_semantics/ |
Ch. 6 | Embeddings · How Embeddings Get Trained |
07_neural_networks/ |
Ch. 7 | Neural Networks |
08_rnns_lstms/ |
Ch. 8 | RNNs & LSTMs |
09_transformers/ |
Ch. 9 | Transformers · Attention |
10_large_language_models/ |
Ch. 10 | Pretraining vs. Fine-tuning · Decoding & Sampling · Context Window · Scaling Laws |
11_masked_language_models/ |
Ch. 11 | Masked Language Models (BERT) |
12_alignment_prompting/ |
Ch. 12 | Prompting & In-Context Learning · RLHF & Alignment · Bias & Fairness · Agents & Tool Use |
13_machine_translation/ |
Ch. 13 | Machine Translation & BLEU |
14_ir_qa_rag/ |
Ch. 14 | Information Retrieval · RAG Pipeline · Hallucination & Validation |
capstone_text_annotation/ |
Capstone | LLMs for Text Annotation & Classification · Inter-rater Reliability |
Note on numbering: this repo follows the latest J&M 3rd-ed draft, where Ch. 8 = RNNs & LSTMs and Ch. 9 = Transformers. Some Notion course reading pages still use an older numbering (e.g. Transformers as Ch. 8); those should be reconciled to match this map. See issue #1.
Requires Python 3.11 (see .python-version). Dependencies are pinned in requirements.txt for reproducibility.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtDatasets go in data/ (git-ignored except its README). A tiny runnable sample lives in data/sample/ so the early exercises work out-of-the-box before you supply real data.
- Read the concept note in Notion.
- Open the exercise
README.mdhere. - Fill in the
TODOs in the*_sandbox.pystub. - Commit your solution next to the stub (e.g.
tokens_solution.py).
- Daniel Jurafsky & James H. Martin, Speech and Language Processing (3rd ed. draft) — free online: https://web.stanford.edu/~jurafsky/slp3/
MIT — see LICENSE.