Demonstrates, at very small scale, how a language model is trained.
This repository is part of a series of toy training repositories plus a companion client repository:
- Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
- A web app loads the artifacts and provides an interactive prompt.
- a small, declared text corpus
- a tokenizer and vocabulary builder
- a simple next-token prediction model
- a repeatable training loop
- committed, inspectable artifacts for downstream use
This is:
- an intentionally inspectable training pipeline
- a next-token predictor trained on an explicit corpus
This is not:
- a production system
- a full Transformer implementation
- a chat interface
- a claim of semantic understanding
This repository produces and commits pretrained artifacts under artifacts/.
Training logs and evidence are written under outputs/
(for example, outputs/train_log.csv).
See SETUP.md for full setup and workflow instructions.
Run the full training script:
uv run python src/toy_gpt_train/d_train.pyRun individually:
- a/b/c are demos (can be run alone if desired)
- d_train produces artifacts
- e_infer consumes artifacts
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.pyThe commands below are used in the workflow guide above. They are provided here for convenience.
Follow the guide for the full instructions.
Show command reference
After you get a copy of this repo in your own GitHub account,
open a machine terminal in your Repos folder:
# Replace username with YOUR GitHub username.
git clone https://github.com/username/train-300-context-2
cd train-300-context-2
code .uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade
uvx pre-commit install
git add -A
uvx pre-commit run --all-files
# run Python
uv run ruff format .
uv run ruff check . --fix
uv run zensical build
git add -A
git commit -m "update"
git push -u origin mainThe primary corpus used for training is declared in SE_MANIFEST.toml.
This repository commits pretrained artifacts so the client can run without retraining.
ANNOTATIONS.md - REQ/WHY/OBS annotations used
SE_MANIFEST.toml - project intent, scope, and role