Skip to content

toy-gpt/train-300-context-2

Toy-GPT: train-300-context-2

PyPI version Latest Release Docs License: MIT CI Deploy-Docs Check Links Dependabot

Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

  • Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
  • A web app loads the artifacts and provides an interactive prompt.

Contents

  • a small, declared text corpus
  • a tokenizer and vocabulary builder
  • a simple next-token prediction model
  • a repeatable training loop
  • committed, inspectable artifacts for downstream use

Scope

This is:

  • an intentionally inspectable training pipeline
  • a next-token predictor trained on an explicit corpus

This is not:

  • a production system
  • a full Transformer implementation
  • a chat interface
  • a claim of semantic understanding

Outputs

This repository produces and commits pretrained artifacts under artifacts/.

Training logs and evidence are written under outputs/ (for example, outputs/train_log.csv).

Quick start

See SETUP.md for full setup and workflow instructions.

Run the full training script:

uv run python src/toy_gpt_train/d_train.py

Run individually:

  • a/b/c are demos (can be run alone if desired)
  • d_train produces artifacts
  • e_infer consumes artifacts
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py

Command Reference

The commands below are used in the workflow guide above. They are provided here for convenience.

Follow the guide for the full instructions.

Show command reference

In a machine terminal (open in your Repos folder)

After you get a copy of this repo in your own GitHub account, open a machine terminal in your Repos folder:

# Replace username with YOUR GitHub username.
git clone https://github.com/username/train-300-context-2

cd train-300-context-2
code .

In a VS Code terminal

uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade

uvx pre-commit install
git add -A
uvx pre-commit run --all-files

# run Python

uv run ruff format .
uv run ruff check . --fix
uv run zensical build

git add -A
git commit -m "update"
git push -u origin main

Provenance and Purpose

The primary corpus used for training is declared in SE_MANIFEST.toml.

This repository commits pretrained artifacts so the client can run without retraining.

Annotations

ANNOTATIONS.md - REQ/WHY/OBS annotations used

Citation

CITATION.cff

License

MIT

SE Manifest

SE_MANIFEST.toml - project intent, scope, and role