Toy-GPT: train-300-context-2

Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
A web app loads the artifacts and provides an interactive prompt.

Scope

This is:

an intentionally inspectable training pipeline
a next-token predictor trained on an explicit corpus

This is not:

a production system
a full Transformer implementation
a chat interface
a claim of semantic understanding

Outputs

This repository produces and commits pretrained artifacts under artifacts/.

Training logs and evidence are written under outputs/ (for example, outputs/train_log.csv).

Quick start

See SETUP.md for full setup and workflow instructions.

Run the full training script:

uv run python src/toy_gpt_train/d_train.py

Run individually:

a/b/c are demos (can be run alone if desired)
d_train produces artifacts
e_infer consumes artifacts

uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py

Command Reference

The commands below are used in the workflow guide above. They are provided here for convenience.

Follow the guide for the full instructions.

Show command reference

In a machine terminal (open in your `Repos` folder)

After you get a copy of this repo in your own GitHub account, open a machine terminal in your Repos folder:

# Replace username with YOUR GitHub username.
git clone https://github.com/username/train-300-context-2

cd train-300-context-2
code .

In a VS Code terminal

uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade

uvx pre-commit install
git add -A
uvx pre-commit run --all-files

# run Python

uv run ruff format .
uv run ruff check . --fix
uv run zensical build

git add -A
git commit -m "update"
git push -u origin main

Provenance and Purpose

The primary corpus used for training is declared in SE_MANIFEST.toml.

This repository commits pretrained artifacts so the client can run without retraining.

Annotations

ANNOTATIONS.md - REQ/WHY/OBS annotations used

Citation

CITATION.cff

License

MIT

SE Manifest

SE_MANIFEST.toml - project intent, scope, and role

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
artifacts		artifacts
corpus		corpus
docs		docs
outputs		outputs
src/toy_gpt_train		src/toy_gpt_train
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.yamllint.yml		.yamllint.yml
ANNOTATIONS.md		ANNOTATIONS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
SE_MANIFEST.toml		SE_MANIFEST.toml
lychee.toml		lychee.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toy-GPT: train-300-context-2

Contents

Scope

Outputs

Quick start

Command Reference

In a machine terminal (open in your `Repos` folder)

In a VS Code terminal

Provenance and Purpose

Annotations

Citation

License

SE Manifest

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toy-GPT: train-300-context-2

Contents

Scope

Outputs

Quick start

Command Reference

In a machine terminal (open in your Repos folder)

In a VS Code terminal

Provenance and Purpose

Annotations

Citation

License

SE Manifest

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

In a machine terminal (open in your `Repos` folder)

Packages