LaughTuned

Fine-tuning Mistral-7B-Instruct for comedy writing on Guardian news articles, using DPO and KTO preference optimization implemented from scratch in PyTorch.

The writeup (report/main.pdf) covers the full method, training dynamics, and a failure analysis of KTO's bounded loss masking policy drift.

Key findings

Context length matters as much as the algorithm. Under identical hyperparameters, short-context fine-tuning improved both algorithms over the base model (62% cross-judge win rate), while long-context fine-tuning (today's article + a model-generated historical backstory) made both worse than base (47%).
DPO and KTO swap places by context. DPO beats KTO head-to-head on short (59–41); KTO beats DPO on long (60–40). Algorithm choice isn't independent of context-length choice.
KTO's val loss can lie. Both KTO runs reached lower val loss than either DPO run, yet the cross-judge ranked them no better. The policy-vs-reference log-ratio collapsed to ~−400 (short) and ~−600 (long) while the bounded loss never registered the drift. The fix: monitor the log-ratio, not just the loss.

Repo layout

.
├── config.py                  # Hyperparameters and paths
├── data/
│   ├── fetch_articles.py      # Guardian API ingestion + historical backstories
│   ├── build_prompts.py       # Comedy-prompt construction (short/long variants)
│   ├── generate_pairs.py      # Base-Mistral paired sampling
│   ├── judge.py               # Claude rubric judging (DPO + KTO labels)
│   └── prepare_datasets.py    # Tokenization, splits, DataLoaders
├── models/
│   ├── load_model.py          # QLoRA model loading
│   ├── log_probs.py           # compute_log_probs primitive
│   ├── dpo.py                 # DPO loss (from scratch)
│   ├── kto.py                 # KTO loss (from scratch)
│   ├── ref_log_probs.py       # Cached reference log-probs
│   └── train.py               # Shared training loop
├── eval/
│   ├── generate_eval.py       # Held-out generation per variant
│   ├── judge_eval.py          # Rank-3 cross-judge
│   └── metrics.py             # Perplexity + BERTScore
├── utils/                     # Drive, logging, checkpointing
├── demo.ipynb                 # Colab-runnable pipeline (executable)
├── demo_colab_output.ipynb    # The above with outputs from the actual run
├── report/                    # CVPR-format writeup (LaTeX + PDF)
└── requirements.txt

Running on Colab

Open demo.ipynb in Colab. (A100 strongly recommended; end-to-end is ~3 hours on A100, longer on T4.)
The first cell clones this repo into /content/laughtuned, installs requirements.txt, mounts Google Drive, and creates the artifact tree under /content/drive/MyDrive/LaughTuned/.
Add three Colab secrets (left sidebar → 🔑 Secrets): GUARDIAN_API_KEY, GUARDIAN_API_KEY_2 (optional second key for higher Guardian rate limit), and ANTHROPIC_API_KEY.
Run cells top-to-bottom. Every long-running stage (Guardian ingestion, paired generation, judging, ref-logprob caching, and the four training runs) is idempotent and resume-friendly — re-running a cell after an interruption picks up from disk.

Cost estimate for a full run from scratch:

Stage	Time on A100	API cost
Article ingestion (502 articles + backstories)	~30 min	~$3
Pair generation (470 prompts × 2 ctx × 2 samples)	~45 min	—
Rubric judging (940 pairs)	~30 min	~$9
Training (4 variants × ~500 steps)	~2 h	—
Eval generation + cross-judge	~15 min	~$2

Where the outputs live

Code is in git. Data, checkpoints, metrics, and figures are not in git (too large, often binary) — they live on Google Drive under drive_root:

/content/drive/MyDrive/LaughTuned/
├── data/{articles,prompts,generations,preferences,splits,eval}/
├── ref_log_probs/                 # Cached reference model log-probs
├── checkpoints/<experiment_name>/ # LoRA adapter snapshots
├── metrics/<experiment_name>/     # JSONL + TensorBoard event files
└── figures/                       # PNG + PDF outputs for the report

Cells in the demo notebook print the exact paths they write to. The training cell streams losses + diagnostics to TensorBoard live (the launcher cell above it sets up %tensorboard).

Re-creating report figures

After a training run, the figures used in the writeup can be rebuilt locally with:

python report/figures/build_training_curves.py
python report/figures/build_winrates_figure.py
python report/figures/build_agreement_figure.py
python report/figures/build_kto_logratio_figure.py
python report/figures/build_pipeline_figure.py

Each script reads from the appropriate JSONL under <drive_root>/metrics/ or <drive_root>/data/eval/ and writes PNG + PDF to report/figures/. Paths are picked up from config.py.

Type checking

pyright

Run from the repo root. Configured in pyrightconfig.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaughTuned

Key findings

Repo layout

Running on Colab

Where the outputs live

Re-creating report figures

Type checking

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
eval		eval
models		models
report		report
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
demo.ipynb		demo.ipynb
demo_colab_output.ipynb		demo_colab_output.ipynb
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LaughTuned

Key findings

Repo layout

Running on Colab

Where the outputs live

Re-creating report figures

Type checking

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages