Status: canonical
PSION-12/#368pretrain-stage contract, written 2026-03-22 after landing the first Psion compact decoder family.
This document freezes the first explicit pretrain stage for the Psion
learned-model lane.
It keeps pretraining as a named stage with its own receipts instead of overloading SFT or relying on notebook-local state.
crates/psionic-train/src/psion_pretrain_stage.rsowns the typed pretrain stage config and run receipt.crates/psionic-train/examples/psion_pretrain_stage_fixtures.rsregenerates the canonical config and receipt fixtures.fixtures/psion/pretrain/psion_pretrain_stage_config_v1.jsonis the canonical declared pretrain-stage config fixture.fixtures/psion/pretrain/psion_pretrain_stage_receipt_v1.jsonis the canonical pretrain-stage receipt fixture.
The stable schema versions are:
psion.pretrain_stage_config.v1psion.pretrain_stage_receipt.v1
The pretrain stage now binds:
- one explicit
pretrainstage kind in the stage program - one next-token objective config tied to model, tokenizer, and dataset identity
- source-family-aware reporting across train, validation, and held-out splits
- replay receipts tied back to the tokenized corpus replay contract
- checkpoint-lineage receipts tied back to the promoted stage checkpoint and model descriptor
That gives the first Psion pretrain lane one repo-owned stage surface instead of smuggling broad-model training through existing SFT types.
psionic-train now validates that:
- the declared stage is really
pretrain - objective config uses the model descriptor's tokenizer binding and context length
- sampling policy, dataset identity, and model descriptor all stay aligned
- source-family reporting covers every split-family pair represented by the tokenized corpus
- replay receipts preserve the tokenized corpus replay contract
- checkpoint lineage stays tied to one promoted checkpoint family and one model descriptor digest
- the generic stage program will not complete a
pretrainstage until at least one validated pretrain receipt has been recorded