Feature and its Use Cases
Problem
Currently the tokenizer training pipeline introduced in PR #17 trains and saves the tokenizer configuration but does not tokenize the actual Wikipedia dataset.
This leaves a verification gap between the verified preprocessing pipeline and the model training stage.
Because the tokenized dataset is not verified, the following risks exist:
- The dataset could be tokenized using a different tokenizer than claimed
- Tokenized outputs could be modified before training
- There is no cryptographic linkage between preprocessing and training
Proposed Solution
Implement a deterministic dataset tokenization pipeline that:
- Uses the trained tokenizer to tokenize the cleaned Wikipedia dataset
- Saves the tokenized dataset as a deterministic artifact
- Computes a Merkle root over tokenized chunks
- Adds the tokenized dataset hash to the verification manifest
This ensures the tokenized dataset is cryptographically tied to the tokenizer configuration and preprocessing outputs.
Builds On
Outcome
This will extend the verification pipeline by adding a verifiable tokenized dataset layer, creating the following chain:
Raw Dataset → Processed Dataset → Tokenizer Config → Tokenized Dataset → Model Training
Happy to implement this if the approach looks good.
Additional Context
No response
Code of Conduct
Feature and its Use Cases
Problem
Currently the tokenizer training pipeline introduced in PR #17 trains and saves the tokenizer configuration but does not tokenize the actual Wikipedia dataset.
This leaves a verification gap between the verified preprocessing pipeline and the model training stage.
Because the tokenized dataset is not verified, the following risks exist:
Proposed Solution
Implement a deterministic dataset tokenization pipeline that:
This ensures the tokenized dataset is cryptographically tied to the tokenizer configuration and preprocessing outputs.
Builds On
Outcome
This will extend the verification pipeline by adding a verifiable tokenized dataset layer, creating the following chain:
Happy to implement this if the approach looks good.
Additional Context
No response
Code of Conduct