Skip to content

[FEATURE]: Implement Deterministic Dataset Encoding Pipeline and Verifiable LLaMA Model (Model Architecture) #55

@Shubhamx404

Description

@Shubhamx404

deterministic dataset encoding pipeline and a minimal implementation of a LLaMA architecture.

  • The dataset is currently processed into wiki_clean.txt and a tokenizer has been trained. We need to implement a memory-efficient script to encode the entire dataset into binary format for training.

  • Implement a minimal PyTorch LLaMA-style architecture to ensure deterministic behavior and full control over initialization.

  • Read the dataset in chunks to avoid high memory usage.

  • Use the trained tokenizer (BPE/SentencePiece) to convert text into token IDs.

  • Stream token IDs into a binary dataset file (.bin, uint16 or similar).

  • Compute a SHA256 hash of the resulting file.

Verification Criteria

  • Running the dataset encoding pipeline twice should produce identical binary files and SHA256 hashes.

  • Initializing the model twice with the same seed should produce identical parameter hashes.

must output the exact same initial parameter hashes.

Additional Context

No response

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions