This repository is a "tidy" version of the Super Tiny Language Models Repo, developed during my research at the Singapore International Pre-Graduate Award (SIPGA) program.
For a comprehensive understanding of STLM foundations and related papers, please visit the original repository.
The main goal of this fork is to leverage the STLM architecture to study fact injection: how new, never-seen-before facts can be introduced into a Language Model's knowledge base, and how the model behaves when prompted with this new information.
Main differences:
- Specialized Focus: Tailored specifically for injection-based research.
- Refactored Codebase: Removal of unused files and refactoring for clarity.
- Better Tooling: Enhanced integration with Weights & Biases (wandb) and improved runtime logging.
- Model Evaluation: Enhanced scripts for testing and comparing different STLMs and open-weights models from Hugging Face.
- Documentation: Updated instructions for reproducing injection experiments.
Note: This repo is fully backward compatible. You can disable injection-specific features via config files to perform standard STLM training.
Requires: Python 3.11+ and uv.
# Clone the repository
git clone https://github.com/GustavoZiel/Tidy-STLMs.git
cd Tidy-STLMs
# Synchronize (Creates virtual env and installs dependencies)
uv sync
# Activate the virtual environment
source .venv/bin/activateConfiguration files are located in the
configs/directory. You can modify these YAML files to set hyperparameters, dataset paths, injection settings, and other options.
Every configuration file follows a modular structure, allowing for granular control over the experiment.
The main sections are:
-
model: Defines the model architecture. This includes the number of layers, attention heads, hidden dimensions, and specific mechanisms like Feed-Forward Network type (e.g.,swiglu), normalization (e.g.,rms_norm), and positional encodings (rope). -
trainer: Controls the training loop dynamics.-
Key parameters include
batch_size,max_epochsormax_iters,gradient_accumulation_steps, and various intervals for logging, evaluation, and checkpointing. -
inject: Configures the specific injection research experiments. This block togglesperform_injection, defines theinject_strategy(e.g., uniform), and points to the specific source files for the injected facts (inject_data) and their corresponding validation prompts. -
prompt: A list of static input/output pairs used for runtime generation checks. The trainer periodically runs these prompts to visually monitor the model's ability to recall specific facts or complete sentences during training. -
optimizer&lr_scheduler: Specifies the optimization algorithm and the learning rate schedule (e.g., cosine decay), including warmup periods and weight decay settings.
-
-
general: Handles system-level settings such as the compute device (cudaorcpu), directory paths for outputs, and integration with Weights & Biases (wandb) for experiment tracking.
The default configuration is set up as a minimal sanity check. It runs for only 10 steps to ensure the environment is correctly configured before launching a full training run.
uv run src/train.pyThis starts a run using the config specified on
configs/train.yamlfile, which in turn references theconfigs/full_configs/simple_en_wiki.yamlconfiguration. You can change the base config file inconfigs/train.yamlto switch training configs.
What to expect:
- Initialization: You will see initialization logs in the console.
- Preparing Data: The Simple English Wikipedia dataset will be downloaded and preprocessed.
- Training Loop: A progress bar will appear. Ensure the loss value is decreasing.
- Duration: This run should take no longer than 10 minutes.
Next Steps:
To run a real experiment, edit the parameters in full_configs/simple_en_wiki.yaml or change the model config file path in configs/train.yaml for another full configuration. You will likely want to increase max_steps and epochs, and tune the batch_size or learning_rate to fit your hardware and desired intention.
To train with the GPT-2 Small architecture (124M parameters) on the Wikitext-103 dataset, simply change the base configuration file in configs/train.yaml to point to the corresponding file on configs/full_configs/gpt2_small.yaml.yaml.
No other modifications are needed unless you want custom hyperparameters.
To resume training from a saved checkpoint, include the path to the checkpoint file at the top of your model configuration file, see configs/full_configs/simple_en_wiki_resume_checkpoint.yaml for an example.
Once the path is added, running the training command will automatically continue from the latest saved state.
To run fact-injection experiments, enable the perform_injection block inside the trainer settings. This activates the components responsible for:
- Data Mixing: Combining your primary training dataset with a stream of injected facts.
- Evaluation: Periodically probing the model with targeted prompts to measure how well the injected facts were learned and retained.
You can fully customize the injection process, including frequency, mixing strategy, and data source, by adjusting the corresponding fields in the config file.
A reference configuration is available in configs/full_configs/inject_data_simple_en_wiki.yaml.
The repository includes utilities for generating fake data for injection experiments. You can find these utilities in the scripts/ directory.
To generate fake data for injection, you can use the generate_injections.py script. This script reads a configuration file (e.g., inject_config.json) that defines the templates and test cases for the injected facts, and generates the data files needed for training.
Example Usage:
uv run scripts/generate_injections.py --save-path data/inject/test/ --inject-config test/inject_config.json --num_injections 1 --seed 1 --no-shuffleArguments:
--save-path: Directory to save the generated files.--inject-config: Path to the JSON config file containing the injection templates (relative todata/inject/).--num_injections: Number of injections to generate for each fact template.--seed: Random seed for reproducibility.--no-shuffle: Disable shuffling of the generated data.
The configuration file (e.g., data/inject/test/inject_config.json) should contain the templates for the facts you want to inject, along with the corresponding test cases (prompts and completions) to evaluate the model's knowledge of these facts.
Important: After generating the data, remember to update your training configuration file (e.g.,
configs/full_configs/inject_data_simple_en_wiki.yaml) to point to the newly generated files in theinjectsection (e.g.,inject_data: test/injected_data.txt).