The Language Model ToolKit
A Modular Toolkit for Efficient Language Model Pretraining and Adaptation
Streamlining continual pretraining of foundation language models through scalable pipelines and reproducible configurations
# Install with Poetry
make install-poetry
# Or just use the default install target (defaults to Poetry)
make install# Install with pip in a virtual environment
make install-pip
# Or specify pip as the installation method
make INSTALL_METHOD=pip install# Create and activate a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
pip install -e . # Regular installation
pip install -e .[dev] # With development dependencies# Build development image (with Poetry)
make build
# Run development container
make containerThe framework implements a comprehensive version control system using Poetry to manage semantic versioning. This ensures consistent versioning across all components and helps track changes systematically.
You can view the current version of the framework using:
# Show the current version
make version-show
# Display detailed version info including dependencies
python -m src.main --versionWhen making changes to the codebase, follow this workflow to manage versions properly:
-
Make your changes to the codebase
-
Decide on the version increment based on semantic versioning principles:
major: Breaking changes to the APIminor: New features, backward compatiblepatch: Bug fixes, backward compatible
-
Bump the version:
# For a patch update (e.g., 0.1.0 β 0.1.1)
make version-bump
# For a minor update (e.g., 0.1.0 β 0.2.0)
make version-bump VERSION_BUMP=minor
# For a major update (e.g., 0.1.0 β 1.0.0)
make version-bump VERSION_BUMP=major- Update the changelog to document your changes:
make changelogWhen editing the changelog, add your changes under the [Unreleased] section using appropriate categories (Added, Changed, Fixed, etc.).
- Finalize the release:
make version-releaseThis command will:
- Update the changelog format
- Create a git tag for the version
- Commit the changes
- Push to the repository:
git push && git push --tagsThe framework provides utilities for checking version information in your code:
from src.utils.version import get_version, display_version_info
# Get current version string
version = get_version()
# Display detailed version information
display_version_info()The project follows Semantic Versioning:
- MAJOR version (0.x.x β 1.0.0): Incompatible API changes
- MINOR version (0.1.x β 0.2.0): New functionality in a backward-compatible manner
- PATCH version (0.1.0 β 0.1.1): Backward-compatible bug fixes
During development phase (before 1.0.0), minor version bumps may include breaking changes.
- Efficiently preprocess and tokenize large text corpora using YAML-driven configs.
- Supports fast and slow tokenizers, parallelism, and memory-optimized workflows.
- See
TOKENIZATION_INDEX.mdandTOKENIZATION_PERFORMANCE.md.
- Further train language models on new data using scalable, resumable pipelines.
- Distributed training, curriculum support, and robust checkpointing.
- See
CLM_TRAINING.md.
- Upload trained models, tokenizers, or datasets to the Hugging Face Hub.
- YAML-driven, supports safe serialization and authentication best practices.
- See
PUBLISH.md.
- Configuration System - Type-safe YAML schemas with Pydantic validation
- Task Orchestration - Modular task execution via CLI/config mapping
- Environment Management - Dockerized training stack with Makefile control
- Resumable Workflows - Atomic checkpoints with full state serialization
- Distributed Strategies - FSDP/DeepSpeed integration profiles
- Curriculum Learning - Phase-wise data mixing (
config/curricula)
- Multi-format Corpus - Unified processor for JSONL/Parquet/TXT
- Streaming Tokenization - Memory-efficient HF datasets integration
- Data Health Checks - Statistical validation pre-training
- Training Telemetry - Gradient/activation histograms in
utils/monitoring.py - Early Warning System - NaN/overflow detection
- Optimizer Diagnostics - Learning rate/parameter scale tracking
All major tasks are YAML-driven. See the /docs folder for detailed per-task guides and example configs.
make build
make validate CONFIG=config/pretraining.yaml
python src/main.py --config tutorials/configs/tokenization_tutorial.yaml
python src/main.py --config tutorials/configs/clm_training_tutorial.yaml
python src/main.py --config tutorials/configs/publish_tutorial.yaml
Note: For publish tasks, authenticate with Hugging Face via huggingface-cli login or set the HUGGINGFACE_HUB_TOKEN environment variable.
For running on SLURM clusters, we provide a comprehensive set of configurable job scripts in the slurm/ directory:
# Quick submission with defaults
cd slurm
sbatch p1-dgx.slurm
# Use the helper script for easy configuration
./submit_job.sh -c config/experiments/test_continual.yaml -k your_wandb_key -g 4
# Custom resource allocation
./submit_job.sh -g 8 -m 400G -t 72:00:00 -o large_experimentKey Features:
- π§ Fully Configurable - All paths, resources, and settings via environment variables
- π³ Docker Integration - Containerized execution with automatic user mapping
- π WandB Integration - Automatic experiment tracking and logging
- π Debug Support - Comprehensive error reporting and troubleshooting
- π Organized Structure - Clean separation of job scripts and configurations
See slurm/README.md for detailed documentation and examples.
Structured YAML Schemas:
# src/config/base.py
class TaskConfig(BaseModel):
task_name: str = Field(..., description="Name of task to execute")
output_dir: Path = Field(..., description="Base output directory")
class TokenizationConfig(TaskConfig):
dataset_path: str
tokenizer_name: str
chunk_size: int = 2048
validation_ratio: float = 0.1
Validation flow:
CLI Command β Load YAML β Validate Schema β Build Config Object β Execute Task
project/
βββ config/ # YAML configuration templates
β βββ curricula/ # Data mixing schedules
β βββ tokenization.yaml # Tokenizer params
β βββ pretraining.yaml # Model/training params
βββ docker/ # Container definitions
β βββ Dockerfile # CUDA+PyTorch base image
βββ Makefile # Project orchestration
βββ requirements/ # Pinned dependencies
βββ src/
βββ config/ # Pydantic schema definitions
βββ tasks/ # Task implementations
βββ utils/ # Monitoring/checkpointing
Dockerfile Highlights:
FROM nvcr.io/nvidia/pytorch:23.10-py3
COPY requirements.txt .
RUN pip install -r requirements.txt
ENTRYPOINT ["make"]
Makefile Targets:
validate: # Config schema check
python -m src.main --validate $(CONFIG)
tokenize: # Process datasets
python -m src.main --task tokenize $(CONFIG)
train: # Launch training
torchrun --nproc_per_node=$(GPUS) src/main.py --task train $(CONFIG)
- Configuration Hashing - MD5 checksum of all config files
- Environment Snapshot -
pip freezein training logs - Deterministic Seeds - Full random state preservation
The project includes comprehensive documentation built with Sphinx featuring a beautiful custom design.
We provide a convenient script to build the documentation:
# Build documentation
./build_docs.sh
# Build and open in browser
./build_docs.sh --open
# More options
./build_docs.sh --helpAlternatively, you can use tox:
# Using tox
tox -e docsFor more details on documentation features, customization, and contributing guidelines, please refer to the Documentation README.
- Implement new config schemas in
src/config/ - Add corresponding task modules in
src/tasks/ - Include validation tests:
def test_tokenization_config():
with pytest.raises(ValidationError):
TokenizationConfig(tokenizer_name="invalid/model")