Skip to content

gplsi/LMTK

Repository files navigation

LMTK

The Language Model ToolKit

A Modular Toolkit for Efficient Language Model Pretraining and Adaptation
Streamlining continual pretraining of foundation language models through scalable pipelines and reproducible configurations

πŸš€ Installation

Option 1: Using Poetry (Recommended)

# Install with Poetry
make install-poetry

# Or just use the default install target (defaults to Poetry)
make install

Option 2: Using pip

# Install with pip in a virtual environment
make install-pip

# Or specify pip as the installation method
make INSTALL_METHOD=pip install

Option 3: Direct pip install

# Create and activate a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install -e .  # Regular installation
pip install -e .[dev]  # With development dependencies

Docker Installation

# Build development image (with Poetry)
make build

# Run development container
make container

πŸ“‹ Version Control System

The framework implements a comprehensive version control system using Poetry to manage semantic versioning. This ensures consistent versioning across all components and helps track changes systematically.

Version Information

You can view the current version of the framework using:

# Show the current version
make version-show

# Display detailed version info including dependencies
python -m src.main --version

Version Management Workflow

When making changes to the codebase, follow this workflow to manage versions properly:

  1. Make your changes to the codebase

  2. Decide on the version increment based on semantic versioning principles:

    • major: Breaking changes to the API
    • minor: New features, backward compatible
    • patch: Bug fixes, backward compatible
  3. Bump the version:

# For a patch update (e.g., 0.1.0 β†’ 0.1.1)
make version-bump

# For a minor update (e.g., 0.1.0 β†’ 0.2.0)
make version-bump VERSION_BUMP=minor

# For a major update (e.g., 0.1.0 β†’ 1.0.0)
make version-bump VERSION_BUMP=major
  1. Update the changelog to document your changes:
make changelog

When editing the changelog, add your changes under the [Unreleased] section using appropriate categories (Added, Changed, Fixed, etc.).

  1. Finalize the release:
make version-release

This command will:

  • Update the changelog format
  • Create a git tag for the version
  • Commit the changes
  1. Push to the repository:
git push && git push --tags

Checking Version in Code

The framework provides utilities for checking version information in your code:

from src.utils.version import get_version, display_version_info

# Get current version string
version = get_version()

# Display detailed version information
display_version_info()

Versioning Strategy

The project follows Semantic Versioning:

  • MAJOR version (0.x.x β†’ 1.0.0): Incompatible API changes
  • MINOR version (0.1.x β†’ 0.2.0): New functionality in a backward-compatible manner
  • PATCH version (0.1.0 β†’ 0.1.1): Backward-compatible bug fixes

During development phase (before 1.0.0), minor version bumps may include breaking changes.

🌟 Major Tasks

🧩 Tokenization

πŸ” Training

  • Further train language models on new data using scalable, resumable pipelines.
  • Distributed training, curriculum support, and robust checkpointing.
  • See CLM_TRAINING.md.

πŸš€ Publish

  • Upload trained models, tokenizers, or datasets to the Hugging Face Hub.
  • YAML-driven, supports safe serialization and authentication best practices.
  • See PUBLISH.md.

πŸ“š Documentation by Task


πŸ”§ Core Infrastructure

  • Configuration System - Type-safe YAML schemas with Pydantic validation
  • Task Orchestration - Modular task execution via CLI/config mapping
  • Environment Management - Dockerized training stack with Makefile control

πŸ› οΈ Training Capabilities

  • Resumable Workflows - Atomic checkpoints with full state serialization
  • Distributed Strategies - FSDP/DeepSpeed integration profiles
  • Curriculum Learning - Phase-wise data mixing (config/curricula)

πŸ“Š Data Processing

  • Multi-format Corpus - Unified processor for JSONL/Parquet/TXT
  • Streaming Tokenization - Memory-efficient HF datasets integration
  • Data Health Checks - Statistical validation pre-training

πŸ”¬ Model Monitoring

  • Training Telemetry - Gradient/activation histograms in utils/monitoring.py
  • Early Warning System - NaN/overflow detection
  • Optimizer Diagnostics - Learning rate/parameter scale tracking

πŸš€ Quick Start

All major tasks are YAML-driven. See the /docs folder for detailed per-task guides and example configs.

Build environment

make build

Validate configuration

make validate CONFIG=config/pretraining.yaml

Run tokenization task (YAML-driven)

python src/main.py --config tutorials/configs/tokenization_tutorial.yaml

Launch CLM training (continual pretraining)

python src/main.py --config tutorials/configs/clm_training_tutorial.yaml

Publish a trained model to Hugging Face Hub

python src/main.py --config tutorials/configs/publish_tutorial.yaml

Note: For publish tasks, authenticate with Hugging Face via huggingface-cli login or set the HUGGINGFACE_HUB_TOKEN environment variable.

πŸ–₯️ SLURM Cluster Execution

For running on SLURM clusters, we provide a comprehensive set of configurable job scripts in the slurm/ directory:

# Quick submission with defaults
cd slurm
sbatch p1-dgx.slurm

# Use the helper script for easy configuration
./submit_job.sh -c config/experiments/test_continual.yaml -k your_wandb_key -g 4

# Custom resource allocation
./submit_job.sh -g 8 -m 400G -t 72:00:00 -o large_experiment

Key Features:

  • πŸ”§ Fully Configurable - All paths, resources, and settings via environment variables
  • 🐳 Docker Integration - Containerized execution with automatic user mapping
  • πŸ“Š WandB Integration - Automatic experiment tracking and logging
  • πŸ” Debug Support - Comprehensive error reporting and troubleshooting
  • πŸ“ Organized Structure - Clean separation of job scripts and configurations

See slurm/README.md for detailed documentation and examples.

βš™οΈ Configuration System

Structured YAML Schemas:

# src/config/base.py
class TaskConfig(BaseModel):
    task_name: str = Field(..., description="Name of task to execute")
    output_dir: Path = Field(..., description="Base output directory")
    
class TokenizationConfig(TaskConfig):
    dataset_path: str
    tokenizer_name: str
    chunk_size: int = 2048
    validation_ratio: float = 0.1

Validation flow:

CLI Command β†’ Load YAML β†’ Validate Schema β†’ Build Config Object β†’ Execute Task

πŸ“‚ Project Structure

project/
β”œβ”€β”€ config/                 # YAML configuration templates
β”‚   β”œβ”€β”€ curricula/         # Data mixing schedules
β”‚   β”œβ”€β”€ tokenization.yaml  # Tokenizer params
β”‚   └── pretraining.yaml   # Model/training params
β”œβ”€β”€ docker/                # Container definitions
β”‚   └── Dockerfile         # CUDA+PyTorch base image
β”œβ”€β”€ Makefile               # Project orchestration
β”œβ”€β”€ requirements/          # Pinned dependencies
└── src/
    β”œβ”€β”€ config/            # Pydantic schema definitions
    β”œβ”€β”€ tasks/             # Task implementations
    └── utils/             # Monitoring/checkpointing

πŸ‹ Docker & Makefile

Dockerfile Highlights:

FROM nvcr.io/nvidia/pytorch:23.10-py3
COPY requirements.txt . 
RUN pip install -r requirements.txt
ENTRYPOINT ["make"]

Makefile Targets:

validate:  # Config schema check
    python -m src.main --validate $(CONFIG)

tokenize:  # Process datasets
    python -m src.main --task tokenize $(CONFIG)

train:     # Launch training
    torchrun --nproc_per_node=$(GPUS) src/main.py --task train $(CONFIG)

πŸ“ Reproducibility Features

  1. Configuration Hashing - MD5 checksum of all config files
  2. Environment Snapshot - pip freeze in training logs
  3. Deterministic Seeds - Full random state preservation

πŸ“š Documentation

The project includes comprehensive documentation built with Sphinx featuring a beautiful custom design.

Building Documentation

We provide a convenient script to build the documentation:

# Build documentation
./build_docs.sh

# Build and open in browser
./build_docs.sh --open

# More options
./build_docs.sh --help

Alternatively, you can use tox:

# Using tox
tox -e docs

For more details on documentation features, customization, and contributing guidelines, please refer to the Documentation README.

🀝 Contributing

  1. Implement new config schemas in src/config/
  2. Add corresponding task modules in src/tasks/
  3. Include validation tests:
def test_tokenization_config():
    with pytest.raises(ValidationError):
        TokenizationConfig(tokenizer_name="invalid/model")

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors