Skip to content

Latest commit

Β 

History

History
420 lines (309 loc) Β· 9.92 KB

File metadata and controls

420 lines (309 loc) Β· 9.92 KB

StackWise Baselines Module

A comprehensive benchmarking framework for encoder-decoder model families with Hydra configuration management and experimental tracking.

🎯 Overview

This module provides:

  • Reproducible baselines for BERT, GPT, and LLaMA families
  • Depth-as-time training comparison with classical approaches
  • Comprehensive evaluation on GLUE, language modeling, and reasoning tasks
  • Hydra-powered configuration management for complex experiments
  • Automated experimental tracking and result analysis

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • uv package manager (recommended)
  • CUDA-capable GPU (optional, for training)

Installation

Option 1: Using uv (Recommended)

# Navigate to the project root
cd /path/to/stack-wise

# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate

# Install the main package with advanced dependencies
uv pip install -e .[advanced]

# Install the baselines module
cd baselines
uv pip install -e .

Option 2: Using pip

# Navigate to the project root
cd /path/to/stack-wise

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the main package with advanced dependencies
pip install -e .[advanced]

# Install the baselines module
cd baselines
pip install -e .

Basic Usage

Using uv (Recommended)

# Activate virtual environment
source .venv/bin/activate

# Train BERT-base on GLUE (reproduction)
uv run python scripts/train.py --config-name=bert_base_glue

# Train GPT-2-small with depth-as-time
uv run python scripts/train.py model=decoder/gpt2_family/small training=depth_time

# Run ablation study
uv run python scripts/benchmark.py --config-name=bert_depth_time_vs_classical

# Evaluate trained model
uv run python scripts/evaluate.py --config-name=bert_base_glue model_path=./checkpoints/model.pt

Using pip

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Train BERT-base on GLUE (reproduction)
python scripts/train.py --config-name=bert_base_glue

# Train GPT-2-small with depth-as-time
python scripts/train.py model=decoder/gpt2_family/small training=depth_time

# Run ablation study
python scripts/benchmark.py --config-name=bert_depth_time_vs_classical

# Evaluate trained model
python scripts/evaluate.py --config-name=bert_base_glue model_path=./checkpoints/model.pt

πŸ“ Directory Structure

baselines/
β”œβ”€β”€ configs/                    # Hydra configuration files
β”‚   β”œβ”€β”€ config.yaml            # Main configuration
β”‚   β”œβ”€β”€ model/                 # Model configurations
β”‚   β”‚   β”œβ”€β”€ encoder/           # BERT, ModernBERT families
β”‚   β”‚   └── decoder/           # GPT-2, LLaMA families
β”‚   β”œβ”€β”€ training/              # Training regime configs
β”‚   β”œβ”€β”€ benchmarks/            # Benchmark task configs
β”‚   β”œβ”€β”€ datasets/              # Dataset-specific configs
β”‚   └── experiments/           # Complete experiment configs
β”œβ”€β”€ src/                       # Source code
β”‚   β”œβ”€β”€ evaluation/            # Evaluation harness
β”‚   β”œβ”€β”€ benchmarks/            # Benchmark implementations
β”‚   └── utils/                 # Utility functions
β”œβ”€β”€ scripts/                   # Executable scripts
β”‚   β”œβ”€β”€ train.py              # Training script
β”‚   β”œβ”€β”€ evaluate.py           # Evaluation script
β”‚   └── benchmark.py          # Benchmark runner
└── experiments/              # Experiment outputs

πŸ”§ Configuration

Model Configurations

BERT Family:

  • tiny.yaml - TinyBERT (14M params)
  • base.yaml - BERT-base (110M params)
  • large.yaml - BERT-large (340M params)

GPT-2 Family:

  • small.yaml - GPT-2-small (124M params)
  • medium.yaml - GPT-2-medium (355M params)
  • large.yaml - GPT-2-large (774M params)

Training Regimes

Classical Training:

strategy: "end_to_end"
end_to_end_scope: "rackwise"
progressive:
  enabled: false

Depth-as-Time Training:

strategy: "progressive"
end_to_end_scope: "stackwise"
progressive:
  enabled: true
  time_interpretation: "depth"
  building_mode: "prepend"

Benchmark Tasks

NLU Tasks (GLUE):

  • CoLA, SST-2, MRPC, STS-B, QQP
  • MNLI, QNLI, RTE, WNLI

NLG Tasks:

  • WikiText-103, PTB (perplexity)
  • LAMBADA, HellaSwag, PIQA (reasoning)
  • CNN/DailyMail (summarization)

πŸ§ͺ Experiments

Reproduction Experiments

Reproduce existing model performance:

# Activate virtual environment
source .venv/bin/activate  # or source venv/bin/activate

# BERT-base GLUE reproduction
uv run python scripts/train.py --config-name=bert_base_glue_reproduction

# GPT-2-small language modeling
uv run python scripts/train.py --config-name=gpt2_small_wikitext

Ablation Studies

Compare training regimes:

# Activate virtual environment
source .venv/bin/activate  # or source venv/bin/activate

# Depth-as-time vs classical
uv run python scripts/benchmark.py --config-name=bert_depth_time_vs_classical

# Multi-run parameter sweep
uv run python scripts/train.py --config-name=bert_base_glue --multirun training.lr=1e-5,2e-5,5e-5

Scaling Studies

Analyze scaling laws:

# Activate virtual environment
source .venv/bin/activate  # or source venv/bin/activate

# Model size scaling
uv run python scripts/benchmark.py --config-name=scaling_study --multirun model_variant=tiny,small,base,large

# Compute equalization
uv run python scripts/train.py --config-name=compute_equalized_scaling

πŸ“Š Evaluation

Metrics

NLU Metrics:

  • Accuracy, F1-score
  • Matthews correlation
  • Pearson/Spearman correlation

NLG Metrics:

  • Perplexity
  • BLEU, ROUGE scores
  • Task-specific accuracy

Results Organization

experiments/
β”œβ”€β”€ {experiment_name}/
β”‚   β”œβ”€β”€ config.yaml              # Final configuration
β”‚   β”œβ”€β”€ checkpoints/             # Model checkpoints
β”‚   β”œβ”€β”€ logs/                    # Training logs
β”‚   β”œβ”€β”€ outputs/                 # Evaluation results
β”‚   β”‚   β”œβ”€β”€ metrics.json
β”‚   β”‚   β”œβ”€β”€ evaluation_report.md
β”‚   β”‚   └── comparison_plots.png
β”‚   └── metadata/                # Run metadata

πŸ”¬ Advanced Usage

Custom Configurations

Create custom experiment configs:

# configs/experiments/custom/my_experiment.yaml
defaults:
  - model: encoder/bert_family/base
  - training: depth_time
  - benchmark: nlu/glue

# Custom overrides
model:
  d_model: 512
  n_heads: 8

training:
  batch_size: 32
  learning_rate: 1e-4

Multi-Run Experiments

Run parameter sweeps:

# Activate virtual environment
source .venv/bin/activate  # or source venv/bin/activate

# Learning rate sweep
uv run python scripts/train.py --config-name=bert_base_glue --multirun training.lr=1e-5,2e-5,5e-5

# Model size sweep
uv run python scripts/train.py --config-name=scaling_study --multirun model_variant=tiny,small,base,large

# Training regime comparison
uv run python scripts/train.py --config-name=ablation_study --multirun training_regime=classical,depth_time,hybrid

Custom Benchmarks

Add new benchmark tasks:

# configs/benchmarks/custom/my_benchmark.yaml
name: "my_benchmark"
tasks:
  - name: "my_task"
    metric: "accuracy"
    target_score: 85.0

πŸ“ˆ Analysis & Reporting

Automated Reports

The framework generates comprehensive reports:

  • Reproduction Reports - Compare with baseline models
  • Ablation Analysis - Statistical significance testing
  • Scaling Analysis - Power law fitting and visualization
  • Performance Plots - Automated visualization generation

Statistical Analysis

  • Confidence intervals
  • Multiple comparison correction
  • Effect size analysis
  • Power analysis

πŸ› οΈ Development

Adding New Models

  1. Create model config in configs/model/
  2. Implement model architecture
  3. Add to training scripts
  4. Create reproduction configs

Adding New Benchmarks

  1. Create benchmark config in configs/benchmarks/
  2. Implement task loader in src/evaluation/
  3. Add metrics computation
  4. Create evaluation configs

Adding New Training Regimes

  1. Create training config in configs/training/
  2. Implement trainer class
  3. Add to training script
  4. Create ablation configs

πŸ“š Examples

See the examples/ directory for:

  • Basic training examples
  • Configuration examples
  • Evaluation examples
  • Analysis examples

πŸ”§ Troubleshooting

Virtual Environment Issues

Problem: uv: command not found

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# or
pip install uv

Problem: Virtual environment not activated

# Check if virtual environment is active
echo $VIRTUAL_ENV

# Activate virtual environment
source .venv/bin/activate  # for uv
# or
source venv/bin/activate   # for pip

Problem: Package not found after installation

# Reinstall in development mode
uv pip install -e .[advanced]
# or
pip install -e .[advanced]

Problem: CUDA/GPU not detected

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Install CUDA-enabled PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Configuration Issues

Problem: Hydra configuration not found

# Run from the baselines directory
cd baselines
uv run python scripts/train.py --config-name=bert_base_glue

Problem: Model checkpoint not found

# Check checkpoint path
ls -la checkpoints/
# Update model_path in evaluation script

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

  • HuggingFace Transformers
  • Hydra configuration framework
  • GLUE and SuperGLUE benchmarks
  • EleutherAI evaluation harness