A comprehensive benchmarking framework for encoder-decoder model families with Hydra configuration management and experimental tracking.
This module provides:
- Reproducible baselines for BERT, GPT, and LLaMA families
- Depth-as-time training comparison with classical approaches
- Comprehensive evaluation on GLUE, language modeling, and reasoning tasks
- Hydra-powered configuration management for complex experiments
- Automated experimental tracking and result analysis
- Python 3.8+
- uv package manager (recommended)
- CUDA-capable GPU (optional, for training)
# Navigate to the project root
cd /path/to/stack-wise
# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate
# Install the main package with advanced dependencies
uv pip install -e .[advanced]
# Install the baselines module
cd baselines
uv pip install -e .# Navigate to the project root
cd /path/to/stack-wise
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the main package with advanced dependencies
pip install -e .[advanced]
# Install the baselines module
cd baselines
pip install -e .# Activate virtual environment
source .venv/bin/activate
# Train BERT-base on GLUE (reproduction)
uv run python scripts/train.py --config-name=bert_base_glue
# Train GPT-2-small with depth-as-time
uv run python scripts/train.py model=decoder/gpt2_family/small training=depth_time
# Run ablation study
uv run python scripts/benchmark.py --config-name=bert_depth_time_vs_classical
# Evaluate trained model
uv run python scripts/evaluate.py --config-name=bert_base_glue model_path=./checkpoints/model.pt# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Train BERT-base on GLUE (reproduction)
python scripts/train.py --config-name=bert_base_glue
# Train GPT-2-small with depth-as-time
python scripts/train.py model=decoder/gpt2_family/small training=depth_time
# Run ablation study
python scripts/benchmark.py --config-name=bert_depth_time_vs_classical
# Evaluate trained model
python scripts/evaluate.py --config-name=bert_base_glue model_path=./checkpoints/model.ptbaselines/
βββ configs/ # Hydra configuration files
β βββ config.yaml # Main configuration
β βββ model/ # Model configurations
β β βββ encoder/ # BERT, ModernBERT families
β β βββ decoder/ # GPT-2, LLaMA families
β βββ training/ # Training regime configs
β βββ benchmarks/ # Benchmark task configs
β βββ datasets/ # Dataset-specific configs
β βββ experiments/ # Complete experiment configs
βββ src/ # Source code
β βββ evaluation/ # Evaluation harness
β βββ benchmarks/ # Benchmark implementations
β βββ utils/ # Utility functions
βββ scripts/ # Executable scripts
β βββ train.py # Training script
β βββ evaluate.py # Evaluation script
β βββ benchmark.py # Benchmark runner
βββ experiments/ # Experiment outputs
BERT Family:
tiny.yaml- TinyBERT (14M params)base.yaml- BERT-base (110M params)large.yaml- BERT-large (340M params)
GPT-2 Family:
small.yaml- GPT-2-small (124M params)medium.yaml- GPT-2-medium (355M params)large.yaml- GPT-2-large (774M params)
Classical Training:
strategy: "end_to_end"
end_to_end_scope: "rackwise"
progressive:
enabled: falseDepth-as-Time Training:
strategy: "progressive"
end_to_end_scope: "stackwise"
progressive:
enabled: true
time_interpretation: "depth"
building_mode: "prepend"NLU Tasks (GLUE):
- CoLA, SST-2, MRPC, STS-B, QQP
- MNLI, QNLI, RTE, WNLI
NLG Tasks:
- WikiText-103, PTB (perplexity)
- LAMBADA, HellaSwag, PIQA (reasoning)
- CNN/DailyMail (summarization)
Reproduce existing model performance:
# Activate virtual environment
source .venv/bin/activate # or source venv/bin/activate
# BERT-base GLUE reproduction
uv run python scripts/train.py --config-name=bert_base_glue_reproduction
# GPT-2-small language modeling
uv run python scripts/train.py --config-name=gpt2_small_wikitextCompare training regimes:
# Activate virtual environment
source .venv/bin/activate # or source venv/bin/activate
# Depth-as-time vs classical
uv run python scripts/benchmark.py --config-name=bert_depth_time_vs_classical
# Multi-run parameter sweep
uv run python scripts/train.py --config-name=bert_base_glue --multirun training.lr=1e-5,2e-5,5e-5Analyze scaling laws:
# Activate virtual environment
source .venv/bin/activate # or source venv/bin/activate
# Model size scaling
uv run python scripts/benchmark.py --config-name=scaling_study --multirun model_variant=tiny,small,base,large
# Compute equalization
uv run python scripts/train.py --config-name=compute_equalized_scalingNLU Metrics:
- Accuracy, F1-score
- Matthews correlation
- Pearson/Spearman correlation
NLG Metrics:
- Perplexity
- BLEU, ROUGE scores
- Task-specific accuracy
experiments/
βββ {experiment_name}/
β βββ config.yaml # Final configuration
β βββ checkpoints/ # Model checkpoints
β βββ logs/ # Training logs
β βββ outputs/ # Evaluation results
β β βββ metrics.json
β β βββ evaluation_report.md
β β βββ comparison_plots.png
β βββ metadata/ # Run metadata
Create custom experiment configs:
# configs/experiments/custom/my_experiment.yaml
defaults:
- model: encoder/bert_family/base
- training: depth_time
- benchmark: nlu/glue
# Custom overrides
model:
d_model: 512
n_heads: 8
training:
batch_size: 32
learning_rate: 1e-4Run parameter sweeps:
# Activate virtual environment
source .venv/bin/activate # or source venv/bin/activate
# Learning rate sweep
uv run python scripts/train.py --config-name=bert_base_glue --multirun training.lr=1e-5,2e-5,5e-5
# Model size sweep
uv run python scripts/train.py --config-name=scaling_study --multirun model_variant=tiny,small,base,large
# Training regime comparison
uv run python scripts/train.py --config-name=ablation_study --multirun training_regime=classical,depth_time,hybridAdd new benchmark tasks:
# configs/benchmarks/custom/my_benchmark.yaml
name: "my_benchmark"
tasks:
- name: "my_task"
metric: "accuracy"
target_score: 85.0The framework generates comprehensive reports:
- Reproduction Reports - Compare with baseline models
- Ablation Analysis - Statistical significance testing
- Scaling Analysis - Power law fitting and visualization
- Performance Plots - Automated visualization generation
- Confidence intervals
- Multiple comparison correction
- Effect size analysis
- Power analysis
- Create model config in
configs/model/ - Implement model architecture
- Add to training scripts
- Create reproduction configs
- Create benchmark config in
configs/benchmarks/ - Implement task loader in
src/evaluation/ - Add metrics computation
- Create evaluation configs
- Create training config in
configs/training/ - Implement trainer class
- Add to training script
- Create ablation configs
See the examples/ directory for:
- Basic training examples
- Configuration examples
- Evaluation examples
- Analysis examples
Problem: uv: command not found
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# or
pip install uvProblem: Virtual environment not activated
# Check if virtual environment is active
echo $VIRTUAL_ENV
# Activate virtual environment
source .venv/bin/activate # for uv
# or
source venv/bin/activate # for pipProblem: Package not found after installation
# Reinstall in development mode
uv pip install -e .[advanced]
# or
pip install -e .[advanced]Problem: CUDA/GPU not detected
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Install CUDA-enabled PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Problem: Hydra configuration not found
# Run from the baselines directory
cd baselines
uv run python scripts/train.py --config-name=bert_base_glueProblem: Model checkpoint not found
# Check checkpoint path
ls -la checkpoints/
# Update model_path in evaluation script- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details.
- HuggingFace Transformers
- Hydra configuration framework
- GLUE and SuperGLUE benchmarks
- EleutherAI evaluation harness