Configuration Guide

This guide explains how to configure the Stack-Wise training system for different use cases.

Quick Start

Basic Configuration

# config.yaml
model:
  vocab_size: 50257
  d_model: 768
  n_heads: 12
  
  # Attention configuration
  attention_type: "mha"              # mha | gqa | mla | kernel
  attention_mode: "bidirectional"   # bidirectional | causal
  
  # Architecture configuration
  architecture:
    n_stacks: 2
    blocks_per_stack: 4
    d_model: 768
    d_ff: 3072
    n_heads: 12

training:
  batch_size: 8
  max_steps: 1000
  
  # Progressive training configuration
  progressive:
    enabled: true
    trunk_strategy: "frozen"        # "frozen" or "qlora"
    new_stack_precision: "full"     # "full", "half", "bfloat16", "nvfp4"

Dual-LoRA Configuration

Progressive QLoRA Setup

training:
  progressive:
    enabled: true
    trunk_strategy: "qlora"        # "frozen" or "qlora"
    new_stack_precision: "full"    # "full", "half", "bfloat16", "nvfp4"
    
    # Stack LoRA parameters (added to each stack)
    qlora_enabled: true
    qlora_strategy: "progressive"  # "simplified", "progressive", "variable"
    qlora_rank: 16
    qlora_alpha: 32
    
    # Progressive QLoRA parameters (added to trunk when new stacks are added)
    progressive_qlora: true
    progressive_qlora_rank: 8
    progressive_qlora_alpha: 16
    
    # Progressive patterns (for qlora_strategy: "progressive")
    qlora_rank_pattern: "increasing"  # "constant", "increasing", "decreasing", "linear"
    qlora_alpha_pattern: "constant"   # "constant", "increasing", "decreasing", "linear"
    
    # Variable QLoRA configurations (for qlora_strategy: "variable")
    qlora_configs:
      0: {rank: 8, alpha: 16}       # Stack 0: small QLoRA
      1: {rank: 16, alpha: 32}      # Stack 1: medium QLoRA  
      2: {rank: 32, alpha: 64}      # Stack 2: large QLoRA
      3: {rank: 64, alpha: 128}     # Stack 3: very large QLoRA

Precision Options

# Supported precision modes
precision_options:
  - "full"      # torch.float32
  - "half"      # torch.float16
  - "bfloat16"  # torch.bfloat16
  - "nvfp4"     # NVIDIA FP4 precision
  - "qlora"     # QLoRA training (not a precision)

Training Modes

1. Layer-wise Training

training:
  mode: "layerwise"
  block_size: 1              # Each layer is a separate block
  epochs_per_stack: 5

2. Block-wise Training

training:
  mode: "blockwise"
  block_size: 4              # 4 layers per block
  total_blocks: 3

3. Fusion Training

training:
  mode: "fused"
  fusion_mode: "frozen"       # frozen | trainable
  total_blocks: 3
  block_size: 4

Advanced Features

Quantization & QLoRA

training:
  qlora_enabled: true
  qlora_lr: 1.0e-5
  current_block_lr: 5.0e-4
  quantization_enabled: true
  quantization_type: "fp16"   # fp4 | fp8 | fp16 | fp32

Time-step Masking

training:
  time_step_masking: true
  num_time_steps: 12
  time_step_mask_fractions: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.99]

Progressive Masking

training:
  min_mask_fraction: 0.1
  max_mask_fraction: 0.99
  mask_schedule_type: "linear"  # linear | exponential | cosine

Model Architectures

GPT-2 Small

model:
  vocab_size: 50257
  d_model: 768
  n_heads: 12
  d_ff: 3072
  attention_mode: "causal"
  tie_embeddings: true

GPT-2 Medium

model:
  vocab_size: 50257
  d_model: 1024
  n_heads: 16
  d_ff: 4096
  attention_mode: "causal"
  tie_embeddings: true

Custom Architecture

model:
  vocab_size: 32000
  d_model: 1024
  n_heads: 16
  n_kv_heads: 8              # For GQA
  d_ff: 4096
  attention_type: "gqa"      # mha | gqa | mla | kernel
  attention_mode: "bidirectional"  # bidirectional | causal
  use_rope: true
  tie_embeddings: true

Data Configuration

Dataset Settings

data:
  dataset_path: "./data/corpus.json"
  use_dummy_data: false
  num_samples: 10000
  tokenizer_path: "gpt2"
  max_length: 512
  num_workers: 4
  shuffle: true

Tokenizer Integration

model:
  tokenizer_embedding:
    family: "gpt2"            # gpt2 | llama-3-8b | mistral-7b
    embedding_option: "embed_tokens"
    freeze_embeddings: false
    adapter_hidden_dim: null

Training Parameters

Learning Rates

training:
      # Base learning rate
  qlora_lr: 1.0e-5           # QLoRA adapter learning rate
  current_block_lr: 5.0e-4   # Current block learning rate
  weight_decay: 0.01

Batch and Sequence

training:
  batch_size: 8
  seq_len: 512
  max_steps: 1000
  log_interval: 10
  save_interval: 100

Memory Management

training:
  run_id: "my_experiment"    # Unique identifier for this run
  checkpoint_dir: "./checkpoints"
  save_fused_checkpoints: true
  mixed_precision: true

Examples

Minimal Configuration

model:
  d_model: 256
  n_heads: 4

training:
  mode: "fused"
  batch_size: 4
  max_steps: 100

Production Configuration

model:
  vocab_size: 50257
  d_model: 768
  n_heads: 12
  attention_mode: "causal"
  tie_embeddings: true

training:
  mode: "fused"
  fusion_mode: "frozen"
  total_blocks: 3
  block_size: 4
  qlora_enabled: true
  quantization_type: "fp16"
  time_step_masking: true
  batch_size: 8
  max_steps: 1000
  run_id: "gpt2_fusion_v1"

Validation

The configuration system includes validation to catch common errors:

Required fields: All essential parameters must be specified
Value ranges: Learning rates, dimensions, etc. are validated
Dependencies: Related parameters are checked for consistency
Type checking: Parameter types are validated

Troubleshooting

Common Issues

Import Errors: Ensure virtual environment is activated
Memory Issues: Reduce batch size or enable quantization
Training Instability: Check learning rates and mask fractions
Poor Convergence: Verify data quality and model architecture

Debug Mode

training:
  debug: true
  log_level: "DEBUG"

Performance Monitoring

training:
  log_interval: 10
  save_interval: 100
  validation_interval: 50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration Guide

Quick Start

Basic Configuration

Dual-LoRA Configuration

Progressive QLoRA Setup

Precision Options

Training Modes

1. Layer-wise Training

2. Block-wise Training

3. Fusion Training

Advanced Features

Quantization & QLoRA

Time-step Masking

Progressive Masking

Model Architectures

GPT-2 Small

GPT-2 Medium

Custom Architecture

Data Configuration

Dataset Settings

Tokenizer Integration

Training Parameters

Learning Rates

Batch and Sequence

Memory Management

Examples

Minimal Configuration

Production Configuration

Validation

Troubleshooting

Common Issues

Debug Mode

Performance Monitoring

FilesExpand file tree

configuration_guide.md

Latest commit

History

configuration_guide.md

File metadata and controls

Configuration Guide

Quick Start

Basic Configuration

Dual-LoRA Configuration

Progressive QLoRA Setup

Precision Options

Training Modes

1. Layer-wise Training

2. Block-wise Training

3. Fusion Training

Advanced Features

Quantization & QLoRA

Time-step Masking

Progressive Masking

Model Architectures

GPT-2 Small

GPT-2 Medium

Custom Architecture

Data Configuration

Dataset Settings

Tokenizer Integration

Training Parameters

Learning Rates

Batch and Sequence

Memory Management

Examples

Minimal Configuration

Production Configuration

Validation

Troubleshooting

Common Issues

Debug Mode

Performance Monitoring