This guide walks you through installing caramba, running your first experiment, and understanding the core concepts that make the platform work.
- Python 3.10+ โ caramba uses modern Python features
- PyTorch 2.0+ โ For model building and training
- 8GB+ RAM โ For loading models and datasets
- HuggingFace account โ For gated models like Llama (requires
huggingface-cli login) - Xcode Command Line Tools โ For Metal kernel compilation on macOS
- CUDA + Triton โ For GPU acceleration on NVIDIA hardware
# Clone the repository
git clone https://github.com/theapemachine/caramba.git
cd caramba
# Install dependencies
pip install -r requirements.txtIf you want AI-assisted paper drafting and review:
pip install -e ".[agents]"
# Or install individual components
pip install deeplake docling transformers # Knowledge store
pip install crawl4ai # Web crawling# Should print the execution plan without running
python3 -m caramba config/presets/standard_transformer.yml --dry-runLet's run a simple transformer training experiment to verify everything works.
caramba uses pre-tokenized .npy files for efficient data loading. For testing, you can create a small dummy dataset:
import numpy as np
# Create 1M random tokens (replace with real tokenized data for actual experiments)
tokens = np.random.randint(0, 50257, size=1_000_000, dtype=np.int32)
np.save("test_data.npy", tokens)For real experiments, use the FineWeb preparation script:
python3 prepare_fineweb.py --tokens 100M --output fineweb_100m.npyCreate my_experiment.yml:
version: 2
name: my_first_experiment
notes: Learning how caramba works
# Default settings applied to all targets
defaults:
data:
tokenizer: tiktoken
val_frac: 0.1
logging:
instrument: rich
wandb: false
runtime:
save_every: 100
# Variables for easy modification
vars:
d_model: 256
n_heads: 4
n_layers: 4
d_ff: 1024
vocab_size: 50257
block_size: 128
# Experiment targets
targets:
- type: experiment
name: train
description: Train a small transformer from scratch
backend: torch
task: task.language_modeling
# Data configuration
data:
ref: dataset.tokens
config:
path: test_data.npy
block_size: ${block_size}
# Model configuration
system:
ref: system.language_model
config:
model:
type: TransformerModel
embedder:
type: token
vocab_size: ${vocab_size}
d_model: ${d_model}
topology:
type: StackedTopology
layers:
# Repeated transformer blocks
- type: NestedTopology
repeat: ${n_layers}
layers:
# Attention with residual
- type: ResidualTopology
layers:
- type: RMSNormLayer
d_model: ${d_model}
- type: AttentionLayer
d_model: ${d_model}
n_heads: ${n_heads}
mode: standard
# FFN with residual
- type: ResidualTopology
layers:
- type: RMSNormLayer
d_model: ${d_model}
- type: SwiGLULayer
d_model: ${d_model}
d_ff: ${d_ff}
# Final normalization
- type: RMSNormLayer
d_model: ${d_model}
# Output projection
- type: LinearLayer
d_in: ${d_model}
d_out: ${vocab_size}
objective: objective.next_token_ce
trainer: trainer.standard
# Training runs
runs:
- id: train_small
mode: train
exp: my_first_run
seed: 42
steps: 500
train:
phase: standard
batch_size: 8
block_size: ${block_size}
lr: 0.001
device: mps # or 'cuda' or 'cpu'
dtype: float32Before running, verify the manifest is valid:
python3 -m caramba my_experiment.yml --dry-runThis shows the execution plan without running anything:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Execution Plan โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Target: train โ
โ Runs: โ
โ - train_small (500 steps, device=mps, dtype=float32) โ
โ Benchmarks: [] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
python3 -m caramba my_experiment.ymlYou'll see:
โญโ Training Phase: standard โโฎ
โ Step 100/500 loss=5.234 โ
โ Step 200/500 loss=4.102 โ
โ Step 300/500 loss=3.567 โ
โ Step 400/500 loss=3.221 โ
โ Step 500/500 loss=2.987 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ Training complete
A manifest is a YAML file that declaratively defines your experiment. Here's the structure:
version: 2 # Manifest schema version (always 2)
name: experiment_name # Used for artifact directories
notes: "Description" # Human-readable notes
vars: # Reusable variables
d_model: 512
defaults: # Settings applied to all targets
data: { ... }
logging: { ... }
runtime: { ... }
targets: # Runnable units (experiments or processes)
- type: experiment
name: train
...
entrypoints: # Optional named entry points
default: "train"Use ${variable} to reference values from the vars section:
vars:
d_model: 512
n_heads: 8
targets:
- type: experiment
system:
config:
model:
topology:
layers:
- type: AttentionLayer
d_model: ${d_model} # Becomes 512
n_heads: ${n_heads} # Becomes 8Models are defined as trees of topologies containing layers:
topology:
type: StackedTopology # Root: sequential execution
layers:
- type: NestedTopology # Repeat this block N times
repeat: 6
layers:
- type: ResidualTopology # x + f(x)
layers:
- type: RMSNormLayer
- type: AttentionLayerA target is a runnable unit. There are two types:
| Type | Purpose |
|---|---|
experiment |
ML training/evaluation with runs and benchmarks |
process |
Agent workflow (paper writing, review, etc.) |
Each experiment target contains one or more runs:
runs:
- id: blockwise
mode: train
steps: 500
train:
phase: blockwise
lr: 0.0001
- id: finetune
mode: train
steps: 2000
train:
phase: global
lr: 0.00005Runs execute sequentially within a target.
Topologies define structure (how things connect):
StackedTopologyโ A then B then CResidualTopologyโ x + f(x)ParallelTopologyโ [A(x), B(x)] stacked
Layers define computation (what happens):
AttentionLayerโ Multi-head attentionSwiGLULayerโ Feed-forward networkRMSNormLayerโ Normalization
Attach verification to runs to check model behavior:
runs:
- id: train
verify:
type: compare
batches: 5
attention:
max_mean_l1: 0.05Verification types:
compareโ Check L1 distance between teacher/studentfidelityโ Check NLL/perplexity ratiosevalโ Run behavioral test cases
Measure and compare models after training:
benchmarks:
- id: perplexity
config:
type: perplexity
num_batches: 100
models: [teacher, student]Generates CSV, PNG, and LaTeX artifacts.
Now that you understand the basics:
- Manifest Reference โ Complete YAML schema and options
- Layer Reference โ All layer types with configurations
- Topology Guide โ Building complex architectures
- Training Guide โ Standard, upcycle, and orchestrated modes
# Train a Mixture of Experts model
python3 -m caramba config/presets/moe_transformer.yml --dry-run
# Upcycle Llama to DBA (requires HF login)
huggingface-cli login
python3 -m caramba config/presets/llama32_1b_dba.yml --target quick
# Run with full benchmarks
python3 -m caramba config/presets/llama32_1b_dba.yml --target paper