This guide covers how to train and fine-tune the top-layer matting module of LayerD using the Crello dataset.
LayerD's matting module is based on BiRefNet, which can be fine-tuned on custom datasets. We provide:
- Dataset preparation tools for the Crello dataset
- Training scripts with Hydra configuration
- Support for single-GPU and multi-GPU training
- Integration with torch.distributed and Hugging Face Accelerate
Install LayerD with training dependencies:
pip install "git+https://github.com/CyberAgentAILab/LayerD.git#egg=layerd[train]"Or for development:
git clone https://github.com/CyberAgentAILab/LayerD.git
cd LayerD
uv sync --all-extras --all-groups- Minimum: NVIDIA GPU with 16GB VRAM
- Recommended: NVIDIA A100 40GB or similar
- Multi-GPU: 2-4 GPUs for faster training
- Storage: ~20GB for Crello dataset
- Additional: ~50-100GB for generated training dataset
- Internet: Stable connection for initial dataset download
The Crello dataset is available on HuggingFace and can be automatically downloaded and converted:
uv run python ./tools/generate_crello_matting.py \
--output-dir /path/to/dataset \
--inpainting \
--save-layers--output-dir(required): Output directory for prepared dataset--inpainting: Apply inpainting to create intermediate composite images--save-layers: Save individual layers (useful for evaluation)
After preparation, the dataset has the following structure:
/path/to/dataset/
├── train/
│ ├── im/ # Input images (full composite or intermediate composite)
│ ├── gt/ # Ground-truth alpha mattes (top-layer)
│ ├── composite/ # Full composite images (for evaluation)
│ └── layers/ # Ground-truth RGBA layers (for evaluation)
├── validation/
│ ├── im/
│ ├── gt/
│ ├── composite/
│ └── layers/
└── test/
├── im/
├── gt/
├── composite/
└── layers/
Directories explained:
im/: Input images for training (what the model sees)gt/: Ground-truth alpha mattes (what the model predicts)composite/: Full composite images (not used in training, for evaluation)layers/: Individual layers (not used in training, for evaluation)
First run downloads the Crello dataset (~20GB) from HuggingFace:
- Download time: Depends on connection speed (30 min - 2 hours)
- Processing time: ~2-4 hours on a modern CPU
LayerD uses Hydra for configuration management. The default configuration is in src/layerd/configs/train.yaml.
Key configuration parameters:
# Data
data_root: ??? # Must be specified at runtime
batch_size: 4
num_workers: 4
# Model
model_name: "birefnet"
backbone: "pvt_v2_b2"
# Training
epochs: 100
learning_rate: 1e-4
weight_decay: 1e-4
# Checkpoints
ckpt: null # Load checkpoint and resume optimizer/scheduler state
resume_from: null # Load model weights only (flexible checkpoint loading)
# Optimization
optimizer: "adamw"
scheduler: "cosine"
# Distributed training
dist: false
use_accelerate: false
mixed_precision: "no"
# Output
out_dir: ??? # Must be specified at runtime
save_interval: 10These parameters must be specified at runtime:
data_root: Path to prepared dataset directoryout_dir: Path to save training outputs (checkpoints, logs)
Basic training on a single GPU:
uv run python ./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
device=cudaFor distributed training across multiple GPUs using torch.distributed:
CUDA_VISIBLE_DEVICES=0,1 uv run torchrun --standalone --nproc_per_node 2 \
./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
dist=trueParameters:
CUDA_VISIBLE_DEVICES: Specify which GPUs to use--nproc_per_node: Number of processes (should match number of GPUs)dist=true: Enable distributed training mode
For distributed training with mixed precision using Accelerate:
CUDA_VISIBLE_DEVICES=0,1 uv run torchrun --standalone --nproc_per_node 2 \
./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
use_accelerate=true \
mixed_precision=bf16Mixed precision options:
no: Full precision (float32)fp16: Half precision (float16)bf16: BFloat16 (recommended for A100 GPUs)
Override any configuration parameter from command line:
uv run python ./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
batch_size=8 \
learning_rate=5e-5 \
epochs=50 \
device=cudaLayerD provides two options for loading checkpoints during training:
The ckpt parameter loads a complete checkpoint including model weights, optimizer state, and scheduler state. Use this to resume interrupted training:
uv run python ./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
ckpt=/path/to/checkpoint.pth \
device=cudaThis continues training from the exact state where it was interrupted, including the learning rate schedule and optimizer momentum.
The resume_from parameter loads only the model weights without optimizer or scheduler state. Use this for:
- Fine-tuning from a pre-trained model
- Starting fresh training with initialized weights
- Transfer learning scenarios
uv run python ./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
resume_from=/path/to/weights.pth \
device=cudaKey Differences:
| Parameter | Loads Model Weights | Loads Optimizer State | Loads Scheduler State | Use Case |
|---|---|---|---|---|
ckpt |
✓ | ✓ | ✓ | Resume interrupted training |
resume_from |
✓ | ✗ | ✗ | Fine-tuning, transfer learning |
Quick test (single GPU, small dataset):
uv run python ./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=./outputs/test_run \
epochs=10 \
device=cudaProduction training (4 GPUs, full dataset, mixed precision):
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run torchrun --standalone --nproc_per_node 4 \
./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=./outputs/production \
use_accelerate=true \
mixed_precision=bf16 \
batch_size=4 \
epochs=100With default configuration (train.yaml):
- 4x A100 40GB with
mixed_precision=bf16: ~40 hours - 2x A100 40GB with
mixed_precision=bf16: ~80 hours - 1x A100 40GB with
mixed_precision=bf16: ~160 hours
Training time scales approximately linearly with number of GPUs.
- GPU Memory:
- Single GPU (bf16): ~35GB
- Single GPU (fp32): ~40GB+ (may not fit in smaller GPUs)
- Disk Space:
- Checkpoints: ~1GB per checkpoint
- Logs and metrics: ~100MB
- Total: ~10-15GB for full training run
Training saves outputs to the specified out_dir:
out_dir/
├── checkpoints/
│ ├── epoch_10.pth
│ ├── epoch_20.pth
│ └── best.pth
├── logs/
│ └── train.log
└── config.yaml # Saved configuration
Training logs include:
- Loss values (per batch and per epoch)
- Learning rate schedule
- Training/validation metrics
- Timing information
View logs in real-time:
tail -f out_dir/logs/train.logIf you want to use TensorBoard for visualization, modify the training script to add TensorBoard logging.
After training, use your custom weights for inference:
uv run python ./tools/infer.py \
--input "data/*.png" \
--output-dir outputs/ \
--matting-weight-path /path/to/output/checkpoints/best.pth \
--device cudaOr in Python:
from layerd import LayerD
# Load with custom weights
layerd = LayerD(
matting_hf_card="cyberagent/layerd-birefnet", # Base model architecture
# Then manually load your weights after initialization
)
# Note: Direct custom weight loading API may need implementationEvaluate your trained model on test set:
# First, run inference on test set
uv run python ./tools/infer.py \
--input /path/to/dataset/test/composite/ \
--output-dir /path/to/predictions/ \
--matting-weight-path /path/to/output/checkpoints/best.pth \
--device cuda
# Then evaluate predictions
uv run python ./tools/evaluate.py \
--pred-dir /path/to/predictions/ \
--gt-dir /path/to/dataset/test/layers/ \
--output-dir /path/to/eval_results/ \
--max-edits 5See evaluation.md for more details.
Problem: CUDA out of memory during training
Solutions:
- Reduce batch size:
batch_size=2orbatch_size=1 - Use mixed precision:
mixed_precision=bf16 - Use gradient accumulation (modify training script)
Problem: Training is very slow
Solutions:
- Use multiple GPUs with
dist=trueoruse_accelerate=true - Use mixed precision:
mixed_precision=bf16 - Increase
num_workersfor data loading:num_workers=8
Problem: NaN loss during training
Solutions:
- Reduce learning rate:
learning_rate=5e-5 - Check dataset for corrupted images
- Use mixed precision with bf16 instead of fp16
Problem: Dataset download fails
Solutions:
- Check internet connection
- Ensure sufficient disk space (~20GB)
- Try again (downloads can be resumed)
For more troubleshooting help, see troubleshooting.md.
The default configuration uses pre-trained BiRefNet weights. Fine-tuning from these weights is recommended for:
- Faster convergence
- Better performance on small datasets
- Domain-specific customization
Key hyperparameters to adjust:
- Learning rate: Start with
1e-4, reduce if training is unstable - Batch size: Larger is better (if memory allows)
- Epochs: Monitor validation loss to determine optimal stopping point
- Weight decay: Controls regularization (default:
1e-4)
The training pipeline includes data augmentation. You can modify augmentation in the dataset implementation:
- Horizontal flipping
- Random cropping
- Color jittering
- Scaling
See src/layerd/matting/birefnet/dataset.py for details.
To train on a custom dataset:
-
Prepare your dataset in the same structure as Crello:
dataset/ ├── train/ │ ├── im/ # Input composite images │ └── gt/ # Ground-truth alpha mattes └── validation/ ├── im/ └── gt/ -
Ensure images are paired (same filename in
im/andgt/) -
Run training with your dataset path:
uv run python ./tools/train.py \ config_path=./src/layerd/configs/train.yaml \ data_root=/path/to/custom_dataset \ out_dir=/path/to/output \ device=cuda
For multi-node distributed training, modify the torchrun command:
# On each node, run:
torchrun \
--nnodes=2 \
--nproc_per_node=4 \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
./tools/train.py \
config_path=./src/layerd/configs/train.yaml \
data_root=/path/to/dataset \
out_dir=/path/to/output \
dist=trueWe thank the authors of BiRefNet for releasing their code, which we used as the basis for our matting backbone.
- Installation Guide - Setup for training
- Evaluation Guide - Evaluate trained models
- Architecture - Understanding the model
- Development Guide - Development workflows