24.5. Batch Processing Support

Batch Processing Support

Introduction

This document provides comprehensive documentation on batch processing support within the Oxide-Lab repository. It covers the system's ability to efficiently handle multiple inference requests using shared context and model state, with a focus on the Batcher implementation, memory management, performance characteristics, and optimization strategies for different hardware configurations and quantization levels.

Section sources

batcher.rs

Batch Processing Implementation

The batch processing system is implemented through the Batcher struct in the candle-datasets crate, which enables efficient handling of multiple tensor inputs by grouping them into batches for simultaneous processing. This approach significantly improves throughput by leveraging parallel computation capabilities of modern GPUs and CPUs.

The Batcher struct provides a flexible interface for creating batches from various iterator types, supporting both single tensors and tensor pairs (commonly used for input-output pairs in training). The implementation uses type specialization through wrapper structs (Iter1, Iter2, IterResult1, IterResult2) to handle different iterator signatures while maintaining a consistent batching interface.

pub struct Batcher<I> {
    inner: I,
    batch_size: usize,
    return_last_incomplete_batch: bool,
}

Key features of the batcher implementation include:

Configurable batch size via the batch_size() method
Option to return incomplete final batches via return_last_incomplete_batch()
Support for both regular and result-wrapped iterators
Efficient memory allocation with pre-sized vectors

The core batching logic is implemented in the Iterator trait for different wrapper types. For example, the implementation for Batcher<Iter1<I>> collects individual tensors into a batch and stacks them along dimension 0:

impl<I: Iterator<Item = Tensor>> Iterator for Batcher<Iter1<I>> {
    type Item = Result<Tensor>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut items = Vec::with_capacity(self.batch_size);
        for _i in 0..self.batch_size {
            match self.inner.inner.next() {
                Some(item) => items.push(item),
                None => {
                    if self.return_last_incomplete_batch && !items.is_empty() {
                        break;
                    }
                    return None;
                }
            }
        }
        Some(Tensor::stack(&items, 0))
    }
}

This implementation efficiently handles the common case where a complete batch is available, while also providing graceful handling of the final batch which may be smaller than the configured batch size when the total number of items is not evenly divisible by the batch size.

Section sources

batcher.rs

GenerateContext for Concurrent Requests

The system manages concurrent generation requests through the ContextSlice struct, which handles the management of token sequences and context windows for text generation tasks. This component plays a crucial role in efficiently managing shared context across multiple generation requests.

pub struct ContextSlice {
    pub encoded_len: usize,
    pub base_context_len: usize,
    pub effective_context_tokens: Vec<u32>,
}

The ContextSlice is initialized with a full sequence of context tokens and a length limit, creating an efficient representation that includes only the relevant portion of the context based on the specified limit:

impl ContextSlice {
    pub fn new(full_context_tokens: Vec<u32>, limit: usize) -> Self {
        let encoded_len = full_context_tokens.len();
        let effective_context_tokens = if encoded_len > limit && limit > 0 {
            let start = encoded_len - limit;
            full_context_tokens[start..].to_vec()
        } else {
            full_context_tokens.clone()
        };
        let base_context_len = effective_context_tokens.len();
        Self { encoded_len, base_context_len, effective_context_tokens }
    }
}

This implementation ensures that only the most relevant portion of the context (typically the most recent tokens) is retained when the context exceeds the specified limit, which is essential for maintaining performance while respecting memory constraints. The structure tracks both the original encoded length and the actual length of the effective context, providing important metadata for downstream processing.

While the current implementation focuses on managing individual context slices, it provides the foundation for handling concurrent generation requests by ensuring that each request has access to an appropriately sized context window without unnecessary memory overhead.

Section sources

ctx.rs

Memory Management Strategies

The system employs several sophisticated memory management strategies to efficiently handle multiple prompts simultaneously while minimizing memory overhead and maximizing performance.

Batcher Memory Optimization

The Batcher implementation uses pre-allocated vectors with capacity matching the batch size to avoid repeated memory allocations during batch construction:

let mut items = Vec::with_capacity(self.batch_size);

This strategy reduces memory fragmentation and allocation overhead, particularly important when processing large numbers of small batches. The batcher also handles error conditions gracefully by collecting errors from individual items and returning the first encountered error, preventing partial batch processing from consuming additional memory.

Device-Level Memory Management

At the device level, the system implements a sophisticated memory allocation strategy through the Device enum and associated backend implementations. The metal backend, for example, includes a buffer pooling mechanism that reuses previously allocated buffers when possible:

/// Simple allocator struct.
/// The buffers are stored in size buckets since ML tends to use similar shapes over and over.
/// We store the buffers in [`Arc`] because it's much faster than Obj-c internal ref counting
/// (could be linked to FFI communication overhead).
///
/// Whenever a buffer has a strong_count==1, we can reuse it, it means it was dropped in the
/// graph calculation, and only we the allocator kept a reference to it, therefore it's free
/// to be reused.
pub(crate) buffers: Arc<RwLock<BufferMap>>,

This buffer pooling approach significantly reduces the overhead of frequent memory allocations and deallocations by reusing buffers of similar sizes. The system maintains buffers in size-based buckets, recognizing that machine learning workloads typically involve tensors with similar dimensions.

Zero Initialization and Uninitialized Allocation

The system provides both zero-initialized and uninitialized memory allocation paths, allowing for optimization based on usage patterns:

fn zeros_impl(&self, _shape: &Shape, _dtype: DType) -> Result<Self::Storage>;
unsafe fn alloc_uninit(&self, _shape: &Shape, _dtype: DType) -> Result<Self::Storage>;

The zero-initialization path is used when tensors need to be initialized to zero values, while the uninitialized allocation path (marked as unsafe) allows for more efficient memory allocation when the tensor will be immediately overwritten with computed values.

Cross-Device Memory Management

The system supports multiple device types (CPU, CUDA, Metal) with consistent memory management interfaces, enabling portable code that can adapt to available hardware resources:

pub enum Device {
    Cpu,
    Cuda(crate::CudaDevice),
    Metal(crate::MetalDevice),
}

Each device type implements the same memory allocation and management interface, allowing the batching system to operate efficiently regardless of the underlying hardware.

Section sources

batcher.rs
device.rs

Batch Inference Across Model Architectures

The system supports batch inference across various model architectures through a combination of tensor operations and model-specific implementations. The quantized Llama model provides a representative example of how batch processing is implemented for transformer-based architectures.

Quantized Model Implementation

The quantized Llama implementation demonstrates how batch processing is supported in memory-constrained environments:

struct QMatMul {
    inner: candle::quantized::QMatMul,
    span: tracing::Span,
}

The QMatMul wrapper provides matrix multiplication operations for quantized tensors, which is a fundamental operation in transformer models. This implementation maintains the same interface as regular matrix multiplication while operating on quantized data, enabling batch processing with reduced memory footprint.

Multi-Layer Processing

Transformer models process batches through multiple layers, with each layer maintaining its own state but operating on the entire batch simultaneously:

impl LayerWeights {
    fn forward_attn(
        &mut self,
        x: &Tensor,
        mask: Option<&Tensor>,
        index_pos: usize,
    ) -> Result<Tensor> {
        let (b_sz, seq_len, n_embd) = x.dims3()?;
        let q = self.attention_wq.forward(x)?;
        let k = self.attention_wk.forward(x)?;
        let v = self.attention_wv.forward(x)?;
        
        // Reshape and transpose for attention computation
        let q = q.reshape((b_sz, seq_len, self.n_head, self.head_dim))?.transpose(1, 2)?;
        let k = k.reshape((b_sz, seq_len, self.n_kv_head, self.head_dim))?.transpose(1, 2)?;

This implementation shows how a single batch of inputs (with dimensions batch_size × sequence_length × embedding_dimension) is processed through the attention mechanism, with queries, keys, and values computed for all sequences in the batch simultaneously.

Mixture of Experts (MoE) Support

The system also supports advanced architectures like Mixture of Experts, which presents unique challenges for batch processing:

enum MlpOrMoe {
    Mlp(Mlp),
    MoE {
        n_expert_used: usize,
        feed_forward_gate_inp: QMatMul,
        experts: Vec<Mlp>,
    },
}

In MoE architectures, different sequences in a batch may be routed to different experts, requiring careful management of the routing weights and expert selection:

// Extract top-k experts for each sequence
let mut top_x = vec![vec![]; experts.len()];
let mut selected_rws = vec![vec![]; experts.len()];
for (row_idx, rw) in routing_weights.iter().enumerate() {
    let mut dst = (0..rw.len() as u32).collect::<Vec<u32>>();
    dst.sort_by(|&i, &j| rw[j as usize].total_cmp(&rw[i as usize]));
    let mut sum_routing_weights = 0f32;
    for &expert_idx in dst.iter().take(*n_expert_used) {
        let expert_idx = expert_idx as usize;
        let routing_weight = rw[expert_idx];
        sum_routing_weights += routing_weight;
        top_x[expert_idx].push(row_idx as u32);
    }
    // Normalize routing weights
    for &expert_idx in dst.iter().take(*n_expert_used) {
        let expert_idx = expert_idx as usize;
        let routing_weight = rw[expert_idx];
        selected_rws[expert_idx].push(routing_weight / sum_routing_weights)
    }
}

This sophisticated routing mechanism allows the model to dynamically allocate computational resources based on the input, with different sequences in the same batch potentially using different subsets of the model's parameters.

Section sources

quantized_llama.rs

Performance Benchmarks

The system includes comprehensive performance benchmarks that measure the efficiency of batch processing operations across different hardware configurations.

Matrix Multiplication Benchmark

The core computational operation in most deep learning models is matrix multiplication, which is benchmarked to evaluate performance characteristics:

fn run_bench(c: &mut Criterion, device: &Device) {
    let b = 1;
    let m = 1;
    let n = 2048;
    let k = 2048;

    let dtype = DType::F32;
    let lhs = Tensor::zeros((b, m, k), dtype, device).unwrap();
    let rhs = Tensor::zeros((b, n, k), dtype, device).unwrap();

    let flops = b * m * n * k;

    let mut group = c.benchmark_group(device.bench_name("matmul"));
    group.throughput(Throughput::Bytes(flops as u64));
    group.bench_function("iter", move |b| {
        b.iter_custom(|iters| {
            let start = Instant::now();
            for _i in 0..iters {
                run(black_box(&lhs), black_box(&rhs));
            }
            device.sync().unwrap();
            start.elapsed()
        })
    });
    group.finish();
}

This benchmark measures the performance of matrix multiplication operations, reporting throughput in terms of floating-point operations per second. The benchmark is designed to run on multiple devices (CPU, CUDA, Metal) to provide comparative performance metrics across different hardware platforms.

Key Performance Metrics

The benchmarks capture several important performance characteristics:

Throughput: Measured in floating-point operations per second (FLOPS)
Latency: Time to complete individual operations
Memory bandwidth utilization: How effectively the system utilizes available memory bandwidth
Device synchronization overhead: Time spent synchronizing operations across devices

The benchmarking framework uses Criterion, a statistical benchmarking library, to provide reliable and reproducible performance measurements. This allows for meaningful comparisons between different configurations and optimization strategies.

Section sources

matmul.rs

Resource Allocation Considerations

The system's resource allocation strategy is designed to efficiently utilize available CPU and GPU resources while adapting to different hardware configurations and workload requirements.

Device Abstraction and Selection

The Device enum provides a unified interface for different hardware backends, enabling portable code that can adapt to available resources:

pub enum Device {
    Cpu,
    Cuda(crate::CudaDevice),
    Metal(crate::MetalDevice),
}

This abstraction allows the system to automatically select the most appropriate device based on availability and performance characteristics. For example, the cuda_if_available() method attempts to create a CUDA device but falls back to CPU if CUDA is not available:

pub fn cuda_if_available(ordinal: usize) -> Result<Self> {
    if crate::utils::cuda_is_available() {
        Self::new_cuda(ordinal)
    } else {
        Ok(Self::Cpu)
    }
}

Memory Type Support

Different devices support different memory types and precision formats, which affects resource allocation decisions:

pub fn supports_bf16(&self) -> bool {
    match self {
        Self::Cuda(_) | Self::Metal(_) => true,
        Self::Cpu => false,
    }
}

pub fn bf16_default_to_f32(&self) -> DType {
    if self.supports_bf16() {
        DType::BF16
    } else {
        DType::F32
    }
}

This capability detection allows the system to optimize memory usage by selecting appropriate data types based on device capabilities. For example, BF16 (Brain Floating Point 16) format can reduce memory usage by 50% compared to F32 while maintaining reasonable numerical precision, but is only supported on certain GPU architectures.

Dynamic Resource Management

The system implements dynamic resource management through various strategies:

Buffer pooling: Reusing allocated memory buffers to reduce allocation overhead
Lazy initialization: Delaying resource allocation until needed
Automatic fallback: Gracefully degrading to CPU when GPU resources are unavailable
Memory pressure awareness: Adapting batch sizes based on available memory

These strategies ensure that the system can operate efficiently across a wide range of hardware configurations, from high-end GPUs with abundant memory to CPU-only systems with limited resources.

Section sources

device.rs

Best Practices for Batch Size Optimization

Optimizing batch size is crucial for achieving optimal performance while respecting resource constraints. The following best practices are derived from the system's implementation and performance characteristics.

GPU Resource Optimization

For GPU-based inference, batch size should be optimized based on available VRAM and computational capabilities:

Start with moderate batch sizes: Begin with batch sizes of 8-16 and adjust based on performance and memory usage
Monitor memory utilization: Ensure batch processing does not exceed available VRAM
Consider tensor core utilization: For NVIDIA GPUs, align batch sizes with tensor core dimensions (typically multiples of 8 or 16)

// Example: Configuring batch size based on available resources
let batcher = Batcher::new1(iterator)
    .batch_size(optimal_batch_size)
    .return_last_incomplete_batch(true);

Quantization Level Considerations

The choice of quantization level significantly impacts optimal batch size:

Quantization	Memory per Parameter	Recommended Batch Size Multiplier
FP32	4 bytes	1× (baseline)
FP16/BF16	2 bytes	2×
INT8	1 byte	4×
INT4	0.5 bytes	8×

Higher quantization levels allow for larger batch sizes due to reduced memory footprint, potentially improving throughput despite slightly reduced numerical precision.

CPU vs GPU Trade-offs

When operating on CPU, different considerations apply:

Memory bandwidth limitations: CPUs typically have lower memory bandwidth than GPUs, making smaller batch sizes more efficient
Cache utilization: Smaller batches may fit better in CPU caches, reducing memory access latency
Parallelization: Utilize multiple CPU cores through data parallelism rather than large batch sizes

Adaptive Batch Sizing

Implement adaptive batch sizing based on runtime conditions:

// Monitor system resources and adjust batch size dynamically
fn determine_optimal_batch_size(device: &Device, model_size: usize) -> usize {
    let available_memory = device.available_memory();
    let memory_per_item = model_size * bytes_per_parameter();
    
    // Leave some memory for overhead and other processes
    let max_batch_size = (available_memory * 0.8 / memory_per_item) as usize;
    
    // Cap at reasonable maximum to avoid excessive latency
    max_batch_size.min(64).max(1)
}

Performance Monitoring

Implement comprehensive performance monitoring to guide batch size optimization:

Throughput: Measure tokens generated per second
Latency: Track time from input to first token and time between tokens
Memory usage: Monitor VRAM/CPU memory consumption
Device utilization: Track GPU/CPU utilization rates

By systematically evaluating these metrics across different batch sizes and hardware configurations, users can identify the optimal configuration for their specific use case and available resources.

Section sources

batcher.rs
device.rs
quantized_llama.rs

Referenced Files in This Document

batcher.rs
ctx.rs
quantized_llama.rs
matmul.rs
device.rs

24.5. Batch Processing Support

Batch Processing Support

Table of Contents

Introduction

Batch Processing Implementation

GenerateContext for Concurrent Requests

Memory Management Strategies

Batcher Memory Optimization

Device-Level Memory Management

Zero Initialization and Uninitialized Allocation

Cross-Device Memory Management

Batch Inference Across Model Architectures

Quantized Model Implementation

Multi-Layer Processing

Mixture of Experts (MoE) Support

Performance Benchmarks

Matrix Multiplication Benchmark

Key Performance Metrics

Resource Allocation Considerations

Device Abstraction and Selection

Memory Type Support

Dynamic Resource Management

Best Practices for Batch Size Optimization

GPU Resource Optimization

Quantization Level Considerations

CPU vs GPU Trade-offs

Adaptive Batch Sizing

Performance Monitoring

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!