Analysis Date: 2026-02-04 Total Source Files: ~130 Python files Total Lines of Code: ~40,000+ (production) + ~40,000+ (tests) Project Status: 60-70% Complete - Core infrastructure solid, integration incomplete
- Executive Summary
- What is LayerZero
- Architecture Overview
- Component Analysis
- Implementation Status
- Critical Issues & Bugs
- Missing Functionality
- Test Coverage
- Performance Characteristics
- How to Use for Inference Engine
- Recommendations
- Selection Engine Pipeline - Fully implemented with filter → score → select → cache flow
- MVCC Sharded Cache - Production-ready with 256 shards, O(1) invalidation, thundering herd prevention
- Policy Engine - Complete YAML-based rule system with locks, allows, denies, boosts
- Registry Systems - Thread-safe kernel and backend registries with health tracking
- KV Cache - Paged attention support with block allocation
- Telemetry - Metrics, circuit breakers, health monitoring, Prometheus/OpenTelemetry export
- Test Suite - 2,427 tests across unit, integration, property-based, fuzz, and stress testing
- Backend Dispatch - Selection engine works but actual kernel dispatch NOT implemented (all falls back to torch SDPA)
- Integration Layer - Modules exist in isolation, no orchestration layer connects them
- Speculative Decoding - Algorithms correct but not wired to inference pipeline
- Quantization - Enums defined but validation/execution missing
- Distributed - Simulated tests only, no real torch.distributed testing
LayerZero is architecturally sound but functionally incomplete. The core selection infrastructure is production-ready, but the critical "last mile" - actually dispatching to optimized kernels - is missing. With 2-4 weeks of focused work on integration, this could be a production-ready kernel orchestration system.
LayerZero is a kernel selection and orchestration framework for PyTorch inference that solves the problem of kernel fragmentation across ML inference libraries.
Modern ML inference relies on optimized kernels from multiple specialized libraries:
- FlashAttention (v2/v3) - IO-aware attention
- FlashInfer - Flexible paged KV-cache attention
- xFormers - Memory-efficient attention variants
- Liger Kernel - Fused Triton kernels for norms/MLPs
- oneDNN/ZenDNN - CPU acceleration
- Triton - Custom GPU kernels
Every serving framework independently implements kernel selection logic, causing bugs and inconsistencies.
# Old way (duplicated everywhere)
if has_flash_attn and sm >= 80:
output = flash_attn_func(q, k, v, causal=True)
elif has_xformers:
output = xformers.ops.memory_efficient_attention(q, k, v)
else:
output = F.scaled_dot_product_attention(q, k, v)
# New way (unified)
import layerzero as lz
output = lz.attention(q, k, v, causal=True)
# LayerZero handles selection, adaptation, and fallback automatically┌─────────────────────────────────────────────────────────────────┐
│ USER CODE / FRAMEWORKS │
│ (vLLM, SGLang, HuggingFace, Custom) │
└─────────────────────────────────────────────────────────────────┘
│ lz.attention(q, k, v)
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYERZERO API │
│ lz.attention() │ lz.rms_norm() │ lz.layer_norm() │ lz.configure()
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SELECTION ENGINE │
│ Policy Check → Filter (HW/Dtype) → Score (Priority) → Cache │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ BACKEND LOADER / PLUGIN MANAGER │
│ (Dynamic import, entry_points, capabilities descriptors) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ KERNEL REGISTRY │
│ flash_attn │ flashinfer │ xformers │ liger │ torch_sdpa │
└─────────────────────────────────────────────────────────────────┘
Input: Attention tensors (q, k, v)
↓
SelectionContext.from_tensors(...)
↓
SelectionEngine.select(context)
├─ Check policy locks
├─ Query KernelRegistry by operation
├─ Apply policy allow/deny rules
├─ Filter by KernelSpec.check(ctx)
├─ Score with priority + policy boosts
├─ Select highest-scoring kernel
└─ Cache result (MVCC)
↓
ExecutionPlan (kernel_id, transforms)
↓
Dispatch to kernel implementation ← NOT IMPLEMENTED
↓
Output tensors
| Component | Status | Quality |
|---|---|---|
| SelectionEngine | ✅ Complete | Excellent |
| MVCC Cache | ✅ Complete | Excellent |
| KernelRegistry | ✅ Complete | Good |
| BackendRegistry | ✅ Complete | Good |
| Policy Engine | ✅ Complete | Good |
| KV Cache | ✅ Complete | Good |
| Telemetry | ✅ Complete | Good |
| Graph Validator | ✅ Complete | Good |
| Multi-Op Planner | ✅ Complete | Good |
| PerfDB | ✅ Complete | Good |
| Distributed | Fair | |
| Backend Dispatch | ❌ Missing | N/A |
Status: ✅ COMPLETE - 97% test coverage
Files:
engine.py- Main SelectionEngine class (359 lines)mvcc_cache.py- Sharded MVCC cache (470 lines)cache.py- Simple LRU cache (203 lines)filter.py- Constraint-based filtering (68 lines)scorer.py- Priority/performance scoring (81 lines)memory_aware.py- Memory headroom checks (411 lines)
How It Works:
class SelectionEngine:
def select(self, ctx: SelectionContext) -> ExecutionPlan:
# 1. Check policy locks (force kernel if lock applies)
if locked_kernel := self._rule_engine.get_locked_kernel(ctx):
return ExecutionPlan(kernel_id=locked_kernel, ...)
# 2. Check cache
cache_key = ctx.cache_key()
if cached := self._cache.get(cache_key, self._policy.hash):
return cached
# 3. Get candidates from registry
candidates = self._registry.get_by_operation(ctx.operation)
# 4. Apply allow/deny rules
candidates = self._rule_engine.filter_kernels(candidates, ctx)
# 5. Filter by compatibility
valid, filtered_out = self._filter_phase.filter(candidates, ctx)
# 6. Score candidates
scores = self._scoring_phase.score(valid, ctx)
# 7. Select highest score
selected = max(scores, key=scores.get)
# 8. Cache and return
plan = ExecutionPlan(kernel_id=selected, ...)
self._cache.put(cache_key, self._policy.hash, plan)
return planStrengths:
- Thread-safe with RLock
- O(1) cache lookup after warmup
- Policy-driven with YAML configuration
- Comprehensive filtering (platform, SM, dtype, shape, features)
Issues Found:
_get_candidate_kernels()in planner is a stub - returns empty list- Memory-aware filtering implemented but not integrated into pipeline
- No batch optimization -
select_batch()just loops
Status: ✅ COMPLETE - 93% test coverage
Architecture:
MVCCShardedCache (256 shards by default)
├── CacheShard 0 (version=0, entries={}, lru_order=OrderedDict)
├── CacheShard 1
├── ...
└── CacheShard 255
Key Features:
- 256 shards for minimal lock contention
- MVCC versioning - O(1) policy invalidation via version bump
- Deduplication - Prevents thundering herd via
threading.Event - LRU eviction - Per-shard with configurable max entries
- TTL expiration - Default 1 hour
Performance:
- Target: 10K+ QPS
- Latency: <50µs cache lookup
- Memory: Bounded to 256 × 100 = 25.6K entries max
Issues Found:
- Potential infinite recursion in
get_or_compute()if compute fails repeatedly - No timeout on
Event.wait()- could hang if computing thread crashes - MD5 used for sharding (not cryptographically secure but fine for distribution)
Status: ✅ COMPLETE
Files:
policy.py- Policy container (120 lines)rule.py- Rules and conditions (226 lines)engine.py- Rule evaluation (177 lines)loader.py- YAML/env loading (302 lines)
Policy Structure:
version: "1.0"
locks:
- operation: "attention.causal"
kernel: "flash_attn.v3.causal"
when:
head_dim: ">=64"
allow:
- backend: "flash_attn"
- backend: "xformers"
deny:
- kernel: "torch.sdpa"
boosts:
- kernel: "flash_attn.*"
priority_add: 20Condition Operators:
- Comparison:
==,!=,>,>=,<,<= - Set:
in,not_in - Pattern:
match(glob via fnmatch)
Issues Found:
Loader.merge()has key typo (denyvsdenies)- No type validation on condition values
- No logging/audit trail for rule matches
Status: ✅ COMPLETE
KernelRegistry:
- Thread-safe (RLock)
- Index by: operation, source, platform
- Methods:
register(),register_many(),get(),get_by_operation(),filter() - Atomic batch registration with rollback on error
BackendRegistry:
- Circuit breaker pattern for health tracking
- States: HEALTHY → DEGRADED → UNHEALTHY → COOLDOWN → HEALTHY
- Health tracking: failure_count, cooldown_until, total_requests
Issues Found:
get_health()mutates state during read - should be separatedget_available_backends()has nested lock acquisitionfilter()doesn't use indexes - O(n) scan
Status: ✅ COMPLETE with gaps
Strategies:
CONTIGUOUS- Pre-allocated contiguous memoryPAGED- Block-based allocation (vLLM-style)CHUNKED- Fixed-size chunks (not implemented)VIRTUAL- OS-level paging (not implemented)
PagedKVCache Features:
- Block allocation/deallocation
- Sequence tracking with auto-extending
- FlashInfer-compatible block tables
- Fragmentation ratio calculation
Issues Found:
- BNHD layout conversion NOT implemented - raises NotImplementedError
get_seq_dim(BNHD)returns -1 (invalid)defragment()only sorts free list, doesn't move data- No seq_len validation in
allocate() - Double-free not detected
Status: ✅ COMPLETE
Metrics Tracked:
- Total selections
- Per-kernel usage counts
- Cache hit/miss rates
- Selection latency (p50, p95, p99)
- Backend health states
Exporters:
- Prometheus text format
- OpenTelemetry OTLP JSON
Circuit Breaker:
- CLOSED → OPEN after N failures
- OPEN → HALF_OPEN after cooldown
- HALF_OPEN → CLOSED on success
Issues Found:
- Memory unbounded growth in metrics (latencies list never pruned)
- Percentile calculation inaccurate for small samples
explain.pyuses hardcoded kernel registry - not integrated with real system
Status:
Purpose: Process isolation for ABI-incompatible backends
Components:
- Subprocess spawning with JSON IPC
- Worker process with signal handling
- ABI conflict detection
Issues Found:
- CRITICAL: Threading deadlock risk - unprotected stdin/stdout access
- CRITICAL: ABI isolation strategy inverted - isolates majority instead of minority
- ThreadPoolExecutor never shut down (resource leak)
- No timeout on operations - could hang
- SharedMemoryChannel defined but never used
Status:
Implemented:
- Rejection sampling (mathematically verified)
- Greedy verification
- Kernel pair selection (draft/target)
- Tree attention config (partial)
Issues Found:
- CRITICAL: No integration with inference pipeline
- Kernel IDs not validated against registry
- Model validation too weak (only vocab_size)
- Speedup estimation formula oversimplified
Status: ✅ COMPLETE
MLP:
- SwiGLU, GeGLU, ReGLU activations
- Fused MLP operation
- Memory-efficient chunked MLP
Sampling:
- Top-k sampling
- Top-p (nucleus) sampling
- Combined top-k + top-p
- Temperature scaling
Position Encoding:
- ALiBi with caching
- Causal bias generation
- Slope computation for non-power-of-2 heads
Issues Found:
- ALiBi cache not thread-safe (global dict)
- CUDA generator parameter silently ignored
- RoPE mentioned but not implemented
Status: ✅ COMPLETE
Graph Safety:
- CUDA graph safety whitelist (17 safe, 19 unsafe operations)
- Warmup protocol (cuBLAS, cuDNN initialization)
- Memory tracking during capture
- Dummy capture validation
Multi-Op Planner:
- Joint optimization (exponential - O(K^N))
- Greedy optimization (linear - O(K*N))
- Transform cost calculation (layout, dtype)
- Plan caching with LRU/FIFO
Issues Found:
- CRITICAL:
_get_candidate_kernels()is a stub - returns empty list - Joint planner scales exponentially - only viable for <8 ops
- Dtype comparison bug in greedy planner
Status: ✅ COMPLETE
Features:
- SQLite with WAL mode
- Thread-local connections
- Bucketing for seq_len and batch_size
- Confidence scoring based on sample count and variance
- Invalidation by driver/CUDA version
Schema:
CREATE TABLE perf_records (
kernel_id, operation, device_id, sm_version, dtype,
head_dim, seq_bucket, batch_bucket,
median_us, p95_us, samples, variance_us, warmup_ms,
kernel_version, cuda_version, driver_version, ...
UNIQUE(kernel_id, operation, device_id, sm_version, dtype,
head_dim, seq_bucket, batch_bucket)
);Issues Found:
- No index on
validflag - queries scan full table - SM version stored as string, parsed on every query
- No query cache
Status:
Implemented:
- Version consistency checking
- Selection hash synchronization
- TP invariance filtering
Issues Found:
- All tests use SimulatedProcessGroup - no real torch.distributed
- No timeout on broadcast operations
- No error recovery mechanism
Status:
Implemented APIs:
attention(),paged_attention(),rope(),rms_norm(),layer_norm()sample_topk(),sample_topp(),tokenize(),detokenize()configure(),lock(),unlock(),get_config()select(),explain(),which(),list_kernels(),validate()doctor(),readiness_check(),compile(),tune()
CRITICAL ISSUE: All operations fall back to torch SDPA regardless of selection!
# In operations.py
def _dispatch_attention(q, k, v, ...):
# TODO: Actually use selected kernel
return F.scaled_dot_product_attention(q, k, v, ...) # Always SDPA!Status:
Implemented:
- HuggingFace Transformers integration
- Diffusers integration
- Model patching mechanism
- Tokenization pipeline
CRITICAL ISSUE: LayerZeroAttentionWrapper.forward() never calls LayerZero!
def forward(self, *args, **kwargs):
# Always calls original, never LayerZero!
return self._original(*args, **kwargs)- ✅ Task 8: Selection Engine Pipeline
- ✅ Task 9: MVCC Sharded Cache
| Task | Status | Priority |
|---|---|---|
| Task 1: Core Enums | Code exists | HIGH |
| Task 2: Data Models | Code exists | HIGH |
| Task 6: Registries | Code exists | HIGH |
| Task 7: Policy Engine | Code exists | HIGH |
| Task 10: PerfDB | Code exists | MEDIUM |
| Task | Status | Priority |
|---|---|---|
| Task 11: Torch SDPA | Adapter exists | HIGH |
| Task 12: FlashAttention | Adapter exists | CRITICAL |
| Task 13: FlashInfer | Adapter exists | HIGH |
| Task 14: xFormers | Adapter exists | HIGH |
| Task 15: Liger | Adapter exists | HIGH |
| Task 16: Triton | Partial | HIGH |
| Task 17: CPU Backends | Partial | MEDIUM |
- Backend Dispatch - Selection works, dispatch doesn't
- Integration Layer - No orchestration connecting modules
- Real Distributed Tests - Only simulated
- Production Validation - No end-to-end production tests
| # | Issue | Location | Impact |
|---|---|---|---|
| 1 | Backend dispatch not implemented | api/operations.py |
All kernels fall back to SDPA |
| 2 | LayerZeroAttentionWrapper doesn't use LayerZero | integrations/model_patching.py |
Patching has no effect |
| 3 | _get_candidate_kernels() is stub | planner/multi_op.py |
Planning always uses dummy kernels |
| 4 | Subprocess threading deadlock | isolation/subprocess_backend.py |
Potential hang |
| 5 | ABI isolation strategy inverted | isolation/abi_detector.py |
Isolates wrong backends |
| # | Issue | Location | Impact |
|---|---|---|---|
| 6 | BNHD layout conversion missing | kv_cache/layouts.py |
Paged attention broken |
| 7 | get_health() mutates state | registry/backend_registry.py |
Race conditions |
| 8 | Infinite recursion in cache | selection/mvcc_cache.py |
Potential crash |
| 9 | No timeout on Event.wait() | selection/mvcc_cache.py |
Potential hang |
| 10 | Dtype comparison bug | planner/multi_op.py |
Wrong transform costs |
| # | Issue | Location | Impact |
|---|---|---|---|
| 11 | ALiBi cache not thread-safe | posenc/alibi.py |
Race conditions |
| 12 | Memory unbounded in metrics | telemetry/metrics.py |
Memory leak |
| 13 | No index on valid flag | perfdb/database.py |
Slow queries |
| 14 | Speculative decoding not integrated | speculative/ |
Feature non-functional |
| 15 | Exponential joint planner | planner/multi_op.py |
Timeout for >8 ops |
-
Backend Dispatch Implementation
- Map kernel_id to actual kernel function
- Handle layout/dtype conversions
- Execute with proper error handling
-
Integration Orchestrator
- Connect selection → dispatch → execution
- Handle KV cache management
- Coordinate speculative decoding
-
Real Distributed Testing
- Test with actual torch.distributed
- Multi-GPU validation
- Rank consistency verification
- RoPE position encoding
- CHUNKED and VIRTUAL KV cache strategies
- Quantization validation (INT4, INT8, FP8)
- Defragmentation with data movement
- Large model testing (13B+)
- AutoTune mechanism
- ONNX/TorchScript export
- HPU/XPU support
- Performance regression detection
- Distributed profiling
- Total Tests: 2,427
- Test Files: 130
- Lines of Test Code: 40,011
- Benchmark Files: 9 (2,594 lines)
| Component | Tests | Coverage |
|---|---|---|
| Selection Engine | 101 | 97% |
| MVCC Cache | 34 | 93% |
| Policy Engine | 40+ | 90%+ |
| Registries | 25+ | 85%+ |
| KV Cache | 30+ | 85%+ |
| Backends | 600+ | 80%+ |
| Telemetry | 30+ | 80%+ |
| API | None found | 0% |
| Integrations | 20+ | Limited |
- Unit Tests: 115 files (comprehensive)
- Integration Tests: 8 files (69 marked tests)
- Correctness Tests: 43 marked tests
- Property-Based: Hypothesis support (optional)
- Fuzz Tests: Fallback implementation
- Stress Tests: 34 marked tests
- Benchmarks: 9 experiment files
- API operations (0% coverage)
- Real distributed tests (simulated only)
- Large model tests (13B+)
- Long context tests (32K+)
- OOM recovery tests
- Multi-GPU tests
- Cache Hit Latency: <50µs
- Cache Miss Latency: <1ms
- Throughput Target: 10K+ QPS
- Memory per Shard: ~100 entries max
- Large batch (32×2048×4096): <60s CPU, <10s GPU
- Memory bandwidth limited on GPU
- Top-k: 100 samples @ 128K vocab → <5s CPU, <1s GPU
- Top-p: 6x slower than top-k (full sort required)
- seq_len=4096, 32 heads: <1s generation
- Memory: num_heads × seq_len² × 4 bytes
full_system_benchmark.py- Comprehensive system testexp_01_selection_overhead.py- <10µs/selection targetexp_02_import_isolation.py- Backend import timingexp_03_jit_cold_start.py- JIT warmup costsexp_04_correctness_variance.py- Precision analysis
LayerZero cannot currently be used for production inference because:
- Backend dispatch is not implemented
- All operations fall back to torch SDPA
- Integration layer is missing
# In api/operations.py
def _dispatch_attention(kernel_id, q, k, v, ...):
if kernel_id.startswith("flash_attn"):
return _call_flash_attention(q, k, v, ...)
elif kernel_id.startswith("flashinfer"):
return _call_flashinfer(q, k, v, ...)
elif kernel_id.startswith("xformers"):
return _call_xformers(q, k, v, ...)
else:
return F.scaled_dot_product_attention(q, k, v, ...)# In integrations/model_patching.py
def forward(self, *args, **kwargs):
# Actually call LayerZero
return torch.ops.layerzero.attention(...)# Create inference orchestrator
class InferenceEngine:
def generate(self, input_ids, ...):
# 1. Run draft model
# 2. Run target verification
# 3. Accept/reject tokens
# 4. Manage KV cache- Replace SimulatedProcessGroup with real torch.distributed
- Add timeout handling
- Test multi-GPU scenarios
For the Bud Waav inference engine:
-
Use LayerZero for Kernel Selection
- Import and configure LayerZero
- Let it handle backend selection
-
Integrate with Gateway
- Selection happens at request routing
- Cache warmup during initialization
-
KV Cache Management
- Use LayerZero's PagedKVCache for vLLM-style serving
- Integrate with request scheduling
-
Monitoring
- Export LayerZero metrics to Prometheus
- Track selection decisions and latencies
import layerzero as lz
# Configure at startup
lz.configure(
default_backend="flash_attn",
cache_size=10000,
strict_mode=False,
)
# Lock specific kernels for production
lz.lock("attention.causal", "flash_attn.v3.causal")
# Check system health
report = lz.doctor()
if not report.healthy:
logger.warning(f"LayerZero issues: {report.summary}")
# In inference loop
class WhisperInference:
def forward(self, audio_features):
# LayerZero automatically selects best kernel
attention_output = lz.attention(
query, key, value,
is_causal=True,
)
return attention_output-
Implement Backend Dispatch
- Create kernel dispatch map in
api/operations.py - Add backend-specific call wrappers
- Wire selection result to dispatch
- Create kernel dispatch map in
-
Fix Integration Wrapper
- Make
LayerZeroAttentionWrapperactually call LayerZero - Test with HuggingFace models
- Make
-
Add API Tests
- Currently 0% coverage
- Add unit tests for all public APIs
-
Fix Critical Bugs
- BNHD layout conversion
- ABI isolation strategy
- Subprocess threading issues
-
Wire Speculative Decoding
- Create inference orchestrator
- Connect to selection engine
- Test with real models
-
Add Distributed Tests
- Replace SimulatedProcessGroup
- Test multi-rank consistency
-
Complete Backend Adapters
- Validate all adapters work end-to-end
- Add missing constraint checks
-
Performance Optimization
- Profile and optimize hot paths
- Add PerfDB integration
-
Production Hardening
- Add timeouts everywhere
- Improve error messages
- Add graceful degradation
-
Advanced Features
- AutoTune mechanism
- Quantization support
- Multi-GPU optimization
-
Documentation
- API documentation
- Architecture guide
- Integration tutorials
src/layerzero/
├── __init__.py
├── device.py # GPU detection, SM versions
├── enums.py # Core enumerations
├── api/ # Public API
│ ├── operations.py # lz.attention(), etc.
│ ├── config.py # lz.configure(), locks
│ ├── inspection.py # lz.explain(), lz.which()
│ └── system.py # lz.doctor(), readiness
├── core/ # Validation utilities
├── models/ # Data models
│ ├── device_spec.py
│ ├── kernel_spec.py
│ ├── selection_context.py
│ ├── execution_plan.py
│ ├── operation_spec.py
│ └── backend_spec.py
├── selection/ # Selection engine
│ ├── engine.py
│ ├── mvcc_cache.py
│ ├── filter.py
│ ├── scorer.py
│ └── memory_aware.py
├── registry/ # Kernel/backend registries
│ ├── kernel_registry.py
│ └── backend_registry.py
├── policy/ # Policy engine
│ ├── policy.py
│ ├── rule.py
│ ├── engine.py
│ └── loader.py
├── kv_cache/ # KV cache management
│ ├── strategy.py
│ ├── layouts.py
│ └── paged.py
├── telemetry/ # Metrics and monitoring
│ ├── metrics.py
│ ├── explain.py
│ ├── selection_report.py
│ └── exporters/
├── health/ # Health tracking
│ ├── circuit_breaker.py
│ └── backend_health.py
├── graphs/ # CUDA graph safety
│ ├── validator.py
│ ├── whitelist.py
│ ├── warmup.py
│ └── memory_tracker.py
├── planner/ # Multi-op planning
│ ├── multi_op.py
│ └── plan_cache.py
├── perfdb/ # Performance database
│ ├── database.py
│ ├── record.py
│ └── schema.py
├── distributed/ # Distributed support
│ ├── consistency.py
│ └── tp_invariance.py
├── isolation/ # Process isolation
│ ├── worker.py
│ ├── subprocess_backend.py
│ ├── ipc.py
│ └── abi_detector.py
├── speculative/ # Speculative decoding
│ ├── coordination.py
│ └── verification.py
├── mlp/ # MLP operations
│ ├── fused.py
│ └── linear.py
├── sampling/ # Token sampling
│ ├── topk.py
│ ├── topp.py
│ ├── combined.py
│ └── temperature.py
├── posenc/ # Position encoding
│ └── alibi.py
└── integrations/ # Framework integrations
├── transformers.py
├── diffusers.py
├── model_patching.py
└── tokenization_pipeline.py
Report Generated: 2026-02-04 Analysis Depth: Line-by-line review of all source files Agents Used: 15 parallel exploration agents