Skip to content

Latest commit

 

History

History
290 lines (233 loc) · 16.5 KB

File metadata and controls

290 lines (233 loc) · 16.5 KB

ARLE Codebase Map

2026-05-08 EOD+18 战略 source of truth:projects/2026-05-07-arle-master-strategy.md §0.1 主战场 3 axis(user 2026-05-08 directive):Agent workload(W3/W4) + 量化全套 + 投机解码(Medusa/EAGLE/DFlash)。 非主战场 deprecated:Piecewise prefill graph(Phase 0 KILL 8b4a03b)+ canonical 4-shape 单点优化(6 KILL 全在错的 workload)。

量化全套 plan:plans/M_quant-fp8-w4-magnitude-path.md。 cuBLASLt FP8 smoke 实测 1.88×(< 2× KILL),cutlass FP8 direct mma smoke 待验(#28)。

今日 41+ commits incremental audit:architecture-snapshot-2026-05-07-eod.md。 本 doc 仍是结构性 truth 来源,战略和今日变化看上面 pointer。

2026-05-10 later update: active Qwen3.5 Medusa/spec work is gated by plans/M_medusa-phase1b-qwen35-v2-snapshot-ring-redesign.md. Older Medusa-ready / A+B notes are historical for Qwen3.6 until recurrent-state accepted-length rollback is licensed for Qwen3.5.

Updated 2026-05-06 after the DSV4 runtime substrate scaffold + nano autograd training landed (2026-05-05). DSV4 is the #1 next-model priority and Qwen 3.6 the #2 — see ROADMAP.md §Next-Model Priority Order. Earlier landings still in scope: F0–F4 multi-GPU scaffold, Phase 2 spec-decode plumbing, and crates/deepseek-spec/ DS0 scaffold.

This document is the canonical workspace-topology truth: where files live, what each crate owns, and where to start reading. For ownership boundaries and crate-admission governance see architecture.md; support status by surface lives in support-matrix.md.

1. Workspace at a glance

The repository has four practical layers:

  • workspace root package: thin binary wrapper in src/main.rs that calls infer_cli::run().
  • infer/: the runtime-heavy crate. It owns the HTTP server, scheduler, backends, model/runtime modules, and the unified server_engine::InferenceEngine contract used by the HTTP server and agent CLI alike.
  • crates/: reusable control-plane/helper crates around the runtime.
  • docs/: architecture, plans, research, and implementation notes (single source of truth; the historical infer/docs/ parallel tree was retired during the 2026-04-25 truth-surface cleanup).

Current workspace members (ownership and boundaries are listed in architecture.md §Package Boundaries):

  • workspace root package
  • infer
  • crates/cuda-kernels
  • crates/mlx-sys
  • crates/agent
  • crates/chat
  • crates/cli
  • crates/tools
  • crates/qwen3-spec, crates/qwen35-spec, crates/deepseek-spec
  • crates/autograd
  • crates/train
  • crates/kv-native-sys

2. Main execution paths

Agent CLI path

src/main.rs
  -> infer_cli::run()
  -> infer::hf_hub::resolve_model_source() + infer::server_engine::LoadedInferenceEngine::load()
  -> infer_agent::AgentSession (uses `dyn InferenceEngine`)
  -> infer_tools builtin tools + infer_chat protocol
  -> LoadedInferenceEngine dispatches to CUDA / Metal / CPU backend

Key files:

  • src/main.rs: arle binary entrypoint from the root package
  • crates/cli/src/lib.rs: CLI startup and backend selection
  • crates/cli/src/repl.rs: REPL loop, slash commands, terminal UX
  • infer/src/server_engine.rs: unified InferenceEngine trait, CompletionRequest/CompletionOutput/TokenUsage/CompletionStreamDelta types, and LoadedInferenceEngine backend dispatch enum
  • infer/src/hf_hub.rs: local model discovery + resolve_model_source
  • crates/agent/src/lib.rs: session state, prompt assembly, turn loop
  • crates/tools/src/lib.rs: builtin tools and shared tool hooks
  • crates/chat/src/lib.rs: OpenAiChatMessage / OpenAiToolDefinition wire format + re-exports of the internal ChatMessage / ToolCall / ToolDefinition protocol types from crate::protocol

CUDA serving path

infer/src/main.rs
  -> backend/cuda/bootstrap.rs
  -> http_server.rs
  -> server_engine.rs
  -> scheduler/cuda/*
  -> model.rs + model/*
  -> ops.rs + ops/*
  -> crates/cuda-kernels kernels / TileLang / CUDA graph path

Key files:

  • infer/src/main.rs: CUDA server binary
  • infer/src/backend/cuda/bootstrap.rs: model loading, runtime config, scheduler bring-up
  • infer/src/http_server.rs and infer/src/http_server/openai_v1.rs: HTTP API
  • infer/src/server_engine.rs: synchronous/streaming generation façade
  • infer/src/scheduler/cuda/: production CUDA scheduler implementation

Serial backend runtime path

cpu_serve / metal_serve
  -> backend/runtime.rs
  -> CpuBackend or MetalBackend
  -> request streaming through StopChunkProcessor

Key files:

  • infer/src/backend/runtime.rs: serial runtime handle for non-CUDA backends
  • infer/src/backend/cpu.rs: development CPU backend
  • infer/src/backend/metal.rs: Apple Silicon backend via mlx-sys
  • infer/src/bin/cpu_serve.rs
  • infer/src/bin/metal_serve.rs

Current train control-plane path (post-OPD-pivot, 2026-05-18)

crates/train/src/commands/train_opd.rs   (substrate landing next milestone)
  -> train::server::bind_and_serve_on_thread()
  -> std TcpListener control plane on /v1/train/{status,events,stop,save}
  -> train::control::TrainingController + ControllerSink
  -> SharedSink background worker
  -> local JSONL/stdout + optional MLflow / OTLP / W&B export
  -> autograd + Trainer<O, C, S> + teacher (in `infer`) + student LoRA

Scratch pretrain, SFT, GRPO, and multi-turn RL surfaces were retired in commit bd94c09 (see docs/projects/2026-05-18-opd-only-pivot.md). Their dispatch sources, supporting modules, and tests have been deleted from crates/train. The autograd + Trainer + checkpoint codec

  • tokenizer + LoRA + /v1/train/* control plane remain as OPD substrate. infer continues to expose the optional /v1/train/* proxy when --train-control-url is configured.

Key files (surviving the pivot):

  • crates/train/src/commands/env.rs, test.rs, estimate-memory.rs: diagnostic surfaces preserved
  • crates/train/src/server.rs: minimal HTTP control plane for /v1/train/status|events|stop|save
  • crates/train/src/control.rs: shared controller / status state plus recent event ring buffer
  • crates/train/src/metrics.rs: shared async observability sink, lifecycle/artifact events, MLflow / OTLP / W&B export adapters
  • crates/train/src/trainer.rs: Trainer<O, C, S> skeleton — kept; OPD will provide its own step_fn
  • crates/train/src/{checkpoint,cli_args,grad_accum,grad_clip,loss,lora,tokenizer,causal_lm,qwen35,qwen35_checkpoint,model_family}.rs: substrate kept for OPD

3. infer/ crate map

Runtime entry, serving, and wiring

  • infer/src/server_engine.rs: unified InferenceEngine trait, CompletionRequest/CompletionOutput/TokenUsage/CompletionStreamDelta types, CUDA generation loop, and the LoadedInferenceEngine enum that dispatches to Qwen35/Qwen35Moe (CUDA), BackendInferenceEngine<MetalBackend> (Metal), or BackendInferenceEngine<CpuBackend> (CPU)
  • infer/src/backend/cuda/bootstrap.rs: builds CUDA engines and schedulers
  • infer/src/backend/runtime.rs: serial backend runtime for CPU/Metal
  • infer/src/http_server.rs: axum wiring for serving
  • infer/src/request_handle.rs: generic request submission interface
  • infer/src/logging.rs: default logging init
  • infer/src/metrics.rs: metrics export surface
  • infer/src/hf_hub.rs: local model discovery / HuggingFace integration
  • infer/src/model_registry.rs: model architecture detection

Scheduling and lifecycle control

  • infer/src/scheduler/batch.rs: pure CPU accounting scheduler with lifecycle events
  • infer/src/scheduler/types.rs: request types, handles, config, queue admission
  • infer/src/scheduler/policy.rs: admission/chunking/eviction policy traits and defaults
  • infer/src/scheduler/forward_batch.rs: F0.7 type-only ForwardBatch + IntermediateTensors PP-proxy slot — present from F0 ahead of pipeline-parallel forward wiring
  • infer/src/scheduler/cuda/: production CUDA scheduler
  • infer/src/scheduler/cuda/spec_path.rs: per-step SpecPath dispatch that gates the speculative decode verifier micro-batch path through the CUDA execution loop
  • infer/src/backend/metal/scheduler.rs: Metal scheduling/accounting layer

Distributed (single-node multi-GPU F0–F4 scaffold)

  • infer/src/distributed.rs + infer/src/distributed/{parallel_state,group_coordinator,pipeline_state,expert_state,nccl,init_method}.rs: F0.1–F0.4 multi-GPU foundation — SGLang-style world / TP / PP / EP / attention-TP/DP/CP / MoE-TP/EP/DP group metadata, a GroupCoordinator collective surface (single-rank no-op; wraps the NCCL smoke group for f32 all-reduce, all-gather, broadcast under --features cuda,nccl), TCP rendezvous (TcpStore / EnvBootstrap), F3 pipeline-parallel scaffold (pipeline_state.rs), and F4 expert-parallel scaffold (expert_state.rs). Real production NCCL collectives in forward are not yet wired; TP>1 production load fails fast until they are.

Shared runtime contracts that Route A folded back in

  • infer/src/types.rs: request/session identifiers and shared scheduler enums
  • infer/src/events.rs: engine event schema and sink trait
  • infer/src/scheduler/policy.rs: admission/chunking/eviction policy traits
  • infer/src/server_engine.rs: unified InferenceEngine trait — the old agent_engine.rs duplicate facade was deleted and its responsibilities collapsed into server_engine.rs so HTTP and agent CLI share one contract

For the Route-A folding rationale see architecture.md §Route-A Note.

Memory, KV, caching, and batching support

  • infer/src/block_manager.rs: KV block accounting for the batch scheduler
  • crates/cuda-kernels/src/paged_kv.rs: token-level KV pool for CUDA paged attention (page-aware, BF16 page_size=16)
  • infer/src/prefix_cache.rs: radix-tree prefix cache for CUDA/runtime reuse; tier-aware RadixNode metadata (hit_count, tier_location, session_id, fingerprint, soft_pin_until, byte_len) + lookup_or_stage classification contract
  • infer/src/kv_tier.rs + infer/src/kv_tier/{backend,chunk,io,lookup,readmission,coordinator,host_pool,transport,tier,id,policy}.rs: tiered KV cache module (T0 GPU → T1 host pinned → T2 NVMe → T3 remote); local path now combines radix metadata, direct GPU prefix attachment + decode-time COW in paged_kv, HostPinnedPool (kv-native-sys arena) for T1 demotion, ReadmissionPlan + WaitingFetch + promote_fetched_prefix for staged readmission, Coordinator-driven fetch/store queues, transport/disk.rs for node-local T2, transport/shared_fs.rs for a minimal cluster-shared backend, and ServerMetrics queue/backpressure gauges for the live fetch/store path. NIXL remains stub-only.
  • infer/src/memory_planner.rs: memory planning helpers
  • crates/cuda-kernels/src/graph_pool.rs: CUDA graph capture/reuse support
  • crates/cuda-kernels/src/tilelang.rs: paged-KV metadata staging for TileLang
  • infer/src/backend/metal/kv_pool.rs
  • infer/src/backend/metal/prefix_cache.rs
  • infer/src/backend/metal/gdr.rs
  • infer/src/backend/metal/request_state.rs: resumable Metal request state layer for Qwen3.5 (prefill in chunks, one-step decode, deterministic cleanup); M0.2a landed locally 2026-04-15

Models, kernels, and numerics

  • infer/src/model.rs: ModelForward, GenerationState, decode-context abstractions
  • infer/src/model/qwen3.rs
  • infer/src/model/qwen35.rs
  • infer/src/model/layer_communicator.rs: F0.8 model-level communicator skeleton with post_attn_all_reduce / post_mlp_all_reduce / DP-attention-gather hooks; single-rank no-op, production multi-rank guarded until real collectives ship
  • supporting files under infer/src/model/
  • infer/src/ops.rs and infer/src/ops/*
  • crates/cuda-kernels/src/tensor.rs: CUDA tensor/device abstractions (DeviceContext, DeviceVec, DeviceMatrix, HiddenStates, RawDevicePtr)
  • infer/src/weight_loader.rs: weight loading
  • infer/src/gguf.rs: GGUF parsing
  • infer/src/quant.rs: quantization metadata + dispatch
  • infer/src/speculative.rs: speculative decoding framework — SpecConfig, DraftMode, TokenProposal, Verifier, persistent per-request draft state, K-token proposals, greedy verifier accounting, bonus-token commit, and live spec counters (Phase 2 plumbing landed; throughput regression tracked in docs/experience/errors/2026-05-01-phase2-real-spec-regression.md; Qwen3.5 Medusa/spec verification additionally requires recurrent accepted-length rollback)
  • infer/src/speculative/cuda.rs: CUDA-side speculative decode integration — draft/verifier state plumbing for the external-draft path
  • infer/src/tensor_parallel.rs: CPU-side TP rank/shard math (used as a library by the tp and distributed modules; not the runtime collective surface)
  • infer/src/tp.rs + infer/src/tp/load_context.rs: TpLoadContext row/column/head shard helpers that drive shard-aware safetensors loading
  • infer/src/tokenizer.rs: tokenizer wrapper

Backends and binaries

  • infer/src/backend.rs: backend traits and shared generate result types
  • infer/src/backend/cpu.rs
  • infer/src/backend/metal.rs
  • runtime/benchmark binaries in infer/src/bin/

4. Extracted crate map

These crates remain independent after Route A:

  • crates/agent: agent session state, tool recovery, turn loop
  • crates/chat: shared protocol parsing/formatting and OpenAI chat types
  • crates/cli: CLI entry, arg parsing, REPL UX
  • crates/tools: builtin tools, sandbox/tool execution, shared tool hooks
  • crates/cuda-kernels: CUDA kernel layer extracted from infer in commit a4e12f5 (2026-04-15). Owns csrc/{attention,gemm,kv,quant,misc}/, tools/tilelang/, Rust FFI, paged_kv, tilelang, graph_pool, tensor, kv_quant, kv_turboquant
  • crates/mlx-sys: MLX C++ bridge for the Metal backend, including vendored MLX qmv kernels used by Qwen3.5 GGUF affine/tiled quant decode
  • crates/kv-native-sys: local persistence layer used by infer/src/kv_tier/transport/disk.rs for local file and content-addressed block object operations; also exports substrate APIs for WAL append/replay, mmap descriptors, and shared-memory descriptors
  • crates/qwen35-spec: shared train↔infer Qwen3.5 config + canonical tensor-name contract + Shard annotations consumed by the F1 sharded loader path
  • crates/deepseek-spec: DeepSeek support is now V4-only for infer/models/dsv4-mini-1B-init. The crate owns DeepSeekV4Config, V4 tensor-name builders, shard annotations, attention operator summaries, and MoE route helpers. infer/src/model/deepseek/* remains the CUDA model scaffold; infer/src/model/deepseek/reference.rs is the CPU-only Rust reference smoke path used by cpu_serve. (The arle train pretrain-dsv4 bootstrap was retired in the 2026-05-18 OPD-only pivot along with the rest of the non-OPD training surfaces.) CUDA V4 hybrid attention + MoE + MTP kernels remain the active runtime blockers. DS4 is the #1 next-model priority (ROADMAP §Next-Model Priority Order)

Current dependency direction:

workspace root package
  -> cli
     -> infer
     -> agent
     -> chat
     -> tools

agent
  -> infer
  -> chat

infer
  -> chat
  -> cuda-kernels  (one-way; never the reverse)
  -> mlx-sys (feature = "metal")

5. Tests and validation map

Rust integration tests in infer/tests/

  • scheduler/runtime: e2e.rs, e2e_qwen35.rs, greedy_consistency.rs
  • GGUF/quantization/kernel regressions: q4k_kernel_correctness.rs, ground_truth_q4k.rs, smoke_*
  • golden/test-data tooling: regen_test_data.rs, gen_test_data_35.rs

Bench and helper entrypoints

  • scripts/bench_guidellm.sh: canonical throughput / latency sweep wrapper
  • scripts/bench_throughput.py: legacy helper for narrower synthetic/sharegpt runs; not canonical throughput / latency truth
  • scripts/bench_agent_trace.py: agent-style trace replay
  • infer/src/bin/metal_bench.rs: Metal micro/macro benchmark entrypoint

6. Where to start reading

  • Backend loading / model discovery: start at infer/src/hf_hub.rs for resolve_model_source, then infer/src/server_engine.rs for LoadedInferenceEngine::load and the InferenceEngine trait, then infer/src/backend/cuda/bootstrap.rs for the CUDA bring-up
  • CUDA serving path: infer/src/main.rsinfer/src/http_server.rsinfer/src/scheduler/cuda/
  • Agent CLI path: src/main.rscrates/cli/src/lib.rsinfer/src/server_engine.rscrates/agent/src/lib.rs