2026-05-08 EOD+18 战略 source of truth:
projects/2026-05-07-arle-master-strategy.md§0.1 主战场 3 axis(user 2026-05-08 directive):Agent workload(W3/W4) + 量化全套 + 投机解码(Medusa/EAGLE/DFlash)。 非主战场 deprecated:Piecewise prefill graph(Phase 0 KILL8b4a03b)+ canonical 4-shape 单点优化(6 KILL 全在错的 workload)。量化全套 plan:
plans/M_quant-fp8-w4-magnitude-path.md。 cuBLASLt FP8 smoke 实测 1.88×(< 2× KILL),cutlass FP8 direct mma smoke 待验(#28)。今日 41+ commits incremental audit:
architecture-snapshot-2026-05-07-eod.md。 本 doc 仍是结构性 truth 来源,战略和今日变化看上面 pointer。
2026-05-10 later update: active Qwen3.5 Medusa/spec work is gated by
plans/M_medusa-phase1b-qwen35-v2-snapshot-ring-redesign.md. Older Medusa-ready / A+B notes are historical for Qwen3.6 until recurrent-state accepted-length rollback is licensed for Qwen3.5.
Updated 2026-05-06 after the DSV4 runtime substrate scaffold + nano autograd
training landed (2026-05-05). DSV4 is the #1 next-model priority and Qwen 3.6
the #2 — see
ROADMAP.md §Next-Model Priority Order.
Earlier landings still in scope: F0–F4 multi-GPU scaffold, Phase 2 spec-decode
plumbing, and crates/deepseek-spec/ DS0 scaffold.
This document is the canonical workspace-topology truth: where files live, what each crate owns, and where to start reading. For ownership boundaries and crate-admission governance see architecture.md; support status by surface lives in support-matrix.md.
The repository has four practical layers:
- workspace root package: thin binary wrapper in
src/main.rsthat callsinfer_cli::run(). infer/: the runtime-heavy crate. It owns the HTTP server, scheduler, backends, model/runtime modules, and the unifiedserver_engine::InferenceEnginecontract used by the HTTP server and agent CLI alike.crates/: reusable control-plane/helper crates around the runtime.docs/: architecture, plans, research, and implementation notes (single source of truth; the historicalinfer/docs/parallel tree was retired during the 2026-04-25 truth-surface cleanup).
Current workspace members (ownership and boundaries are listed in architecture.md §Package Boundaries):
- workspace root package
infercrates/cuda-kernelscrates/mlx-syscrates/agentcrates/chatcrates/clicrates/toolscrates/qwen3-spec,crates/qwen35-spec,crates/deepseek-speccrates/autogradcrates/traincrates/kv-native-sys
src/main.rs
-> infer_cli::run()
-> infer::hf_hub::resolve_model_source() + infer::server_engine::LoadedInferenceEngine::load()
-> infer_agent::AgentSession (uses `dyn InferenceEngine`)
-> infer_tools builtin tools + infer_chat protocol
-> LoadedInferenceEngine dispatches to CUDA / Metal / CPU backend
Key files:
src/main.rs:arlebinary entrypoint from the root packagecrates/cli/src/lib.rs: CLI startup and backend selectioncrates/cli/src/repl.rs: REPL loop, slash commands, terminal UXinfer/src/server_engine.rs: unifiedInferenceEnginetrait,CompletionRequest/CompletionOutput/TokenUsage/CompletionStreamDeltatypes, andLoadedInferenceEnginebackend dispatch enuminfer/src/hf_hub.rs: local model discovery +resolve_model_sourcecrates/agent/src/lib.rs: session state, prompt assembly, turn loopcrates/tools/src/lib.rs: builtin tools and shared tool hookscrates/chat/src/lib.rs:OpenAiChatMessage/OpenAiToolDefinitionwire format + re-exports of the internalChatMessage/ToolCall/ToolDefinitionprotocol types fromcrate::protocol
infer/src/main.rs
-> backend/cuda/bootstrap.rs
-> http_server.rs
-> server_engine.rs
-> scheduler/cuda/*
-> model.rs + model/*
-> ops.rs + ops/*
-> crates/cuda-kernels kernels / TileLang / CUDA graph path
Key files:
infer/src/main.rs: CUDA server binaryinfer/src/backend/cuda/bootstrap.rs: model loading, runtime config, scheduler bring-upinfer/src/http_server.rsandinfer/src/http_server/openai_v1.rs: HTTP APIinfer/src/server_engine.rs: synchronous/streaming generation façadeinfer/src/scheduler/cuda/: production CUDA scheduler implementation
cpu_serve / metal_serve
-> backend/runtime.rs
-> CpuBackend or MetalBackend
-> request streaming through StopChunkProcessor
Key files:
infer/src/backend/runtime.rs: serial runtime handle for non-CUDA backendsinfer/src/backend/cpu.rs: development CPU backendinfer/src/backend/metal.rs: Apple Silicon backend viamlx-sysinfer/src/bin/cpu_serve.rsinfer/src/bin/metal_serve.rs
crates/train/src/commands/train_opd.rs (substrate landing next milestone)
-> train::server::bind_and_serve_on_thread()
-> std TcpListener control plane on /v1/train/{status,events,stop,save}
-> train::control::TrainingController + ControllerSink
-> SharedSink background worker
-> local JSONL/stdout + optional MLflow / OTLP / W&B export
-> autograd + Trainer<O, C, S> + teacher (in `infer`) + student LoRA
Scratch pretrain, SFT, GRPO, and multi-turn RL surfaces were retired
in commit bd94c09 (see docs/projects/2026-05-18-opd-only-pivot.md).
Their dispatch sources, supporting modules, and tests have been
deleted from crates/train. The autograd + Trainer + checkpoint codec
- tokenizer + LoRA +
/v1/train/*control plane remain as OPD substrate.infercontinues to expose the optional/v1/train/*proxy when--train-control-urlis configured.
Key files (surviving the pivot):
crates/train/src/commands/env.rs,test.rs,estimate-memory.rs: diagnostic surfaces preservedcrates/train/src/server.rs: minimal HTTP control plane for/v1/train/status|events|stop|savecrates/train/src/control.rs: shared controller / status state plus recent event ring buffercrates/train/src/metrics.rs: shared async observability sink, lifecycle/artifact events, MLflow / OTLP / W&B export adapterscrates/train/src/trainer.rs:Trainer<O, C, S>skeleton — kept; OPD will provide its ownstep_fncrates/train/src/{checkpoint,cli_args,grad_accum,grad_clip,loss,lora,tokenizer,causal_lm,qwen35,qwen35_checkpoint,model_family}.rs: substrate kept for OPD
infer/src/server_engine.rs: unifiedInferenceEnginetrait,CompletionRequest/CompletionOutput/TokenUsage/CompletionStreamDeltatypes, CUDA generation loop, and theLoadedInferenceEngineenum that dispatches to Qwen35/Qwen35Moe (CUDA),BackendInferenceEngine<MetalBackend>(Metal), orBackendInferenceEngine<CpuBackend>(CPU)infer/src/backend/cuda/bootstrap.rs: builds CUDA engines and schedulersinfer/src/backend/runtime.rs: serial backend runtime for CPU/Metalinfer/src/http_server.rs: axum wiring for servinginfer/src/request_handle.rs: generic request submission interfaceinfer/src/logging.rs: default logging initinfer/src/metrics.rs: metrics export surfaceinfer/src/hf_hub.rs: local model discovery / HuggingFace integrationinfer/src/model_registry.rs: model architecture detection
infer/src/scheduler/batch.rs: pure CPU accounting scheduler with lifecycle eventsinfer/src/scheduler/types.rs: request types, handles, config, queue admissioninfer/src/scheduler/policy.rs: admission/chunking/eviction policy traits and defaultsinfer/src/scheduler/forward_batch.rs: F0.7 type-onlyForwardBatch+IntermediateTensorsPP-proxy slot — present from F0 ahead of pipeline-parallel forward wiringinfer/src/scheduler/cuda/: production CUDA schedulerinfer/src/scheduler/cuda/spec_path.rs: per-stepSpecPathdispatch that gates the speculative decode verifier micro-batch path through the CUDA execution loopinfer/src/backend/metal/scheduler.rs: Metal scheduling/accounting layer
infer/src/distributed.rs+infer/src/distributed/{parallel_state,group_coordinator,pipeline_state,expert_state,nccl,init_method}.rs: F0.1–F0.4 multi-GPU foundation — SGLang-style world / TP / PP / EP / attention-TP/DP/CP / MoE-TP/EP/DP group metadata, aGroupCoordinatorcollective surface (single-rank no-op; wraps the NCCL smoke group for f32 all-reduce, all-gather, broadcast under--features cuda,nccl), TCP rendezvous (TcpStore/EnvBootstrap), F3 pipeline-parallel scaffold (pipeline_state.rs), and F4 expert-parallel scaffold (expert_state.rs). Real production NCCL collectives in forward are not yet wired; TP>1 production load fails fast until they are.
infer/src/types.rs: request/session identifiers and shared scheduler enumsinfer/src/events.rs: engine event schema and sink traitinfer/src/scheduler/policy.rs: admission/chunking/eviction policy traitsinfer/src/server_engine.rs: unifiedInferenceEnginetrait — the oldagent_engine.rsduplicate facade was deleted and its responsibilities collapsed intoserver_engine.rsso HTTP and agent CLI share one contract
For the Route-A folding rationale see architecture.md §Route-A Note.
infer/src/block_manager.rs: KV block accounting for the batch schedulercrates/cuda-kernels/src/paged_kv.rs: token-level KV pool for CUDA paged attention (page-aware, BF16page_size=16)infer/src/prefix_cache.rs: radix-tree prefix cache for CUDA/runtime reuse; tier-awareRadixNodemetadata (hit_count,tier_location,session_id,fingerprint,soft_pin_until,byte_len) +lookup_or_stageclassification contractinfer/src/kv_tier.rs+infer/src/kv_tier/{backend,chunk,io,lookup,readmission,coordinator,host_pool,transport,tier,id,policy}.rs: tiered KV cache module (T0 GPU → T1 host pinned → T2 NVMe → T3 remote); local path now combines radix metadata, direct GPU prefix attachment + decode-time COW inpaged_kv,HostPinnedPool(kv-native-sys arena) for T1 demotion,ReadmissionPlan + WaitingFetch + promote_fetched_prefixfor staged readmission,Coordinator-driven fetch/store queues,transport/disk.rsfor node-local T2,transport/shared_fs.rsfor a minimal cluster-shared backend, andServerMetricsqueue/backpressure gauges for the live fetch/store path. NIXL remains stub-only.infer/src/memory_planner.rs: memory planning helperscrates/cuda-kernels/src/graph_pool.rs: CUDA graph capture/reuse supportcrates/cuda-kernels/src/tilelang.rs: paged-KV metadata staging for TileLanginfer/src/backend/metal/kv_pool.rsinfer/src/backend/metal/prefix_cache.rsinfer/src/backend/metal/gdr.rsinfer/src/backend/metal/request_state.rs: resumable Metal request state layer for Qwen3.5 (prefill in chunks, one-step decode, deterministic cleanup); M0.2a landed locally 2026-04-15
infer/src/model.rs:ModelForward,GenerationState, decode-context abstractionsinfer/src/model/qwen3.rsinfer/src/model/qwen35.rsinfer/src/model/layer_communicator.rs: F0.8 model-level communicator skeleton withpost_attn_all_reduce/post_mlp_all_reduce/ DP-attention-gather hooks; single-rank no-op, production multi-rank guarded until real collectives ship- supporting files under
infer/src/model/ infer/src/ops.rsandinfer/src/ops/*crates/cuda-kernels/src/tensor.rs: CUDA tensor/device abstractions (DeviceContext,DeviceVec,DeviceMatrix,HiddenStates,RawDevicePtr)infer/src/weight_loader.rs: weight loadinginfer/src/gguf.rs: GGUF parsinginfer/src/quant.rs: quantization metadata + dispatchinfer/src/speculative.rs: speculative decoding framework —SpecConfig,DraftMode,TokenProposal,Verifier, persistent per-request draft state, K-token proposals, greedy verifier accounting, bonus-token commit, and live spec counters (Phase 2 plumbing landed; throughput regression tracked indocs/experience/errors/2026-05-01-phase2-real-spec-regression.md; Qwen3.5 Medusa/spec verification additionally requires recurrent accepted-length rollback)infer/src/speculative/cuda.rs: CUDA-side speculative decode integration — draft/verifier state plumbing for the external-draft pathinfer/src/tensor_parallel.rs: CPU-side TP rank/shard math (used as a library by thetpanddistributedmodules; not the runtime collective surface)infer/src/tp.rs+infer/src/tp/load_context.rs:TpLoadContextrow/column/head shard helpers that drive shard-aware safetensors loadinginfer/src/tokenizer.rs: tokenizer wrapper
infer/src/backend.rs: backend traits and shared generate result typesinfer/src/backend/cpu.rsinfer/src/backend/metal.rs- runtime/benchmark binaries in
infer/src/bin/
These crates remain independent after Route A:
crates/agent: agent session state, tool recovery, turn loopcrates/chat: shared protocol parsing/formatting and OpenAI chat typescrates/cli: CLI entry, arg parsing, REPL UXcrates/tools: builtin tools, sandbox/tool execution, shared tool hookscrates/cuda-kernels: CUDA kernel layer extracted frominferin commita4e12f5(2026-04-15). Ownscsrc/{attention,gemm,kv,quant,misc}/,tools/tilelang/, Rust FFI,paged_kv,tilelang,graph_pool,tensor,kv_quant,kv_turboquantcrates/mlx-sys: MLX C++ bridge for the Metal backend, including vendored MLX qmv kernels used by Qwen3.5 GGUF affine/tiled quant decodecrates/kv-native-sys: local persistence layer used byinfer/src/kv_tier/transport/disk.rsfor local file and content-addressed block object operations; also exports substrate APIs for WAL append/replay, mmap descriptors, and shared-memory descriptorscrates/qwen35-spec: shared train↔infer Qwen3.5 config + canonical tensor-name contract +Shardannotations consumed by the F1 sharded loader pathcrates/deepseek-spec: DeepSeek support is now V4-only forinfer/models/dsv4-mini-1B-init. The crate ownsDeepSeekV4Config, V4 tensor-name builders, shard annotations, attention operator summaries, and MoE route helpers.infer/src/model/deepseek/*remains the CUDA model scaffold;infer/src/model/deepseek/reference.rsis the CPU-only Rust reference smoke path used bycpu_serve. (Thearle train pretrain-dsv4bootstrap was retired in the 2026-05-18 OPD-only pivot along with the rest of the non-OPD training surfaces.) CUDA V4 hybrid attention + MoE + MTP kernels remain the active runtime blockers. DS4 is the #1 next-model priority (ROADMAP §Next-Model Priority Order)
Current dependency direction:
workspace root package
-> cli
-> infer
-> agent
-> chat
-> tools
agent
-> infer
-> chat
infer
-> chat
-> cuda-kernels (one-way; never the reverse)
-> mlx-sys (feature = "metal")
- scheduler/runtime:
e2e.rs,e2e_qwen35.rs,greedy_consistency.rs - GGUF/quantization/kernel regressions:
q4k_kernel_correctness.rs,ground_truth_q4k.rs,smoke_* - golden/test-data tooling:
regen_test_data.rs,gen_test_data_35.rs
scripts/bench_guidellm.sh: canonical throughput / latency sweep wrapperscripts/bench_throughput.py: legacy helper for narrower synthetic/sharegpt runs; not canonical throughput / latency truthscripts/bench_agent_trace.py: agent-style trace replayinfer/src/bin/metal_bench.rs: Metal micro/macro benchmark entrypoint
- Backend loading / model discovery: start at
infer/src/hf_hub.rsforresolve_model_source, theninfer/src/server_engine.rsforLoadedInferenceEngine::loadand theInferenceEnginetrait, theninfer/src/backend/cuda/bootstrap.rsfor the CUDA bring-up - CUDA serving path:
infer/src/main.rs→infer/src/http_server.rs→infer/src/scheduler/cuda/ - Agent CLI path:
src/main.rs→crates/cli/src/lib.rs→infer/src/server_engine.rs→crates/agent/src/lib.rs