Skip to content

cklxx/arle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,263 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ARLE
Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Website CI CUDA CI Metal CI MIT License Release

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog

English · 简体中文


Quick Start

# Apple Silicon — Homebrew
brew install cklxx/tap/arle

# Apple Silicon or Linux x86_64 — one-line installer
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

# Linux + NVIDIA — Docker, no compile
docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model

# From source (any backend)
cargo build --release --features cuda --bin arle     # Linux + NVIDIA
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle  # Apple Silicon

Full install matrix + uninstall: docs/install.md.

Serve:

arle serve --backend cuda  --model-path /path/to/Qwen3.5-4B --port 8000
arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000

Talk to it (OpenAI-compatible):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Local agent / self-check:

arle                              # interactive REPL with python/shell tools
arle run --prompt "Summarize this repo" --model-path /path/to/Qwen3.5-4B
arle --doctor --json              # CI-friendly self-check

More copy-paste: examples/.


Status at a glance

Backend Platform Status Headline
CUDA Linux + NVIDIA Stable Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, CUDA Graph decode. L4 / Qwen3.5-4B BF16 + FP8 KV: 197 tok/s @ c=16 / 4k-in.
Metal Apple Silicon Beta Scheduler-backed serving, chunked prefill, replay prefix reuse. Qwen3.6 35B-A3B 4-bit MLX: 85.6 tok/s decode / 385 ms TTFT on M4 Pro 48GB.
Metal DFlash Apple Silicon Beta — default-on Speculative decode for Qwen3.5. Qwen3.5-4B-4bit bit-identical, c=1..8.
OPD train (CUDA) Linux + NVIDIA Beta 2.04× faster than HuggingFace TRL GKDTrainer at matched Qwen3-0.6B setup. LoRA-only: 0.140 s/step at 3.9 GB peak — fits 4 GB consumer cards. Cross-runtime large-teacher path validated end-to-end (Qwen3.5-4B → 0.8B LoRA). See Latest Updates.
CPU Portable Dev-only Smoke tests; not a perf target.

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B) on CUDA + Metal. Next-model queue: DeepSeek V4 (#1)Qwen 3.6 (#2) — see ROADMAP.md.

Authoritative tier matrix: docs/support-matrix.md · docs/stability-policy.md.


Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results re-process every turn. ARLE treats this as the core problem in both serving and training:

  • Multi-turn KV reuse. Slot-sticky reuse + radix-backed tiered KV (T0 GPU → T1 host → T2 disk → T3 cluster) keep prior-turn KV hot.
  • Paged KV pool. page_size=16 with direct GPU page attach + tail-page CoW for shared prefixes — predictable accounting, cheap prefix sharing.
  • Shared runtime authority. infer, arle, and the OPD training loop share one Rust runtime + model code path — the OPD teacher is the production-serving runtime, not a separate stack.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md.


Entry surfaces

arle is the single binary:

Command What it does
arle (no args) Interactive agent REPL with python and shell tools.
arle run --prompt "…" One-shot agent prompt. --no-tools to disable tools.
arle serve --backend … OpenAI-compatible HTTP server.
arle train opd On-Policy Distillation — teacher in infer, student in train. CUDA path. Usage manual.
arle --doctor [--json] Backend / hardware / model-resolution self-check.

Operators wanting only the serving binary can use infer directly — same HTTP contract, without agent / train surfaces.


Latest Updates

2026-05-22 — ARLE OPD pipeline closed end-to-end, GDR prefill bug fixed. 4B BF16 teacher → 0.8B-Base LoRA r=16 student, train→save→load→eval loop in one cycle. ARLE serve cross-validated against HF transformers reference: same Qwen3.5-4B, MMLU 5-shot n=171, 77.33 % vs 78.18 % (Δ +0.85 pp, statistically equivalent) — engines agree.

ARLE OPD distill trajectory (U-curve valley → recovery) + ARLE serve vs HF transformers cross-validation

  • Bug fixed: arle serve corruption on prompts ≥ 33 tokens (Qwen3.5 hybrid GDR chunkwise prefill divergence) → MMLU recovered 0/171 invalid → 116/150 = 77.3 % (a374108).
  • Pipeline closed: OPD train saves LoRA adapter (PEFT format) → INFER_LORA_PATH loads in CUDA serve → scripts/arle_capability_eval.py produces before/after MMLU table.
  • Distill trajectory + lr sweep (2k steps each): OPD U-curve is fundamental, not just lr-driven.
    • lr=2e-5: base 51.4 % → step 1000 deep valley 47.9 % (−3.5 pp) → step 2000 recovering 50.0 % (+2.1 pp).
    • lr=1e-5: base 51.4 % → step 1000 shallow valley 50.6 % (−0.8 pp) → step 2000 REGRESSED 48.5 % (−2.1 pp from peak).
    • Lower lr ≠ shallower valley ≠ faster recovery. Neither lr crosses base in 2 k steps. → Need longer horizon (10 k +) or GKD λ-mixing, not just lr tuning.

Evidence: serve fix · pipeline close · U-curve diagnosis · lr sweep KILL · cross-validation · cycle wrap


2026-05-21 — ARLE OPD CUDA: faster + smaller vs HuggingFace TRL. Same Qwen3-0.6B teacher/student, 32 prompts, rollout_len=8, lr=1e-7, 500 steps, AdamW, RTX 4070 Ti SUPER.

ARLE OPD CUDA vs HuggingFace TRL — speed, memory, held-out KL

TRL GKDTrainer ARLE full-finetune ARLE LoRA r=16
step time (s) 0.408 0.164 (2.49×) 0.140 (2.91×)
peak GPU memory (GB) 12.6 15.4 3.93 (fits 4 GB cards)
held-out KL (500 steps) -5.5 % -18.5 % -36.4 %

Cross-runtime large-teacher path validated. Qwen3.5-4B BF16 teacher in infer → Qwen3.5-0.8B-Base LoRA r=16 student in train via the InferTeacher device-logits bridge. 200-step real-text run: 5.66 s/step, 14.8 GiB peak, monotonic KL decrease (held-out -2.05%). Cross-runtime overhead measured at 1.5% of step time — production-fast teacher integration.

End-to-end convergence verified: held-out exact-overlap 50% → 82.8% in 5000 steps (lr=1e-7).

Evidence: docs/projects/2026-05-21-opd-cuda-cycle-wrap.md · usage manual · TRL head-to-head · 4B→0.8B cross-runtime bench.

Full history: CHANGELOG.md.


Documentation map


License

MIT

About

Rust-native inference runtime for Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors