Releases: AIdevsmartdata/chimere
v0.1.0 — M1 Multi-slot Native (M=4)
v0.1.0 — M1 Multi-slot Native (2026-04-25)
M1 Multi-slot (2026-04-24)
Ships the scaffolding, FFI primitives and documentation required to
serve N concurrent /v1/chat/completions requests through a single
chimere-server process. The legacy Mutex<AppStateModel> path is
byte-for-byte unchanged — the new serving path activates only when
CHIMERE_MULTISLOT is explicitly set to >= 2.
See chimere-server/ARCHITECTURE.md §"M1 Multi-slot + Continuous
Batching (Apr 2026)" for the end-to-end design, per-seq / per-slot
data ownership, batch-construction invariants, and the MTP gating
policy. See ~/Bureau/plan-M1-multislot-2026-04-24.md for the 7-8
day roadmap this work implements.
Added — scaffolding (J1)
slot_scheduler.rs(chimere-server/src/slot_scheduler.rs,
~960 lines across all M1 commits). Contains:SchedulerConfig+from_env()— readsCHIMERE_MULTISLOT,
clamps to[1, 8], and setsenabled = num_slots >= 2.Scheduler— owns the admission mpsc channel
(ADMISSION_QUEUE_CAP = 64default, overridable via
CHIMERE_ADMISSION_QUEUE), aSlotPoolbehind astd::Mutex,
and ashutdownatomic.SlotPool+Slot+SlotStatefinite-state machine
(Free | Prefilling { chunks_done } | Generating | Draining).BatchBuilder— pure-Rust accumulator matching theLlamaBatch
layout (toks,pos,n_seq_id,seq_ids,logits,
slot_emit_indices).ScheduledRequest/ScheduledRequestMeta— cheap-clone work
envelope plus aBox<dyn FnOnce + Send>closure so the
scheduler stays decoupled fromchimere_model::*.- 5 unit tests: pool bookkeeping, batch layout, config default,
scheduler-new is a no-op at N=1,admission_txis cheap-clone.
AppState.scheduler: Option<Arc<Scheduler>>field in
server.rs:329, plusAppState::multislot_active()helper.
Added — admission queue + dispatcher (J2)
Scheduler::admission_tx()cheap-clone sender and
Scheduler::spawn_workers()OS-thread dispatcher
(chimere-sched-dispatch) that drains the admission channel and
runs eachScheduledRequest.runclosure. Per-request queue-wait
ms is logged to stderr on dispatch.bin/chimere-server.rs:312-342builds the scheduler iff
SchedulerConfig::is_active(), spawns the dispatcher, and
detaches the JoinHandle (process-lifetime worker).bin/j2_smoke.rs— 2 concurrent fake-inference closures,
asserts the dispatcher accepts both and interleaves their outputs.
Added — multi-seq FFI decoder driver (J3)
LlamaForward::forward_multi_seq(&mut self, entries) -> Result<Vec<(i32, Vec<f32>)>, String>—
chimere-server/src/llama_backend.rs. Composes N seq_ids into a
singlellama_batch, callsllama_decodeonce, returns per-entry
logits for entries withrequest_logits = true. libllama routes
K/V writes to per-seq pages (transformer) or per-seq SSM states
(Mamba / Nemotron-H / qwen3next GDN).MultiSeqEntry { token, pos, seq_id, request_logits }— input
shape for the above.LlamaForward::kv_cache_seq_rm_for(seq_id) -> bool—
frees KV pages owned by a finished sequence. The legacy seq_id=0
hard-code at line 1015 is kept for the single-slot path.LlamaForward::vocab_size() -> usize— public accessor so
callers can slice theVec<f32>of logits returned by
forward_multi_seq.bin/j3_smoke.rs— loads a model withn_seq_max = 2,
prefills 2 distinct prompts on seq_id 0 and 1 in one multi-seq
batch, then 10 generate steps alternated via multi-seq batches,
asserts the two token streams diverge.
Added — chunked prefill + mixed-seq generate (J4)
bin/j4_smoke.rs— chunked prefill of one seq interleaved with
concurrent generate of another seq. Usesforward_multi_seqas the
scheduler will once the HTTP dispatcher rewrite lands. The smoke
proves seq-1's token stream is bit-for-bit identical whether it
runs in isolation or interleaved with a 512-token prefill of
seq-0, and documents the ik_llama qwen3next constraint "no
repeated seq_id within onellama_decodebatch".llama_grammar_applyFFI shim flipped fromabortto no-op
(chimere_sampler.cpp) — the originalabort()crashed every
fresh build on top of the 2026-04-24 libcommon rebuild.CHIMERE_SKIP_SAMPLER_INITenvironment variable wired for the
smokes (default0, honours0/false). Lets targeted repros
bypass the C++ sampler init path.
Added — per-slot sampler + per-slot engram (J5)
SamplerHandle(owned,Send,!Sync,!Clone) —
slot_scheduler.rs. Wraps a*mut c_voidreturned by
chimere_sampler_alloc_with_dry;Dropcalls
chimere_sampler_free_handle. One handle per active slot →
logit_bias maps, DRY histories and repetition counters are
per-slot, no cross-slot leakage by construction.EngramHandle { lookup: Arc<MultiEngramLookup>, alpha: f32 }—
cheap-clone. Engram tables (mmap'd.engrfiles) stay global;
only the alpha is per-slot.Slot::apply_engram_bias_to_sampler()— implements the
production formulaalpha * ln(prob + 1e-10)identically to
mtp_scheduler.rs, so the multi-slot path is numerically
equivalent to the single-slot path on the same prompt.SlotPool::alloc_samplers_with_dry()— allocates N
independent samplers at scheduler boot; rolls back to "no sampler"
if any slot's alloc fails.SlotPool::attach_engram{,_per_slot}()— attaches a single
shared lookup Arc (homogeneous deployment) or one lookup per slot
(multi-tenant, useful for tests and kine / cyber / research domain
split).- FFI helpers on
chimere_sampler.cpp:
chimere_sampler_alloc_with_dry,
chimere_sampler_set_engram_bias_handle,
chimere_sampler_set_logit_bias_handle,
chimere_sampler_clear_engram_bias_handle,
chimere_sampler_reset_handle,
chimere_sampler_free_handle. LlamaForward::sample_slot_with_logprobs(sampler, idx)—
per-slot sample from a given batch index (used by the
scheduler's per-slot sample step).bin/j5_smoke.rs— two slots with DIFFERENT in-memory engram
tables, asserts target_a never appears in slot 1's top-5 and
target_b never appears in slot 0's top-5 (no cross-slot
logit_bias leakage).
Fixed — libcommon ABI drift (J5 unblock)
Previously the chimere FFI chimere_sampler.cpp depended on
common_sampler_init from libcommon.a. On 2026-04-24, ik_llama's
sampling.cpp was rebuilt with upstream's autoparser refactor
(rbudget field + reasoning_budget_* params) while the
sampling.h checked out on the chimere branch did not carry those
changes. The linker resolved init anyway, init wrote past the end of
the smaller new common_sampler() allocation, and a C++ exception
propagated through extern "C" into Rust — aborting with
fatal runtime error: Rust cannot catch foreign exceptions.
Fix: chimere_sampler.cpp no longer calls anything from libcommon.
A minimal in-file sampler is built entirely on libllama.so's stable
llama_sample_* C API (repetition → top-k → top-p → min-p →
temperature → llama_sample_token, greedy when temp ≤ 0).
libcommon.a is no longer linked (ffi/build.rs). The
llama_grammar_apply shim is kept as a defensive no-op.
Full write-up: ~/Bureau/chimere-sampler-unblock-2026-04-24.md.
Added — bench harness + docs (J8)
bin/bench_m1.rs— async reqwest load generator. Reads
BENCH_URL(default:8082, hard-refuse on:8081so production
is never targeted by accident),BENCH_CONC,BENCH_N,
BENCH_MAX_TOKENS,BENCH_PASS_LABEL,BENCH_BASELINE_TPS.
Emits aggregate tok/s, p50/p95/p99 latency, VRAM delta via
nvidia-smi, and a coarse isolation assertion (no two distinct
prompts returning byte-identical bodies). Exit 5 = ratio below
target; exit 6 = isolation broken.scripts/bench_m1.sh— one-shot sweep wrapper. Launches three
freshchimere-serverprocesses on:8082with
CHIMERE_MULTISLOT = 1 / 2 / 4and runsbench-m1against each.
Targets (from §7 of the plan):≥ 1.7×baseline at 2 slots,
≥ 3.0×at 4 slots. SupportsBENCH_SKIP_SERVER=1for benching a
pre-started process while the HTTP dispatcher rewrite is pending.reqwestdep (optional, gated on theserverfeature,
default-features = false+rustls-tls+json— no OpenSSL
headers needed).ARCHITECTURE.md— new section "M1 Multi-slot + Continuous
Batching (Apr 2026)" (~330 lines) with ASCII data flow,
per-seq/per-slot ownership table, feature-flag semantics, module
responsibility index, slot state machine, batch-construction
invariants, MTP gating policy, and J1-J8 status table.
Feature flag
New environment variable CHIMERE_MULTISLOT: unset / 1
selects the legacy path (production default, behaviour unchanged),
>= 2 activates the admission queue + scheduler.
Values >= 9 are clamped to 8 with a warning.
CHIMERE_ADMISSION_QUEUE overrides the 64-slot admission mpsc
bound.
Known status (J8)
| Layer | Status |
|---|---|
| Scheduler types + admission | OK |
forward_multi_seq FFI |
OK |
| chunked prefill + concurrent gen | OK |
| per-slot sampler + engram isolation | OK |
HTTP dispatcher → forward_multi_seq |
WIP |
| stop / cancel / disconnect cleanup | PENDING |
| stress (4c, 8 backlog, 1000-req leak) | PENDING |
| bench harness | OK |
The HTTP handler still routes through the legacy
state.model.blocking_lock() closure; wiring it through the
scheduler's step loop is the remaining multi-day task. Until that
lands, CHIMERE_MULTISLOT >= 2 is a no-op for request throughput
(the primitives and per-slot isolation are already proven by the
Full CHANGELOG: see ...