Releases · AIdevsmartdata/chimere

v0.1.0 — M1 Multi-slot Native (2026-04-25)

M1 Multi-slot (2026-04-24)

Ships the scaffolding, FFI primitives and documentation required to
serve N concurrent /v1/chat/completions requests through a single
chimere-server process. The legacy Mutex<AppStateModel> path is
byte-for-byte unchanged — the new serving path activates only when
CHIMERE_MULTISLOT is explicitly set to >= 2.

See chimere-server/ARCHITECTURE.md §"M1 Multi-slot + Continuous
Batching (Apr 2026)" for the end-to-end design, per-seq / per-slot
data ownership, batch-construction invariants, and the MTP gating
policy. See ~/Bureau/plan-M1-multislot-2026-04-24.md for the 7-8
day roadmap this work implements.

Added — scaffolding (J1)

slot_scheduler.rs (chimere-server/src/slot_scheduler.rs,
~960 lines across all M1 commits). Contains:
- SchedulerConfig + from_env() — reads CHIMERE_MULTISLOT,
  clamps to [1, 8], and sets enabled = num_slots >= 2.
- Scheduler — owns the admission mpsc channel
  (ADMISSION_QUEUE_CAP = 64 default, overridable via
  CHIMERE_ADMISSION_QUEUE), a SlotPool behind a std::Mutex,
  and a shutdown atomic.
- SlotPool + Slot + SlotState finite-state machine
  (Free | Prefilling { chunks_done } | Generating | Draining).
- BatchBuilder — pure-Rust accumulator matching the LlamaBatch
  layout (toks, pos, n_seq_id, seq_ids, logits,
  slot_emit_indices).
- ScheduledRequest / ScheduledRequestMeta — cheap-clone work
  envelope plus a Box<dyn FnOnce + Send> closure so the
  scheduler stays decoupled from chimere_model::*.
- 5 unit tests: pool bookkeeping, batch layout, config default,
  scheduler-new is a no-op at N=1, admission_tx is cheap-clone.
AppState.scheduler: Option<Arc<Scheduler>> field in
server.rs:329, plus AppState::multislot_active() helper.

Added — admission queue + dispatcher (J2)

Scheduler::admission_tx() cheap-clone sender and
Scheduler::spawn_workers() OS-thread dispatcher
(chimere-sched-dispatch) that drains the admission channel and
runs each ScheduledRequest.run closure. Per-request queue-wait
ms is logged to stderr on dispatch.
bin/chimere-server.rs:312-342 builds the scheduler iff
SchedulerConfig::is_active(), spawns the dispatcher, and
detaches the JoinHandle (process-lifetime worker).
bin/j2_smoke.rs — 2 concurrent fake-inference closures,
asserts the dispatcher accepts both and interleaves their outputs.

Added — multi-seq FFI decoder driver (J3)

LlamaForward::forward_multi_seq(&mut self, entries) -> Result<Vec<(i32, Vec<f32>)>, String> —
chimere-server/src/llama_backend.rs. Composes N seq_ids into a
single llama_batch, calls llama_decode once, returns per-entry
logits for entries with request_logits = true. libllama routes
K/V writes to per-seq pages (transformer) or per-seq SSM states
(Mamba / Nemotron-H / qwen3next GDN).
MultiSeqEntry { token, pos, seq_id, request_logits } — input
shape for the above.
LlamaForward::kv_cache_seq_rm_for(seq_id) -> bool —
frees KV pages owned by a finished sequence. The legacy seq_id=0
hard-code at line 1015 is kept for the single-slot path.
LlamaForward::vocab_size() -> usize — public accessor so
callers can slice the Vec<f32> of logits returned by
forward_multi_seq.
bin/j3_smoke.rs — loads a model with n_seq_max = 2,
prefills 2 distinct prompts on seq_id 0 and 1 in one multi-seq
batch, then 10 generate steps alternated via multi-seq batches,
asserts the two token streams diverge.

Added — chunked prefill + mixed-seq generate (J4)

bin/j4_smoke.rs — chunked prefill of one seq interleaved with
concurrent generate of another seq. Uses forward_multi_seq as the
scheduler will once the HTTP dispatcher rewrite lands. The smoke
proves seq-1's token stream is bit-for-bit identical whether it
runs in isolation or interleaved with a 512-token prefill of
seq-0, and documents the ik_llama qwen3next constraint "no
repeated seq_id within one llama_decode batch".
llama_grammar_apply FFI shim flipped from abort to no-op
(chimere_sampler.cpp) — the original abort() crashed every
fresh build on top of the 2026-04-24 libcommon rebuild.
CHIMERE_SKIP_SAMPLER_INIT environment variable wired for the
smokes (default 0, honours 0/false). Lets targeted repros
bypass the C++ sampler init path.

Added — per-slot sampler + per-slot engram (J5)

SamplerHandle (owned, Send, !Sync, !Clone) —
slot_scheduler.rs. Wraps a *mut c_void returned by
chimere_sampler_alloc_with_dry; Drop calls
chimere_sampler_free_handle. One handle per active slot →
logit_bias maps, DRY histories and repetition counters are
per-slot, no cross-slot leakage by construction.
EngramHandle { lookup: Arc<MultiEngramLookup>, alpha: f32 } —
cheap-clone. Engram tables (mmap'd .engr files) stay global;
only the alpha is per-slot.
Slot::apply_engram_bias_to_sampler() — implements the
production formula alpha * ln(prob + 1e-10) identically to
mtp_scheduler.rs, so the multi-slot path is numerically
equivalent to the single-slot path on the same prompt.
SlotPool::alloc_samplers_with_dry() — allocates N
independent samplers at scheduler boot; rolls back to "no sampler"
if any slot's alloc fails.
SlotPool::attach_engram{,_per_slot}() — attaches a single
shared lookup Arc (homogeneous deployment) or one lookup per slot
(multi-tenant, useful for tests and kine / cyber / research domain
split).
FFI helpers on chimere_sampler.cpp:
chimere_sampler_alloc_with_dry,
chimere_sampler_set_engram_bias_handle,
chimere_sampler_set_logit_bias_handle,
chimere_sampler_clear_engram_bias_handle,
chimere_sampler_reset_handle,
chimere_sampler_free_handle.
LlamaForward::sample_slot_with_logprobs(sampler, idx) —
per-slot sample from a given batch index (used by the
scheduler's per-slot sample step).
bin/j5_smoke.rs — two slots with DIFFERENT in-memory engram
tables, asserts target_a never appears in slot 1's top-5 and
target_b never appears in slot 0's top-5 (no cross-slot
logit_bias leakage).

Fixed — libcommon ABI drift (J5 unblock)

Previously the chimere FFI chimere_sampler.cpp depended on
common_sampler_init from libcommon.a. On 2026-04-24, ik_llama's
sampling.cpp was rebuilt with upstream's autoparser refactor
(rbudget field + reasoning_budget_* params) while the
sampling.h checked out on the chimere branch did not carry those
changes. The linker resolved init anyway, init wrote past the end of
the smaller new common_sampler() allocation, and a C++ exception
propagated through extern "C" into Rust — aborting with
fatal runtime error: Rust cannot catch foreign exceptions.

Fix: chimere_sampler.cpp no longer calls anything from libcommon.
A minimal in-file sampler is built entirely on libllama.so's stable
llama_sample_* C API (repetition → top-k → top-p → min-p →
temperature → llama_sample_token, greedy when temp ≤ 0).
libcommon.a is no longer linked (ffi/build.rs). The
llama_grammar_apply shim is kept as a defensive no-op.

Full write-up: ~/Bureau/chimere-sampler-unblock-2026-04-24.md.

Added — bench harness + docs (J8)

bin/bench_m1.rs — async reqwest load generator. Reads
BENCH_URL (default :8082, hard-refuse on :8081 so production
is never targeted by accident), BENCH_CONC, BENCH_N,
BENCH_MAX_TOKENS, BENCH_PASS_LABEL, BENCH_BASELINE_TPS.
Emits aggregate tok/s, p50/p95/p99 latency, VRAM delta via
nvidia-smi, and a coarse isolation assertion (no two distinct
prompts returning byte-identical bodies). Exit 5 = ratio below
target; exit 6 = isolation broken.
scripts/bench_m1.sh — one-shot sweep wrapper. Launches three
fresh chimere-server processes on :8082 with
CHIMERE_MULTISLOT = 1 / 2 / 4 and runs bench-m1 against each.
Targets (from §7 of the plan): ≥ 1.7× baseline at 2 slots,
≥ 3.0× at 4 slots. Supports BENCH_SKIP_SERVER=1 for benching a
pre-started process while the HTTP dispatcher rewrite is pending.
reqwest dep (optional, gated on the server feature,
default-features = false + rustls-tls + json — no OpenSSL
headers needed).
ARCHITECTURE.md — new section "M1 Multi-slot + Continuous
Batching (Apr 2026)" (~330 lines) with ASCII data flow,
per-seq/per-slot ownership table, feature-flag semantics, module
responsibility index, slot state machine, batch-construction
invariants, MTP gating policy, and J1-J8 status table.

Feature flag

New environment variable CHIMERE_MULTISLOT: unset / 1
selects the legacy path (production default, behaviour unchanged),
>= 2 activates the admission queue + scheduler.
Values >= 9 are clamped to 8 with a warning.
CHIMERE_ADMISSION_QUEUE overrides the 64-slot admission mpsc
bound.

Known status (J8)

Layer	Status
Scheduler types + admission	OK
`forward_multi_seq` FFI	OK
chunked prefill + concurrent gen	OK
per-slot sampler + engram isolation	OK
HTTP dispatcher → `forward_multi_seq`	WIP
stop / cancel / disconnect cleanup	PENDING
stress (4c, 8 backlog, 1000-req leak)	PENDING
bench harness	OK

The HTTP handler still routes through the legacy
state.model.blocking_lock() closure; wiring it through the
scheduler's step loop is the remaining multi-day task. Until that
lands, CHIMERE_MULTISLOT >= 2 is a no-op for request throughput
(the primitives and per-slot isolation are already proven by the

Full CHANGELOG: see ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.1.0 — M1 Multi-slot Native (2026-04-25)

M1 Multi-slot (2026-04-24)

Added — scaffolding (J1)

Added — admission queue + dispatcher (J2)

Added — multi-seq FFI decoder driver (J3)

Added — chunked prefill + mixed-seq generate (J4)

Added — per-slot sampler + per-slot engram (J5)

Fixed — libcommon ABI drift (J5 unblock)

Added — bench harness + docs (J8)

Feature flag

Known status (J8)

Uh oh!

Releases: AIdevsmartdata/chimere

v0.1.0 — M1 Multi-slot Native (M=4)

v0.1.0 — M1 Multi-slot Native (2026-04-25)

M1 Multi-slot (2026-04-24)

Added — scaffolding (J1)

Added — admission queue + dispatcher (J2)

Added — multi-seq FFI decoder driver (J3)

Added — chunked prefill + mixed-seq generate (J4)

Added — per-slot sampler + per-slot engram (J5)

Fixed — libcommon ABI drift (J5 unblock)

Added — bench harness + docs (J8)

Feature flag

Known status (J8)

Uh oh!