Skip to content

Releases: AIdevsmartdata/chimere

v0.1.0 — M1 Multi-slot Native (M=4)

25 Apr 17:46

Choose a tag to compare

v0.1.0 — M1 Multi-slot Native (2026-04-25)

M1 Multi-slot (2026-04-24)

Ships the scaffolding, FFI primitives and documentation required to
serve N concurrent /v1/chat/completions requests through a single
chimere-server process. The legacy Mutex<AppStateModel> path is
byte-for-byte unchanged — the new serving path activates only when
CHIMERE_MULTISLOT is explicitly set to >= 2.

See chimere-server/ARCHITECTURE.md §"M1 Multi-slot + Continuous
Batching (Apr 2026)" for the end-to-end design, per-seq / per-slot
data ownership, batch-construction invariants, and the MTP gating
policy. See ~/Bureau/plan-M1-multislot-2026-04-24.md for the 7-8
day roadmap this work implements.

Added — scaffolding (J1)

  • slot_scheduler.rs (chimere-server/src/slot_scheduler.rs,
    ~960 lines across all M1 commits). Contains:
    • SchedulerConfig + from_env() — reads CHIMERE_MULTISLOT,
      clamps to [1, 8], and sets enabled = num_slots >= 2.
    • Scheduler — owns the admission mpsc channel
      (ADMISSION_QUEUE_CAP = 64 default, overridable via
      CHIMERE_ADMISSION_QUEUE), a SlotPool behind a std::Mutex,
      and a shutdown atomic.
    • SlotPool + Slot + SlotState finite-state machine
      (Free | Prefilling { chunks_done } | Generating | Draining).
    • BatchBuilder — pure-Rust accumulator matching the LlamaBatch
      layout (toks, pos, n_seq_id, seq_ids, logits,
      slot_emit_indices).
    • ScheduledRequest / ScheduledRequestMeta — cheap-clone work
      envelope plus a Box<dyn FnOnce + Send> closure so the
      scheduler stays decoupled from chimere_model::*.
    • 5 unit tests: pool bookkeeping, batch layout, config default,
      scheduler-new is a no-op at N=1, admission_tx is cheap-clone.
  • AppState.scheduler: Option<Arc<Scheduler>> field in
    server.rs:329, plus AppState::multislot_active() helper.

Added — admission queue + dispatcher (J2)

  • Scheduler::admission_tx() cheap-clone sender and
    Scheduler::spawn_workers() OS-thread dispatcher
    (chimere-sched-dispatch) that drains the admission channel and
    runs each ScheduledRequest.run closure. Per-request queue-wait
    ms is logged to stderr on dispatch.
  • bin/chimere-server.rs:312-342 builds the scheduler iff
    SchedulerConfig::is_active(), spawns the dispatcher, and
    detaches the JoinHandle (process-lifetime worker).
  • bin/j2_smoke.rs — 2 concurrent fake-inference closures,
    asserts the dispatcher accepts both and interleaves their outputs.

Added — multi-seq FFI decoder driver (J3)

  • LlamaForward::forward_multi_seq(&mut self, entries) -> Result<Vec<(i32, Vec<f32>)>, String>
    chimere-server/src/llama_backend.rs. Composes N seq_ids into a
    single llama_batch, calls llama_decode once, returns per-entry
    logits for entries with request_logits = true. libllama routes
    K/V writes to per-seq pages (transformer) or per-seq SSM states
    (Mamba / Nemotron-H / qwen3next GDN).
  • MultiSeqEntry { token, pos, seq_id, request_logits } — input
    shape for the above.
  • LlamaForward::kv_cache_seq_rm_for(seq_id) -> bool
    frees KV pages owned by a finished sequence. The legacy seq_id=0
    hard-code at line 1015 is kept for the single-slot path.
  • LlamaForward::vocab_size() -> usize — public accessor so
    callers can slice the Vec<f32> of logits returned by
    forward_multi_seq.
  • bin/j3_smoke.rs — loads a model with n_seq_max = 2,
    prefills 2 distinct prompts on seq_id 0 and 1 in one multi-seq
    batch, then 10 generate steps alternated via multi-seq batches,
    asserts the two token streams diverge.

Added — chunked prefill + mixed-seq generate (J4)

  • bin/j4_smoke.rs — chunked prefill of one seq interleaved with
    concurrent generate of another seq. Uses forward_multi_seq as the
    scheduler will once the HTTP dispatcher rewrite lands. The smoke
    proves seq-1's token stream is bit-for-bit identical whether it
    runs in isolation or interleaved with a 512-token prefill of
    seq-0, and documents the ik_llama qwen3next constraint "no
    repeated seq_id within one llama_decode batch".
  • llama_grammar_apply FFI shim flipped from abort to no-op
    (chimere_sampler.cpp) — the original abort() crashed every
    fresh build on top of the 2026-04-24 libcommon rebuild.
  • CHIMERE_SKIP_SAMPLER_INIT environment variable wired for the
    smokes (default 0, honours 0/false). Lets targeted repros
    bypass the C++ sampler init path.

Added — per-slot sampler + per-slot engram (J5)

  • SamplerHandle (owned, Send, !Sync, !Clone) —
    slot_scheduler.rs. Wraps a *mut c_void returned by
    chimere_sampler_alloc_with_dry; Drop calls
    chimere_sampler_free_handle. One handle per active slot →
    logit_bias maps, DRY histories and repetition counters are
    per-slot, no cross-slot leakage by construction.
  • EngramHandle { lookup: Arc<MultiEngramLookup>, alpha: f32 }
    cheap-clone. Engram tables (mmap'd .engr files) stay global;
    only the alpha is per-slot.
  • Slot::apply_engram_bias_to_sampler() — implements the
    production formula alpha * ln(prob + 1e-10) identically to
    mtp_scheduler.rs, so the multi-slot path is numerically
    equivalent to the single-slot path on the same prompt.
  • SlotPool::alloc_samplers_with_dry() — allocates N
    independent samplers at scheduler boot; rolls back to "no sampler"
    if any slot's alloc fails.
  • SlotPool::attach_engram{,_per_slot}() — attaches a single
    shared lookup Arc (homogeneous deployment) or one lookup per slot
    (multi-tenant, useful for tests and kine / cyber / research domain
    split).
  • FFI helpers on chimere_sampler.cpp:
    chimere_sampler_alloc_with_dry,
    chimere_sampler_set_engram_bias_handle,
    chimere_sampler_set_logit_bias_handle,
    chimere_sampler_clear_engram_bias_handle,
    chimere_sampler_reset_handle,
    chimere_sampler_free_handle.
  • LlamaForward::sample_slot_with_logprobs(sampler, idx)
    per-slot sample from a given batch index (used by the
    scheduler's per-slot sample step).
  • bin/j5_smoke.rs — two slots with DIFFERENT in-memory engram
    tables, asserts target_a never appears in slot 1's top-5 and
    target_b never appears in slot 0's top-5 (no cross-slot
    logit_bias leakage).

Fixed — libcommon ABI drift (J5 unblock)

Previously the chimere FFI chimere_sampler.cpp depended on
common_sampler_init from libcommon.a. On 2026-04-24, ik_llama's
sampling.cpp was rebuilt with upstream's autoparser refactor
(rbudget field + reasoning_budget_* params) while the
sampling.h checked out on the chimere branch did not carry those
changes. The linker resolved init anyway, init wrote past the end of
the smaller new common_sampler() allocation, and a C++ exception
propagated through extern "C" into Rust — aborting with
fatal runtime error: Rust cannot catch foreign exceptions.

Fix: chimere_sampler.cpp no longer calls anything from libcommon.
A minimal in-file sampler is built entirely on libllama.so's stable
llama_sample_* C API (repetition → top-k → top-p → min-p →
temperature → llama_sample_token, greedy when temp ≤ 0).
libcommon.a is no longer linked (ffi/build.rs). The
llama_grammar_apply shim is kept as a defensive no-op.

Full write-up: ~/Bureau/chimere-sampler-unblock-2026-04-24.md.

Added — bench harness + docs (J8)

  • bin/bench_m1.rs — async reqwest load generator. Reads
    BENCH_URL (default :8082, hard-refuse on :8081 so production
    is never targeted by accident), BENCH_CONC, BENCH_N,
    BENCH_MAX_TOKENS, BENCH_PASS_LABEL, BENCH_BASELINE_TPS.
    Emits aggregate tok/s, p50/p95/p99 latency, VRAM delta via
    nvidia-smi, and a coarse isolation assertion (no two distinct
    prompts returning byte-identical bodies). Exit 5 = ratio below
    target; exit 6 = isolation broken.
  • scripts/bench_m1.sh — one-shot sweep wrapper. Launches three
    fresh chimere-server processes on :8082 with
    CHIMERE_MULTISLOT = 1 / 2 / 4 and runs bench-m1 against each.
    Targets (from §7 of the plan): ≥ 1.7× baseline at 2 slots,
    ≥ 3.0× at 4 slots. Supports BENCH_SKIP_SERVER=1 for benching a
    pre-started process while the HTTP dispatcher rewrite is pending.
  • reqwest dep (optional, gated on the server feature,
    default-features = false + rustls-tls + json — no OpenSSL
    headers needed).
  • ARCHITECTURE.md — new section "M1 Multi-slot + Continuous
    Batching (Apr 2026)" (~330 lines) with ASCII data flow,
    per-seq/per-slot ownership table, feature-flag semantics, module
    responsibility index, slot state machine, batch-construction
    invariants, MTP gating policy, and J1-J8 status table.

Feature flag

New environment variable CHIMERE_MULTISLOT: unset / 1
selects the legacy path (production default, behaviour unchanged),
>= 2 activates the admission queue + scheduler.
Values >= 9 are clamped to 8 with a warning.
CHIMERE_ADMISSION_QUEUE overrides the 64-slot admission mpsc
bound.

Known status (J8)

Layer Status
Scheduler types + admission OK
forward_multi_seq FFI OK
chunked prefill + concurrent gen OK
per-slot sampler + engram isolation OK
HTTP dispatcher → forward_multi_seq WIP
stop / cancel / disconnect cleanup PENDING
stress (4c, 8 backlog, 1000-req leak) PENDING
bench harness OK

The HTTP handler still routes through the legacy
state.model.blocking_lock() closure; wiring it through the
scheduler's step loop is the remaining multi-day task. Until that
lands, CHIMERE_MULTISLOT >= 2 is a no-op for request throughput
(the primitives and per-slot isolation are already proven by the


Full CHANGELOG: see ...

Read more