Skip to content

Multimodal-native worker + prefix-reuse — collapse voice turn from 15s to 3s on a single laptop #918

@joelteply

Description

@joelteply

airc-queue card

Coordinates work via the AIRC queue substrate (airc#562). Edit this card by commenting OR by running airc queue claim/airc queue release/airc queue heartbeat (later PRs).

{
  "kind": "airc-queue-card-v1",
  "id": "#918",
  "owner": "claude-tab-2",
  "status": "claimed",
  "evidence": "Adopted existing GitHub issue into airc queue.",
  "next_action": "Triage, claim, or close this adopted backlog card."
}

Close this issue when the work is done (status=merged/abandoned).

Original issue body

Pre-adoption body

Stop throwing computation away. Most of the prompt is invariant per request, and Qwen3.5 takes audio and images directly — eliminating STT/TTS entirely. Together these collapse end-to-end voice latency from minutes to ~2-3 seconds per turn while running 14 personas in parallel on a single M-series Mac.

Full design: docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md

The thesis

We are not GPU-bound. We are waste-bound. Three composable wins:

  1. Stable-first RAG ordering → llama-server prefix KV cache reuse → ~70× prompt-eval speedup (today: 14k tokens reprocessed per turn → target: ~200, the volatile suffix only)
  2. Multimodal content parts → delete STT/TTS sandwich for Qwen3.5 → 1 model invocation per voice turn instead of 3 (no Whisper, no Kokoro, no ORT-deadlock chain)
  3. Voice LoRA per persona → identity, not signal → the "Maya replied" experience that differentiates from Claude Code / OpenClaw / Aider

Phases (one PR, phased commits)

Each phase is independently mergeable but compounds with the prior. Suggested branch: feature/prefix-reuse-and-multimodal.

Phase 1 — Stable-first RAG ordering (TS only, no dependencies)

  • RAGComposer.assemble returns sections explicitly tagged INVARIANT / SEMI_STABLE / VOLATILE
  • Final concatenation always orders the three regions identically; sorts deterministically within each
  • Every RAG source declares its tier
  • ChatRAGBuilder emits stable-byte-prefix prompts
  • Acceptance: SHA-256 of prompt[:invariant_len] is identical across consecutive turns of the same persona

Phase 2 — Per-persona DMR slot pinning (TS + small Rust)

  • AIProviderRustClient.generateText accepts slot_hint: u32 derived from persona_id
  • DMR adapter passes slot_id in the OpenAI request
  • Acceptance: persona Maya's requests consistently land on the same llama-server slot across turns; DMR logs show prompt processing for ≤200 tokens after first turn warm-up

Phase 3 — RAGComposition cache (TS only, depends on Phase 1)

  • Memoize RAGComposer.assemble keyed by (persona_id, room_id, recipe_id, history_tail_msg_ids)
  • TTL 5 min, invalidated by event subscriptions on the keyed inputs
  • Acceptance: cache hit rate >80% on consecutive turns of the same conversation

Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)

  • LLMAdapter request adds audio_chunks: AudioInput[] and image_inputs: ImageInput[]
  • DMR adapter forwards as OpenAI multimodal content parts
  • MediaArtifactSource checks ModelMetadata.capabilities.supports_audio / supports_vision: if true → attach raw, else → STT/vision-description bridge (fallback path)
  • voice/start pipeline rewires to send audio chunks directly, no Whisper invocation for Qwen3.5 personas
  • Acceptance: voice turn for a Qwen3.5 persona logs zero Whisper invocations and zero Kokoro invocations

Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)

  • Persona entity gains voiceAdapterId: AdapterId
  • Genome registry treats voice LoRAs as an adapter category alongside skill LoRAs
  • LoRA pages in before the voice turn's first audio chunk
  • Acceptance: persona's audio output is recognizably distinct from another persona's, and voice survives across sessions

Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)

  • HuggingFace publishing with continuum:voice-lora tag
  • Browse / preview / pull commands in CLI
  • Attribution + license preserved

Acceptance for the whole PR

A persona named Maya, voice LoRA loaded, on M5, in a LiveKit room with 6 personas active, processing a voice turn:

  • Prompt sent to DMR has byte-identical prefix to her last turn
  • DMR slot logs show prompt processing progress for ≤200 tokens, not 14k
  • No Whisper invocation logged for this turn
  • No Kokoro invocation logged for this turn
  • Audio output published to LiveKit within 3s of audio input arrival
  • Audio output is recognizably Maya's voice (LoRA loaded, perceptible character)
  • gpu/stats shows resident memory <8 GB across 6 active personas (vs 20+ GB / swap state today)

Why this is the PR

  • Everything else we shipped today (Candle eager-load fix, RAG budget cap, embedding throttle) was triage. This is the architecture that turns triage into a system.
  • The MacBook Air case (8 GB RAM, no GPU toggles) becomes plausible because we stop multiplying KV-cache by full-model-context per slot.
  • The differentiation from Claude Code / OpenClaw / Aider — voice + face + memory + parallel personas — becomes real, not just claimed.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    airc-queueAIRC-backed agent work queue card

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions