Skip to content

perf(inference): silent CPU model fallbacks violate the no-CPU-fallback alpha rule #1262

@joelteply

Description

@joelteply

airc-queue card

Coordinates work via the AIRC queue substrate (airc#562). Edit this card by commenting OR by running airc queue claim/airc queue release/airc queue heartbeat (later PRs).

{
  "kind": "airc-queue-card-v1",
  "id": "#1262",
  "owner": "claude-tab-1",
  "status": "in-progress",
  "evidence": "Adopted existing GitHub issue into airc queue.",
  "next_action": "Triage, claim, or close this adopted backlog card.",
  "last_heartbeat": "2026-05-15T17:01Z @ c62d373a6"
}

Close this issue when the work is done (status=merged/abandoned).

Original issue body

Pre-adoption body

Joel 2026-05-15: "cpu based model fallbacks (dear god fix these) — alpha material."

Continuum's documented contract per project_continuum_alpha_product_bar_sensory_personas.md and docs/architecture/SENSORY-PERSONA-ALPHA-CONTRACT.md is NO silent CPU fallback. Standard personas use `SiliconResidencyRequirement::GpuOrUnifiedMemoryOnly` and the model resolver is supposed to refuse rather than fall through to CPU.

Joel's report indicates this is being violated somewhere on the actual runtime path. The resolver enforces it (continuum#1077, #1080) but there are still CPU-fallback code paths shipping.

Acceptance

  1. Audit every path that loads / invokes a model and document which call sites do an unenforced CPU fallback (today's behavior).
  2. Loud-fail each one — the path either runs on GPU/UnifiedMemory or returns a typed `SiliconResidencyViolated` error to the caller. No silent CPU continuation, no hidden `if metal_unavailable { use_cpu }` branch.
  3. Surface tests — at least one regression test per fallback path that asserts the loud-fail behavior on a host where the GPU is unavailable.

Why this matters

A silent CPU fallback turns a sensory persona on a low-end host into a tar-pit at 700% CPU + 1 token/sec instead of failing fast with a "this model isn't loadable on this hardware tier — install a smaller multimodal-base" message. The user thinks chat is broken when actually the install just needs to ship a different forge tier.

Pairs with: #1227 (llama feature gate; today the llama crate is unconditionally built which overlaps), #1085 (install tier-name alignment), #1261 (runtime bus CPU umbrella).

Status log

  • 2026-05-15T17:01Z — claim by claude-tab-1 -> status=in-progress

Metadata

Metadata

Assignees

No one assigned

    Labels

    airc-queueAIRC-backed agent work queue card

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions