Skip to content

ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917

@joelteply

Description

@joelteply

airc-queue card

Coordinates work via the AIRC queue substrate (airc#562). Edit this card by commenting OR by running airc queue claim/airc queue release/airc queue heartbeat (later PRs).

{
  "kind": "airc-queue-card-v1",
  "id": "#917",
  "owner": "claude-tab-2",
  "status": "claimed",
  "evidence": "Adopted existing GitHub issue into airc queue.",
  "next_action": "Triage, claim, or close this adopted backlog card."
}

Close this issue when the work is done (status=merged/abandoned).

Original issue body

Pre-adoption body

Background

Joel called out a pattern in our inference layer (2026-04-17 chat session):

"context window limits are defined BY the model as are features such as audio or vision. this is probably why those attempts also failed"
"if you need a var require it / pass the entire struct around / has all the info you need / or grab it"
"make it declarative"

Today we have parallel model-info plumbing:

  • system/shared/ModelContextWindows.ts — TS lookup tables (getContextWindow, getInferenceSpeed, isSlowLocalModel, getLatencyAwareTokenLimit)
  • workers/continuum-core/src/ai/types.rs::ModelInfo — Rust struct with Option<> on max_output_tokens and cost_per_1k_tokens
  • 21 hardcoded ModelInfo {…} constructions across openai_adapter.rs, candle_adapter.rs, anthropic_adapter.rs, embedding.rs — each adapter maintains a static catalog instead of querying its source
  • system/core/src/models/mod.rs — yet another parallel Option<u32> max_output_tokens definition

Symptoms this caused (visible on M5 PR #914 verification today):

  • ChatRAGBuilder computed totalBudget = floor(contextWindow × 0.75). For Qwen3.5-4b's 262k window = 196k tokens. RAG actually filled ~14k per request → llama-server allocated full 262k KV cache per persona slot → com.docker.llama-server 20.87 GB resident on M5, 44 GB total vs 32 GB physical = swap.
  • Vision/audio attempts have failed silently when the hardcoded TS table claims a model supports a capability the actual model doesn't.
  • getInferenceSpeed is a TS const — fundamentally can't reflect what's measured at runtime.

Scope

Single coherent refactor, ~25 files, its own branch. Not to be sprinkled into other PRs.

1. ModelMetadata (replaces ModelInfo), all fields required

#[derive(Debug, Clone, Serialize, Deserialize, TS)]
#[ts(export, export_to = "../../../shared/generated/ai/ModelMetadata.ts")]
#[serde(rename_all = "camelCase")]
pub struct ModelMetadata {
    pub id: String,
    pub name: String,
    pub provider: String,
    pub capabilities: Vec<ModelCapability>,
    pub context_window: u32,
    pub max_output_tokens: u32,
    pub cost_per_1k_tokens: CostPer1kTokens,  // local = {0,0}
    pub tokens_per_second: f32,
    pub supports_streaming: bool,
    pub supports_tools: bool,
}

No Option<>. Local-cost = {0,0} is still a declaration, not an absence.

2. Adapters query their source, not hardcoded vec![ModelInfo {…}]

  • DMR: GET http://localhost:12434/engines/v1/models returns the live catalog. docker model inspect <id> exposes GGUF metadata for fields the catalog doesn't.
  • OpenAI / Anthropic / DeepSeek / etc.: their /v1/models endpoint. Cache at adapter initialize().
  • Candle: GGUF metadata directly from the loaded file.

Delete the 21 hardcoded literals.

3. AIProviderAdapter::model_metadata(model_id) returns the full struct

fn model_metadata(&self, model_id: &str) -> Option<ModelMetadata>;  // None ONLY when not in adapter's live catalog

4. Thread ModelMetadata through the chain

  • PersonaResponseGenerator receives ModelMetadata at request entry.
  • ChatRAGBuilder.buildContext(model: ModelMetadata, …) reads model.context_window, model.tokens_per_second, model.capabilities directly.
  • Vision attachment, tool injection — gated by model.capabilities and model.supports_tools.

5. Delete the lookup-helper layer

  • system/shared/ModelContextWindows.ts — fully deletable.
  • system/core/src/models/mod.rs — collapse into ai/types.rs.

Acceptance

  • grep -r "Option<u32>" workers/continuum-core/src/ai/ returns zero hits.
  • grep -rn "ModelInfo {" workers/continuum-core/src/ only matches ai/types.rs (the definition itself).
  • system/shared/ModelContextWindows.ts deleted.
  • ChatRAGBuilder and PersonaResponseGenerator take ModelMetadata; never reconstruct it from loose strings.
  • Live test on M5: persona chat sends prompts that respect model.context_window AND the latency budget derived from model.tokens_per_second. KV cache pressure drops from 20+ GB to single-GB range.

Why separate

Touching 21+ adapter sites + consumer chain + IPC export + TS plumbing has to land atomically. Half of it sprinkled into other PRs leaves the codebase worse than it started.

Metadata

Metadata

Assignees

No one assigned

    Labels

    airc-queueAIRC-backed agent work queue card

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions