Skip to content

feat(resources): #1239 Phase 3 — typed ResourceError::DiskCapacity refusal at production hot paths #1300

@joelteply

Description

@joelteply

Follow-on to #1239 Phase 1 (PR #1297) + Phase 2 (broker singleton bootstrap).

Phase 1 surfaced disk-tier pressure data; Phase 2 bootstraps the broker + alerts on >threshold. This card adds the typed runtime refusal so hot-path operations don't barrel into ENOSPC before the broker has a chance to alert.

Acceptance

  1. New typed error variant ResourceError::DiskCapacity { tier: String, used_bytes: u64, capacity_bytes: u64 } in whichever crate owns the central error types.
  2. Audit + plumb the refusal at each production hot path that can fail with no-space:
    • Model pull (hf_hub::api::sync::Api::repo().get(...)) before downloading GGUF/safetensors.
    • Container start (docker compose up, wherever it's invoked from Rust).
    • Image build (docker build, if invoked from continuum-core).
    • GGUF artifact resolve (crate::model_registry::artifacts::resolve_gguf_for_model).
  3. Refusal logic: each site queries the broker (or directly calls DockerTierPool::snapshot_stats()), compares projected post-op usage against capacity, and refuses with ResourceError::DiskCapacity if it would push past 95% (configurable threshold).
  4. Tests with a mocked DockerTierPool returning controllable capacity/usage — assert refusal at threshold, success below threshold.
  5. TS-side surfaces the typed error to the user (chat message: "Can't pull qwen3-coder: Docker disk would hit 96%, prune first") instead of an opaque "operation failed."

Hardest piece

Touching production hot paths is the riskiest part of this 3-phase series. Recommend landing Phase 2 first so the alert sink is observable; Phase 3 then has empirical data on which hot paths actually hit the threshold most often, informing the refusal-vs-warning policy.

Why this matters

The 2026-05-14 incident (Docker.raw silently grew to fill the whole disk) is the exact failure mode this refusal prevents. Phase 1 = observability. Phase 2 = alerting. Phase 3 = refusal. All three are needed for the substrate to actually act on disk pressure rather than just surface it.

Lane: alpha flywheel #1272 lane 4 (substrate) or wherever #1239 sat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions