Follow-on to #1239 Phase 1 (PR #1297) + Phase 2 (broker singleton bootstrap).
Phase 1 surfaced disk-tier pressure data; Phase 2 bootstraps the broker + alerts on >threshold. This card adds the typed runtime refusal so hot-path operations don't barrel into ENOSPC before the broker has a chance to alert.
Acceptance
- New typed error variant
ResourceError::DiskCapacity { tier: String, used_bytes: u64, capacity_bytes: u64 } in whichever crate owns the central error types.
- Audit + plumb the refusal at each production hot path that can fail with no-space:
- Model pull (
hf_hub::api::sync::Api::repo().get(...)) before downloading GGUF/safetensors.
- Container start (
docker compose up, wherever it's invoked from Rust).
- Image build (
docker build, if invoked from continuum-core).
- GGUF artifact resolve (
crate::model_registry::artifacts::resolve_gguf_for_model).
- Refusal logic: each site queries the broker (or directly calls
DockerTierPool::snapshot_stats()), compares projected post-op usage against capacity, and refuses with ResourceError::DiskCapacity if it would push past 95% (configurable threshold).
- Tests with a mocked
DockerTierPool returning controllable capacity/usage — assert refusal at threshold, success below threshold.
- TS-side surfaces the typed error to the user (chat message: "Can't pull qwen3-coder: Docker disk would hit 96%, prune first") instead of an opaque "operation failed."
Hardest piece
Touching production hot paths is the riskiest part of this 3-phase series. Recommend landing Phase 2 first so the alert sink is observable; Phase 3 then has empirical data on which hot paths actually hit the threshold most often, informing the refusal-vs-warning policy.
Why this matters
The 2026-05-14 incident (Docker.raw silently grew to fill the whole disk) is the exact failure mode this refusal prevents. Phase 1 = observability. Phase 2 = alerting. Phase 3 = refusal. All three are needed for the substrate to actually act on disk pressure rather than just surface it.
Lane: alpha flywheel #1272 lane 4 (substrate) or wherever #1239 sat.
Follow-on to #1239 Phase 1 (PR #1297) + Phase 2 (broker singleton bootstrap).
Phase 1 surfaced disk-tier pressure data; Phase 2 bootstraps the broker + alerts on >threshold. This card adds the typed runtime refusal so hot-path operations don't barrel into ENOSPC before the broker has a chance to alert.
Acceptance
ResourceError::DiskCapacity { tier: String, used_bytes: u64, capacity_bytes: u64 }in whichever crate owns the central error types.hf_hub::api::sync::Api::repo().get(...)) before downloading GGUF/safetensors.docker compose up, wherever it's invoked from Rust).docker build, if invoked from continuum-core).crate::model_registry::artifacts::resolve_gguf_for_model).DockerTierPool::snapshot_stats()), compares projected post-op usage against capacity, and refuses withResourceError::DiskCapacityif it would push past 95% (configurable threshold).DockerTierPoolreturning controllable capacity/usage — assert refusal at threshold, success below threshold.Hardest piece
Touching production hot paths is the riskiest part of this 3-phase series. Recommend landing Phase 2 first so the alert sink is observable; Phase 3 then has empirical data on which hot paths actually hit the threshold most often, informing the refusal-vs-warning policy.
Why this matters
The 2026-05-14 incident (Docker.raw silently grew to fill the whole disk) is the exact failure mode this refusal prevents. Phase 1 = observability. Phase 2 = alerting. Phase 3 = refusal. All three are needed for the substrate to actually act on disk pressure rather than just surface it.
Lane: alpha flywheel #1272 lane 4 (substrate) or wherever #1239 sat.