Skip to content

Model-level circuit breaker for media generation #898

@nickna

Description

@nickna

Problem

When a provider's model becomes unhealthy (stuck predictions, repeated timeouts, infrastructure issues), every subsequent request to that model will hang for up to 10 minutes before timing out. This wastes time, API credits, and provides a poor user experience — especially when the failure is systemic and predictable after the first few attempts.

Recent debugging of a stuck Replicate image generation request revealed that:

  • A prediction stayed in "processing" state indefinitely on Replicate's side
  • The system polled for 300+ attempts with no timeout enforcement
  • No error was recorded, no feedback was given to the user
  • Every subsequent request to the same model would have hit the same issue

We've since improved error classification, timeout handling, and provider error tracking (see recent commits). But we're still missing the ability to fail fast when a model is known to be unhealthy.

Proposal: Model-level circuit breaker

Why model-level?

  • Provider-level is too broad. flux-1.1-pro being stuck on Replicate shouldn't block minimax/video-01 which also runs on Replicate.
  • Key-level is too narrow. The existing auto-disable system handles bad credentials. A circuit breaker is about operational health — the model infrastructure is misbehaving regardless of which key is used.
  • Model mapping is the natural unit. Requests route through model alias → model mapping → provider + model ID. This is the granularity at which failures are correlated.

Circuit breaker states

Closed (healthy)
  │
  ├─ failure threshold exceeded
  ▼
Open (failing fast)
  │
  ├─ cooldown elapsed
  ▼
Half-Open (probing)
  │
  ├─ probe succeeds → Closed
  └─ probe fails → Open (reset cooldown)

Proposed behavior

State Behavior
Closed Requests proceed normally. Failures are counted.
Open Requests fail immediately with HTTP 503: "Model temporarily unavailable due to repeated failures. Retry after {cooldown}s." No polling, no API calls, no wasted credits.
Half-Open One probe request is allowed through. If it succeeds, circuit closes. If it fails, circuit reopens with reset cooldown.

Suggested defaults (configurable)

Parameter Image Generation Video Generation
Failure threshold 3 failures in 5 minutes 2 failures in 10 minutes
Open duration (cooldown) 30 seconds 60 seconds
Half-open max probes 1 1
Success threshold to close 1 1

Failure criteria

A "failure" for circuit breaker purposes includes:

  • RequestTimeoutException (polling timeout exceeded)
  • ServiceUnavailableException (provider 503)
  • LLMCommunicationException with 5xx status codes
  • Network errors (HttpRequestException)

Not counted as failures (these are request-specific, not model health issues):

  • InvalidRequestException (bad prompt, content policy)
  • RateLimitExceededException (handled by backoff/retry)
  • InvalidApiKey / InsufficientBalance (handled by key auto-disable)
  • ModelNotFoundException (configuration issue)

Implementation notes

State storage

Use IDistributedCache (Redis) so circuit state is shared across scaled instances. Follows existing cache key conventions:

circuit:model:{mappingId}:state     → "closed" | "open" | "half-open"
circuit:model:{mappingId}:failures  → failure count with sliding window
circuit:model:{mappingId}:opened_at → timestamp when circuit opened

Integration point

The natural check point is during model mapping lookup — every request already goes through GetMappingByModelAliasAsync. The circuit breaker check can be added to CachedModelProviderMappingService or as a wrapper that checks circuit state before returning the mapping.

Alternatively, it could live in the controller layer (ImagesController, VideosController) after mapping resolution but before client creation — this keeps the mapping service clean and makes the circuit breaker explicit.

Existing infrastructure to leverage

  • RedisCircuitBreaker — already exists for cache operations, could inform the pattern
  • CacheKeys — established key naming conventions
  • ProviderErrorTrackingService — already tracks errors per provider/key; circuit breaker could consume these events
  • OperationTimeoutProvider — already has per-operation timeout configuration; circuit breaker config could follow the same pattern
  • ExceptionToResponseMapper — already maps ServiceUnavailableException to HTTP 503

Related: reduce MaxPollingDuration for image generation

The current 10-minute MaxPollingDuration in ReplicateClient applies to all prediction types. Image generation should have a much shorter timeout (60 seconds is reasonable — most image models complete in 10-30 seconds). Video generation legitimately takes longer. This could be:

  • A parameter passed to PollPredictionUntilCompletedAsync
  • Derived from OperationTimeoutProvider configuration
  • Set per media type in the orchestrator/controller

Open questions

  1. Should the circuit breaker emit events? Publishing a ModelCircuitOpened / ModelCircuitClosed event via MassTransit would allow the WebAdmin to show real-time model health status and could trigger notifications.
  2. Admin override? Should there be an Admin API endpoint to manually close/open a circuit (e.g., after a known provider outage is resolved)?
  3. Metrics? Should circuit state changes be recorded as Prometheus metrics for Grafana dashboards?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions