Problem
When a provider's model becomes unhealthy (stuck predictions, repeated timeouts, infrastructure issues), every subsequent request to that model will hang for up to 10 minutes before timing out. This wastes time, API credits, and provides a poor user experience — especially when the failure is systemic and predictable after the first few attempts.
Recent debugging of a stuck Replicate image generation request revealed that:
- A prediction stayed in "processing" state indefinitely on Replicate's side
- The system polled for 300+ attempts with no timeout enforcement
- No error was recorded, no feedback was given to the user
- Every subsequent request to the same model would have hit the same issue
We've since improved error classification, timeout handling, and provider error tracking (see recent commits). But we're still missing the ability to fail fast when a model is known to be unhealthy.
Proposal: Model-level circuit breaker
Why model-level?
- Provider-level is too broad.
flux-1.1-pro being stuck on Replicate shouldn't block minimax/video-01 which also runs on Replicate.
- Key-level is too narrow. The existing auto-disable system handles bad credentials. A circuit breaker is about operational health — the model infrastructure is misbehaving regardless of which key is used.
- Model mapping is the natural unit. Requests route through model alias → model mapping → provider + model ID. This is the granularity at which failures are correlated.
Circuit breaker states
Closed (healthy)
│
├─ failure threshold exceeded
▼
Open (failing fast)
│
├─ cooldown elapsed
▼
Half-Open (probing)
│
├─ probe succeeds → Closed
└─ probe fails → Open (reset cooldown)
Proposed behavior
| State |
Behavior |
| Closed |
Requests proceed normally. Failures are counted. |
| Open |
Requests fail immediately with HTTP 503: "Model temporarily unavailable due to repeated failures. Retry after {cooldown}s." No polling, no API calls, no wasted credits. |
| Half-Open |
One probe request is allowed through. If it succeeds, circuit closes. If it fails, circuit reopens with reset cooldown. |
Suggested defaults (configurable)
| Parameter |
Image Generation |
Video Generation |
| Failure threshold |
3 failures in 5 minutes |
2 failures in 10 minutes |
| Open duration (cooldown) |
30 seconds |
60 seconds |
| Half-open max probes |
1 |
1 |
| Success threshold to close |
1 |
1 |
Failure criteria
A "failure" for circuit breaker purposes includes:
RequestTimeoutException (polling timeout exceeded)
ServiceUnavailableException (provider 503)
LLMCommunicationException with 5xx status codes
- Network errors (
HttpRequestException)
Not counted as failures (these are request-specific, not model health issues):
InvalidRequestException (bad prompt, content policy)
RateLimitExceededException (handled by backoff/retry)
InvalidApiKey / InsufficientBalance (handled by key auto-disable)
ModelNotFoundException (configuration issue)
Implementation notes
State storage
Use IDistributedCache (Redis) so circuit state is shared across scaled instances. Follows existing cache key conventions:
circuit:model:{mappingId}:state → "closed" | "open" | "half-open"
circuit:model:{mappingId}:failures → failure count with sliding window
circuit:model:{mappingId}:opened_at → timestamp when circuit opened
Integration point
The natural check point is during model mapping lookup — every request already goes through GetMappingByModelAliasAsync. The circuit breaker check can be added to CachedModelProviderMappingService or as a wrapper that checks circuit state before returning the mapping.
Alternatively, it could live in the controller layer (ImagesController, VideosController) after mapping resolution but before client creation — this keeps the mapping service clean and makes the circuit breaker explicit.
Existing infrastructure to leverage
RedisCircuitBreaker — already exists for cache operations, could inform the pattern
CacheKeys — established key naming conventions
ProviderErrorTrackingService — already tracks errors per provider/key; circuit breaker could consume these events
OperationTimeoutProvider — already has per-operation timeout configuration; circuit breaker config could follow the same pattern
ExceptionToResponseMapper — already maps ServiceUnavailableException to HTTP 503
Related: reduce MaxPollingDuration for image generation
The current 10-minute MaxPollingDuration in ReplicateClient applies to all prediction types. Image generation should have a much shorter timeout (60 seconds is reasonable — most image models complete in 10-30 seconds). Video generation legitimately takes longer. This could be:
- A parameter passed to
PollPredictionUntilCompletedAsync
- Derived from
OperationTimeoutProvider configuration
- Set per media type in the orchestrator/controller
Open questions
- Should the circuit breaker emit events? Publishing a
ModelCircuitOpened / ModelCircuitClosed event via MassTransit would allow the WebAdmin to show real-time model health status and could trigger notifications.
- Admin override? Should there be an Admin API endpoint to manually close/open a circuit (e.g., after a known provider outage is resolved)?
- Metrics? Should circuit state changes be recorded as Prometheus metrics for Grafana dashboards?
Problem
When a provider's model becomes unhealthy (stuck predictions, repeated timeouts, infrastructure issues), every subsequent request to that model will hang for up to 10 minutes before timing out. This wastes time, API credits, and provides a poor user experience — especially when the failure is systemic and predictable after the first few attempts.
Recent debugging of a stuck Replicate image generation request revealed that:
We've since improved error classification, timeout handling, and provider error tracking (see recent commits). But we're still missing the ability to fail fast when a model is known to be unhealthy.
Proposal: Model-level circuit breaker
Why model-level?
flux-1.1-probeing stuck on Replicate shouldn't blockminimax/video-01which also runs on Replicate.Circuit breaker states
Proposed behavior
Suggested defaults (configurable)
Failure criteria
A "failure" for circuit breaker purposes includes:
RequestTimeoutException(polling timeout exceeded)ServiceUnavailableException(provider 503)LLMCommunicationExceptionwith 5xx status codesHttpRequestException)Not counted as failures (these are request-specific, not model health issues):
InvalidRequestException(bad prompt, content policy)RateLimitExceededException(handled by backoff/retry)InvalidApiKey/InsufficientBalance(handled by key auto-disable)ModelNotFoundException(configuration issue)Implementation notes
State storage
Use
IDistributedCache(Redis) so circuit state is shared across scaled instances. Follows existing cache key conventions:Integration point
The natural check point is during model mapping lookup — every request already goes through
GetMappingByModelAliasAsync. The circuit breaker check can be added toCachedModelProviderMappingServiceor as a wrapper that checks circuit state before returning the mapping.Alternatively, it could live in the controller layer (ImagesController, VideosController) after mapping resolution but before client creation — this keeps the mapping service clean and makes the circuit breaker explicit.
Existing infrastructure to leverage
RedisCircuitBreaker— already exists for cache operations, could inform the patternCacheKeys— established key naming conventionsProviderErrorTrackingService— already tracks errors per provider/key; circuit breaker could consume these eventsOperationTimeoutProvider— already has per-operation timeout configuration; circuit breaker config could follow the same patternExceptionToResponseMapper— already mapsServiceUnavailableExceptionto HTTP 503Related: reduce
MaxPollingDurationfor image generationThe current 10-minute
MaxPollingDurationinReplicateClientapplies to all prediction types. Image generation should have a much shorter timeout (60 seconds is reasonable — most image models complete in 10-30 seconds). Video generation legitimately takes longer. This could be:PollPredictionUntilCompletedAsyncOperationTimeoutProviderconfigurationOpen questions
ModelCircuitOpened/ModelCircuitClosedevent via MassTransit would allow the WebAdmin to show real-time model health status and could trigger notifications.