Skip to content

add client-side circuit breaker interceptor for gRPC clients#550

Open
k4leung4 wants to merge 4 commits into
chainguard-dev:mainfrom
k4leung4:grpc-circuit-breaker
Open

add client-side circuit breaker interceptor for gRPC clients#550
k4leung4 wants to merge 4 commits into
chainguard-dev:mainfrom
k4leung4:grpc-circuit-breaker

Conversation

@k4leung4
Copy link
Copy Markdown
Contributor

@k4leung4 k4leung4 commented Mar 16, 2026

Summary

Add a new pkg/interceptors/circuitbreaker package that provides gRPC
client interceptors backed by sony/gobreaker. When a downstream service
returns too many consecutive errors, the circuit opens and calls fail fast
with codes.Unavailable instead of adding load to the failing service.

Motivation

Addresses CUS-239 /
CUS-172 /
chainguard-dev/internal-dev#18988.

During INC-44, all services continued retrying the overwhelmed load balancer,
amplifying the cascade. A circuit breaker would have stopped the cascade
within seconds by failing fast once the downstream was detected as unhealthy.

Package API

// Get default settings tuned for Cloud Run service-to-service calls.
settings := circuitbreaker.DefaultSettings("iam-datastore")
cb := gobreaker.NewCircuitBreaker[any](settings)

// Add to gRPC dial options (opt-in per client).
opts = append(opts,
    grpc.WithUnaryInterceptor(circuitbreaker.UnaryClientInterceptor(cb)),
    grpc.WithStreamInterceptor(circuitbreaker.StreamClientInterceptor(cb)),
)

Default settings

Parameter Value Rationale
ConsecutiveFailures to trip 5 Tolerates brief transient errors
Half-open timeout 15s Gives downstream time to recover
Max probe requests (half-open) 10 Tests recovery with limited traffic
Failure count reset interval 30s Prevents slow leak from tripping

Error classification

Client-side errors (the server didn't fail) are classified as successes
and do not count toward tripping the breaker:

  • InvalidArgument, NotFound, AlreadyExists, PermissionDenied,
    Unauthenticated, FailedPrecondition, OutOfRange, Unimplemented, Canceled

Canceled is classified as a success because cancellation is typically
client-initiated (context timeout or user abort), not a server failure.

Server-side errors that count as failures:

  • Internal, Unavailable, DeadlineExceeded, ResourceExhausted,
    Unknown, Aborted, DataLoss

Observability

State transitions (closed → open → half-open → closed) are logged via
clog.InfoContextf with context.Background() (circuit-level events,
not request-scoped).

Stream interceptor limitation

The stream interceptor only tracks stream establishment errors.
Errors on Send/Recv after the stream is established are not tracked,
so a downstream that accepts connections but fails on every message will
not trip the breaker. This is an inherent limitation of the gRPC stream
interceptor model.

Tests

  • TestCircuitBreaker_TripsAfterConsecutiveFailures — circuit opens after
    N server errors, subsequent calls fail fast with Unavailable
  • TestCircuitBreaker_ClientErrorsDoNotTrip — NotFound etc don't trip
  • TestCircuitBreaker_RecoversThroughHalfOpen — open → half-open → probe
    succeeds → closed
  • TestStreamClientInterceptor_TripsAndFailsFast — stream call fails
    fast when circuit is open
  • TestDefaultSettings_IsSuccessful — table-driven test of error
    classification across 16 gRPC codes

Follow-up in mono

After bumping go-grpc-kit, enable the circuit breaker per-client in
api-impl/cmd/backend/main.go's configureClients(). Start with the
IAM datastore client (highest traffic, most critical) and expand.

Test plan

  • go test ./pkg/interceptors/circuitbreaker/ -v — 5/5 pass
  • CI passes
  • Deploy to staging, verify state transition logs in Cloud Logging

🤖 Generated with Claude Code

Signed-off-by: Kenny Leung <kleung@chainguard.dev>
@k4leung4 k4leung4 force-pushed the grpc-circuit-breaker branch from 0e69e20 to 70befdc Compare March 16, 2026 19:45
fix
Signed-off-by: Kenny Leung <kleung@chainguard.dev>
@k4leung4 k4leung4 marked this pull request as ready for review March 16, 2026 19:59
@k4leung4 k4leung4 requested review from cmdpdx and tcnghia March 16, 2026 19:59
Kenny Leung and others added 2 commits March 18, 2026 19:58
- Fix ReadyToTrip: >= 5 (not > 5) to match documented "5 consecutive
  failures" behavior
- Add codes.Canceled to success list — cancellation is typically
  client-initiated (context timeout), not a server failure
- Add OnStateChange logging via clog for production observability
- Add doc comment on StreamClientInterceptor noting that only stream
  establishment is tracked (not Send/Recv errors)
- Use defensive type assertion in stream interceptor
- Add TestStreamClientInterceptor_TripsAndFailsFast
- Add Canceled, Aborted, DataLoss to IsSuccessful test table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use clog.InfoContextf with context.Background() instead of clog.Infof
so the log entry goes through the structured logger configured at
process startup. OnStateChange fires outside any request context, so
context.Background() is the correct scope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@eslerm eslerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖: Solid circuit breaker implementation — no correctness bugs, good test coverage with real in-process server. Two comment nits.

}
// Treat client-side errors as successes (the server didn't fail).
// Canceled is included because cancellation is typically initiated
// by the client (context timeout or user abort), not a server failure.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖: Comment says "context timeout" but codes.Canceled is explicit context cancellation (e.g., ctx.Cancel()), not a timeout. Timeouts produce codes.DeadlineExceeded which is already handled separately above. Consider: "cancelled by the client (explicit cancellation), not a server failure."


settings := DefaultSettings("test")
settings.ReadyToTrip = func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures > 2 // Trip after 2 for faster test
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖: Comment says "Trip after 2" but the condition is > 2, which trips after 3 consecutive failures. Should be "Trip after 3" or change to > 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants