-
Notifications
You must be signed in to change notification settings - Fork 233
Fix hub crash during Cloud Logging metadata outage #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Cloud Logging Circuit Breaker (Issue #70) | ||
|
|
||
| **Date:** 2026-05-31 | ||
| **Author:** dev-issue-70 | ||
| **Issue:** #70 — Hub crashes when Cloud Logging retries exhaust resources during metadata outage | ||
|
|
||
| ## Summary | ||
|
|
||
| Added a circuit breaker pattern to the Cloud Logging integration so the hub | ||
| remains operational when the GCP metadata service (or Cloud Logging API) is | ||
| unavailable. | ||
|
|
||
| ## Changes | ||
|
|
||
| ### New: `pkg/util/logging/resilient_cloud_handler.go` | ||
|
|
||
| - `ResilientCloudHandler` wraps `CloudHandler` with a three-state circuit breaker | ||
| (closed → open → half-open → closed). | ||
| - Background goroutine runs periodic flush health checks against the Cloud | ||
| Logging buffer. | ||
| - When consecutive failures exceed the threshold (default: 3), the circuit | ||
| opens and `Handle()` silently drops entries from the cloud path. Local | ||
| logging via the `multiHandler` continues unaffected. | ||
| - After `OpenDuration` (default: 60s), the circuit transitions to half-open | ||
| and probes with a timeout-guarded flush. On success the circuit closes and | ||
| Cloud Logging resumes automatically. | ||
| - Circuit breaker state is shared across derived handlers (`WithAttrs` / | ||
| `WithGroup`) via a `*circuitBreaker` pointer. | ||
|
|
||
| ### Modified: `pkg/util/logging/cloud_handler.go` | ||
|
|
||
| - Added `BufferedByteLimit` (default 8 MiB) to the `gcplog.Logger` to cap | ||
| the internal write buffer and prevent unbounded memory growth. | ||
| - Added `ClientTimeout` (default 15s) to `gcplog.NewClient` creation context | ||
| so startup doesn't hang when metadata is unreachable. | ||
|
|
||
| ### Modified: `cmd/server_foreground.go` | ||
|
|
||
| - `initServerLogging` wraps the `CloudHandler` with `ResilientCloudHandler`. | ||
| - Updated type assertions for `Client()` access from `*CloudHandler` to | ||
| `*ResilientCloudHandler`. | ||
|
|
||
| ## Design Decisions | ||
|
|
||
| - **Circuit breaker over retry limiter**: A circuit breaker provides cleaner | ||
| behavior than just capping retries — it stops all Cloud Logging traffic | ||
| during outages rather than letting each log entry independently discover | ||
| the backend is down. | ||
| - **Flush-based health detection**: Since `gcplog.Logger.Log()` is async | ||
| and doesn't return errors, we use periodic `Flush()` calls with timeouts | ||
| to detect backend failures. | ||
| - **Shared state via pointer**: The `circuitBreaker` struct is heap-allocated | ||
| and shared by pointer, avoiding `go vet` complaints about copying | ||
| `atomic.Int32` in `WithAttrs`/`WithGroup`. | ||
|
|
||
| ## Testing | ||
|
|
||
| - 17 unit tests covering: config defaults, state transitions, Handle behavior | ||
| in each circuit state, failure/success tracking, WithAttrs/WithGroup | ||
| state sharing, concurrent access safety (race detector). | ||
| - All existing logging tests continue to pass. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -449,7 +449,10 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl | |
| cleanups = append(cleanups, logCleanup) | ||
| } | ||
|
|
||
| // Initialize direct Cloud Logging | ||
| // Initialize direct Cloud Logging with circuit breaker protection. | ||
| // If Cloud Logging becomes unavailable (e.g. during a metadata | ||
| // service outage), the circuit breaker opens and the hub falls back to | ||
| // local-only logging automatically. | ||
| var cloudHandler slog.Handler | ||
| if logging.IsCloudLoggingEnabled() { | ||
| logLevel := logging.ResolveLogLevel(enableDebug) | ||
|
|
@@ -460,9 +463,14 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl | |
| if cloudErr != nil { | ||
| log.Printf("Warning: failed to initialize Cloud Logging: %v", cloudErr) | ||
| } else { | ||
| cloudHandler = ch | ||
| // Wrap with resilient handler for circuit breaker protection. | ||
| resilientHandler, resilientCleanup := logging.NewResilientCloudHandler( | ||
| ch, logging.ResilientCloudHandlerConfig{}, | ||
| ) | ||
| cloudHandler = resilientHandler | ||
| cleanups = append(cleanups, cloudLogCleanup) | ||
| log.Printf("Cloud Logging enabled (logId=%s, project=%s)", logging.FormatLogID(), logging.FormatProjectID()) | ||
| cleanups = append(cleanups, resilientCleanup) | ||
| log.Printf("Cloud Logging enabled with circuit breaker (logId=%s, project=%s)", logging.FormatLogID(), logging.FormatProjectID()) | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -476,8 +484,9 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl | |
| Foreground: serverStartForeground, | ||
| Level: logging.ResolveLogLevel(enableDebug), | ||
| } | ||
| if ch, ok := cloudHandler.(*logging.CloudHandler); ok && ch != nil { | ||
| if ch, ok := cloudHandler.(*logging.ResilientCloudHandler); ok && ch != nil { | ||
| reqLogCfg.CloudClient = ch.Client() | ||
| reqLogCfg.CircuitOpen = ch.CircuitOpen | ||
| reqLogCfg.ProjectID = logging.FormatProjectID() | ||
| } | ||
|
Comment on lines
+487
to
491
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The request logger and message logger are initialized with the raw During a Cloud Logging outage, writes to these loggers will bypass the circuit breaker and attempt to write directly to Cloud Logging, which could still lead to resource exhaustion, hangs, or crashes. Consider updating |
||
| requestLogger, reqLogCleanup, reqErr := logging.NewRequestLogger(reqLogCfg) | ||
|
|
@@ -495,8 +504,9 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl | |
| UseGCP: useGCP, | ||
| Level: logging.ResolveLogLevel(enableDebug), | ||
| } | ||
| if ch, ok := cloudHandler.(*logging.CloudHandler); ok && ch != nil { | ||
| if ch, ok := cloudHandler.(*logging.ResilientCloudHandler); ok && ch != nil { | ||
| msgLogCfg.CloudClient = ch.Client() | ||
| msgLogCfg.CircuitOpen = ch.CircuitOpen | ||
| } | ||
| messageLogger, msgLogCleanup, msgErr := logging.NewMessageLogger(msgLogCfg) | ||
| if msgErr != nil { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
resilientCleanup(which stops the background health check goroutine) is appended tocleanupsbeforecloudLogCleanup(which closes the underlying Cloud Logging client).Because cleanups are executed in reverse order of how they are deferred in the main loop (
for _, cleanup := range logCleanups { defer cleanup() }),cloudLogCleanupwill execute beforeresilientCleanup. This means the underlying client/logger will be closed while the background health check goroutine is still running and potentially attempting to callFlush(), which can lead to panics or errors on shutdown.To fix this, append
cloudLogCleanuptocleanupsbeforeresilientCleanupso thatresilientCleanupis executed first.