Fix hub crash during Cloud Logging metadata outage by ptone · Pull Request #270 · GoogleCloudPlatform/scion

ptone · 2026-05-31T16:36:38Z

Summary

Fixes ptone#70 — Hub crashes when Cloud Logging retries exhaust resources during metadata outage.

Circuit breaker: Added ResilientCloudHandler that wraps CloudHandler with a three-state circuit breaker (closed → open → half-open). After 3 consecutive flush failures, the circuit opens and Cloud Logging entries are silently dropped. Local logging continues unaffected via the multiHandler. The circuit automatically probes for recovery and resumes Cloud Logging when the service returns.
Bounded buffer: Added BufferedByteLimit (8 MiB default) to the Cloud Logging client to prevent unbounded memory growth during transient failures.
Client creation timeout: Added a 15-second context timeout to gcplog.NewClient so the hub doesn't hang at startup when the metadata service is unreachable.

Acceptance Criteria

Hub remains operational when GCP metadata service is down
Cloud Logging retries are bounded (circuit breaker opens after 3 consecutive failures)
Hub falls back to local logging when Cloud Logging is unavailable
Warning is logged when operating in degraded (local-only) mode
Cloud Logging resumes automatically when the service recovers

Test plan

17 unit tests covering config defaults, circuit state transitions, Handle behavior in each state, failure/success tracking, WithAttrs/WithGroup state sharing, concurrent access (race detector)
All existing logging package tests pass
go vet clean
Race detector clean

gemini-code-assist

Code Review

This pull request introduces a circuit breaker pattern (ResilientCloudHandler) to protect the hub from crashing or hanging during GCP metadata or Cloud Logging outages, alongside adding bounded buffers and client timeouts. The review feedback highlights several critical issues: a cleanup ordering bug that could cause panics on shutdown, a potential goroutine leak in flushWithTimeout during prolonged outages, and a deadlock risk from holding mutexes while calling external logging functions. Additionally, the request and message loggers currently bypass the circuit breaker, and the core background health-checking and recovery logic remains untested.

gemini-code-assist · 2026-05-31T16:38:33Z

+			resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(
+				ch, logging.ResilientCloudHandlerConfig{},
+			)
+			cloudHandler = resilientHandler
+			cleanups = append(cleanups, resilientCleanup)
 			cleanups = append(cleanups, cloudLogCleanup)


The resilientCleanup (which stops the background health check goroutine) is appended to cleanups before cloudLogCleanup (which closes the underlying Cloud Logging client).

Because cleanups are executed in reverse order of how they are deferred in the main loop (for _, cleanup := range logCleanups { defer cleanup() }), cloudLogCleanup will execute before resilientCleanup. This means the underlying client/logger will be closed while the background health check goroutine is still running and potentially attempting to call Flush(), which can lead to panics or errors on shutdown.

To fix this, append cloudLogCleanup to cleanups before resilientCleanup so that resilientCleanup is executed first.

Suggested change

resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(

ch, logging.ResilientCloudHandlerConfig{},

)

cloudHandler = resilientHandler

cleanups = append(cleanups, resilientCleanup)

cleanups = append(cleanups, cloudLogCleanup)

resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(

ch, logging.ResilientCloudHandlerConfig{},

)

cloudHandler = resilientHandler

cleanups = append(cleanups, cloudLogCleanup)

cleanups = append(cleanups, resilientCleanup)

gemini-code-assist · 2026-05-31T16:38:33Z

+	if ch, ok := cloudHandler.(*logging.ResilientCloudHandler); ok && ch != nil {
 		reqLogCfg.CloudClient = ch.Client()
 		reqLogCfg.ProjectID = logging.FormatProjectID()
 	}


The request logger and message logger are initialized with the raw *gcplog.Client (ch.Client()) instead of being wrapped by the ResilientCloudHandler.

During a Cloud Logging outage, writes to these loggers will bypass the circuit breaker and attempt to write directly to Cloud Logging, which could still lead to resource exhaustion, hangs, or crashes. Consider updating RequestLoggerConfig and MessageLoggerConfig to accept a wrapped handler or integrating the circuit breaker state at the client level so that all log streams benefit from the protection.

gemini-code-assist · 2026-05-31T16:38:33Z

+type ResilientCloudHandler struct {
+	inner  *CloudHandler
+	logger *gcplog.Logger
+	config ResilientCloudHandlerConfig
+	cb     *circuitBreaker
+
+	done chan struct{}
+	wg   sync.WaitGroup
+}


Add a flushInFlight atomic boolean to the ResilientCloudHandler struct to prevent concurrent flush operations and avoid goroutine leaks during prolonged outages.

Suggested change

type ResilientCloudHandler struct {

inner *CloudHandler

logger *gcplog.Logger

config ResilientCloudHandlerConfig

cb *circuitBreaker

done chan struct{}

wg sync.WaitGroup

}

type ResilientCloudHandler struct {

inner *CloudHandler

logger *gcplog.Logger

config ResilientCloudHandlerConfig

cb *circuitBreaker

flushInFlight atomic.Bool

done chan struct{}

wg sync.WaitGroup

}

gemini-code-assist · 2026-05-31T16:38:33Z

+func (h *ResilientCloudHandler) flushWithTimeout() error {
+	ctx, cancel := context.WithTimeout(context.Background(), h.config.ProbeTimeout)
+	defer cancel()
+
+	// Logger.Flush() doesn't accept a context, so we race it against our
+	// timeout. This prevents Flush from hanging indefinitely when the
+	// metadata service is unreachable.
+	errCh := make(chan error, 1)
+	go func() {
+		errCh <- h.logger.Flush()
+	}()
+
+	select {
+	case err := <-errCh:
+		return err
+	case <-ctx.Done():
+		return fmt.Errorf("cloud logging flush timed out after %v", h.config.ProbeTimeout)
+	}
+}


In flushWithTimeout, a new goroutine is spawned to call h.logger.Flush(). If the flush operation hangs indefinitely (which is the primary reason for the timeout), the spawned goroutine will be leaked. Since healthCheckLoop periodically calls runHealthCheck (and thus flushWithTimeout), this can lead to an unbounded accumulation of leaked goroutines during a prolonged outage.

To prevent this, use the flushInFlight atomic boolean to ensure that only one flush operation is active at any given time. If a previous flush is still running, subsequent health checks can immediately return an error without spawning a new goroutine.

func (h *ResilientCloudHandler) flushWithTimeout() error { if !h.flushInFlight.CompareAndSwap(false, true) { return fmt.Errorf("a previous flush is still in progress") } ctx, cancel := context.WithTimeout(context.Background(), h.config.ProbeTimeout) defer cancel() errCh := make(chan error, 1) go func() { defer h.flushInFlight.Store(false) errCh <- h.logger.Flush() }() select { case err := <-errCh: return err case <-ctx.Done(): return fmt.Errorf("cloud logging flush timed out after %v", h.config.ProbeTimeout) } }

gemini-code-assist · 2026-05-31T16:38:33Z

+func (h *ResilientCloudHandler) recordFailure(err error) {
+	h.cb.mu.Lock()
+	defer h.cb.mu.Unlock()
+
+	h.cb.failures++
+	if h.cb.failures >= h.config.MaxFailures && circuitState(h.cb.state.Load()) == circuitClosed {
+		h.cb.state.Store(int32(circuitOpen))
+		h.cb.lastStateChange = time.Now()
+		// Log to stderr since cloud logging is the one failing.
+		fmt.Fprintf(os.Stderr,
+			"WARNING: Cloud Logging circuit breaker opened after %d consecutive failures (last error: %v). Falling back to local-only logging.\n",
+			h.cb.failures, err)
+		// Also log via slog so the local handler captures it.
+		slog.Warn("Cloud Logging unavailable, falling back to local-only logging",
+			"consecutive_failures", h.cb.failures,
+			"error", err.Error(),
+		)
+	}
+}


In recordFailure, the mutex h.cb.mu is held while calling external logging functions like slog.Warn and fmt.Fprintf. Holding locks across external calls is a known anti-pattern that can lead to unexpected deadlocks or performance bottlenecks if the logging handlers block or call back into the circuit breaker.

It is safer to perform the state updates under the lock, release the lock, and then perform the logging.

func (h *ResilientCloudHandler) recordFailure(err error) { h.cb.mu.Lock() h.cb.failures++ opened := false if h.cb.failures >= h.config.MaxFailures && circuitState(h.cb.state.Load()) == circuitClosed { h.cb.state.Store(int32(circuitOpen)) h.cb.lastStateChange = time.Now() opened = true } failures := h.cb.failures h.cb.mu.Unlock() if opened { // Log to stderr since cloud logging is the one failing. fmt.Fprintf(os.Stderr, "WARNING: Cloud Logging circuit breaker opened after %d consecutive failures (last error: %v). Falling back to local-only logging.\n", failures, err) // Also log via slog so the local handler captures it. slog.Warn("Cloud Logging unavailable, falling back to local-only logging", "consecutive_failures", failures, "error", err.Error(), ) } }

gemini-code-assist · 2026-05-31T16:38:33Z

+// transitionTo atomically transitions the circuit to the given state and
+// logs the transition.
+func (h *ResilientCloudHandler) transitionTo(newState circuitState) {
+	h.cb.mu.Lock()
+	defer h.cb.mu.Unlock()
+
+	oldState := circuitState(h.cb.state.Load())
+	if oldState == newState {
+		return
+	}
+
+	h.cb.state.Store(int32(newState))
+	h.cb.lastStateChange = time.Now()
+
+	switch newState {
+	case circuitClosed:
+		h.cb.failures = 0
+		slog.Info("Cloud Logging circuit breaker closed: Cloud Logging resumed")
+	case circuitOpen:
+		slog.Warn("Cloud Logging circuit breaker opened: falling back to local-only logging")
+	case circuitHalfOpen:
+		slog.Info("Cloud Logging circuit breaker half-open: probing Cloud Logging")
+	}


In transitionTo, the mutex h.cb.mu is held while calling external logging functions like slog.Info and slog.Warn. Holding locks across external calls is a known anti-pattern that can lead to unexpected deadlocks or performance bottlenecks if the logging handlers block or call back into the circuit breaker.

It is safer to perform the state updates under the lock, release the lock, and then perform the logging.

func (h *ResilientCloudHandler) transitionTo(newState circuitState) { h.cb.mu.Lock() oldState := circuitState(h.cb.state.Load()) if oldState == newState { h.cb.mu.Unlock() return } h.cb.state.Store(int32(newState)) h.cb.lastStateChange = time.Now() if newState == circuitClosed { h.cb.failures = 0 } h.cb.mu.Unlock() switch newState { case circuitClosed: slog.Info("Cloud Logging circuit breaker closed: Cloud Logging resumed") case circuitOpen: slog.Warn("Cloud Logging circuit breaker opened: falling back to local-only logging") case circuitHalfOpen: slog.Info("Cloud Logging circuit breaker half-open: probing Cloud Logging") } }

gemini-code-assist · 2026-05-31T16:38:33Z

+func newTestResilientHandler(t *testing.T, cfg ResilientCloudHandlerConfig) *ResilientCloudHandler {
+	t.Helper()
+	// Create a minimal CloudHandler without a real client/logger.
+	inner := &CloudHandler{
+		level:     slog.LevelInfo,
+		component: "test",
+		hostname:  "test-host",
+	}
+	cfg.applyDefaults()
+	cb := &circuitBreaker{}
+	cb.state.Store(int32(circuitClosed))
+	h := &ResilientCloudHandler{
+		inner:  inner,
+		config: cfg,
+		cb:     cb,
+		done:   make(chan struct{}),
+	}
+	// Don't start the health check loop — tests drive it explicitly.
+	return h
+}


The test helper newTestResilientHandler explicitly avoids starting the background health check loop, and none of the unit tests invoke runHealthCheck, healthCheckLoop, or flushWithTimeout. As a result, the core background health checking, timeout-guarded flushing, and automatic recovery/probing logic is completely untested.

Consider adding unit tests that mock the gcplog.Logger or use a test double to verify that runHealthCheck correctly transitions the circuit state on failures and successes, and that timeouts are handled properly.

- Fix cleanup ordering: append cloudLogCleanup before resilientCleanup so health check goroutine stops before client closes on shutdown - Add circuitGatedHandler to gate request/message loggers through the circuit breaker, preventing bypass during Cloud Logging outages - Add flushInFlight atomic.Bool to prevent goroutine leaks from concurrent flush operations during prolonged outages - Move slog/fmt logging outside mutex in recordFailure and transitionTo to prevent potential deadlocks - Add tests for runHealthCheck, flushWithTimeout, flushInFlight guard, and circuitGatedHandler - Fix unused _linkedTelegramId in profile-telegram.ts (CI type check)

Add circuit breaker pattern to Cloud Logging integration: - ResilientCloudHandler wraps CloudHandler with three-state circuit breaker (closed/open/half-open) and background health checks - flushInFlight atomic prevents goroutine leaks from concurrent flushes - circuitGatedHandler gates request/message loggers through the circuit breaker so all log streams benefit from protection - CloudHandler gets BufferedByteLimit (8MiB) and ClientTimeout (15s) - Cleanup ordering ensures health check goroutine stops before client - Mutex-protected logging moved outside locks to prevent deadlocks

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

ptone mentioned this pull request May 31, 2026

Fix hub crash during Cloud Logging metadata outage ptone/scion#84

Closed

9 tasks

scion-gteam Bot force-pushed the scion/dev-issue-70 branch from 0c86336 to fb75b99 Compare June 1, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hub crash during Cloud Logging metadata outage#270

Fix hub crash during Cloud Logging metadata outage#270
ptone wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
ptone:scion/dev-issue-70

ptone commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptone commented May 31, 2026

Summary

Acceptance Criteria

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant