Skip to content

Fix hub crash during Cloud Logging metadata outage#84

Closed
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/dev-issue-70
Closed

Fix hub crash during Cloud Logging metadata outage#84
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/dev-issue-70

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented May 31, 2026

Summary

Fixes #70 — Hub crashes when Cloud Logging retries exhaust resources during metadata outage.

  • Circuit breaker: Added ResilientCloudHandler that wraps CloudHandler with a three-state circuit breaker (closed → open → half-open). After 3 consecutive flush failures, the circuit opens and Cloud Logging entries are silently dropped. Local logging continues unaffected via the multiHandler. The circuit automatically probes for recovery and resumes Cloud Logging when the service returns.
  • Bounded buffer: Added BufferedByteLimit (8 MiB default) to the Cloud Logging client to prevent unbounded memory growth during transient failures.
  • Client creation timeout: Added a 15-second context timeout to gcplog.NewClient so the hub doesn't hang at startup when the metadata service is unreachable.

Acceptance Criteria

  • Hub remains operational when GCP metadata service is down
  • Cloud Logging retries are bounded (circuit breaker opens after 3 consecutive failures)
  • Hub falls back to local logging when Cloud Logging is unavailable
  • Warning is logged when operating in degraded (local-only) mode
  • Cloud Logging resumes automatically when the service recovers

Test plan

  • 17 unit tests covering config defaults, circuit state transitions, Handle behavior in each state, failure/success tracking, WithAttrs/WithGroup state sharing, concurrent access (race detector)
  • All existing logging package tests pass
  • go vet clean
  • Race detector clean

ptone added 2 commits May 31, 2026 03:14
…etadata outages

When the GCP metadata service is unavailable, Cloud Logging retries could
exhaust resources (goroutines, connections, memory) and crash the hub.

This change adds three layers of protection:

1. ResilientCloudHandler: A circuit breaker wrapper around CloudHandler that
   monitors Cloud Logging health via periodic flush checks. After consecutive
   failures exceed the threshold (default: 3), the circuit opens and log
   entries are silently dropped from the Cloud Logging path. Local logging
   continues unaffected via the multiHandler. The circuit automatically
   probes for recovery and resumes Cloud Logging when the service returns.

2. BufferedByteLimit: Caps the Cloud Logging client's internal buffer at
   8 MiB to prevent unbounded memory growth during transient failures.

3. Client creation timeout: Adds a 15-second timeout to Cloud Logging
   client initialization so the hub doesn't hang at startup when the
   metadata service is unreachable.

Fixes #70
@ptone
Copy link
Copy Markdown
Owner

ptone commented May 31, 2026

/gemini

@ptone ptone closed this May 31, 2026
@ptone
Copy link
Copy Markdown
Owner

ptone commented May 31, 2026

This pull request has been recreated on the target repository as GoogleCloudPlatform#270.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Hub crashes when Cloud Logging retries exhaust resources during metadata outage

1 participant