Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions .design/project-log/2026-05-31-cloud-logging-circuit-breaker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Cloud Logging Circuit Breaker (Issue #70)

**Date:** 2026-05-31
**Author:** dev-issue-70
**Issue:** #70 — Hub crashes when Cloud Logging retries exhaust resources during metadata outage

## Summary

Added a circuit breaker pattern to the Cloud Logging integration so the hub
remains operational when the GCP metadata service (or Cloud Logging API) is
unavailable.

## Changes

### New: `pkg/util/logging/resilient_cloud_handler.go`

- `ResilientCloudHandler` wraps `CloudHandler` with a three-state circuit breaker
(closed → open → half-open → closed).
- Background goroutine runs periodic flush health checks against the Cloud
Logging buffer.
- When consecutive failures exceed the threshold (default: 3), the circuit
opens and `Handle()` silently drops entries from the cloud path. Local
logging via the `multiHandler` continues unaffected.
- After `OpenDuration` (default: 60s), the circuit transitions to half-open
and probes with a timeout-guarded flush. On success the circuit closes and
Cloud Logging resumes automatically.
- Circuit breaker state is shared across derived handlers (`WithAttrs` /
`WithGroup`) via a `*circuitBreaker` pointer.

### Modified: `pkg/util/logging/cloud_handler.go`

- Added `BufferedByteLimit` (default 8 MiB) to the `gcplog.Logger` to cap
the internal write buffer and prevent unbounded memory growth.
- Added `ClientTimeout` (default 15s) to `gcplog.NewClient` creation context
so startup doesn't hang when metadata is unreachable.

### Modified: `cmd/server_foreground.go`

- `initServerLogging` wraps the `CloudHandler` with `ResilientCloudHandler`.
- Updated type assertions for `Client()` access from `*CloudHandler` to
`*ResilientCloudHandler`.

## Design Decisions

- **Circuit breaker over retry limiter**: A circuit breaker provides cleaner
behavior than just capping retries — it stops all Cloud Logging traffic
during outages rather than letting each log entry independently discover
the backend is down.
- **Flush-based health detection**: Since `gcplog.Logger.Log()` is async
and doesn't return errors, we use periodic `Flush()` calls with timeouts
to detect backend failures.
- **Shared state via pointer**: The `circuitBreaker` struct is heap-allocated
and shared by pointer, avoiding `go vet` complaints about copying
`atomic.Int32` in `WithAttrs`/`WithGroup`.

## Testing

- 17 unit tests covering: config defaults, state transitions, Handle behavior
in each circuit state, failure/success tracking, WithAttrs/WithGroup
state sharing, concurrent access safety (race detector).
- All existing logging tests continue to pass.
20 changes: 15 additions & 5 deletions cmd/server_foreground.go
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,10 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl
cleanups = append(cleanups, logCleanup)
}

// Initialize direct Cloud Logging
// Initialize direct Cloud Logging with circuit breaker protection.
// If Cloud Logging becomes unavailable (e.g. during a metadata
// service outage), the circuit breaker opens and the hub falls back to
// local-only logging automatically.
var cloudHandler slog.Handler
if logging.IsCloudLoggingEnabled() {
logLevel := logging.ResolveLogLevel(enableDebug)
Expand All @@ -460,9 +463,14 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl
if cloudErr != nil {
log.Printf("Warning: failed to initialize Cloud Logging: %v", cloudErr)
} else {
cloudHandler = ch
// Wrap with resilient handler for circuit breaker protection.
resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(
ch, logging.ResilientCloudHandlerConfig{},
)
cloudHandler = resilientHandler
cleanups = append(cleanups, cloudLogCleanup)
Comment on lines +467 to 471
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The resilientCleanup (which stops the background health check goroutine) is appended to cleanups before cloudLogCleanup (which closes the underlying Cloud Logging client).

Because cleanups are executed in reverse order of how they are deferred in the main loop (for _, cleanup := range logCleanups { defer cleanup() }), cloudLogCleanup will execute before resilientCleanup. This means the underlying client/logger will be closed while the background health check goroutine is still running and potentially attempting to call Flush(), which can lead to panics or errors on shutdown.

To fix this, append cloudLogCleanup to cleanups before resilientCleanup so that resilientCleanup is executed first.

Suggested change
resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(
ch, logging.ResilientCloudHandlerConfig{},
)
cloudHandler = resilientHandler
cleanups = append(cleanups, resilientCleanup)
cleanups = append(cleanups, cloudLogCleanup)
resilientHandler, resilientCleanup := logging.NewResilientCloudHandler(
ch, logging.ResilientCloudHandlerConfig{},
)
cloudHandler = resilientHandler
cleanups = append(cleanups, cloudLogCleanup)
cleanups = append(cleanups, resilientCleanup)

log.Printf("Cloud Logging enabled (logId=%s, project=%s)", logging.FormatLogID(), logging.FormatProjectID())
cleanups = append(cleanups, resilientCleanup)
log.Printf("Cloud Logging enabled with circuit breaker (logId=%s, project=%s)", logging.FormatLogID(), logging.FormatProjectID())
}
}

Expand All @@ -476,8 +484,9 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl
Foreground: serverStartForeground,
Level: logging.ResolveLogLevel(enableDebug),
}
if ch, ok := cloudHandler.(*logging.CloudHandler); ok && ch != nil {
if ch, ok := cloudHandler.(*logging.ResilientCloudHandler); ok && ch != nil {
reqLogCfg.CloudClient = ch.Client()
reqLogCfg.CircuitOpen = ch.CircuitOpen
reqLogCfg.ProjectID = logging.FormatProjectID()
}
Comment on lines +487 to 491
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The request logger and message logger are initialized with the raw *gcplog.Client (ch.Client()) instead of being wrapped by the ResilientCloudHandler.

During a Cloud Logging outage, writes to these loggers will bypass the circuit breaker and attempt to write directly to Cloud Logging, which could still lead to resource exhaustion, hangs, or crashes. Consider updating RequestLoggerConfig and MessageLoggerConfig to accept a wrapped handler or integrating the circuit breaker state at the client level so that all log streams benefit from the protection.

requestLogger, reqLogCleanup, reqErr := logging.NewRequestLogger(reqLogCfg)
Expand All @@ -495,8 +504,9 @@ func initServerLogging(cmd *cobra.Command) (cleanups []func(), requestLogger *sl
UseGCP: useGCP,
Level: logging.ResolveLogLevel(enableDebug),
}
if ch, ok := cloudHandler.(*logging.CloudHandler); ok && ch != nil {
if ch, ok := cloudHandler.(*logging.ResilientCloudHandler); ok && ch != nil {
msgLogCfg.CloudClient = ch.Client()
msgLogCfg.CircuitOpen = ch.CircuitOpen
}
messageLogger, msgLogCleanup, msgErr := logging.NewMessageLogger(msgLogCfg)
if msgErr != nil {
Expand Down
36 changes: 34 additions & 2 deletions pkg/util/logging/cloud_handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,16 @@ const (
EnvGoogleCloudProject = "GOOGLE_CLOUD_PROJECT"
)

// Default buffer limits for Cloud Logging.
const (
// DefaultBufferedByteLimit is the maximum bytes the Cloud Logging
// client will buffer before dropping entries (8 MiB).
DefaultBufferedByteLimit = 8 << 20
// DefaultClientTimeout is the timeout for creating a Cloud Logging
// client (covers initial connection and credential fetch).
DefaultClientTimeout = 15 * time.Second
)

// CloudLoggingConfig holds configuration for direct Cloud Logging.
type CloudLoggingConfig struct {
// ProjectID is the GCP project ID.
Expand All @@ -45,6 +55,13 @@ type CloudLoggingConfig struct {
LogID string
// Component is the server component name (e.g., "scion-hub").
Component string
// BufferedByteLimit is the maximum bytes the Cloud Logging client
// will buffer. Prevents unbounded memory growth when Cloud Logging
// is temporarily unavailable. Default: 8 MiB.
BufferedByteLimit int
// ClientTimeout is the timeout for creating the Cloud Logging client.
// Default: 15s.
ClientTimeout time.Duration
}

// CloudHandler is a slog.Handler that sends log entries directly to
Expand Down Expand Up @@ -76,12 +93,27 @@ func NewCloudHandler(ctx context.Context, config CloudLoggingConfig, level slog.
logID = resolveLogID()
}

client, err := gcplog.NewClient(ctx, projectID)
// Apply a timeout to client creation so we don't hang indefinitely
// when the GCP metadata service is unreachable.
clientTimeout := config.ClientTimeout
if clientTimeout <= 0 {
clientTimeout = DefaultClientTimeout
}
clientCtx, clientCancel := context.WithTimeout(ctx, clientTimeout)
defer clientCancel()

client, err := gcplog.NewClient(clientCtx, projectID)
if err != nil {
return nil, nil, fmt.Errorf("creating Cloud Logging client: %w", err)
}

logger := client.Logger(logID)
// Apply a bounded buffer to prevent unbounded memory growth when
// Cloud Logging is temporarily unavailable.
bufLimit := config.BufferedByteLimit
if bufLimit <= 0 {
bufLimit = DefaultBufferedByteLimit
}
logger := client.Logger(logID, gcplog.BufferedByteLimit(bufLimit))

hostname, _ := os.Hostname()

Expand Down
7 changes: 6 additions & 1 deletion pkg/util/logging/message_log.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ const (
// MessageLoggerConfig configures the dedicated message logger.
type MessageLoggerConfig struct {
CloudClient *gcplog.Client // Shared GCP client (nil if not enabled)
CircuitOpen func() bool // Returns true when circuit breaker is open (nil = never open)
Component string // "scion-server", "scion-hub", "scion-broker"
UseGCP bool // Format output as GCP-compatible JSON
Level slog.Level
Expand All @@ -56,7 +57,11 @@ func NewMessageLogger(cfg MessageLoggerConfig) (*slog.Logger, func(), error) {
// Cloud handler with dedicated log ID and message-aware label promotion
if cfg.CloudClient != nil {
ch := newMessageCloudHandler(cfg.CloudClient, MessageLogID, cfg.Component, cfg.Level)
handlers = append(handlers, ch)
var cloudHandler slog.Handler = ch
if cfg.CircuitOpen != nil {
cloudHandler = &circuitGatedHandler{inner: ch, circuitOpen: cfg.CircuitOpen}
}
handlers = append(handlers, cloudHandler)
cleanups = append(cleanups, func() {
ch.logger.Flush()
})
Expand Down
21 changes: 13 additions & 8 deletions pkg/util/logging/request_log.go
Original file line number Diff line number Diff line change
Expand Up @@ -159,13 +159,14 @@ func SetRequestBrokerID(ctx context.Context, brokerID string) {

// RequestLoggerConfig configures the dedicated request logger.
type RequestLoggerConfig struct {
FilePath string // From SCION_SERVER_REQUEST_LOG_PATH
CloudClient *gcplog.Client // Shared GCP client (nil if not enabled)
ProjectID string // For trace URL formatting
Component string // "scion-server", "scion-hub", "scion-broker"
UseGCP bool // Format output as GCP-compatible JSON
Foreground bool // If true, suppress stdout output
Level slog.Level
FilePath string // From SCION_SERVER_REQUEST_LOG_PATH
CloudClient *gcplog.Client // Shared GCP client (nil if not enabled)
CircuitOpen func() bool // Returns true when circuit breaker is open (nil = never open)
ProjectID string // For trace URL formatting
Component string // "scion-server", "scion-hub", "scion-broker"
UseGCP bool // Format output as GCP-compatible JSON
Foreground bool // If true, suppress stdout output
Level slog.Level
}

// NewRequestLogger creates a dedicated request logger with the configured outputs.
Expand All @@ -192,7 +193,11 @@ func NewRequestLogger(cfg RequestLoggerConfig) (*slog.Logger, func(), error) {
// Cloud handler
if cfg.CloudClient != nil {
ch := NewCloudHandlerFromClient(cfg.CloudClient, RequestLogID, cfg.Component, cfg.Level)
handlers = append(handlers, ch)
var cloudHandler slog.Handler = ch
if cfg.CircuitOpen != nil {
cloudHandler = &circuitGatedHandler{inner: ch, circuitOpen: cfg.CircuitOpen}
}
handlers = append(handlers, cloudHandler)
cleanups = append(cleanups, func() {
ch.logger.Flush()
})
Expand Down
Loading