Skip to content

feat: add unified metrics caching with configurable backends#33

Closed
appleboy wants to merge 1 commit intomainfrom
cache
Closed

feat: add unified metrics caching with configurable backends#33
appleboy wants to merge 1 commit intomainfrom
cache

Conversation

@appleboy
Copy link
Owner

@appleboy appleboy commented Feb 17, 2026

  • Add comprehensive configuration options and documentation for gauge metrics updates and metrics cache, including support for memory, redis, and redis-aside cache types
  • Introduce the rueidis and rueidisaside libraries for specialized Redis caching strategies
  • Implement cache interfaces and concrete cache backends (memory, redis, redis-aside), including test coverage for all behaviors
  • Add environment variable support and validation for metrics cache type and TTL
  • Create a metrics cache wrapper that gives unified cache-aside support for database count queries (tokens, device codes)
  • Switch gauge metrics update to use the cache wrapper for count queries, reducing database load in production, and allowing cache-backed metrics
  • Set the default gauge update interval to 5 minutes for improved database performance; update gauge job logic and interval management accordingly
  • Add tests for metrics cache validation, cache interface, and cache-assisted metrics updating
  • Extend the SQLite store with count methods for device codes and pending codes
  • Ensure all cache resources are cleaned up gracefully during shutdown
  • Correct minor doc/formatting issues in audit feature description

Copilot AI review requested due to automatic review settings February 17, 2026 08:22
if err != nil {
m.RecordDatabaseQueryError("count_refresh_tokens")
gaugeErrorLogger.logIfNeeded("count_refresh_tokens", err)
} else {

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.
Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.
Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.

Copilot Autofix

AI 2 days ago

In general, to fix this class of problems you should avoid converting from a wider integer type (here int64) to a narrower type (here potentially 32‑bit int) without enforcing bounds that guarantee no overflow. The cleanest approach is to keep the value in the wide type all the way through, or to change the API to accept the wider type, rather than down‑casting.

In this codebase, the best minimal fix is to avoid the int(...) casts when recording metrics and pass the int64 counts directly to the metrics recorder, assuming its interface accepts or can be changed to accept int64. However, the snippet shows only the call sites, not the MetricsRecorder interface. Since we are constrained to edit only the provided snippets and must not change other files, we cannot safely change the metrics interface signature. Therefore, the next best fix that preserves existing functionality is to explicitly clamp the int64 values into the valid range of int before casting. On platforms where int is 64‑bit, this clamp will be a no‑op; on 32‑bit platforms, it prevents overflow by capping large counts at math.MaxInt (and similarly handling negative values if they ever occurred).

Concretely, in main.go we will import math and replace each int(...) cast on values that originated as int64 (activeAccessTokens, activeRefreshTokens, totalDeviceCodes, pendingDeviceCodes) with calls to a small helper safeIntFromInt64 that clamps to the int range. This helper will live in main.go below the imports so we don’t modify any other packages. We will then use safeIntFromInt64 in both updateGaugeMetrics and updateGaugeMetricsWithCache when calling m.SetActiveTokensCount and m.SetActiveDeviceCodesCount. This preserves behavior on 64‑bit systems and removes overflow risk on 32‑bit systems, addressing all three CodeQL variants because all of them trace back to these int(...) conversions.

Suggested changeset 1
main.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/main.go b/main.go
--- a/main.go
+++ b/main.go
@@ -33,6 +33,7 @@
 	"fmt"
 	"io/fs"
 	"log"
+	"math"
 	"net/http"
 	"os"
 	"time"
@@ -63,6 +64,17 @@
 	_ "github.com/appleboy/authgate/api" // swagger docs
 )
 
+// safeIntFromInt64 safely converts an int64 to int, clamping to the valid int range.
+func safeIntFromInt64(v int64) int {
+	if v > int64(math.MaxInt) {
+		return math.MaxInt
+	}
+	if v < int64(math.MinInt) {
+		return math.MinInt
+	}
+	return int(v)
+}
+
 //go:embed internal/templates/*
 var templatesFS embed.FS
 
@@ -888,7 +900,7 @@
 		m.RecordDatabaseQueryError("count_access_tokens")
 		gaugeErrorLogger.logIfNeeded("count_access_tokens", err)
 	} else {
-		m.SetActiveTokensCount("access", int(activeAccessTokens))
+		m.SetActiveTokensCount("access", safeIntFromInt64(activeAccessTokens))
 	}
 
 	// Update active refresh tokens count
@@ -897,7 +909,7 @@
 		m.RecordDatabaseQueryError("count_refresh_tokens")
 		gaugeErrorLogger.logIfNeeded("count_refresh_tokens", err)
 	} else {
-		m.SetActiveTokensCount("refresh", int(activeRefreshTokens))
+		m.SetActiveTokensCount("refresh", safeIntFromInt64(activeRefreshTokens))
 	}
 
 	// Update active device codes count
@@ -906,7 +918,7 @@
 		m.RecordDatabaseQueryError("count_device_codes")
 		gaugeErrorLogger.logIfNeeded("count_device_codes", err)
 	} else {
-		m.SetActiveDeviceCodesCount(int(totalDeviceCodes), int(pendingDeviceCodes))
+		m.SetActiveDeviceCodesCount(safeIntFromInt64(totalDeviceCodes), safeIntFromInt64(pendingDeviceCodes))
 	}
 }
 
@@ -925,7 +937,7 @@
 		m.RecordDatabaseQueryError("count_access_tokens")
 		gaugeErrorLogger.logIfNeeded("count_access_tokens", err)
 	} else {
-		m.SetActiveTokensCount("access", int(activeAccessTokens))
+		m.SetActiveTokensCount("access", safeIntFromInt64(activeAccessTokens))
 	}
 
 	// Update active refresh tokens count
@@ -934,7 +946,7 @@
 		m.RecordDatabaseQueryError("count_refresh_tokens")
 		gaugeErrorLogger.logIfNeeded("count_refresh_tokens", err)
 	} else {
-		m.SetActiveTokensCount("refresh", int(activeRefreshTokens))
+		m.SetActiveTokensCount("refresh", safeIntFromInt64(activeRefreshTokens))
 	}
 
 	// Update active device codes count
@@ -952,5 +964,5 @@
 		pendingDeviceCodes = 0
 	}
 
-	m.SetActiveDeviceCodesCount(int(totalDeviceCodes), int(pendingDeviceCodes))
+	m.SetActiveDeviceCodesCount(safeIntFromInt64(totalDeviceCodes), safeIntFromInt64(pendingDeviceCodes))
 }
EOF
Copilot is powered by AI and may make mistakes. Always verify output.
main.go Outdated

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.
Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.
Incorrect conversion of a signed 64-bit integer from
strconv.ParseInt
to a lower bit size type int without an upper bound check.

Copilot Autofix

AI 2 days ago

In general, to fix this type of issue you should avoid narrowing conversions from a larger integer type (here int64) to a smaller one (here potentially 32‑bit int) without bounds checking, or you should keep using the wider type end‑to‑end. Since the MetricsRecorder interface uses int parameters, the simplest safe change is to clamp the int64 values into the valid int range before converting, using math.MinInt/math.MaxInt (which reflect the platform’s int size).

Concretely, the problematic conversions are in main.go in updateGaugeMetricsWithCache:

  • int(activeAccessTokens) on line 928
  • int(activeRefreshTokens) on line 937
  • int(totalDeviceCodes) and int(pendingDeviceCodes) on line 955

We should:

  1. Import the math package in main.go (no new external dependency).

  2. Introduce a small helper function in main.go that takes an int64 and returns an int by clamping to the platform’s int range, e.g.:

    func clampInt64ToInt(v int64) int {
        if v > int64(math.MaxInt) {
            return math.MaxInt
        }
        if v < int64(math.MinInt) {
            return math.MinInt
        }
        return int(v)
    }
  3. Replace all direct int(...) casts on the metric counts in updateGaugeMetricsWithCache with calls to this helper:

    • m.SetActiveTokensCount("access", clampInt64ToInt(activeAccessTokens))
    • m.SetActiveTokensCount("refresh", clampInt64ToInt(activeRefreshTokens))
    • m.SetActiveDeviceCodesCount(clampInt64ToInt(totalDeviceCodes), clampInt64ToInt(pendingDeviceCodes))

This preserves existing functionality for values within the valid int range and prevents overflow for out‑of‑range values by saturating at the bounds.

No changes are needed in internal/cache/rueidis.go or internal/cache/rueidis_aside.go, because they already operate purely on int64. The only narrowing occurs in main.go.


Suggested changeset 1
main.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/main.go b/main.go
--- a/main.go
+++ b/main.go
@@ -33,6 +33,7 @@
 	"fmt"
 	"io/fs"
 	"log"
+	"math"
 	"net/http"
 	"os"
 	"time"
@@ -913,6 +914,16 @@
 // updateGaugeMetricsWithCache updates gauge metrics using a cache-backed store.
 // This reduces database load in multi-instance deployments by caching query results.
 // The cache TTL should match the update interval to ensure consistent behavior.
+func clampInt64ToInt(v int64) int {
+	if v > int64(math.MaxInt) {
+		return math.MaxInt
+	}
+	if v < int64(math.MinInt) {
+		return math.MinInt
+	}
+	return int(v)
+}
+
 func updateGaugeMetricsWithCache(
 	ctx context.Context,
 	cacheWrapper *metrics.MetricsCacheWrapper,
@@ -925,7 +936,7 @@
 		m.RecordDatabaseQueryError("count_access_tokens")
 		gaugeErrorLogger.logIfNeeded("count_access_tokens", err)
 	} else {
-		m.SetActiveTokensCount("access", int(activeAccessTokens))
+		m.SetActiveTokensCount("access", clampInt64ToInt(activeAccessTokens))
 	}
 
 	// Update active refresh tokens count
@@ -934,7 +945,7 @@
 		m.RecordDatabaseQueryError("count_refresh_tokens")
 		gaugeErrorLogger.logIfNeeded("count_refresh_tokens", err)
 	} else {
-		m.SetActiveTokensCount("refresh", int(activeRefreshTokens))
+		m.SetActiveTokensCount("refresh", clampInt64ToInt(activeRefreshTokens))
 	}
 
 	// Update active device codes count
@@ -952,5 +963,5 @@
 		pendingDeviceCodes = 0
 	}
 
-	m.SetActiveDeviceCodesCount(int(totalDeviceCodes), int(pendingDeviceCodes))
+	m.SetActiveDeviceCodesCount(clampInt64ToInt(totalDeviceCodes), clampInt64ToInt(pendingDeviceCodes))
 }
EOF
@@ -33,6 +33,7 @@
"fmt"
"io/fs"
"log"
"math"
"net/http"
"os"
"time"
@@ -913,6 +914,16 @@
// updateGaugeMetricsWithCache updates gauge metrics using a cache-backed store.
// This reduces database load in multi-instance deployments by caching query results.
// The cache TTL should match the update interval to ensure consistent behavior.
func clampInt64ToInt(v int64) int {
if v > int64(math.MaxInt) {
return math.MaxInt
}
if v < int64(math.MinInt) {
return math.MinInt
}
return int(v)
}

func updateGaugeMetricsWithCache(
ctx context.Context,
cacheWrapper *metrics.MetricsCacheWrapper,
@@ -925,7 +936,7 @@
m.RecordDatabaseQueryError("count_access_tokens")
gaugeErrorLogger.logIfNeeded("count_access_tokens", err)
} else {
m.SetActiveTokensCount("access", int(activeAccessTokens))
m.SetActiveTokensCount("access", clampInt64ToInt(activeAccessTokens))
}

// Update active refresh tokens count
@@ -934,7 +945,7 @@
m.RecordDatabaseQueryError("count_refresh_tokens")
gaugeErrorLogger.logIfNeeded("count_refresh_tokens", err)
} else {
m.SetActiveTokensCount("refresh", int(activeRefreshTokens))
m.SetActiveTokensCount("refresh", clampInt64ToInt(activeRefreshTokens))
}

// Update active device codes count
@@ -952,5 +963,5 @@
pendingDeviceCodes = 0
}

m.SetActiveDeviceCodesCount(int(totalDeviceCodes), int(pendingDeviceCodes))
m.SetActiveDeviceCodesCount(clampInt64ToInt(totalDeviceCodes), clampInt64ToInt(pendingDeviceCodes))
}
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable caching to reduce database load from Prometheus gauge metric updates, supporting single- and multi-replica deployments with memory/Redis/rueidisaside backends and configurable update intervals.

Changes:

  • Introduces a generic cache interface plus memory, Redis (rueidis), and Redis-aside (rueidisaside) implementations.
  • Adds a cache-backed metrics wrapper and integrates it into the gauge update loop with configurable interval/enablement.
  • Extends configuration, tests, and documentation for the new metrics cache and gauge update behavior.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
main.go Initializes selected metrics cache backend, uses cache-backed gauge updates, and closes cache on shutdown
internal/metrics/cache.go Adds read-through caching wrapper for DB-backed gauge metrics queries
internal/metrics/cache_test.go Tests cache wrapper behavior (hit/miss/expiry/partial hits and aside support)
internal/cache/interface.go Defines cache interface used by metrics caching
internal/cache/errors.go Defines cache error types (miss/unavailable/invalid)
internal/cache/memory.go Adds in-memory cache backend implementation
internal/cache/rueidis.go Adds Redis cache backend using rueidis
internal/cache/rueidis_aside.go Adds Redis-aside backend using rueidisaside with client-side caching
internal/config/config.go Adds metrics cache + gauge update configuration and validation
internal/config/config_test.go Adds validation tests for metrics cache configuration
.env.example Documents new metrics gauge update + cache environment variables
CLAUDE.md Documents multi-replica metrics guidance and new env vars
go.mod / go.sum Adds rueidis + rueidisaside dependencies (and updated sums)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Gauge Metrics Update Configuration
# METRICS_GAUGE_UPDATE_ENABLED=true # Enable gauge metric updates (default: true)
# # In multi-replica deployments, enable on only ONE instance or use max()/avg() aggregation
# METRICS_GAUGE_UPDATE_INTERVAL=5m # Update interval for gauge metrics (default: 5m, reduced from 30s)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the 5m default interval is "reduced from 30s", but this change increases the interval. Please update the wording to avoid misleading operators.

Suggested change
# METRICS_GAUGE_UPDATE_INTERVAL=5m # Update interval for gauge metrics (default: 5m, reduced from 30s)
# METRICS_GAUGE_UPDATE_INTERVAL=5m # Update interval for gauge metrics (default: 5m, increased from 30s)

Copilot uses AI. Check for mistakes.
| **METRICS_ENABLED** | false | Enable Prometheus metrics endpoint |
| METRICS_TOKEN | (empty) | Bearer token for /metrics endpoint (empty = no auth) |
| METRICS_GAUGE_UPDATE_ENABLED | true | Enable gauge updates (disable on all but one replica) |
| METRICS_GAUGE_UPDATE_INTERVAL | 5m | Gauge update interval (default: 5m, reduced from 30s) |
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable table describes METRICS_GAUGE_UPDATE_INTERVAL as "reduced from 30s" even though the default moved from 30s to 5m (an increase). Please correct this wording for accuracy.

Suggested change
| METRICS_GAUGE_UPDATE_INTERVAL | 5m | Gauge update interval (default: 5m, reduced from 30s) |
| METRICS_GAUGE_UPDATE_INTERVAL | 5m | Gauge update interval (default: 5m, increased from 30s) |

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper keys already include a "metrics:" prefix (e.g., "metrics:tokens:..."). Since main.go also configures Redis caches with keyPrefix "metrics:", Redis-backed caches will end up storing keys like "metrics:metrics:tokens:...". Consider either removing the hardcoded "metrics:" prefix from wrapper keys or passing an empty keyPrefix to the cache backends so key naming is consistent and not duplicated.

Suggested change
key := fmt.Sprintf("metrics:tokens:%s", category)
key := fmt.Sprintf("tokens:%s", category)

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +172
case config.MetricsCacheTypeRedisAside:
metricsCache, err = cache.NewRueidisAsideCache(
cfg.RedisAddr,
cfg.RedisPassword,
cfg.RedisDB,
"metrics:",
cfg.MetricsCacheClientTTL,
)
if err != nil {
log.Fatalf("Failed to initialize redis-aside metrics cache: %v", err)
}
log.Printf(
"Metrics cache: redis-aside (addr=%s, db=%d, client_ttl=%s)",
cfg.RedisAddr,
cfg.RedisDB,
cfg.MetricsCacheClientTTL,
)
case config.MetricsCacheTypeRedis:
metricsCache, err = cache.NewRueidisCache(
cfg.RedisAddr,
cfg.RedisPassword,
cfg.RedisDB,
"metrics:",
)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Redis cache constructors are passed keyPrefix "metrics:", but the MetricsCacheWrapper already prefixes all keys with "metrics:". This results in duplicated Redis keys ("metrics:metrics:..."). Consider passing an empty prefix here (and letting the wrapper own the namespace) or changing the wrapper to use un-prefixed keys and keep the prefix solely in the cache backend.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +158
name: "redis cache type without redis address (allowed)",
config: &Config{
RateLimitStore: RateLimitStoreMemory,
MetricsCacheType: MetricsCacheTypeRedis,
RedisAddr: "",
},
expectError: false,
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case asserts that METRICS_CACHE_TYPE="redis" is valid when RedisAddr is empty, but runServer will still try to initialize a rueidis client and likely fail/fatal. If RedisAddr is required for redis metrics caching, update this test to expect a validation error and ensure the error message is clear.

Suggested change
name: "redis cache type without redis address (allowed)",
config: &Config{
RateLimitStore: RateLimitStoreMemory,
MetricsCacheType: MetricsCacheTypeRedis,
RedisAddr: "",
},
expectError: false,
name: "redis cache type without redis address",
config: &Config{
RateLimitStore: RateLimitStoreMemory,
MetricsCacheType: MetricsCacheTypeRedis,
RedisAddr: "",
},
expectError: true,
errorMsg: `METRICS_CACHE_TYPE="redis" requires REDIS_ADDR`,

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +40
DisableCache: false, // Enable client-side caching
// Client-side cache configuration
CacheSizeEachConn: 128 * 1024 * 1024, // 128MB per connection
},
})
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheSizeEachConn is hard-coded to 128MB per connection. This can materially increase memory usage (especially with multiple connections) and may not be appropriate for all deployments. Consider making this configurable and/or documenting the expected memory impact, or choosing a more conservative default.

Copilot uses AI. Check for mistakes.
MetricsCacheTypeRedisAside,
)
}

Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

METRICS_GAUGE_UPDATE_INTERVAL is used to create a time.Ticker in main.go; if it's set to 0 or a negative duration, time.NewTicker will panic. Consider adding validation (e.g., require >0 when METRICS_ENABLED && METRICS_GAUGE_UPDATE_ENABLED) to prevent a configuration that can crash the process.

Suggested change
// Validate metrics gauge update interval to prevent time.NewTicker panics
if c.MetricsEnabled && c.MetricsGaugeUpdateEnabled && c.MetricsGaugeUpdateInterval <= 0 {
return fmt.Errorf(
"invalid METRICS_GAUGE_UPDATE_INTERVAL value: %s (must be > 0 when METRICS_ENABLED and METRICS_GAUGE_UPDATE_ENABLED are true)",
c.MetricsGaugeUpdateInterval,
)
}

Copilot uses AI. Check for mistakes.
- Add comprehensive configuration options and documentation for gauge metrics updates and metrics cache, including support for memory, redis, and redis-aside cache types
- Introduce the rueidis and rueidisaside libraries for specialized Redis caching strategies
- Implement cache interfaces and concrete cache backends (memory, redis, redis-aside), including test coverage for all behaviors
- Add environment variable support and validation for metrics cache type and TTL
- Create a metrics cache wrapper that gives unified cache-aside support for database count queries (tokens, device codes)
- Switch gauge metrics update to use the cache wrapper for count queries, reducing database load in production, and allowing cache-backed metrics
- Set the default gauge update interval to 5 minutes for improved database performance; update gauge job logic and interval management accordingly
- Add tests for metrics cache validation, cache interface, and cache-assisted metrics updating
- Extend the SQLite store with count methods for device codes and pending codes
- Ensure all cache resources are cleaned up gracefully during shutdown
- Correct minor doc/formatting issues in audit feature description

Signed-off-by: appleboy <appleboy.tw@gmail.com>
@appleboy appleboy changed the title feat: add configurable caching for Prometheus metrics updates feat: add unified metrics caching with configurable backends Feb 17, 2026
@appleboy appleboy closed this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments