Skip to content

refactor(monitoring): Implement event-driven metrics for share counters#195

Closed
gimballock wants to merge 7 commits intostratum-mining:mainfrom
fossatmara:event_based_metrics
Closed

refactor(monitoring): Implement event-driven metrics for share counters#195
gimballock wants to merge 7 commits intostratum-mining:mainfrom
fossatmara:event_based_metrics

Conversation

@gimballock
Copy link
Copy Markdown

@gimballock gimballock commented Jan 21, 2026

Note: This commit builds off of PR 193

Convert shares_accepted metrics from snapshot-based Gauges to event-driven Counters that increment at the point of share acceptance, enabling correct Prometheus rate() calculations and real-time alerting.

This commit demonstrates how to instrument a call site with a Counter metric and explains the process of adding additional metrics, of any type, in the monitoring/README.md

Changes

New Architecture

  • Created EventMetrics struct with CounterVec for real-time share tracking
  • Metrics are incremented in message handlers when shares are validated

Implementation Details

stratum-apps/src/monitoring/event_metrics.rs (new)

  • EventMetrics struct with server/client shares_accepted CounterVec
  • Helper methods: inc_server_shares_accepted(), inc_client_shares_accepted()
  • Registers with same Prometheus registry as snapshot-based metrics

stratum-apps/src/monitoring/http_server.rs

  • Added event_metrics field to MonitoringServer
  • Created EventMetrics during initialization
  • Added event_metrics() getter for passing to business logic
  • Removed counter increment logic from Prometheus scrape handler

stratum-apps/src/monitoring/snapshot_metrics.rs

  • Changed shares_accepted_total from GaugeVec to CounterVec
  • Updated metric registration to use CounterVec::new()

pool-apps/pool/src/lib/channel_manager/mod.rs

  • Added event_metrics: Option<Arc> field
  • Added with_event_metrics() builder method

pool-apps/pool/src/lib/mod.rs

  • Pass event_metrics from MonitoringServer to ChannelManager during init

pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs

  • Increment counter in handle_submit_shares_standard() on share acceptance
  • Increment counter in handle_submit_shares_extended() on share acceptance

Benefits

  • Correct Prometheus counter semantics (monotonic, handles resets)
  • Proper rate() calculations without negative values on channel close
  • Immediate metric availability (no 60s snapshot delay)
  • Sub-second updates suitable for real-time alerting
  • Needed for latency percentile tracking and critical error / revenue impacting alerting

Eric Price added 7 commits January 21, 2026 09:18
Fixes a critical DoS vulnerability where rapid monitoring API requests
could degrade mining performance by causing lock contention with business
logic operations (share validation, job distribution).

## Problem

The monitoring system's trait methods (get_clients(), get_server(), etc.)
directly acquire locks on business-critical data structures like
ChannelManagerData. An attacker spamming monitoring endpoints (/metrics,
/api/v1/*) could cause these locks to be held frequently, blocking share
validation and reducing mining throughput.

## Solution

Implement a snapshot cache that decouples monitoring reads from business
logic locks:

1. **SnapshotCache**: Periodically refreshes monitoring data (default: 5s)
   by calling trait methods once per interval, storing results in an
   RwLock-protected snapshot.

2. **CachedMonitoring**: Wrapper implementing monitoring traits that reads
   from the cached snapshot instead of acquiring business logic locks.

3. **Automatic refresh**: MonitoringServer spawns a background task that
   refreshes the cache at the configured interval.

API requests now read from the cache (fast RwLock read) rather than
acquiring business logic locks, eliminating the DoS vector.

## Changes

### New Files
- stratum-apps/src/monitoring/snapshot_cache.rs: Core cache implementation
- stratum-apps/src/monitoring/tests.rs: Unit tests verifying fix

### Modified Files
- stratum-apps/src/monitoring/http_server.rs:
  - Updated MonitoringServer::new() to require refresh_interval parameter
  - Automatically creates SnapshotCache and spawns refresh task
  - Removed separate new_with_cache() constructor (cache is now default)

- stratum-apps/src/monitoring/mod.rs:
  - Export SnapshotCache, CachedMonitoring, MonitoringSnapshot types

- All config examples (26 files):
  - Added `monitoring_cache_refresh_secs = 5` parameter

## Performance Impact

Test results show the cache processes 250,000+ requests in 100ms with only
2 lock acquisitions (during refresh), compared to 4 requests without the
cache due to lock contention. Business logic throughput is unaffected by
monitoring load.

## Breaking Changes

MonitoringServer::new() now requires a `refresh_interval: Duration` parameter.
Update calls to:

```rust
MonitoringServer::new(
    bind_address,
    server_monitoring,
    clients_monitoring,
    Duration::from_secs(5),  // NEW: cache refresh interval
)
Add `monitoring_cache_refresh_secs` configuration parameter to all apps
(jd-client, translator, pool) with a default value of 60 seconds. This
allows users to configure how frequently the monitoring snapshot cache
is refreshed.

Also simplify `unwrap_or_else` calls to `unwrap_or` in snapshot_cache.rs
since the closures don't capture any variables.
Remove unused return value from `SnapshotCache::refresh()` and eliminate
unnecessary clone when updating the cache. The refresh method now updates
the cache in-place without returning the snapshot, since the return value
was not being used by callers.
Add `refresh_if_stale()` method to SnapshotCache that allows API handlers
to trigger cache refresh when data is stale beyond a freshness threshold.
The threshold defaults to `refresh_interval / 2`, ensuring API requests
see reasonably fresh data without excessive lock acquisitions.

Changes:
- Add `freshness_threshold` field to SnapshotCache (defaults to refresh_interval / 2)
- Add `is_stale()` method to check staleness against
Remove unit tests for snapshot cache as they are no longer needed.
The tests verified that the cache eliminates lock contention between
monitoring API requests and business logic operations, but this
functionality is now sufficiently validated through integration testing.
Add unit test verifying that `refresh_if_stale()` only refreshes the cache
when data exceeds the freshness threshold (refresh_interval / 2). The test
confirms that immediate calls don't trigger refresh, but calls after the
threshold has passed do trigger a refresh.
…apshotMetrics

Convert shares_accepted metrics from snapshot-based Gauges to event-driven
Counters that increment at the point of share acceptance, enabling correct
Prometheus rate() calculations and real-time alerting.

Rename PrometheusMetrics to SnapshotMetrics to clarify the architectural
distinction between snapshot-based (gauges) and event-based (counters/histograms)
collection mechanisms.

## Changes

### New Architecture
- Created EventMetrics struct with CounterVec for real-time share tracking
- Metrics are incremented in message handlers when shares are validated
- Renamed PrometheusMetrics → SnapshotMetrics to contrast with EventMetrics

### Implementation Details

**stratum-apps/src/monitoring/event_metrics.rs** (new)
- EventMetrics struct with server/client shares_accepted CounterVec
- Helper methods: inc_server_shares_accepted(), inc_client_shares_accepted()
- Registers with same Prometheus registry as snapshot-based metrics

**stratum-apps/src/monitoring/snapshot_metrics.rs** (renamed from prometheus_metrics.rs)
- Renamed PrometheusMetrics struct to SnapshotMetrics
- Changed shares_accepted_total from GaugeVec to CounterVec
- Updated module documentation to emphasize snapshot-based collection

**stratum-apps/src/monitoring/http_server.rs**
- Updated imports to use snapshot_metrics module
- Added event_metrics field to MonitoringServer
- Created EventMetrics during initialization
- Added event_metrics() getter for passing to business logic
- Removed counter increment logic from Prometheus scrape handler

**stratum-apps/src/monitoring/README.md**
- Added comprehensive guide for adding new metrics
- Decision tree for choosing Gauge vs Counter vs Histogram
- Step-by-step implementation patterns for each metric type
- Reorganized to put API documentation before implementation details

**pool-apps/pool/src/lib/channel_manager/mod.rs**
- Added event_metrics: Option<Arc<EventMetrics>> field
- Added with_event_metrics() builder method

**pool-apps/pool/src/lib/mod.rs**
- Pass event_metrics from MonitoringServer to ChannelManager during init

**pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs**
- Increment counter in handle_submit_shares_standard() on share acceptance
- Increment counter in handle_submit_shares_extended() on share acceptance

## Benefits
- Correct Prometheus counter semantics (monotonic, handles resets)
- Proper rate() calculations without negative values on channel close
- Immediate metric availability (no snapshot delay)
- Sub-second updates suitable for real-time alerting
- Clear naming: SnapshotMetrics vs EventMetrics clarifies collection mechanism
- Comprehensive documentation for adding future metrics (latency, errors, saturation)
@gimballock
Copy link
Copy Markdown
Author

This change needs more private development before it is read for consideration

@gimballock gimballock closed this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant