refactor(monitoring): Implement event-driven metrics for share counters by gimballock · Pull Request #195 · stratum-mining/sv2-apps

gimballock · 2026-01-21T06:03:25Z

Note: This commit builds off of PR 193

Convert shares_accepted metrics from snapshot-based Gauges to event-driven Counters that increment at the point of share acceptance, enabling correct Prometheus rate() calculations and real-time alerting.

This commit demonstrates how to instrument a call site with a Counter metric and explains the process of adding additional metrics, of any type, in the monitoring/README.md

Changes

New Architecture

Created EventMetrics struct with CounterVec for real-time share tracking
Metrics are incremented in message handlers when shares are validated

Implementation Details

stratum-apps/src/monitoring/event_metrics.rs (new)

EventMetrics struct with server/client shares_accepted CounterVec
Helper methods: inc_server_shares_accepted(), inc_client_shares_accepted()
Registers with same Prometheus registry as snapshot-based metrics

stratum-apps/src/monitoring/http_server.rs

Added event_metrics field to MonitoringServer
Created EventMetrics during initialization
Added event_metrics() getter for passing to business logic
Removed counter increment logic from Prometheus scrape handler

stratum-apps/src/monitoring/snapshot_metrics.rs

Changed shares_accepted_total from GaugeVec to CounterVec
Updated metric registration to use CounterVec::new()

pool-apps/pool/src/lib/channel_manager/mod.rs

Added event_metrics: Option<Arc> field
Added with_event_metrics() builder method

pool-apps/pool/src/lib/mod.rs

Pass event_metrics from MonitoringServer to ChannelManager during init

pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs

Increment counter in handle_submit_shares_standard() on share acceptance
Increment counter in handle_submit_shares_extended() on share acceptance

Benefits

Correct Prometheus counter semantics (monotonic, handles resets)
Proper rate() calculations without negative values on channel close
Immediate metric availability (no 60s snapshot delay)
Sub-second updates suitable for real-time alerting
Needed for latency percentile tracking and critical error / revenue impacting alerting

Fixes a critical DoS vulnerability where rapid monitoring API requests could degrade mining performance by causing lock contention with business logic operations (share validation, job distribution). ## Problem The monitoring system's trait methods (get_clients(), get_server(), etc.) directly acquire locks on business-critical data structures like ChannelManagerData. An attacker spamming monitoring endpoints (/metrics, /api/v1/*) could cause these locks to be held frequently, blocking share validation and reducing mining throughput. ## Solution Implement a snapshot cache that decouples monitoring reads from business logic locks: 1. **SnapshotCache**: Periodically refreshes monitoring data (default: 5s) by calling trait methods once per interval, storing results in an RwLock-protected snapshot. 2. **CachedMonitoring**: Wrapper implementing monitoring traits that reads from the cached snapshot instead of acquiring business logic locks. 3. **Automatic refresh**: MonitoringServer spawns a background task that refreshes the cache at the configured interval. API requests now read from the cache (fast RwLock read) rather than acquiring business logic locks, eliminating the DoS vector. ## Changes ### New Files - stratum-apps/src/monitoring/snapshot_cache.rs: Core cache implementation - stratum-apps/src/monitoring/tests.rs: Unit tests verifying fix ### Modified Files - stratum-apps/src/monitoring/http_server.rs: - Updated MonitoringServer::new() to require refresh_interval parameter - Automatically creates SnapshotCache and spawns refresh task - Removed separate new_with_cache() constructor (cache is now default) - stratum-apps/src/monitoring/mod.rs: - Export SnapshotCache, CachedMonitoring, MonitoringSnapshot types - All config examples (26 files): - Added `monitoring_cache_refresh_secs = 5` parameter ## Performance Impact Test results show the cache processes 250,000+ requests in 100ms with only 2 lock acquisitions (during refresh), compared to 4 requests without the cache due to lock contention. Business logic throughput is unaffected by monitoring load. ## Breaking Changes MonitoringServer::new() now requires a `refresh_interval: Duration` parameter. Update calls to: ```rust MonitoringServer::new( bind_address, server_monitoring, clients_monitoring, Duration::from_secs(5), // NEW: cache refresh interval )

Add `monitoring_cache_refresh_secs` configuration parameter to all apps (jd-client, translator, pool) with a default value of 60 seconds. This allows users to configure how frequently the monitoring snapshot cache is refreshed. Also simplify `unwrap_or_else` calls to `unwrap_or` in snapshot_cache.rs since the closures don't capture any variables.

Remove unused return value from `SnapshotCache::refresh()` and eliminate unnecessary clone when updating the cache. The refresh method now updates the cache in-place without returning the snapshot, since the return value was not being used by callers.

Add `refresh_if_stale()` method to SnapshotCache that allows API handlers to trigger cache refresh when data is stale beyond a freshness threshold. The threshold defaults to `refresh_interval / 2`, ensuring API requests see reasonably fresh data without excessive lock acquisitions. Changes: - Add `freshness_threshold` field to SnapshotCache (defaults to refresh_interval / 2) - Add `is_stale()` method to check staleness against

Remove unit tests for snapshot cache as they are no longer needed. The tests verified that the cache eliminates lock contention between monitoring API requests and business logic operations, but this functionality is now sufficiently validated through integration testing.

Add unit test verifying that `refresh_if_stale()` only refreshes the cache when data exceeds the freshness threshold (refresh_interval / 2). The test confirms that immediate calls don't trigger refresh, but calls after the threshold has passed do trigger a refresh.

…apshotMetrics Convert shares_accepted metrics from snapshot-based Gauges to event-driven Counters that increment at the point of share acceptance, enabling correct Prometheus rate() calculations and real-time alerting. Rename PrometheusMetrics to SnapshotMetrics to clarify the architectural distinction between snapshot-based (gauges) and event-based (counters/histograms) collection mechanisms. ## Changes ### New Architecture - Created EventMetrics struct with CounterVec for real-time share tracking - Metrics are incremented in message handlers when shares are validated - Renamed PrometheusMetrics → SnapshotMetrics to contrast with EventMetrics ### Implementation Details **stratum-apps/src/monitoring/event_metrics.rs** (new) - EventMetrics struct with server/client shares_accepted CounterVec - Helper methods: inc_server_shares_accepted(), inc_client_shares_accepted() - Registers with same Prometheus registry as snapshot-based metrics **stratum-apps/src/monitoring/snapshot_metrics.rs** (renamed from prometheus_metrics.rs) - Renamed PrometheusMetrics struct to SnapshotMetrics - Changed shares_accepted_total from GaugeVec to CounterVec - Updated module documentation to emphasize snapshot-based collection **stratum-apps/src/monitoring/http_server.rs** - Updated imports to use snapshot_metrics module - Added event_metrics field to MonitoringServer - Created EventMetrics during initialization - Added event_metrics() getter for passing to business logic - Removed counter increment logic from Prometheus scrape handler **stratum-apps/src/monitoring/README.md** - Added comprehensive guide for adding new metrics - Decision tree for choosing Gauge vs Counter vs Histogram - Step-by-step implementation patterns for each metric type - Reorganized to put API documentation before implementation details **pool-apps/pool/src/lib/channel_manager/mod.rs** - Added event_metrics: Option<Arc<EventMetrics>> field - Added with_event_metrics() builder method **pool-apps/pool/src/lib/mod.rs** - Pass event_metrics from MonitoringServer to ChannelManager during init **pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs** - Increment counter in handle_submit_shares_standard() on share acceptance - Increment counter in handle_submit_shares_extended() on share acceptance ## Benefits - Correct Prometheus counter semantics (monotonic, handles resets) - Proper rate() calculations without negative values on channel close - Immediate metric availability (no snapshot delay) - Sub-second updates suitable for real-time alerting - Clear naming: SnapshotMetrics vs EventMetrics clarifies collection mechanism - Comprehensive documentation for adding future metrics (latency, errors, saturation)

gimballock · 2026-01-26T15:32:22Z

This change needs more private development before it is read for consideration

Eric Price added 7 commits January 21, 2026 09:18

gimballock force-pushed the event_based_metrics branch from 89c7497 to 00fdbcd Compare January 21, 2026 14:43

gimballock closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(monitoring): Implement event-driven metrics for share counters#195

refactor(monitoring): Implement event-driven metrics for share counters#195
gimballock wants to merge 7 commits intostratum-mining:mainfrom
fossatmara:event_based_metrics

gimballock commented Jan 21, 2026 •

edited

Loading

Uh oh!

gimballock commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gimballock commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

New Architecture

Implementation Details

Benefits

Uh oh!

gimballock commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gimballock commented Jan 21, 2026 •

edited

Loading