refactor(monitoring): Implement event-driven metrics for share counters#195
Closed
gimballock wants to merge 7 commits intostratum-mining:mainfrom
Closed
refactor(monitoring): Implement event-driven metrics for share counters#195gimballock wants to merge 7 commits intostratum-mining:mainfrom
gimballock wants to merge 7 commits intostratum-mining:mainfrom
Conversation
added 7 commits
January 21, 2026 09:18
Fixes a critical DoS vulnerability where rapid monitoring API requests
could degrade mining performance by causing lock contention with business
logic operations (share validation, job distribution).
## Problem
The monitoring system's trait methods (get_clients(), get_server(), etc.)
directly acquire locks on business-critical data structures like
ChannelManagerData. An attacker spamming monitoring endpoints (/metrics,
/api/v1/*) could cause these locks to be held frequently, blocking share
validation and reducing mining throughput.
## Solution
Implement a snapshot cache that decouples monitoring reads from business
logic locks:
1. **SnapshotCache**: Periodically refreshes monitoring data (default: 5s)
by calling trait methods once per interval, storing results in an
RwLock-protected snapshot.
2. **CachedMonitoring**: Wrapper implementing monitoring traits that reads
from the cached snapshot instead of acquiring business logic locks.
3. **Automatic refresh**: MonitoringServer spawns a background task that
refreshes the cache at the configured interval.
API requests now read from the cache (fast RwLock read) rather than
acquiring business logic locks, eliminating the DoS vector.
## Changes
### New Files
- stratum-apps/src/monitoring/snapshot_cache.rs: Core cache implementation
- stratum-apps/src/monitoring/tests.rs: Unit tests verifying fix
### Modified Files
- stratum-apps/src/monitoring/http_server.rs:
- Updated MonitoringServer::new() to require refresh_interval parameter
- Automatically creates SnapshotCache and spawns refresh task
- Removed separate new_with_cache() constructor (cache is now default)
- stratum-apps/src/monitoring/mod.rs:
- Export SnapshotCache, CachedMonitoring, MonitoringSnapshot types
- All config examples (26 files):
- Added `monitoring_cache_refresh_secs = 5` parameter
## Performance Impact
Test results show the cache processes 250,000+ requests in 100ms with only
2 lock acquisitions (during refresh), compared to 4 requests without the
cache due to lock contention. Business logic throughput is unaffected by
monitoring load.
## Breaking Changes
MonitoringServer::new() now requires a `refresh_interval: Duration` parameter.
Update calls to:
```rust
MonitoringServer::new(
bind_address,
server_monitoring,
clients_monitoring,
Duration::from_secs(5), // NEW: cache refresh interval
)
Add `monitoring_cache_refresh_secs` configuration parameter to all apps (jd-client, translator, pool) with a default value of 60 seconds. This allows users to configure how frequently the monitoring snapshot cache is refreshed. Also simplify `unwrap_or_else` calls to `unwrap_or` in snapshot_cache.rs since the closures don't capture any variables.
Remove unused return value from `SnapshotCache::refresh()` and eliminate unnecessary clone when updating the cache. The refresh method now updates the cache in-place without returning the snapshot, since the return value was not being used by callers.
Add `refresh_if_stale()` method to SnapshotCache that allows API handlers to trigger cache refresh when data is stale beyond a freshness threshold. The threshold defaults to `refresh_interval / 2`, ensuring API requests see reasonably fresh data without excessive lock acquisitions. Changes: - Add `freshness_threshold` field to SnapshotCache (defaults to refresh_interval / 2) - Add `is_stale()` method to check staleness against
Remove unit tests for snapshot cache as they are no longer needed. The tests verified that the cache eliminates lock contention between monitoring API requests and business logic operations, but this functionality is now sufficiently validated through integration testing.
Add unit test verifying that `refresh_if_stale()` only refreshes the cache when data exceeds the freshness threshold (refresh_interval / 2). The test confirms that immediate calls don't trigger refresh, but calls after the threshold has passed do trigger a refresh.
…apshotMetrics Convert shares_accepted metrics from snapshot-based Gauges to event-driven Counters that increment at the point of share acceptance, enabling correct Prometheus rate() calculations and real-time alerting. Rename PrometheusMetrics to SnapshotMetrics to clarify the architectural distinction between snapshot-based (gauges) and event-based (counters/histograms) collection mechanisms. ## Changes ### New Architecture - Created EventMetrics struct with CounterVec for real-time share tracking - Metrics are incremented in message handlers when shares are validated - Renamed PrometheusMetrics → SnapshotMetrics to contrast with EventMetrics ### Implementation Details **stratum-apps/src/monitoring/event_metrics.rs** (new) - EventMetrics struct with server/client shares_accepted CounterVec - Helper methods: inc_server_shares_accepted(), inc_client_shares_accepted() - Registers with same Prometheus registry as snapshot-based metrics **stratum-apps/src/monitoring/snapshot_metrics.rs** (renamed from prometheus_metrics.rs) - Renamed PrometheusMetrics struct to SnapshotMetrics - Changed shares_accepted_total from GaugeVec to CounterVec - Updated module documentation to emphasize snapshot-based collection **stratum-apps/src/monitoring/http_server.rs** - Updated imports to use snapshot_metrics module - Added event_metrics field to MonitoringServer - Created EventMetrics during initialization - Added event_metrics() getter for passing to business logic - Removed counter increment logic from Prometheus scrape handler **stratum-apps/src/monitoring/README.md** - Added comprehensive guide for adding new metrics - Decision tree for choosing Gauge vs Counter vs Histogram - Step-by-step implementation patterns for each metric type - Reorganized to put API documentation before implementation details **pool-apps/pool/src/lib/channel_manager/mod.rs** - Added event_metrics: Option<Arc<EventMetrics>> field - Added with_event_metrics() builder method **pool-apps/pool/src/lib/mod.rs** - Pass event_metrics from MonitoringServer to ChannelManager during init **pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs** - Increment counter in handle_submit_shares_standard() on share acceptance - Increment counter in handle_submit_shares_extended() on share acceptance ## Benefits - Correct Prometheus counter semantics (monotonic, handles resets) - Proper rate() calculations without negative values on channel close - Immediate metric availability (no snapshot delay) - Sub-second updates suitable for real-time alerting - Clear naming: SnapshotMetrics vs EventMetrics clarifies collection mechanism - Comprehensive documentation for adding future metrics (latency, errors, saturation)
89c7497 to
00fdbcd
Compare
Author
|
This change needs more private development before it is read for consideration |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: This commit builds off of PR 193
Convert shares_accepted metrics from snapshot-based Gauges to event-driven Counters that increment at the point of share acceptance, enabling correct Prometheus rate() calculations and real-time alerting.
This commit demonstrates how to instrument a call site with a Counter metric and explains the process of adding additional metrics, of any type, in the monitoring/README.md
Changes
New Architecture
Implementation Details
stratum-apps/src/monitoring/event_metrics.rs (new)
stratum-apps/src/monitoring/http_server.rs
stratum-apps/src/monitoring/snapshot_metrics.rs
pool-apps/pool/src/lib/channel_manager/mod.rs
pool-apps/pool/src/lib/mod.rs
pool-apps/pool/src/lib/channel_manager/mining_message_handler.rs
Benefits