fix: fork detection + /health hardening (forked node now returns 503)#768
Conversation
Addresses the silent-healthy-forked-node class demonstrated by testnet
val2 at h=6,132,038. Previously a node on a divergent branch kept
reporting healthy to Docker and serving stale RPC. There was no way
for the operator to distinguish a forked node from a healthy one
without manually comparing head hashes.
Root cause: /health returned {status:ok} unconditionally. Block import
failures with 'invalid previous hash' incremented APPLY_ERR but made
no state change. No rollback API exists in storage layer.
Changes:
- sentrix-network/src/node.rs: add NodeEvent::SyncForkDetected{height,
local_head_hash} — emitted by libp2p paths when 'invalid previous
hash' is seen.
- sentrix-network/src/libp2p_node.rs: detect SentrixError::InvalidBlock
containing 'invalid previous hash' in all three block-apply paths
(batch-sync, gossip, direct-apply), emit SyncForkDetected with
local head hash for diagnostics.
- sentrix-rpc/src/routes/ops.rs: add FORK_DETECTED / FORK_DETECTED_HEIGHT
/ FORK_DETECTED_AT_UNIX / FORK_LOCAL_HEAD atomics. Harden /health to
return HTTP 503 when forked or stale (uptime-gated). Response body
includes fork_at_height, local_head_at_detection, and recovery
instructions. Docker healthcheck now fails on a forked node.
- bin/sentrix/src/main.rs: handle NodeEvent::SyncForkDetected — set
fork atomics on first detection. Handle NodeEvent::NewBlock — clear
fork flag if set (sync recovered).
Limitation: no automatic fork recovery. Storage layer has no
rollback/rewind primitive. The fail-closed approach is correct:
mark unhealthy, log operator instructions, expose state via health.
Tests:
- test_invalid_previous_hash_error_string_is_stable: pins the exact
error substring that fork detection matches against.
- test_health_503_when_fork_detected: health returns 503 + structured
body when FORK_DETECTED is set.
- test_health_ok_when_no_fork: health returns 200 when clean.
- test_health_recovers_when_fork_cleared: health recovers to 200 when
flag is cleared (NewBlock path).
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Warning Review limit reached
More reviews will be available in 21 minutes and 50 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThe PR implements fork detection for blockchain nodes by monitoring "invalid previous hash" errors during block application across three network paths (gossipsub, request/response, and sync batches). When a fork is detected, a Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/sentrix-rpc/src/routes/ops.rs`:
- Around line 548-635: Add a new async test that acquires TEST_LOCK and calls
reset_fork_state(), then sets START_TIME to a time far enough in the past (so
the node appears stale) while ensuring FORK_DETECTED is false, mutate the
in-memory blockchain state inside SharedState (Arc<RwLock<Blockchain>>) to have
a last block timestamp older than the stale threshold (either by setting the
chain head / last_block_ts or inserting a block with an old timestamp via the
Blockchain API), call health(State(state)).await.into_response(), and assert the
response status is 503 and the JSON contains "status":"stale" and
"fork_detected":false; reference the health function, START_TIME, TEST_LOCK,
reset_fork_state, and the SharedState/Blockchain mutation to locate where to add
this test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 05686ef1-13ce-4a79-ad76-e0960da62b08
📒 Files selected for processing (6)
README.mdbin/sentrix/src/main.rscrates/sentrix-network/src/libp2p_node.rscrates/sentrix-network/src/node.rscrates/sentrix-primitives/src/block.rscrates/sentrix-rpc/src/routes/ops.rs
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use axum::{body::to_bytes, http::StatusCode}; | ||
| use sentrix_core::blockchain::Blockchain; | ||
| use std::sync::{Arc, Mutex, atomic::Ordering}; | ||
| use tokio::sync::RwLock; | ||
|
|
||
| // Tests that touch process-level atomics must run sequentially to | ||
| // avoid races between parallel test threads. | ||
| static TEST_LOCK: Mutex<()> = Mutex::new(()); | ||
|
|
||
| fn make_state() -> SharedState { | ||
| Arc::new(RwLock::new(Blockchain::new( | ||
| "0x0000000000000000000000000000000000000001".into(), | ||
| ))) | ||
| } | ||
|
|
||
| fn reset_fork_state() { | ||
| FORK_DETECTED.store(false, Ordering::SeqCst); | ||
| FORK_DETECTED_HEIGHT.store(0, Ordering::SeqCst); | ||
| FORK_DETECTED_AT_UNIX.store(0, Ordering::SeqCst); | ||
| if let Ok(mut g) = FORK_LOCAL_HEAD.lock() { | ||
| g.clear(); | ||
| } | ||
| } | ||
|
|
||
| /// Health returns 200 + `status: ok` when no fork is detected and chain | ||
| /// is fresh. (The genesis block has no blocks so last_block_ts=0 and | ||
| /// the stale guard `last_block_ts > 0` prevents a false stale alarm.) | ||
| #[tokio::test] | ||
| async fn test_health_ok_when_no_fork() { | ||
| let _guard = TEST_LOCK.lock().unwrap(); | ||
| reset_fork_state(); | ||
|
|
||
| let state = make_state(); | ||
| let resp = health(State(state)).await.into_response(); | ||
| assert_eq!(resp.status(), StatusCode::OK); | ||
| let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); | ||
| let json: serde_json::Value = serde_json::from_slice(&body).unwrap(); | ||
| assert_eq!(json["status"], "ok"); | ||
| assert_eq!(json["fork_detected"], false); | ||
| } | ||
|
|
||
| /// Health returns 503 + `status: fork_detected` when FORK_DETECTED is set. | ||
| /// This is what the Docker healthcheck sees when the node is on a | ||
| /// divergent branch. | ||
| #[tokio::test] | ||
| async fn test_health_503_when_fork_detected() { | ||
| let _guard = TEST_LOCK.lock().unwrap(); | ||
| reset_fork_state(); | ||
|
|
||
| FORK_DETECTED.store(true, Ordering::SeqCst); | ||
| FORK_DETECTED_HEIGHT.store(6_132_038, Ordering::SeqCst); | ||
| FORK_DETECTED_AT_UNIX.store(1_748_000_000, Ordering::SeqCst); | ||
| if let Ok(mut g) = FORK_LOCAL_HEAD.lock() { | ||
| *g = "deadbeef01234567".to_string(); | ||
| } | ||
|
|
||
| let state = make_state(); | ||
| let resp = health(State(state)).await.into_response(); | ||
| assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE); | ||
| let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); | ||
| let json: serde_json::Value = serde_json::from_slice(&body).unwrap(); | ||
| assert_eq!(json["status"], "fork_detected"); | ||
| assert_eq!(json["fork_detected"], true); | ||
| assert_eq!(json["fork_at_height"], 6_132_038u64); | ||
| assert_eq!(json["fork_local_head_at_detection"], "deadbeef01234567"); | ||
|
|
||
| reset_fork_state(); | ||
| } | ||
|
|
||
| /// Clearing FORK_DETECTED (as the NewBlock handler does) switches health | ||
| /// back to 200. This simulates a transient fork that resolved after sync. | ||
| #[tokio::test] | ||
| async fn test_health_recovers_when_fork_cleared() { | ||
| let _guard = TEST_LOCK.lock().unwrap(); | ||
| reset_fork_state(); | ||
|
|
||
| // Simulate: fork detected, then NewBlock clears the flag. | ||
| FORK_DETECTED.store(true, Ordering::SeqCst); | ||
| FORK_DETECTED.store(false, Ordering::SeqCst); | ||
|
|
||
| let state = make_state(); | ||
| let resp = health(State(state)).await.into_response(); | ||
| assert_eq!(resp.status(), StatusCode::OK); | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider adding a test for the stale condition.
The tests cover fork detection and recovery but don't validate the stale path (HTTP 503 with "status": "stale"). This path is harder to test because it requires manipulating START_TIME and block timestamps, but it's a distinct unhealthy condition that should have coverage.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/sentrix-rpc/src/routes/ops.rs` around lines 548 - 635, Add a new async
test that acquires TEST_LOCK and calls reset_fork_state(), then sets START_TIME
to a time far enough in the past (so the node appears stale) while ensuring
FORK_DETECTED is false, mutate the in-memory blockchain state inside SharedState
(Arc<RwLock<Blockchain>>) to have a last block timestamp older than the stale
threshold (either by setting the chain head / last_block_ts or inserting a block
with an old timestamp via the Blockchain API), call
health(State(state)).await.into_response(), and assert the response status is
503 and the JSON contains "status":"stale" and "fork_detected":false; reference
the health function, START_TIME, TEST_LOCK, reset_fork_state, and the
SharedState/Blockchain mutation to locate where to add this test.
- Use Ordering::Release on FORK_DETECTED.swap() and store HEIGHT/AT_UNIX
BEFORE the swap (so Acquire readers see updated metadata). Acquire on
reads in health endpoint. Pairs correctly on non-TSO architectures.
- Clear FORK_DETECTED only if block.index >= FORK_DETECTED_HEIGHT —
prevents a block below the fork point from prematurely clearing the
unhealthy flag.
- FORK_LOCAL_HEAD: recover poisoned Mutex inner value with
unwrap_or_else(|e| e.into_inner().clone()) + warn log.
- Fix {:?} → {} for PeerId in gossip error/fork log paths.
- Test serialization: switch from std::sync::Mutex (held across .await)
to tokio::sync::Mutex via OnceLock helper.
Root Cause
Testnet val2 forked at h=6,132,038 but reported healthy to Docker because
/healthreturned{"status":"ok"}unconditionally. The sync errorlibp2p sync: block 6132039 failed: Invalid block: invalid previous hashwas logged andAPPLY_ERRincremented, but no health state changed. The node kept attempting sync on every interval, kept failing, and kept serving as healthy.No rollback/rewind API exists in the storage layer (
truncate_chain,rollback_to_height— nothing). Automatic fork recovery is not implemented.What Changed
sentrix-network/src/node.rsNodeEvent::SyncForkDetected { height, local_head_hash }— emitted when a block import fails with "invalid previous hash".sentrix-network/src/libp2p_node.rsSentrixError::InvalidBlockcontaining"invalid previous hash". On match: read local head hash, log ERROR attarget: "sentrix::fork", emitSyncForkDetected.sentrix-rpc/src/routes/ops.rsFORK_DETECTED,FORK_DETECTED_HEIGHT,FORK_DETECTED_AT_UNIX,FORK_LOCAL_HEAD./healthnow returns HTTP 503 when forked or stale (stale check is uptime-gated to avoid false alarms during boot). Response body includesfork_at_height,fork_local_head_at_detection, andrecoveryinstructions for the operator.bin/sentrix/src/main.rsNodeEvent::SyncForkDetectedhandler sets fork atomics on first detection; logs ERROR with operator recovery instruction.NodeEvent::NewBlockhandler clearsFORK_DETECTEDif set (fork resolved via sync recovery).Limitations
rollback_to(height)orrewindAPI. The correct action remains: stop the node, copychain.dbfrom a healthy validator, restart.SyncForkDetectedevent may clear if the canonical network somehow catches up to the local fork branch — but that scenario implies local state was actually canonical. The flag clears on any successful block apply, which is conservative.Tests
test_invalid_previous_hash_error_string_is_stable— pins the exact"invalid previous hash"substring that fork detection matches on. Guards against silent breakage from error message changes.test_health_503_when_fork_detected— verifies HTTP 503 + structured body with correct fields.test_health_ok_when_no_fork— verifies HTTP 200 on clean state.test_health_recovers_when_fork_cleared— verifies recovery path.cargo check --workspace -D warnings: clean.cargo test --workspace: all pass.Summary by CodeRabbit
Release Notes
Documentation
New Features
/healthendpoint to return 503 SERVICE_UNAVAILABLE with diagnostic details when a fork is detected or node data is stale, including recovery guidance.Tests