fix: fork detection + /health hardening (forked node now returns 503) by satyakwok · Pull Request #768 · sentrix-labs/sentrix

satyakwok · 2026-06-02T14:39:08Z

Root Cause

Testnet val2 forked at h=6,132,038 but reported healthy to Docker because /health returned {"status":"ok"} unconditionally. The sync error libp2p sync: block 6132039 failed: Invalid block: invalid previous hash was logged and APPLY_ERR incremented, but no health state changed. The node kept attempting sync on every interval, kept failing, and kept serving as healthy.

No rollback/rewind API exists in the storage layer (truncate_chain, rollback_to_height — nothing). Automatic fork recovery is not implemented.

What Changed

sentrix-network/src/node.rs

New NodeEvent::SyncForkDetected { height, local_head_hash } — emitted when a block import fails with "invalid previous hash".

sentrix-network/src/libp2p_node.rs

All three block-apply error paths (batch-sync GetBlocks, gossip, direct NewBlock) now pattern-match on SentrixError::InvalidBlock containing "invalid previous hash". On match: read local head hash, log ERROR at target: "sentrix::fork", emit SyncForkDetected.

sentrix-rpc/src/routes/ops.rs

New atomics: FORK_DETECTED, FORK_DETECTED_HEIGHT, FORK_DETECTED_AT_UNIX, FORK_LOCAL_HEAD.
/health now returns HTTP 503 when forked or stale (stale check is uptime-gated to avoid false alarms during boot). Response body includes fork_at_height, fork_local_head_at_detection, and recovery instructions for the operator.
Docker healthcheck now fails on a forked node.

bin/sentrix/src/main.rs

NodeEvent::SyncForkDetected handler sets fork atomics on first detection; logs ERROR with operator recovery instruction.
NodeEvent::NewBlock handler clears FORK_DETECTED if set (fork resolved via sync recovery).

Limitations

No automatic recovery. The storage layer has no rollback_to(height) or rewind API. The correct action remains: stop the node, copy chain.db from a healthy validator, restart.
The SyncForkDetected event may clear if the canonical network somehow catches up to the local fork branch — but that scenario implies local state was actually canonical. The flag clears on any successful block apply, which is conservative.

Tests

test_invalid_previous_hash_error_string_is_stable — pins the exact "invalid previous hash" substring that fork detection matches on. Guards against silent breakage from error message changes.
test_health_503_when_fork_detected — verifies HTTP 503 + structured body with correct fields.
test_health_ok_when_no_fork — verifies HTTP 200 on clean state.
test_health_recovers_when_fork_cleared — verifies recovery path.

cargo check --workspace -D warnings: clean. cargo test --workspace: all pass.

Summary by CodeRabbit

Release Notes

Documentation
- Updated Telegram community link in README.
New Features
- Added fork detection: nodes now detect and report when they diverge from the canonical blockchain.
- Enhanced /health endpoint to return 503 SERVICE_UNAVAILABLE with diagnostic details when a fork is detected or node data is stale, including recovery guidance.
Tests
- Added test suite validating fork detection and health endpoint behavior.

Addresses the silent-healthy-forked-node class demonstrated by testnet val2 at h=6,132,038. Previously a node on a divergent branch kept reporting healthy to Docker and serving stale RPC. There was no way for the operator to distinguish a forked node from a healthy one without manually comparing head hashes. Root cause: /health returned {status:ok} unconditionally. Block import failures with 'invalid previous hash' incremented APPLY_ERR but made no state change. No rollback API exists in storage layer. Changes: - sentrix-network/src/node.rs: add NodeEvent::SyncForkDetected{height, local_head_hash} — emitted by libp2p paths when 'invalid previous hash' is seen. - sentrix-network/src/libp2p_node.rs: detect SentrixError::InvalidBlock containing 'invalid previous hash' in all three block-apply paths (batch-sync, gossip, direct-apply), emit SyncForkDetected with local head hash for diagnostics. - sentrix-rpc/src/routes/ops.rs: add FORK_DETECTED / FORK_DETECTED_HEIGHT / FORK_DETECTED_AT_UNIX / FORK_LOCAL_HEAD atomics. Harden /health to return HTTP 503 when forked or stale (uptime-gated). Response body includes fork_at_height, local_head_at_detection, and recovery instructions. Docker healthcheck now fails on a forked node. - bin/sentrix/src/main.rs: handle NodeEvent::SyncForkDetected — set fork atomics on first detection. Handle NodeEvent::NewBlock — clear fork flag if set (sync recovered). Limitation: no automatic fork recovery. Storage layer has no rollback/rewind primitive. The fail-closed approach is correct: mark unhealthy, log operator instructions, expose state via health. Tests: - test_invalid_previous_hash_error_string_is_stable: pins the exact error substring that fork detection matches against. - test_health_503_when_fork_detected: health returns 503 + structured body when FORK_DETECTED is set. - test_health_ok_when_no_fork: health returns 200 when clean. - test_health_recovers_when_fork_cleared: health recovers to 200 when flag is cleared (NewBlock path).

codecov · 2026-06-02T14:44:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-06-02T14:45:02Z

Warning

Review limit reached

@satyakwok, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a3f192f3-ed6d-40f6-a2a6-3eb6bab41375

📥 Commits

Reviewing files that changed from the base of the PR and between 1810d80 and 68f69cd.

📒 Files selected for processing (7)

bin/sentrix/src/main.rs
crates/sentrix-core/src/blockchain.rs
crates/sentrix-network/src/libp2p_node.rs
crates/sentrix-rpc/src/routes/ops.rs
crates/sentrix-trie/src/address.rs
crates/sentrix-wallet/src/keystore.rs
crates/sentrix-wallet/src/wallet.rs

📝 Walkthrough

Walkthrough

The PR implements fork detection for blockchain nodes by monitoring "invalid previous hash" errors during block application across three network paths (gossipsub, request/response, and sync batches). When a fork is detected, a SyncForkDetected event is emitted, which triggers handlers to update global atomic state. A new health endpoint checks these state flags and returns 503 for fork-detected or stale conditions, with automatic recovery when a valid block is successfully applied. Tests validate the error message stability and health endpoint behavior across state transitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is comprehensive and well-structured, covering root cause, changes across all modified files, limitations, and testing. However, it deviates from the required template by not using the repository's standardized PR description format with sections like Summary, Scope, Checks, and Deploy impact.	Restructure the description to follow the repository template: include Summary (1-3 sentences), Scope checkboxes, Checks checklist, Linked issue, and Deploy impact sections. Move detailed technical information to after the template sections.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main changes: fork detection implementation and /health endpoint hardening to return 503 for forked nodes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/fork-detection-health-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-rpc/src/routes/ops.rs`:
- Around line 548-635: Add a new async test that acquires TEST_LOCK and calls
reset_fork_state(), then sets START_TIME to a time far enough in the past (so
the node appears stale) while ensuring FORK_DETECTED is false, mutate the
in-memory blockchain state inside SharedState (Arc<RwLock<Blockchain>>) to have
a last block timestamp older than the stale threshold (either by setting the
chain head / last_block_ts or inserting a block with an old timestamp via the
Blockchain API), call health(State(state)).await.into_response(), and assert the
response status is 503 and the JSON contains "status":"stale" and
"fork_detected":false; reference the health function, START_TIME, TEST_LOCK,
reset_fork_state, and the SharedState/Blockchain mutation to locate where to add
this test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 05686ef1-13ce-4a79-ad76-e0960da62b08

📥 Commits

Reviewing files that changed from the base of the PR and between 53dece8 and 1810d80.

📒 Files selected for processing (6)

README.md
bin/sentrix/src/main.rs
crates/sentrix-network/src/libp2p_node.rs
crates/sentrix-network/src/node.rs
crates/sentrix-primitives/src/block.rs
crates/sentrix-rpc/src/routes/ops.rs

coderabbitai · 2026-06-02T14:45:06Z

+#[cfg(test)]
+mod tests {
+    use super::*;
+    use axum::{body::to_bytes, http::StatusCode};
+    use sentrix_core::blockchain::Blockchain;
+    use std::sync::{Arc, Mutex, atomic::Ordering};
+    use tokio::sync::RwLock;
+
+    // Tests that touch process-level atomics must run sequentially to
+    // avoid races between parallel test threads.
+    static TEST_LOCK: Mutex<()> = Mutex::new(());
+
+    fn make_state() -> SharedState {
+        Arc::new(RwLock::new(Blockchain::new(
+            "0x0000000000000000000000000000000000000001".into(),
+        )))
+    }
+
+    fn reset_fork_state() {
+        FORK_DETECTED.store(false, Ordering::SeqCst);
+        FORK_DETECTED_HEIGHT.store(0, Ordering::SeqCst);
+        FORK_DETECTED_AT_UNIX.store(0, Ordering::SeqCst);
+        if let Ok(mut g) = FORK_LOCAL_HEAD.lock() {
+            g.clear();
+        }
+    }
+
+    /// Health returns 200 + `status: ok` when no fork is detected and chain
+    /// is fresh. (The genesis block has no blocks so last_block_ts=0 and
+    /// the stale guard `last_block_ts > 0` prevents a false stale alarm.)
+    #[tokio::test]
+    async fn test_health_ok_when_no_fork() {
+        let _guard = TEST_LOCK.lock().unwrap();
+        reset_fork_state();
+
+        let state = make_state();
+        let resp = health(State(state)).await.into_response();
+        assert_eq!(resp.status(), StatusCode::OK);
+        let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap();
+        let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
+        assert_eq!(json["status"], "ok");
+        assert_eq!(json["fork_detected"], false);
+    }
+
+    /// Health returns 503 + `status: fork_detected` when FORK_DETECTED is set.
+    /// This is what the Docker healthcheck sees when the node is on a
+    /// divergent branch.
+    #[tokio::test]
+    async fn test_health_503_when_fork_detected() {
+        let _guard = TEST_LOCK.lock().unwrap();
+        reset_fork_state();
+
+        FORK_DETECTED.store(true, Ordering::SeqCst);
+        FORK_DETECTED_HEIGHT.store(6_132_038, Ordering::SeqCst);
+        FORK_DETECTED_AT_UNIX.store(1_748_000_000, Ordering::SeqCst);
+        if let Ok(mut g) = FORK_LOCAL_HEAD.lock() {
+            *g = "deadbeef01234567".to_string();
+        }
+
+        let state = make_state();
+        let resp = health(State(state)).await.into_response();
+        assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
+        let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap();
+        let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
+        assert_eq!(json["status"], "fork_detected");
+        assert_eq!(json["fork_detected"], true);
+        assert_eq!(json["fork_at_height"], 6_132_038u64);
+        assert_eq!(json["fork_local_head_at_detection"], "deadbeef01234567");
+
+        reset_fork_state();
+    }
+
+    /// Clearing FORK_DETECTED (as the NewBlock handler does) switches health
+    /// back to 200. This simulates a transient fork that resolved after sync.
+    #[tokio::test]
+    async fn test_health_recovers_when_fork_cleared() {
+        let _guard = TEST_LOCK.lock().unwrap();
+        reset_fork_state();
+
+        // Simulate: fork detected, then NewBlock clears the flag.
+        FORK_DETECTED.store(true, Ordering::SeqCst);
+        FORK_DETECTED.store(false, Ordering::SeqCst);
+
+        let state = make_state();
+        let resp = health(State(state)).await.into_response();
+        assert_eq!(resp.status(), StatusCode::OK);
+    }
+}


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding a test for the stale condition.

The tests cover fork detection and recovery but don't validate the stale path (HTTP 503 with "status": "stale"). This path is harder to test because it requires manipulating START_TIME and block timestamps, but it's a distinct unhealthy condition that should have coverage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sentrix-rpc/src/routes/ops.rs` around lines 548 - 635, Add a new async test that acquires TEST_LOCK and calls reset_fork_state(), then sets START_TIME to a time far enough in the past (so the node appears stale) while ensuring FORK_DETECTED is false, mutate the in-memory blockchain state inside SharedState (Arc<RwLock<Blockchain>>) to have a last block timestamp older than the stale threshold (either by setting the chain head / last_block_ts or inserting a block with an old timestamp via the Blockchain API), call health(State(state)).await.into_response(), and assert the response status is 503 and the JSON contains "status":"stale" and "fork_detected":false; reference the health function, START_TIME, TEST_LOCK, reset_fork_state, and the SharedState/Blockchain mutation to locate where to add this test.

- Use Ordering::Release on FORK_DETECTED.swap() and store HEIGHT/AT_UNIX BEFORE the swap (so Acquire readers see updated metadata). Acquire on reads in health endpoint. Pairs correctly on non-TSO architectures. - Clear FORK_DETECTED only if block.index >= FORK_DETECTED_HEIGHT — prevents a block below the fork point from prematurely clearing the unhealthy flag. - FORK_LOCAL_HEAD: recover poisoned Mutex inner value with unwrap_or_else(|e| e.into_inner().clone()) + warn log. - Fix {:?} → {} for PeerId in gossip error/fork log paths. - Test serialization: switch from std::sync::Mutex (held across .await) to tokio::sync::Mutex via OnceLock helper.

satyakwok added 2 commits June 2, 2026 14:23

docs: update Telegram community chat link

0c57d75

github-actions Bot enabled auto-merge (squash) June 2, 2026 14:39

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

satyakwok added 2 commits June 2, 2026 17:15

chore: cargo fmt cleanup

68f69cd

satyakwok self-assigned this Jun 2, 2026

github-actions Bot merged commit d8ac0b1 into main Jun 2, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fork detection + /health hardening (forked node now returns 503)#768

fix: fork detection + /health hardening (forked node now returns 503)#768
github-actions[bot] merged 4 commits into
mainfrom
fix/fork-detection-health-endpoint

satyakwok commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codecov Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review limit reached

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

What Changed

Limitations

Tests

Summary by CodeRabbit

Release Notes

Uh oh!

codecov Bot commented Jun 2, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

satyakwok commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading