Skip to content

fix: fork detection + /health hardening (forked node now returns 503)#768

Merged
github-actions[bot] merged 4 commits into
mainfrom
fix/fork-detection-health-endpoint
Jun 2, 2026
Merged

fix: fork detection + /health hardening (forked node now returns 503)#768
github-actions[bot] merged 4 commits into
mainfrom
fix/fork-detection-health-endpoint

Conversation

@satyakwok
Copy link
Copy Markdown
Collaborator

@satyakwok satyakwok commented Jun 2, 2026

Root Cause

Testnet val2 forked at h=6,132,038 but reported healthy to Docker because /health returned {"status":"ok"} unconditionally. The sync error libp2p sync: block 6132039 failed: Invalid block: invalid previous hash was logged and APPLY_ERR incremented, but no health state changed. The node kept attempting sync on every interval, kept failing, and kept serving as healthy.

No rollback/rewind API exists in the storage layer (truncate_chain, rollback_to_height — nothing). Automatic fork recovery is not implemented.

What Changed

sentrix-network/src/node.rs

  • New NodeEvent::SyncForkDetected { height, local_head_hash } — emitted when a block import fails with "invalid previous hash".

sentrix-network/src/libp2p_node.rs

  • All three block-apply error paths (batch-sync GetBlocks, gossip, direct NewBlock) now pattern-match on SentrixError::InvalidBlock containing "invalid previous hash". On match: read local head hash, log ERROR at target: "sentrix::fork", emit SyncForkDetected.

sentrix-rpc/src/routes/ops.rs

  • New atomics: FORK_DETECTED, FORK_DETECTED_HEIGHT, FORK_DETECTED_AT_UNIX, FORK_LOCAL_HEAD.
  • /health now returns HTTP 503 when forked or stale (stale check is uptime-gated to avoid false alarms during boot). Response body includes fork_at_height, fork_local_head_at_detection, and recovery instructions for the operator.
  • Docker healthcheck now fails on a forked node.

bin/sentrix/src/main.rs

  • NodeEvent::SyncForkDetected handler sets fork atomics on first detection; logs ERROR with operator recovery instruction.
  • NodeEvent::NewBlock handler clears FORK_DETECTED if set (fork resolved via sync recovery).

Limitations

  • No automatic recovery. The storage layer has no rollback_to(height) or rewind API. The correct action remains: stop the node, copy chain.db from a healthy validator, restart.
  • The SyncForkDetected event may clear if the canonical network somehow catches up to the local fork branch — but that scenario implies local state was actually canonical. The flag clears on any successful block apply, which is conservative.

Tests

  • test_invalid_previous_hash_error_string_is_stable — pins the exact "invalid previous hash" substring that fork detection matches on. Guards against silent breakage from error message changes.
  • test_health_503_when_fork_detected — verifies HTTP 503 + structured body with correct fields.
  • test_health_ok_when_no_fork — verifies HTTP 200 on clean state.
  • test_health_recovers_when_fork_cleared — verifies recovery path.

cargo check --workspace -D warnings: clean. cargo test --workspace: all pass.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Updated Telegram community link in README.
  • New Features

    • Added fork detection: nodes now detect and report when they diverge from the canonical blockchain.
    • Enhanced /health endpoint to return 503 SERVICE_UNAVAILABLE with diagnostic details when a fork is detected or node data is stale, including recovery guidance.
  • Tests

    • Added test suite validating fork detection and health endpoint behavior.

satyakwok added 2 commits June 2, 2026 14:23
Addresses the silent-healthy-forked-node class demonstrated by testnet
val2 at h=6,132,038. Previously a node on a divergent branch kept
reporting healthy to Docker and serving stale RPC. There was no way
for the operator to distinguish a forked node from a healthy one
without manually comparing head hashes.

Root cause: /health returned {status:ok} unconditionally. Block import
failures with 'invalid previous hash' incremented APPLY_ERR but made
no state change. No rollback API exists in storage layer.

Changes:
- sentrix-network/src/node.rs: add NodeEvent::SyncForkDetected{height,
  local_head_hash} — emitted by libp2p paths when 'invalid previous
  hash' is seen.
- sentrix-network/src/libp2p_node.rs: detect SentrixError::InvalidBlock
  containing 'invalid previous hash' in all three block-apply paths
  (batch-sync, gossip, direct-apply), emit SyncForkDetected with
  local head hash for diagnostics.
- sentrix-rpc/src/routes/ops.rs: add FORK_DETECTED / FORK_DETECTED_HEIGHT
  / FORK_DETECTED_AT_UNIX / FORK_LOCAL_HEAD atomics. Harden /health to
  return HTTP 503 when forked or stale (uptime-gated). Response body
  includes fork_at_height, local_head_at_detection, and recovery
  instructions. Docker healthcheck now fails on a forked node.
- bin/sentrix/src/main.rs: handle NodeEvent::SyncForkDetected — set
  fork atomics on first detection. Handle NodeEvent::NewBlock — clear
  fork flag if set (sync recovered).

Limitation: no automatic fork recovery. Storage layer has no
rollback/rewind primitive. The fail-closed approach is correct:
mark unhealthy, log operator instructions, expose state via health.

Tests:
- test_invalid_previous_hash_error_string_is_stable: pins the exact
  error substring that fork detection matches against.
- test_health_503_when_fork_detected: health returns 503 + structured
  body when FORK_DETECTED is set.
- test_health_ok_when_no_fork: health returns 200 when clean.
- test_health_recovers_when_fork_cleared: health recovers to 200 when
  flag is cleared (NewBlock path).
@github-actions github-actions Bot enabled auto-merge (squash) June 2, 2026 14:39
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Warning

Review limit reached

@satyakwok, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a3f192f3-ed6d-40f6-a2a6-3eb6bab41375

📥 Commits

Reviewing files that changed from the base of the PR and between 1810d80 and 68f69cd.

📒 Files selected for processing (7)
  • bin/sentrix/src/main.rs
  • crates/sentrix-core/src/blockchain.rs
  • crates/sentrix-network/src/libp2p_node.rs
  • crates/sentrix-rpc/src/routes/ops.rs
  • crates/sentrix-trie/src/address.rs
  • crates/sentrix-wallet/src/keystore.rs
  • crates/sentrix-wallet/src/wallet.rs
📝 Walkthrough

Walkthrough

The PR implements fork detection for blockchain nodes by monitoring "invalid previous hash" errors during block application across three network paths (gossipsub, request/response, and sync batches). When a fork is detected, a SyncForkDetected event is emitted, which triggers handlers to update global atomic state. A new health endpoint checks these state flags and returns 503 for fork-detected or stale conditions, with automatic recovery when a valid block is successfully applied. Tests validate the error message stability and health endpoint behavior across state transitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is comprehensive and well-structured, covering root cause, changes across all modified files, limitations, and testing. However, it deviates from the required template by not using the repository's standardized PR description format with sections like Summary, Scope, Checks, and Deploy impact. Restructure the description to follow the repository template: include Summary (1-3 sentences), Scope checkboxes, Checks checklist, Linked issue, and Deploy impact sections. Move detailed technical information to after the template sections.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: fork detection implementation and /health endpoint hardening to return 503 for forked nodes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fork-detection-health-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-rpc/src/routes/ops.rs`:
- Around line 548-635: Add a new async test that acquires TEST_LOCK and calls
reset_fork_state(), then sets START_TIME to a time far enough in the past (so
the node appears stale) while ensuring FORK_DETECTED is false, mutate the
in-memory blockchain state inside SharedState (Arc<RwLock<Blockchain>>) to have
a last block timestamp older than the stale threshold (either by setting the
chain head / last_block_ts or inserting a block with an old timestamp via the
Blockchain API), call health(State(state)).await.into_response(), and assert the
response status is 503 and the JSON contains "status":"stale" and
"fork_detected":false; reference the health function, START_TIME, TEST_LOCK,
reset_fork_state, and the SharedState/Blockchain mutation to locate where to add
this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 05686ef1-13ce-4a79-ad76-e0960da62b08

📥 Commits

Reviewing files that changed from the base of the PR and between 53dece8 and 1810d80.

📒 Files selected for processing (6)
  • README.md
  • bin/sentrix/src/main.rs
  • crates/sentrix-network/src/libp2p_node.rs
  • crates/sentrix-network/src/node.rs
  • crates/sentrix-primitives/src/block.rs
  • crates/sentrix-rpc/src/routes/ops.rs

Comment on lines +548 to +635
#[cfg(test)]
mod tests {
use super::*;
use axum::{body::to_bytes, http::StatusCode};
use sentrix_core::blockchain::Blockchain;
use std::sync::{Arc, Mutex, atomic::Ordering};
use tokio::sync::RwLock;

// Tests that touch process-level atomics must run sequentially to
// avoid races between parallel test threads.
static TEST_LOCK: Mutex<()> = Mutex::new(());

fn make_state() -> SharedState {
Arc::new(RwLock::new(Blockchain::new(
"0x0000000000000000000000000000000000000001".into(),
)))
}

fn reset_fork_state() {
FORK_DETECTED.store(false, Ordering::SeqCst);
FORK_DETECTED_HEIGHT.store(0, Ordering::SeqCst);
FORK_DETECTED_AT_UNIX.store(0, Ordering::SeqCst);
if let Ok(mut g) = FORK_LOCAL_HEAD.lock() {
g.clear();
}
}

/// Health returns 200 + `status: ok` when no fork is detected and chain
/// is fresh. (The genesis block has no blocks so last_block_ts=0 and
/// the stale guard `last_block_ts > 0` prevents a false stale alarm.)
#[tokio::test]
async fn test_health_ok_when_no_fork() {
let _guard = TEST_LOCK.lock().unwrap();
reset_fork_state();

let state = make_state();
let resp = health(State(state)).await.into_response();
assert_eq!(resp.status(), StatusCode::OK);
let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert_eq!(json["status"], "ok");
assert_eq!(json["fork_detected"], false);
}

/// Health returns 503 + `status: fork_detected` when FORK_DETECTED is set.
/// This is what the Docker healthcheck sees when the node is on a
/// divergent branch.
#[tokio::test]
async fn test_health_503_when_fork_detected() {
let _guard = TEST_LOCK.lock().unwrap();
reset_fork_state();

FORK_DETECTED.store(true, Ordering::SeqCst);
FORK_DETECTED_HEIGHT.store(6_132_038, Ordering::SeqCst);
FORK_DETECTED_AT_UNIX.store(1_748_000_000, Ordering::SeqCst);
if let Ok(mut g) = FORK_LOCAL_HEAD.lock() {
*g = "deadbeef01234567".to_string();
}

let state = make_state();
let resp = health(State(state)).await.into_response();
assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert_eq!(json["status"], "fork_detected");
assert_eq!(json["fork_detected"], true);
assert_eq!(json["fork_at_height"], 6_132_038u64);
assert_eq!(json["fork_local_head_at_detection"], "deadbeef01234567");

reset_fork_state();
}

/// Clearing FORK_DETECTED (as the NewBlock handler does) switches health
/// back to 200. This simulates a transient fork that resolved after sync.
#[tokio::test]
async fn test_health_recovers_when_fork_cleared() {
let _guard = TEST_LOCK.lock().unwrap();
reset_fork_state();

// Simulate: fork detected, then NewBlock clears the flag.
FORK_DETECTED.store(true, Ordering::SeqCst);
FORK_DETECTED.store(false, Ordering::SeqCst);

let state = make_state();
let resp = health(State(state)).await.into_response();
assert_eq!(resp.status(), StatusCode::OK);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding a test for the stale condition.

The tests cover fork detection and recovery but don't validate the stale path (HTTP 503 with "status": "stale"). This path is harder to test because it requires manipulating START_TIME and block timestamps, but it's a distinct unhealthy condition that should have coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-rpc/src/routes/ops.rs` around lines 548 - 635, Add a new async
test that acquires TEST_LOCK and calls reset_fork_state(), then sets START_TIME
to a time far enough in the past (so the node appears stale) while ensuring
FORK_DETECTED is false, mutate the in-memory blockchain state inside SharedState
(Arc<RwLock<Blockchain>>) to have a last block timestamp older than the stale
threshold (either by setting the chain head / last_block_ts or inserting a block
with an old timestamp via the Blockchain API), call
health(State(state)).await.into_response(), and assert the response status is
503 and the JSON contains "status":"stale" and "fork_detected":false; reference
the health function, START_TIME, TEST_LOCK, reset_fork_state, and the
SharedState/Blockchain mutation to locate where to add this test.

satyakwok added 2 commits June 2, 2026 17:15
- Use Ordering::Release on FORK_DETECTED.swap() and store HEIGHT/AT_UNIX
  BEFORE the swap (so Acquire readers see updated metadata). Acquire on
  reads in health endpoint. Pairs correctly on non-TSO architectures.
- Clear FORK_DETECTED only if block.index >= FORK_DETECTED_HEIGHT —
  prevents a block below the fork point from prematurely clearing the
  unhealthy flag.
- FORK_LOCAL_HEAD: recover poisoned Mutex inner value with
  unwrap_or_else(|e| e.into_inner().clone()) + warn log.
- Fix {:?} → {} for PeerId in gossip error/fork log paths.
- Test serialization: switch from std::sync::Mutex (held across .await)
  to tokio::sync::Mutex via OnceLock helper.
@satyakwok satyakwok self-assigned this Jun 2, 2026
@github-actions github-actions Bot merged commit d8ac0b1 into main Jun 2, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant