Skip to content

refactor: replace timestamp-based node liveness with event-index staleness#1451

Closed
AlexCheema wants to merge 3 commits intomainfrom
alexcheema/event-index-liveness
Closed

refactor: replace timestamp-based node liveness with event-index staleness#1451
AlexCheema wants to merge 3 commits intomainfrom
alexcheema/event-index-liveness

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

Motivation

The node liveness system uses last_seen: Mapping[NodeId, datetime] in the event-sourced State and checks now - last_seen > 30s to time out nodes. This has two fundamental problems:

  1. Timestamps don't belong in event-sourced state. datetime.now() is a side effect — replaying the same events produces different state depending on when they're replayed. The master was even overwriting the worker's timestamp with its own clock at indexing time (event.when = str(datetime.now(...))), adding another source of non-determinism.

  2. The timeout is finicky. The 30-second wall-clock threshold fires whenever any info-gathering task is slow — not when the node is actually unreachable. A node can be perfectly connected via libp2p, actively participating in topology, but get removed because system_profiler took 31 seconds to respond. This causes false-positive node removals in production.

Changes

State (state.py):

  • Replace last_seen: Mapping[NodeId, datetime] with last_event_index_by_node: Mapping[NodeId, int] — stores the global event index at which each node last produced an event. Fully deterministic: derived from the event stream itself.

Events (events.py):

  • Remove when: str field from NodeGatheredInfo — no longer needed.
  • Rename NodeTimedOutNodeDisconnected — reflects the actual semantics (the node stopped producing events, not "we checked the clock").

Apply (apply.py):

  • Top-level apply() now records last_event_index_by_node[node_id] = event.idx whenever a NodeGatheredInfo is applied. This keeps index tracking as a concern of the indexed event layer, not the individual event handler.
  • apply_node_gathered_info() no longer touches liveness state.
  • apply_node_timed_out()apply_node_disconnected() with last_event_index_by_node cleanup.

Master (master/main.py):

  • Remove all datetime/timedelta/timezone imports and the event.when overwrite in _event_processor.
  • Replace the timestamp check in _plan() with cycle-based staleness detection: each plan cycle (10s), the master snapshots last_event_index_by_node and compares to the previous snapshot. If a node's index hasn't advanced for 3 consecutive cycles, it's disconnected.
  • Staleness tracking (_last_checked_indices, _stale_cycles) is local master state — ephemeral, not event-sourced, reset on re-election.

Worker (worker/main.py):

  • Remove when field from NodeGatheredInfo construction. Remove unused datetime import.

Why It Works

The old system asked: "Has this node's timestamp been updated within 30 wall-clock seconds?" — conflating "slow info gathering" with "node is dead."

The new system asks: "Has this node produced ANY event since the last time I checked?" — which is the actual question we care about. A node running macmon at 1s intervals, memory polling, network monitoring, etc. produces many events per second. If none of those events arrive for 3 consecutive 10-second plan cycles, the node is genuinely unreachable — not just slow at one particular task.

The event index is deterministic and monotonically increasing. It's derived from the event stream itself, making it a natural fit for event-sourced state. No wall clocks, no timezone issues, no race between worker timestamps and master timestamps.

Test Plan

Manual Testing

Automated Testing

  • test_master updated: removed when field from NodeGatheredInfo fixture — passes.
  • basedpyright: 0 errors, 0 warnings.
  • ruff check: all passed.
  • nix fmt: 0 files changed.

🤖 Generated with Claude Code

AlexCheema and others added 2 commits February 11, 2026 13:35
…eness

Remove all datetime timestamps from the event-sourced state. Instead of
tracking `last_seen: Mapping[NodeId, datetime]` and checking wall-clock
deltas, track `last_event_index_by_node: Mapping[NodeId, int]` — the
global event index at which each node was last heard from.

The master's planning loop now compares snapshots of each node's last
event index across consecutive cycles. If a node produces no new events
for 3 consecutive plan cycles (~30s), it is disconnected.

This eliminates false-positive node removals caused by slow info
gathering tasks, since the timeout is now purely based on whether
events are flowing — not on wall-clock timing of individual tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented Feb 12, 2026

just throwing out that claude's analysis is straight up wrong here so please understand if this change is what you want before making it. "i haven't seen an event in 30s" is the outcome either way, this obfuscates that imo. Also the master timestamp was an intentional inclusion for debugging that is being removed here.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

The motivation for this change is that consumer device timestamps should not be trusted. Event-index staleness gives us a reliable, monotonic measure of liveness that doesn't depend on clock synchronization across nodes.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

To clarify — we're fine keeping the master timestamp for our own debugging purposes, but it shouldn't be used for liveness detection logic. The core liveness check should rely on event-index staleness, not timestamps from consumer devices that can drift or be out of sync.

Copy link
Copy Markdown
Contributor Author

@AlexCheema AlexCheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #1451 — Replace Timestamp-Based Node Liveness with Event-Indexed

Overall Assessment

Clean, focused refactor that eliminates a real problem (clock skew between nodes) by replacing wall-clock timestamps with event-index-based staleness detection. The change is well-scoped and the before/after is easy to follow.

Strengths

  1. Eliminates clock-skew vulnerability: The old last_seen: Mapping[NodeId, datetime] depended on master-assigned timestamps. If the master's clock drifted or if timestamp comparison had timezone issues, nodes could be falsely timed out or kept alive. Event indices are monotonically increasing and master-local — no clock synchronization needed.

  2. NodeDisconnected is a better name than NodeTimedOut: The rename better describes the semantics — the master is disconnecting a node it considers stale, not necessarily "timing out" in a traditional sense.

  3. Clean removal of when field from NodeGatheredInfo: This field was always awkward — it was "a manually cast datetime overrode by the master" (per the old comment). Removing it simplifies the event model.

  4. Test updates are correct: test_master.py no longer needs to pass when=str(datetime.now(...)) in NodeGatheredInfo, and test_apply_node_timed_out.py correctly uses the new NodeDisconnected event and last_event_index_by_node.

Issues

  1. last_event_index_by_node is only updated for NodeGatheredInfo events (apply.py ~line 107):

    if isinstance(event.event, NodeGatheredInfo):
        update["last_event_index_by_node"] = {
            **new_state.last_event_index_by_node,
            event.event.node_id: event.idx,
        }

    Other node-originated events (e.g., RunnerStatusUpdated, TaskStatusUpdated, TaskAcknowledged) do NOT update this index. If a node stops sending NodeGatheredInfo but continues sending other events (e.g., runner status updates during inference), it would appear stale and get disconnected. Verify that NodeGatheredInfo is sent regularly by all active nodes regardless of what else they're doing. If it's sent by the info_gatherer on a fixed interval, this is fine.

  2. Stale detection state is not event-sourced: _last_checked_indices and _stale_cycles are instance variables on the master, NOT part of State. If the master process crashes and restarts, these reset to empty dicts. It then takes 3 full plan cycles (~30s at 10s/cycle) before any stale node is detected. This matches the old 30-second timeout behavior, so it's not a regression, but worth documenting.

  3. Edge case: node with index 0: When a brand-new node sends its first NodeGatheredInfo, its last_event_index_by_node value will be some positive index. _last_checked_indices won't have an entry yet, so last_checked defaults to -1. Since current_indices[node_id] != last_checked, the stale count won't increment. Correct behavior.

  4. Edge case: rapidly rejoining node: If a node disconnects (removed from last_event_index_by_node via apply_node_disconnected) and immediately rejoins with a new peer ID, the old stale data in _stale_cycles for the old peer ID is harmless (it just won't match any current last_event_index_by_node key and will be ignored). But _stale_cycles will accumulate dead entries over time. Consider periodically pruning keys that are no longer in state.last_event_index_by_node.

Minor

  1. The _plan loop still does await anyio.sleep(10) — 10-second granularity for stale detection is reasonable but worth documenting the detection latency: 3 stale cycles * 10s = minimum 30 seconds to detect a dead node.

Verdict

Approve. This is a clean improvement that removes a real problem (clock dependency) with a simpler, deterministic approach. The only concern worth addressing is #5 (verify NodeGatheredInfo frequency) — if it's sent on a fixed timer by the info_gatherer, this is a non-issue.

🤖 Generated with Claude Code

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1451 — refactor: replace timestamp-based node liveness with event-index staleness

Author: AlexCheema
Status: OPEN (auto-merge disabled)
Changes: +41 / -37 across 6 files

Overview

Replaces wall-clock timestamp-based node liveness (last_seen: datetime, 30s
timeout) with event-index staleness detection. Instead of checking "has this
node's timestamp been updated within 30 seconds?", it checks "has this node
produced any new events in the last 3 plan cycles (~30s)?".

Files changed:

  • src/exo/shared/types/state.py — last_seen → last_event_index_by_node
  • src/exo/shared/types/events.py — NodeTimedOut → NodeDisconnected, remove when field
  • src/exo/shared/apply.py — update apply logic for new state shape
  • src/exo/master/main.py — new staleness detection in _plan(), remove datetime usage
  • src/exo/worker/main.py — remove when field from NodeGatheredInfo construction
  • src/exo/master/tests/test_master.py — remove when field from fixture

Motivation

SOUND. Two real problems with the old approach:

  1. datetime.now() in event-sourced state breaks deterministic replay
  2. 30s wall-clock timeout fires on slow info-gathering (e.g. system_profiler
    taking 31s), not actual node unreachability — causing false-positive removals

The new approach is deterministic (event indices are derived from the event
stream) and more robust (checks event production, not clock deltas).

Correctness — State Changes

PASS.

state.py: Replacing last_seen: Mapping[NodeId, datetime] with
last_event_index_by_node: Mapping[NodeId, int] is clean. Integer indices are
deterministic and monotonically increasing.

events.py: Removing the when: str field from NodeGatheredInfo eliminates
the non-deterministic timestamp from the event stream. The rename
NodeTimedOut → NodeDisconnected better reflects semantics.

Correctness — Apply Logic

PASS.

In apply() (top-level), the index tracking is done at the IndexedEvent layer:
if isinstance(event.event, NodeGatheredInfo):
update["last_event_index_by_node"] = {
**new_state.last_event_index_by_node,
event.event.node_id: event.idx,
}
This is correct — the global event index is available on the IndexedEvent
wrapper, and the update is applied atomically with last_event_applied_idx.

apply_node_disconnected() properly cleans up last_event_index_by_node,
topology, downloads, node_identities, node_memory, node_disk, instances,
and runners for the disconnected node. This matches the old
apply_node_timed_out() behavior.

Correctness — Staleness Detection

PASS with notes.

The new logic in _plan():

  1. Snapshot current_indices from state.last_event_index_by_node
  2. Compare each node's index to _last_checked_indices (default -1)
  3. If unchanged → increment _stale_cycles counter
  4. If changed → reset counter
  5. If stale for >= 3 cycles → emit NodeDisconnected

This is correct:

  • New nodes: index > -1, so not stale on first check ✓
  • list() copies prevent mutation-during-iteration issues ✓
  • del _stale_cycles[node_id] after disconnect prevents double-fire ✓
  • _last_checked_indices updated AFTER the comparison loop ✓

Race Condition Analysis

NONE. _plan() runs in a single async task. _last_checked_indices and
_stale_cycles are only accessed within _plan(). No concurrent mutation.

Master Re-election

SAFE. On re-election, a new Master instance is created with fresh empty dicts:
_last_checked_indices: dict[NodeId, int] = {}
_stale_cycles: dict[NodeId, int] = {}
First cycle: all nodes compare against -1 (default), so none are marked stale.
Staleness tracking resumes normally from cycle 2. Effective grace period of
~10s after re-election before staleness counting begins.

Edge Case — Only NodeGatheredInfo Updates the Index

NOTE. The apply() function only updates last_event_index_by_node for
NodeGatheredInfo events. Other node-originated events (RunnerStatusUpdated,
NodeDownloadProgress, etc.) do NOT update the index.

This means a node that is actively running inference (producing
RunnerStatusUpdated events) but whose InfoGatherer has stalled would still be
disconnected.

In practice this is LOW RISK because:

  • macOS nodes: macmon runs at ~1s intervals, thunderbolt at ~5s
  • Linux nodes: psutil memory at ~1s, disk at ~30s
  • A healthy InfoGatherer produces events every 1-2 seconds

However, if InfoGatherer crashes while the runner continues working, the node
would be falsely disconnected after ~30s. Consider whether other event types
with a node_id should also update the index — though NodeGatheredInfo is
arguably the right "heartbeat" signal.

Timing Comparison

Old: Exact 30-second wall-clock check (but subject to clock skew)
New: 3 × 10-second plan cycles ≈ 30 seconds (but subject to plan loop drift)

The effective timeout is 20-40 seconds depending on when the node stops
producing events relative to the plan cycle. This is comparable to the old
behavior and arguably more predictable since it doesn't depend on clock sync.

Dashboard Compatibility

PASS. No references to last_seen or lastSeen in the dashboard source code.
The dashboard does not consume this state field directly.

Backwards Compatibility

BREAKING for any external consumers that:

  • Parse the when field from NodeGatheredInfo events
  • Listen for NodeTimedOut events (now NodeDisconnected)
  • Read state.last_seen (now state.last_event_index_by_node)

These are internal types, so the risk is low unless there are external
integrations consuming the event stream or state endpoint directly.

Test Coverage

ADEQUATE. The test_master.py fixture is updated to remove the when field.
No new tests for the staleness detection logic itself (_plan loop behavior).
The staleness logic is simple enough that the lack of a unit test is acceptable,
though a test verifying the 3-cycle threshold would add confidence.

Nits

None.

Verdict

LGTM. Well-motivated refactor that eliminates non-deterministic timestamps from
event-sourced state. The staleness detection logic is correct with no race
conditions. The only notable design choice is that only NodeGatheredInfo events
update the index (not all node-originated events), which is a reasonable
"heartbeat" approach but could theoretically cause false disconnects if
InfoGatherer fails while runners are healthy. Low risk in practice.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review -- PR #1451: refactor: replace timestamp-based node liveness with event-index staleness

CI status: All 8 checks PASSED (typecheck, Build and check on aarch64-darwin, aarch64-linux, x86_64-linux)

Merge status: CONFLICTING -- this PR has merge conflicts with main and needs a rebase.

Overview

Replaces wall-clock datetime-based node liveness (last_seen: Mapping[NodeId, datetime], 30s timeout) with event-index staleness detection (last_event_index_by_node: Mapping[NodeId, int], 3 consecutive plan cycles with no new events). The motivation is sound: datetime.now() is a side effect that breaks deterministic event replay, and the 30s wall-clock timeout conflates "slow info gathering" with "node actually unreachable."

Changes span 6 files: state.py, events.py, apply.py, master/main.py, worker/main.py, test_master.py.

Critical Issues

1. Merge conflict with #1493 -- node_identities cleanup regression

PR #1493 ("don't time out node identities", merged today) intentionally removed node_identities cleanup from apply_node_timed_out so that nodes rejoining the cluster retain their identity data. However, this PR's apply_node_disconnected still includes node_identities cleanup:

node_identities = {
    key: value
    for key, value in state.node_identities.items()
    if key != event.node_id
}

When rebasing, the node_identities cleanup must be removed to preserve the behavior from #1493. GitHub reports mergeStateStatus: DIRTY, confirming the conflict.

2. _master_time_stamp assignment silently removed

The PR removes this line from _event_processor:

event._master_time_stamp = datetime.now(tz=timezone.utc)  # pyright: ignore[reportPrivateUsage]

The _master_time_stamp field on BaseEvent still exists in events.py (untouched by this PR), but the only place it was ever set is now deleted. As Evanev7 noted in the PR comments, this was an intentional debugging aid. Alex agreed it should be kept for debugging but not used for liveness.

Recommended fix: Restore the _master_time_stamp assignment (keeping the datetime import in master/main.py for this one line). It's a private debug field with zero impact on liveness logic. Alternatively, if the team decides it's no longer needed, remove the field definition from BaseEvent too for consistency.

Significant Issues

3. Only NodeGatheredInfo events update last_event_index_by_node

In apply():

if isinstance(event.event, NodeGatheredInfo):
    update["last_event_index_by_node"] = {
        **new_state.last_event_index_by_node,
        event.event.node_id: event.idx,
    }

Other node-originated events (RunnerStatusUpdated, NodeDownloadProgress, TaskStatusUpdated) do NOT update the index. If a node's InfoGatherer crashes while its runner continues working, the node would be falsely disconnected after ~30s.

In practice this is low risk because the InfoGatherer produces events from multiple independent sources:

  • macOS: macmon at 1s, system_profiler at 5s, thunderbolt bridge at 10s, disk at 30s
  • Linux: psutil memory at 1s, network interfaces at 10s, disk at 30s

ALL of these would have to fail simultaneously to miss the heartbeat window. But this is a design decision worth making explicitly -- a comment noting that NodeGatheredInfo serves as the heartbeat signal would help future readers.

Minor Issues

4. Detection latency is variable (20-40s, not "~30s")

The effective detection window depends on when the node stops producing events relative to the plan cycle boundary. The PR description and comments say "~30s" -- more precisely it's 20-40 seconds, comparable to the old behavior.

5. No unit test for the staleness detection logic

The _plan() method's staleness tracking (_last_checked_indices, _stale_cycles, the 3-cycle threshold) has no dedicated test. The logic is simple, but a test verifying the 3-cycle threshold would add confidence for the core liveness mechanism.

Correctness Analysis

The staleness detection logic in _plan() is correct:

  • New nodes: _last_checked_indices.get(node_id, -1) returns -1, which won't match any real index, so new nodes are never falsely marked stale on first check.
  • list(current_indices) prevents mutation-during-iteration.
  • del _stale_cycles[node_id] after disconnect prevents double-fire.
  • _last_checked_indices = current_indices is set AFTER the comparison loop, ensuring correct before/after comparison.
  • No race conditions: _plan() runs in a single async task; staleness dicts are only accessed there.
  • Master re-election: fresh Master instance gets empty dicts, giving a natural ~10s grace period before staleness counting begins.

The apply_node_disconnected cleanup of topology, downloads, memory, disk, system, network, thunderbolt, and rdma_ctl is complete and correct (modulo the node_identities issue in #1 above).

What's Good

  • Eliminates non-determinism from event-sourced state. The datetime.now() calls in apply_node_gathered_info and the when field overwrite in _event_processor were genuine event-sourcing violations. Event indices are fully deterministic.
  • NodeDisconnected is a better name than NodeTimedOut -- reflects actual semantics.
  • Clean removal of the when: str field from NodeGatheredInfo. The old field was awkward (manually cast datetime, overwritten by master, had a TODO comment).
  • Staleness tracking correctly scoped as ephemeral master state, not event-sourced. Resets on re-election, which is the right behavior.
  • Dashboard unaffected. No references to last_seen/lastSeen in the frontend code.
  • Net code reduction (+41/-37) despite adding new functionality.

Verdict

Sound design that eliminates a real problem. The implementation is correct, but the PR needs work before merge:

  1. Rebase required -- merge conflicts with main
  2. Drop node_identities cleanup from apply_node_disconnected to match don't time out node identities #1493
  3. Restore _master_time_stamp assignment or remove the field from BaseEvent entirely

After addressing those three items, this is ready.

Review only -- not a merge approval.

@Evanev7 Evanev7 closed this Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants