Skip to content

Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788

Merged
reiabreu merged 6 commits into
apache:masterfrom
dpol1:fix/7897
Jun 14, 2026
Merged

Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788
reiabreu merged 6 commits into
apache:masterfrom
dpol1:fix/7897

Conversation

@dpol1

@dpol1 dpol1 commented Jun 11, 2026

Copy link
Copy Markdown
Member

What is the purpose of the change

Fixes the Y2038 heartbeat overflow issue. Worker heartbeat timestamps (time_secs) were carried as i32 seconds, which would overflow on 2038-01-19. This change migrates the timestamps to 64-bit values (i64) to prevent false timeouts and cluster failures post-2038. It also establishes the expected bounce-upgrade semantics for legacy heartbeats.

How was the change tested

  • Added Y2038HeartbeatTest to ensure that 64-bit timestamps round-trip correctly through Thrift serialization.

  • Verified that HeartbeatCache successfully parses and handles post-2038 heartbeats without flagging them as timed out.

  • Tested schema backward-compatibility to ensure legacy i32 payloads fail validation gracefully during the bounce-upgrade window.

  • Fixes [STORM-4116] Heartbeats mechanism is affected by Y2038 bug #7897

dpol1 added 6 commits June 11, 2026 09:46
Five regression tests for STORM apache#7897:
- CWH time_secs i64 round-trip survives post-2038 epochs
- HeartbeatCache accepts Long TIME_SECS beats without false timeout
- documents int currentTimeSecs() overflow (negative guard)
- legacy i32 LSWorkerHeartbeat blob fails required-field validation
  under the i64 schema (wire-compat semantics)
- post-2038 executor launch window not misclassified as dead
Time.currentTimeSecs() narrows epoch seconds to int, which overflows on
2038-01-19T03:14:07Z and is the root cause of STORM apache#7897 (workers
falsely timed out post-2038). Add currentTimeSecsLong() and
deltaSecsLong(long) for absolute timestamps. The int variants stay,
deprecated, for short relative durations (uptime/UI).
Promote time_secs from i32 to i64 in ClusterWorkerHeartbeat,
SupervisorWorkerHeartbeat and LSWorkerHeartbeat; uptime_secs stays i32
(relative duration). Regenerate the Java and Python thrift bindings for
the three structs.

Widen the callers that the type change forces at compile time:
ExecutorBeat.timeSecs, ClusterUtils.convertExecutorBeats and the
latest-heartbeat comparison in PaceMakerStateStorage.get_worker_hb
(an absolute comparison that broke across the 2038 rollover).

Wire compat: thrift tags i32/i64 differently, so blobs written by the
old schema fail required-field validation under the new one. Heartbeats
self-heal on re-report; a full-cluster bounce upgrade is required.
Switch every heartbeat writer to Time.currentTimeSecsLong():
- Worker.doHeartBeat (LSWorkerHeartbeat, on-disk local state)
- ClientStatsUtil.mkZkWorkerHb (ZK beat map, TIME_SECS now a Long)
- ClientStatsUtil.thriftifyZkWorkerHb (keep full long, no intValue narrowing)
- StatsUtil.thriftifyRpcWorkerHb (SupervisorWorkerHeartbeat)
- SupervisorHeartbeat (SupervisorInfo.time_secs, already i64 on the wire
  but previously fed a wrapped int)
HeartbeatCache (the Nimbus worker-liveness site) tracks receipt time
and reported time as Long and computes timeouts with deltaSecsLong;
beat map values are read through Number so Integer beats from legacy
producers still work. The executor launch-window check no longer
truncates assignment start times through intValue().

Supervisor and logviewer consumers move off the int clock as well:
Slot heartbeat-age checks use deltaSecsLong, and the logviewer
alive-worker scan (WorkerLogs/LogCleaner) carries epoch seconds as
long end to end.
Explain the STORM-7897 time_secs i32->i64 promotion, why uptime_secs
stays i32, the deprecation of the int-based Time methods, and the
full-cluster bounce upgrade requirement (legacy heartbeat blobs fail
required-field validation and self-heal on re-report).
@dpol1 dpol1 changed the title Fix/7897 Fix Y2038 worker heartbeat overflow by migrating timestamps to i64 Jun 11, 2026
@rzo1 rzo1 requested review from reiabreu and rzo1 June 11, 2026 13:56
@rzo1 rzo1 added this to the 3.0.0 milestone Jun 11, 2026
@rzo1 rzo1 added bug java Pull requests that update Java code labels Jun 11, 2026

@rzo1 rzo1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. LGTM.

@reiabreu

Copy link
Copy Markdown
Contributor

@dpol1 thanks for the submission.

What do you think about offering the possibility of heartbeat producers still sending 32 bits and Nimbus would ajust?
This could make the upgrade process smoother.

struct ClusterWorkerHeartbeat {
    1: optional i32 time_secs        // ← demoted to optional, kept for legacy readers
    5: required i64 time_secs_long   // ← new field at a new field ID
    ...
}

Then in the reader code:

long getTimeSecs() {
    if (isSet(TIME_SECS_LONG)) return time_secs_long;  // new writer
    return (long) time_secs;                            // legacy fallback, safe pre-2038
}

@dpol1

dpol1 commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

@reiabreu yeah, optional-i32 + a new i64 field is the compat path I weighed before going in-place.

Two notes on the snippet: the new field needs to be optional, not required, or a legacy producer that only sets the old one fails Thrift's required-field check on the new reader and the fallback never runs. Also the read fallback only covers old worker → new Nimbus. A new worker writing only time_secs_long still won't deserialize on an old Nimbus, so a real rolling upgrade also needs writers setting both fields for a release, then dropping the i32 later.

I went in-place to avoid carrying two fields plus a cleanup release, and because these heartbeat structs don't have a rolling-upgrade contract today. A full restart avoids the mixed-version window completely — Nimbus times out on receipt time, so heartbeats recover within an interval once everything's upgraded.
If we want heartbeats to survive a rolling upgrade I'll switch to optional + dual-write and track the i32 removal as a follow-up. Otherwise I'd keep this and just call out the restart in the docs.

@reiabreu

Copy link
Copy Markdown
Contributor

Thanks @dpol1 , it makes sense. Next version will be major where disruptions are expected. We just have to make this clear in the release notes.
Happy to merge this.

@reiabreu reiabreu merged commit 6f8b94d into apache:master Jun 14, 2026
12 checks passed
@dpol1 dpol1 deleted the fix/7897 branch June 14, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[STORM-4116] Heartbeats mechanism is affected by Y2038 bug

3 participants