Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788
Conversation
Five regression tests for STORM apache#7897: - CWH time_secs i64 round-trip survives post-2038 epochs - HeartbeatCache accepts Long TIME_SECS beats without false timeout - documents int currentTimeSecs() overflow (negative guard) - legacy i32 LSWorkerHeartbeat blob fails required-field validation under the i64 schema (wire-compat semantics) - post-2038 executor launch window not misclassified as dead
Time.currentTimeSecs() narrows epoch seconds to int, which overflows on 2038-01-19T03:14:07Z and is the root cause of STORM apache#7897 (workers falsely timed out post-2038). Add currentTimeSecsLong() and deltaSecsLong(long) for absolute timestamps. The int variants stay, deprecated, for short relative durations (uptime/UI).
Promote time_secs from i32 to i64 in ClusterWorkerHeartbeat, SupervisorWorkerHeartbeat and LSWorkerHeartbeat; uptime_secs stays i32 (relative duration). Regenerate the Java and Python thrift bindings for the three structs. Widen the callers that the type change forces at compile time: ExecutorBeat.timeSecs, ClusterUtils.convertExecutorBeats and the latest-heartbeat comparison in PaceMakerStateStorage.get_worker_hb (an absolute comparison that broke across the 2038 rollover). Wire compat: thrift tags i32/i64 differently, so blobs written by the old schema fail required-field validation under the new one. Heartbeats self-heal on re-report; a full-cluster bounce upgrade is required.
Switch every heartbeat writer to Time.currentTimeSecsLong(): - Worker.doHeartBeat (LSWorkerHeartbeat, on-disk local state) - ClientStatsUtil.mkZkWorkerHb (ZK beat map, TIME_SECS now a Long) - ClientStatsUtil.thriftifyZkWorkerHb (keep full long, no intValue narrowing) - StatsUtil.thriftifyRpcWorkerHb (SupervisorWorkerHeartbeat) - SupervisorHeartbeat (SupervisorInfo.time_secs, already i64 on the wire but previously fed a wrapped int)
HeartbeatCache (the Nimbus worker-liveness site) tracks receipt time and reported time as Long and computes timeouts with deltaSecsLong; beat map values are read through Number so Integer beats from legacy producers still work. The executor launch-window check no longer truncates assignment start times through intValue(). Supervisor and logviewer consumers move off the int clock as well: Slot heartbeat-age checks use deltaSecsLong, and the logviewer alive-worker scan (WorkerLogs/LogCleaner) carries epoch seconds as long end to end.
Explain the STORM-7897 time_secs i32->i64 promotion, why uptime_secs stays i32, the deprecation of the int-based Time methods, and the full-cluster bounce upgrade requirement (legacy heartbeat blobs fail required-field validation and self-heal on re-report).
|
@dpol1 thanks for the submission. What do you think about offering the possibility of heartbeat producers still sending 32 bits and Nimbus would ajust? Then in the reader code: |
|
@reiabreu yeah, optional-i32 + a new i64 field is the compat path I weighed before going in-place. Two notes on the snippet: the new field needs to be optional, not required, or a legacy producer that only sets the old one fails Thrift's required-field check on the new reader and the fallback never runs. Also the read fallback only covers old worker → new Nimbus. A new worker writing only I went in-place to avoid carrying two fields plus a cleanup release, and because these heartbeat structs don't have a rolling-upgrade contract today. A full restart avoids the mixed-version window completely — Nimbus times out on receipt time, so heartbeats recover within an interval once everything's upgraded. |
|
Thanks @dpol1 , it makes sense. Next version will be major where disruptions are expected. We just have to make this clear in the release notes. |
What is the purpose of the change
Fixes the Y2038 heartbeat overflow issue. Worker heartbeat timestamps (
time_secs) were carried asi32seconds, which would overflow on 2038-01-19. This change migrates the timestamps to 64-bit values (i64) to prevent false timeouts and cluster failures post-2038. It also establishes the expected bounce-upgrade semantics for legacy heartbeats.How was the change tested
Added
Y2038HeartbeatTestto ensure that 64-bit timestamps round-trip correctly through Thrift serialization.Verified that
HeartbeatCachesuccessfully parses and handles post-2038 heartbeats without flagging them as timed out.Tested schema backward-compatibility to ensure legacy
i32payloads fail validation gracefully during the bounce-upgrade window.Fixes [STORM-4116] Heartbeats mechanism is affected by Y2038 bug #7897