feat: add sigterm handling for graceful termination by linxiulei · Pull Request #1324 · apple/axlearn

linxiulei · 2025-07-29T20:59:33Z

When there are system errors on the hosts (e.g hardware errors), the system is given limited timeout to shutdown so likely the checkpoint saving is unable to complete and also unnecessary if other slices have the full checkpoint to save without system errors.

This change will skip checkpoint saving to reduce the termination latency so that the overall recovery time is improved.

GitOrigin-RevId: b5525f1

samos123 · 2025-08-03T22:13:34Z

Checkpoint saving can be extremely fast with emergency checkpointing. I'm afraid not saving checkpoints may be worse. Especially since we have a custom method deployed to delete lingering pods.

For example for a 70B model, I'm seeing in memory checkpoint saving time to be 7 seconds.

linxiulei · 2025-08-04T18:11:01Z

Checkpoint saving can be extremely fast with emergency checkpointing. I'm afraid not saving checkpoints may be worse.

This doesn't stop checkpoint saving in other Nodes without hardware faults. The intent of this PR is to make the faulty nodes to terminate and shutdown as soon as possible while other nodes can do the checkpoint saving so that the overall recovery time is reduced. To illustrate:

node0 (without fault): running -> JobSet restart -> checkpoint saving -> pod terminate -> job creation -> pod start
node1 (with fault): running -> JobSet restart -> checkpoint saving -> pod terminate -> node shutdown for repair -> job creation -> pod start

After this PR:

node 1 (with faulty): running -> JobSet restart -> pod terminate -> node shutdown for repair -> job creation -> pod start

Since node 1 is likely having the longest recovery time for whole workload to be back running, skipping its checkpoint saving step would reduce overall recovery time. Besides that, if there is hardware fault such as link error, the checkpoint saving may fail anyway.

github-actions · 2025-10-18T02:04:45Z

This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.

github-actions · 2025-12-28T02:18:05Z

This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.

github-actions · 2026-01-05T02:18:11Z

This pull request was closed because it has been inactive for more than 7 days since being marked as stale. Please feel free to reopen it if you would like to continue.

changlan and others added 2 commits July 28, 2025 15:10

Merge pull request apple#1321 from Edwinhr716:lws-integration

d4f9e38

GitOrigin-RevId: b5525f1

feat: add sigterm handling for graceful termination

6e02555

linxiulei force-pushed the sigterm branch from c388af1 to 6e02555 Compare July 29, 2025 21:39

changlan force-pushed the main branch from 4a4da82 to a014498 Compare August 5, 2025 00:30

github-actions Bot added the stale label Oct 18, 2025

github-actions Bot removed the stale label Oct 29, 2025

github-actions Bot added the stale label Dec 28, 2025

github-actions Bot closed this Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add sigterm handling for graceful termination#1324

feat: add sigterm handling for graceful termination#1324
linxiulei wants to merge 2 commits intoapple:mainfrom
linxiulei:sigterm

linxiulei commented Jul 29, 2025

Uh oh!

samos123 commented Aug 3, 2025 •

edited

Loading

Uh oh!

linxiulei commented Aug 4, 2025

Uh oh!

github-actions Bot commented Oct 18, 2025

Uh oh!

github-actions Bot commented Dec 28, 2025

Uh oh!

github-actions Bot commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

linxiulei commented Jul 29, 2025

Uh oh!

samos123 commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linxiulei commented Aug 4, 2025

Uh oh!

github-actions Bot commented Oct 18, 2025

Uh oh!

github-actions Bot commented Dec 28, 2025

Uh oh!

github-actions Bot commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samos123 commented Aug 3, 2025 •

edited

Loading