Feature/checkpointing wall clock by CodersAcademy006 · Pull Request #1914 · google-deepmind/torax

CodersAcademy006 · 2026-01-18T09:08:22Z

This PR extends the existing opt-in checkpointing infrastructure to support
wall-clock–based triggers in addition to solver-step–based triggers.

Checkpoints are written when a configurable wall-clock interval has elapsed,
using the existing NetCDF output and restart-compatible format. The feature is
fully optional, backward compatible, and does not modify solver or physics
behavior.

CodersAcademy006 · 2026-01-18T09:12:07Z

Converted to draft to invite early feedback on the checkpointing hook location
and overall approach before finalizing. Happy to iterate based on maintainer
guidance.
#1894

jcitrin · 2026-02-21T01:13:23Z

 """run_loop for iterating over the simulation step function."""

 import time
+from typing import TYPE_CHECKING


No need for this

jcitrin · 2026-02-21T01:15:26Z

    step_fn: step_function.SimulationStepFn,
    log_timestep_info: bool = False,
    progress_bar: bool = True,
+    torax_config: 'model_config.ToraxConfig | None' = None,


torax_config does not need to be optionally None. No problem changing the internal API to always pass it in.

jcitrin · 2026-02-21T01:16:58Z

+            # Import here to avoid circular dependency
+            from torax._src.output_tools import output as output_module


only import at the top. If you have a circular dependency then consider another way to solve it

jcitrin · 2026-02-21T01:17:30Z

        post_processing_history.append(post_processed_outputs)
+
+        # Periodic checkpointing
+        if torax_config is not None and torax_config.checkpointing.enabled:


Cleaner if torax_config is always there so would just need to query the checkpointing object.

Should also consider the case that torax_config.checkpointing may be None, depending on how you handle the case of having no checkpointing key in the user config dict that is used to build a ToraxConfig.

jcitrin · 2026-02-21T01:17:57Z

        state_history.append(current_state)
        post_processing_history.append(post_processed_outputs)
+
+        # Periodic checkpointing


abstract away all this new logic into a private helper function which returns should_checkpoint . Do not inline all this here

jcitrin · 2026-02-21T01:33:25Z

 from typing_extensions import Self


+class CheckpointConfig(torax_pydantic.BaseModelFrozen):


it's better to have this class in its own module in torax_pydantic. Follow the same pattern as file_restart

jcitrin · 2026-02-21T01:38:18Z

+  every_n_seconds: float | None = None
+  path: str | None = None
+
+  @pydantic.model_validator(mode='after')


add a simple test for this validator, in torax_pydantic/tests

jcitrin · 2026-02-21T01:40:19Z

+      ):
+        raise ValueError(
+            'checkpointing requires every_n_steps or every_n_seconds'
+        )


can use Pydantic native types for this and avoid the validation. pydantic.PositiveInt

also every_n_seconds can be a pydantic.PositiveFloat

jcitrin · 2026-02-21T01:40:35Z

+        raise ValueError(
+            'checkpointing.every_n_steps must be positive'
+        )
+      if self.every_n_seconds is not None and self.every_n_seconds <= 0:


see comment above

jcitrin · 2026-02-21T01:41:46Z

+        )
+      if not self.path:
+        raise ValueError(
+            'checkpointing.path must be set when checkpointing is enabled'


to make it easier on the user, this should default to the standard output path. Since this crosses different Pydantic objects, this validation should be done on the torax_config level

eganwall · 2026-02-22T00:24:40Z

Hey @jcitrin, apologies if this is the wrong place to do this (I'm not a contributor, just an interested observer), but I have an extremely strong suspicion that the account that opened this PR is pretty much exclusively copy-pasting from an LLM with little/no actual understanding of the underlying changes. I don't think I've seen a single comment from them that looks hand-written, and they've spammed several repos - particularly tensorflow - with PRs in the last several months. Here are a few examples:

tensorflow/tensorflow#105371
tensorflow/tensorflow#105372
tensorflow/tensorflow#105370

I'm not sure if this makes a difference to you, but I just wanted to call it out in case it saves maintainer time and energy avoiding some slop.

jcitrin · 2026-02-22T00:59:45Z

Hey @eganwall . Thanks for reaching out! I appreciate your concerns. We are aware of the use of genAI by external contributors, and have some guidance on this in our contribution docs (which could be expanded): https://torax.readthedocs.io/en/latest/contributing.html#contributing-tips

In general the policy is to accommodate external PRs unless their quality of the original PR or the iterative process makes it clear that the time investment in reviewing exceeds the time in doing the work ourselves. This is judged on a case-by-case basis.

We also label many of the more complex issues as "domain expertise required" to gatekeep, and tend to avoid accepting external PRs from non-known-collaborators on those.

jcitrin · 2026-03-24T10:55:57Z

Apologies but closing for now due to lack of activity, low prioritization, and lack of review capacity. We can consider reopening this later.

CodersAcademy006 added 2 commits January 18, 2026 14:32

Add opt-in solver-step–based checkpointing via NetCDF output

f7036c5

Add wall-clock–based checkpoint trigger

b58a6fe

CodersAcademy006 force-pushed the feature/checkpointing-wall-clock branch from f209d57 to b58a6fe Compare January 18, 2026 09:11

jcitrin reviewed Feb 21, 2026

View reviewed changes

jcitrin closed this Mar 24, 2026

		# Import here to avoid circular dependency
		from torax._src.output_tools import output as output_module

		from typing_extensions import Self


		class CheckpointConfig(torax_pydantic.BaseModelFrozen):

Conversation

CodersAcademy006 commented Jan 18, 2026

Uh oh!

CodersAcademy006 commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eganwall commented Feb 22, 2026

Uh oh!

jcitrin commented Feb 22, 2026

Uh oh!

jcitrin commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CodersAcademy006 commented Jan 18, 2026 •

edited

Loading