Skip to content

Feature/checkpointing wall clock#1914

Closed
CodersAcademy006 wants to merge 2 commits intogoogle-deepmind:mainfrom
CodersAcademy006:feature/checkpointing-wall-clock
Closed

Feature/checkpointing wall clock#1914
CodersAcademy006 wants to merge 2 commits intogoogle-deepmind:mainfrom
CodersAcademy006:feature/checkpointing-wall-clock

Conversation

@CodersAcademy006
Copy link
Copy Markdown

This PR extends the existing opt-in checkpointing infrastructure to support
wall-clock–based triggers in addition to solver-step–based triggers.

Checkpoints are written when a configurable wall-clock interval has elapsed,
using the existing NetCDF output and restart-compatible format. The feature is
fully optional, backward compatible, and does not modify solver or physics
behavior.

@CodersAcademy006 CodersAcademy006 force-pushed the feature/checkpointing-wall-clock branch from f209d57 to b58a6fe Compare January 18, 2026 09:11
@CodersAcademy006
Copy link
Copy Markdown
Author

CodersAcademy006 commented Jan 18, 2026

Converted to draft to invite early feedback on the checkpointing hook location
and overall approach before finalizing. Happy to iterate based on maintainer
guidance.
#1894

"""run_loop for iterating over the simulation step function."""

import time
from typing import TYPE_CHECKING
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this

step_fn: step_function.SimulationStepFn,
log_timestep_info: bool = False,
progress_bar: bool = True,
torax_config: 'model_config.ToraxConfig | None' = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torax_config does not need to be optionally None. No problem changing the internal API to always pass it in.

Comment on lines +173 to +174
# Import here to avoid circular dependency
from torax._src.output_tools import output as output_module
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only import at the top. If you have a circular dependency then consider another way to solve it

post_processing_history.append(post_processed_outputs)

# Periodic checkpointing
if torax_config is not None and torax_config.checkpointing.enabled:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner if torax_config is always there so would just need to query the checkpointing object.

Should also consider the case that torax_config.checkpointing may be None, depending on how you handle the case of having no checkpointing key in the user config dict that is used to build a ToraxConfig.

state_history.append(current_state)
post_processing_history.append(post_processed_outputs)

# Periodic checkpointing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstract away all this new logic into a private helper function which returns should_checkpoint . Do not inline all this here

from typing_extensions import Self


class CheckpointConfig(torax_pydantic.BaseModelFrozen):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to have this class in its own module in torax_pydantic. Follow the same pattern as file_restart

every_n_seconds: float | None = None
path: str | None = None

@pydantic.model_validator(mode='after')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a simple test for this validator, in torax_pydantic/tests

):
raise ValueError(
'checkpointing requires every_n_steps or every_n_seconds'
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use Pydantic native types for this and avoid the validation. pydantic.PositiveInt

also every_n_seconds can be a pydantic.PositiveFloat

raise ValueError(
'checkpointing.every_n_steps must be positive'
)
if self.every_n_seconds is not None and self.every_n_seconds <= 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

)
if not self.path:
raise ValueError(
'checkpointing.path must be set when checkpointing is enabled'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it easier on the user, this should default to the standard output path. Since this crosses different Pydantic objects, this validation should be done on the torax_config level

@eganwall
Copy link
Copy Markdown

Hey @jcitrin, apologies if this is the wrong place to do this (I'm not a contributor, just an interested observer), but I have an extremely strong suspicion that the account that opened this PR is pretty much exclusively copy-pasting from an LLM with little/no actual understanding of the underlying changes. I don't think I've seen a single comment from them that looks hand-written, and they've spammed several repos - particularly tensorflow - with PRs in the last several months. Here are a few examples:

tensorflow/tensorflow#105371
tensorflow/tensorflow#105372
tensorflow/tensorflow#105370

I'm not sure if this makes a difference to you, but I just wanted to call it out in case it saves maintainer time and energy avoiding some slop.

@jcitrin
Copy link
Copy Markdown
Collaborator

jcitrin commented Feb 22, 2026

Hey @eganwall . Thanks for reaching out! I appreciate your concerns. We are aware of the use of genAI by external contributors, and have some guidance on this in our contribution docs (which could be expanded): https://torax.readthedocs.io/en/latest/contributing.html#contributing-tips

In general the policy is to accommodate external PRs unless their quality of the original PR or the iterative process makes it clear that the time investment in reviewing exceeds the time in doing the work ourselves. This is judged on a case-by-case basis.

We also label many of the more complex issues as "domain expertise required" to gatekeep, and tend to avoid accepting external PRs from non-known-collaborators on those.

@jcitrin
Copy link
Copy Markdown
Collaborator

jcitrin commented Mar 24, 2026

Apologies but closing for now due to lack of activity, low prioritization, and lack of review capacity. We can consider reopening this later.

@jcitrin jcitrin closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants