Feature/checkpointing wall clock#1914
Feature/checkpointing wall clock#1914CodersAcademy006 wants to merge 2 commits intogoogle-deepmind:mainfrom
Conversation
f209d57 to
b58a6fe
Compare
|
Converted to draft to invite early feedback on the checkpointing hook location |
| """run_loop for iterating over the simulation step function.""" | ||
|
|
||
| import time | ||
| from typing import TYPE_CHECKING |
| step_fn: step_function.SimulationStepFn, | ||
| log_timestep_info: bool = False, | ||
| progress_bar: bool = True, | ||
| torax_config: 'model_config.ToraxConfig | None' = None, |
There was a problem hiding this comment.
torax_config does not need to be optionally None. No problem changing the internal API to always pass it in.
| # Import here to avoid circular dependency | ||
| from torax._src.output_tools import output as output_module |
There was a problem hiding this comment.
only import at the top. If you have a circular dependency then consider another way to solve it
| post_processing_history.append(post_processed_outputs) | ||
|
|
||
| # Periodic checkpointing | ||
| if torax_config is not None and torax_config.checkpointing.enabled: |
There was a problem hiding this comment.
Cleaner if torax_config is always there so would just need to query the checkpointing object.
Should also consider the case that torax_config.checkpointing may be None, depending on how you handle the case of having no checkpointing key in the user config dict that is used to build a ToraxConfig.
| state_history.append(current_state) | ||
| post_processing_history.append(post_processed_outputs) | ||
|
|
||
| # Periodic checkpointing |
There was a problem hiding this comment.
abstract away all this new logic into a private helper function which returns should_checkpoint . Do not inline all this here
| from typing_extensions import Self | ||
|
|
||
|
|
||
| class CheckpointConfig(torax_pydantic.BaseModelFrozen): |
There was a problem hiding this comment.
it's better to have this class in its own module in torax_pydantic. Follow the same pattern as file_restart
| every_n_seconds: float | None = None | ||
| path: str | None = None | ||
|
|
||
| @pydantic.model_validator(mode='after') |
There was a problem hiding this comment.
add a simple test for this validator, in torax_pydantic/tests
| ): | ||
| raise ValueError( | ||
| 'checkpointing requires every_n_steps or every_n_seconds' | ||
| ) |
There was a problem hiding this comment.
can use Pydantic native types for this and avoid the validation. pydantic.PositiveInt
also every_n_seconds can be a pydantic.PositiveFloat
| raise ValueError( | ||
| 'checkpointing.every_n_steps must be positive' | ||
| ) | ||
| if self.every_n_seconds is not None and self.every_n_seconds <= 0: |
| ) | ||
| if not self.path: | ||
| raise ValueError( | ||
| 'checkpointing.path must be set when checkpointing is enabled' |
There was a problem hiding this comment.
to make it easier on the user, this should default to the standard output path. Since this crosses different Pydantic objects, this validation should be done on the torax_config level
|
Hey @jcitrin, apologies if this is the wrong place to do this (I'm not a contributor, just an interested observer), but I have an extremely strong suspicion that the account that opened this PR is pretty much exclusively copy-pasting from an LLM with little/no actual understanding of the underlying changes. I don't think I've seen a single comment from them that looks hand-written, and they've spammed several repos - particularly tensorflow - with PRs in the last several months. Here are a few examples: tensorflow/tensorflow#105371 I'm not sure if this makes a difference to you, but I just wanted to call it out in case it saves maintainer time and energy avoiding some slop. |
|
Hey @eganwall . Thanks for reaching out! I appreciate your concerns. We are aware of the use of genAI by external contributors, and have some guidance on this in our contribution docs (which could be expanded): https://torax.readthedocs.io/en/latest/contributing.html#contributing-tips In general the policy is to accommodate external PRs unless their quality of the original PR or the iterative process makes it clear that the time investment in reviewing exceeds the time in doing the work ourselves. This is judged on a case-by-case basis. We also label many of the more complex issues as "domain expertise required" to gatekeep, and tend to avoid accepting external PRs from non-known-collaborators on those. |
|
Apologies but closing for now due to lack of activity, low prioritization, and lack of review capacity. We can consider reopening this later. |
This PR extends the existing opt-in checkpointing infrastructure to support
wall-clock–based triggers in addition to solver-step–based triggers.
Checkpoints are written when a configurable wall-clock interval has elapsed,
using the existing NetCDF output and restart-compatible format. The feature is
fully optional, backward compatible, and does not modify solver or physics
behavior.