You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #357 added warn_outlier_threshold (default 10.0) to PI07PaligemmaLowLevelConfig, which makes the training forward run _warn_state_action_outliers every step. On the common no-outlier path that still does two small max reductions plus one bool(torch.cat(...).any()), which forces a 1-byte device→host sync per step.
During review we estimated the per-step cost on 8×A100 analytically at typically <0.1% of step wall-clock (sub-ms/step), with a ~0.25% pathological ceiling — reasoning that the sync sits at forward-start before the heavy backbone, and that update_policy already forces a per-step D2H via gather_for_metrics(...).item() (src/opentau/scripts/train.py:123-125) on top of the DDP gradient all-reduce. This was never measured on real hardware.
Task
Add a profile_step.py-based micro-benchmark that measures the actual per-step wall-clock delta of the outlier check, default-on vs disabled, for pi07_paligemma_low_level:
Run src/opentau/scripts/profile_step.py on a GPU box with --policy.warn_outlier_threshold=10.0 (default-on) and --policy.warn_outlier_threshold=0 (disabled, early-returns before any sync).
Compare the per-step forward/total wall-clock breakdown between the two.
Confirm the overhead is in the noise (or quantify it if not), and confirm the threshold <= 0 path is truly zero-overhead (no D2H).
Acceptance
A measured before/after number posted here (and ideally a short note in the config docstring or PR thread).
If the measured cost is material (say >0.5% of step time), follow up with mitigation (e.g. step-sampling the check, or flipping the default to opt-in).
References
Implementation: _warn_state_action_outliers in src/opentau/policies/pi07_paligemma/low_level/modeling_pi07_low_level.py
Config field: warn_outlier_threshold in src/opentau/policies/pi07_paligemma/low_level/configuration_pi07_low_level.py
Background
PR #357 added
warn_outlier_threshold(default10.0) toPI07PaligemmaLowLevelConfig, which makes the trainingforwardrun_warn_state_action_outliersevery step. On the common no-outlier path that still does two smallmaxreductions plus onebool(torch.cat(...).any()), which forces a 1-byte device→host sync per step.During review we estimated the per-step cost on 8×A100 analytically at typically <0.1% of step wall-clock (sub-ms/step), with a ~0.25% pathological ceiling — reasoning that the sync sits at
forward-start before the heavy backbone, and thatupdate_policyalready forces a per-step D2H viagather_for_metrics(...).item()(src/opentau/scripts/train.py:123-125) on top of the DDP gradient all-reduce. This was never measured on real hardware.Task
Add a
profile_step.py-based micro-benchmark that measures the actual per-step wall-clock delta of the outlier check, default-on vs disabled, forpi07_paligemma_low_level:src/opentau/scripts/profile_step.pyon a GPU box with--policy.warn_outlier_threshold=10.0(default-on) and--policy.warn_outlier_threshold=0(disabled, early-returns before any sync).threshold <= 0path is truly zero-overhead (no D2H).Acceptance
References
_warn_state_action_outliersinsrc/opentau/policies/pi07_paligemma/low_level/modeling_pi07_low_level.pywarn_outlier_thresholdinsrc/opentau/policies/pi07_paligemma/low_level/configuration_pi07_low_level.pysrc/opentau/scripts/profile_step.py