Skip to content

Measure per-step cost of pi07_paligemma outlier check on 8×A100 (profile_step.py) #360

@shuheng-liu

Description

@shuheng-liu

Background

PR #357 added warn_outlier_threshold (default 10.0) to PI07PaligemmaLowLevelConfig, which makes the training forward run _warn_state_action_outliers every step. On the common no-outlier path that still does two small max reductions plus one bool(torch.cat(...).any()), which forces a 1-byte device→host sync per step.

During review we estimated the per-step cost on 8×A100 analytically at typically <0.1% of step wall-clock (sub-ms/step), with a ~0.25% pathological ceiling — reasoning that the sync sits at forward-start before the heavy backbone, and that update_policy already forces a per-step D2H via gather_for_metrics(...).item() (src/opentau/scripts/train.py:123-125) on top of the DDP gradient all-reduce. This was never measured on real hardware.

Task

Add a profile_step.py-based micro-benchmark that measures the actual per-step wall-clock delta of the outlier check, default-on vs disabled, for pi07_paligemma_low_level:

  • Run src/opentau/scripts/profile_step.py on a GPU box with --policy.warn_outlier_threshold=10.0 (default-on) and --policy.warn_outlier_threshold=0 (disabled, early-returns before any sync).
  • Compare the per-step forward/total wall-clock breakdown between the two.
  • Confirm the overhead is in the noise (or quantify it if not), and confirm the threshold <= 0 path is truly zero-overhead (no D2H).

Acceptance

  • A measured before/after number posted here (and ideally a short note in the config docstring or PR thread).
  • If the measured cost is material (say >0.5% of step time), follow up with mitigation (e.g. step-sampling the check, or flipping the default to opt-in).

References

  • Implementation: _warn_state_action_outliers in src/opentau/policies/pi07_paligemma/low_level/modeling_pi07_low_level.py
  • Config field: warn_outlier_threshold in src/opentau/policies/pi07_paligemma/low_level/configuration_pi07_low_level.py
  • Profiler: src/opentau/scripts/profile_step.py
  • Origin: PR feat: dataset provenance fields + pi07_paligemma outlier warning #357 review, item 1

Metadata

Metadata

Assignees

Labels

optimizationOptimizes the performance of somethingtest

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions