Skip to content

ddp bench#6

Open
fab2s wants to merge 31 commits intomainfrom
ddp-bench
Open

ddp bench#6
fab2s wants to merge 31 commits intomainfrom
ddp-bench

Conversation

@fab2s
Copy link
Copy Markdown
Owner

@fab2s fab2s commented Apr 9, 2026

No description provided.

fab2s added 30 commits April 9, 2026 22:36
Coordinator: SyncAck must not inflate steps_since_avg, must satisfy nccl_ack.
Worker: scheduler output reaches optimizer, lr_scale multiplies it, scheduler
step argument advances per batch.
Graph: set_scheduler drives optimizer LR through step(), lr_scale multiplies,
no scheduler leaves LR untouched.

Adds GpuWorker::current_lr() accessor for test introspection.
Asserts that the same MultiStepLR produces identical optimizer LR
trajectories across all three training paths: manual (reference), GpuWorker
(builder), and Graph::step() (sync). Covers both unscaled and lr_scale != 1.0.
Replace symmetric 25%/75% drift check with one-sided before/after pattern,
matching test_backward_frees_grad_fn_chain. live_tensor_count is a global
atomic shared with concurrent tests; symmetric thresholds flake when other
tests' tensors are created or freed during the measurement window. A real
leak only manifests as monotonic growth, so tolerate shrinkage and assert
only on growth (< 200, scaled from the 100 used in the simpler test).

RSS threshold loosened from 30MB to 100MB: glibc allocator behavior under
CI memory pressure can hold pages well past last use.
- api_ref.rs: backtick-escape <tag> and <repo-name> placeholders that rustdoc
  was parsing as unclosed HTML tags
- coordinator.rs: qualify ConvergenceGuard link as super::ConvergenceGuard
- datasets/mod.rs: qualify BatchDataSet/DataLoader links as super::*
- pooling.rs: drop redundant explicit link target on adaptive_avg_pool2d
@fab2s fab2s mentioned this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant