[DRAFT] Modify CosWithWarmupAndLinearDecay tail-handoff and missing super().__post_init__()#676
Draft
AkshitaB wants to merge 4 commits into
Draft
[DRAFT] Modify CosWithWarmupAndLinearDecay tail-handoff and missing super().__post_init__()#676AkshitaB wants to merge 4 commits into
AkshitaB wants to merge 4 commits into
Conversation
Three changes: 1. Adds the missing super().__post_init__() call. Without it, the deprecated 'warmup_steps' alias never gets migrated on the subclass, and the parent's warmup-fraction validations are skipped. 2. Hard-codes the linear-decay-tail starting LR to initial_lr * alpha_f (rather than computing it via super().get_lr(initial_lr, t_max-decay, t_max)). Combined with (3), this guarantees the tail starts from the value the cosine *fully completes at*, regardless of any floating-point drift at the boundary. 3. Passes t_max - decay to super().get_lr(...) for the main cosine region, so the cosine completes (reaches alpha_f * peak) by step t_max - decay rather than running to t_max. The linear tail then carries from alpha_f * peak down to decay_min_lr over the remaining 'decay' steps. Net effect: 'warmup -> cosine fully completing to alpha_f*peak by t_max - decay -> linear tail to decay_min_lr over decay steps'. Before, the cosine ran across the full t_max and the linear tail started from wherever the cosine happened to be at t_max - decay — i.e. the tail attached mid-flight, not at the cosine's terminal value. Originally introduced in cb9d4a3 ("add upcycle convert and train code") on akshitab/olmoe-dev-v2; only the scheduler.py portion is carried over here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the leftover '# final_cosine_lr = super().get_lr(...)' and '# return super().get_lr(...)' reference comments and a trailing- whitespace line that came in with the rewrite, and expands the inline comment to make the t_max-decay handoff semantics explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_cos_with_warmup_and_linear_decay: verifies the three-phase schedule end-to-end. The cosine region matches a bare CosWithWarmup evaluated against t_max = (full_t_max - decay). The tail starts at alpha_f * peak at exactly t_max - decay, anneals linearly to decay_min_lr by t_max. - test_cos_with_warmup_and_linear_decay_migrates_deprecated_warmup_steps: confirms the new super().__post_init__() call propagates the deprecated 'warmup_steps' alias migration into the subclass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TODO: unclear if we actually want this behavior.
Rewrites the tail-handoff semantics of
CosWithWarmupAndLinearDecayand fixes a missingsuper().__post_init__()call. Original work by @Tianhua-Tao; this PR ports just the scheduler portion ontomain(preserving authorship/date) and adds tests + a CHANGELOG entry.What's changed
1. Tail starts at the cosine's terminal value, not its mid-flight value
Before: The cosine ran across the full
t_max. At stept_max - decay, the linear tail attached at whatever value the cosine happened to produce there (call thatcos(t_max - decay, t_max)— a value somewhere between peak andalpha_f * peak, depending on horizon andalpha_f). The tail then anneals from that mid-flight value todecay_min_lr.After: The cosine completes (reaches
alpha_f * peak) by stept_max - decay. Achieved by passingt_max - decaytosuper().get_lr(...)for the cosine region, and hard-coding the tail's starting value toinitial_lr * alpha_f. The linear tail then anneals fromalpha_f * peakdown todecay_min_lrover the lastdecaysteps.Net effect: the class now matches the schedule its name promises — "warmup → cosine completing to
alpha_f * peakbyt_max - decay→ linear tail todecay_min_lroverdecaysteps."2. Missing
super().__post_init__()callCosWithWarmupAndLinearDecay.__post_init__validated its own fields but never invokedsuper().__post_init__(). This meant the parent's deprecated-alias migration (warmup_steps→warmup) was skipped on the subclass, and the parent'swarmup_fractionrange check ran only via duplicated logic in the child. Nowsuper().__post_init__()is called at the end.Tests
src/test/optim/scheduler_test.pygains:test_cos_with_warmup_and_linear_decay— verifies the three-phase schedule. The cosine region matches a bareCosWithWarmupevaluated againstt_max = (full_t_max - decay). The tail starts atalpha_f * peakat exactlyt_max - decay, anneals linearly todecay_min_lrbyt_max.test_cos_with_warmup_and_linear_decay_migrates_deprecated_warmup_steps— confirms thesuper().__post_init__()call now propagates thewarmup_steps→warmupmigration and emits the deprecation warning.34 tests pass;
make checksclean (isort, black, ruff, mypy across all 381 source files).Commits
0dd3f4c5CosWithWarmupAndLinearDecaytail semantics (the core change)f824210d# ...reference comments and trailing whitespaceb7ab9a438834192cCompatibility note
Existing callers of
CosWithWarmupAndLinearDecaywill see different LR values in the tail region. If you have runs in flight that depend on the old "tail attaches mid-cosine" behavior, this is a behavior change — but the class name already implies the new semantics, and the old behavior was buggy in the sense that the tail start was implicit/non-obvious.Test plan
pytest src/test/optim/scheduler_test.py— 34 passedmake checks(isort + black + ruff + mypy) — cleanmaindepend on the old tail-attachment behavior ofCosWithWarmupAndLinearDecay🤖 Generated with Claude Code