Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5822 commits
Select commit Hold shift + click to select a range
f559059
Ko3n1g/ci/test iteration time (#2067)
ko3n1g Oct 31, 2025
818e072
ci(hotfix): Remove performance for ckpt-resume
ko3n1g Oct 31, 2025
f248fcb
Allow inference test throughput to vary by 10% (#2070)
mathemakitten Oct 31, 2025
e715d2f
ci(hotfix): Inference test pipeline
ko3n1g Oct 31, 2025
aad8761
chore: Fix autoformatter (#2073)
ko3n1g Oct 31, 2025
e3ae351
ci(hotfix): Remove iteration-time from t5
ko3n1g Oct 31, 2025
87cbe76
ci(hotfix): disable inference test
ko3n1g Nov 1, 2025
d0d00b3
ci(hotfix): Disable inference test
ko3n1g Nov 2, 2025
88e3a8a
ci(hotfix): Bypass approvalbot in merge-queue (#2082)
ko3n1g Nov 2, 2025
53305bc
ci(hotfix): Enable merge-group for approval bot
ko3n1g Nov 2, 2025
7c16ca0
chore: Update local tooling (#2066)
ko3n1g Nov 2, 2025
dc7a0ca
Add extra RL files (#2077)
tdene Nov 2, 2025
5cfad7b
Prevent summary jobs from running in forks (#2083)
tdene Nov 2, 2025
ba21b69
ci: Fix test scope (#2091)
ko3n1g Nov 2, 2025
7ca2890
ci(hotfix): Remove publish workflows
ko3n1g Nov 3, 2025
a652e2c
Refactor the attention metadata into separate classes (#2001)
kanz-nv Nov 3, 2025
65cd27c
Guard against incorrectly using MoE prefill graphs (#2030)
tdene Nov 3, 2025
d3f1af4
Revert "Refactor the attention metadata into separate classes (#2001)"
ko3n1g Nov 3, 2025
5671e3a
Run mr-slim tests in lightweight-mode (#2106)
chtruong814 Nov 3, 2025
7487c53
Inference | Lazy compile UVM allocator. (#1977)
lmcafee-nvidia Nov 3, 2025
1307f87
chore: Reenable trustees (#2108)
ko3n1g Nov 3, 2025
282b74c
Revert "Inference | Lazy compile UVM allocator. (#1977)"
ko3n1g Nov 3, 2025
2cab46f
ci(fix): Changeset of copyright checker (#2110)
ko3n1g Nov 3, 2025
d4194b7
Ko3n1g/chore/update release settings (#2097)
ko3n1g Nov 3, 2025
5dee638
Remove unnecessary check on rotary_pos_cos (#2003)
santhnm2 Nov 4, 2025
aecce9e
(Reverted) Inference | Lazy compile UVM allocator. (#2125)
lmcafee-nvidia Nov 4, 2025
1586563
Refactor Attention Metadata to Separate Classes (#2112)
kanz-nv Nov 4, 2025
46e066b
Refactor model_provider to model_builder format for ModelOpt examples…
AAnoosheh Nov 5, 2025
26b2eb5
wandb Inference stats logging (#2026)
wdykas Nov 5, 2025
9be6d47
Make `PipelineParallelLayout` always return str from ` __repr__` (#2055)
ananthsub Nov 5, 2025
a32ff75
Add flash_attn_3 as first option for FA3 import (#2010)
santhnm2 Nov 5, 2025
f119a06
Add debugging hint for case when cudagraphs are created but no matchi…
mathemakitten Nov 5, 2025
eb48e81
ci: LTS container (#2133)
ko3n1g Nov 5, 2025
75f87c2
Revert "ci: LTS container (#2133)"
ko3n1g Nov 5, 2025
08c3771
Fix param init (#2033)
cuichenx Nov 5, 2025
f150f42
Hotfix to unit tests on hopper FA3 (#2143)
tdene Nov 5, 2025
10146c6
Add BytesIO to safe_globals (#2074)
tdene Nov 6, 2025
f167a85
add deprecation warning for legacy tokenizer system (#2145)
dimapihtar Nov 6, 2025
23a1dca
replay: ci: Bump LTS container (#2157)
ko3n1g Nov 6, 2025
0abff08
Hotfix to unit tests on hopper FA3 (bis) (#2179)
tdene Nov 7, 2025
0981e3c
Fix has_modelopt_state() for native Torch checkpoint format (#2160)
AAnoosheh Nov 7, 2025
c63b921
chore: Remove codeowners (#2175)
ko3n1g Nov 7, 2025
9aa14ed
Fix FP8 inference with sequence parallelism (#2009)
santhnm2 Nov 7, 2025
0f8fb9b
Replace ModelOpt generation server (#2147)
AAnoosheh Nov 7, 2025
e07c4a4
Add hybrid model support for dynamic inference engine (#1907)
santhnm2 Nov 7, 2025
82e846d
Async task and event loop safety in Megatron Core (#2025)
tdene Nov 10, 2025
c193bf5
Rename skip_prompt_log_probs (#2181)
tdene Nov 10, 2025
d6979d6
Dynamic inference context | UVM only. (#1983)
lmcafee-nvidia Nov 10, 2025
a59223d
Update copy-pr-bot.yaml [skip ci]
ko3n1g Nov 10, 2025
7055186
Revert "Dynamic inference context | UVM only. (#1983)"
ko3n1g Nov 10, 2025
75f7d50
ci: Run `auto-update-copy-pr-bot` only on forks (#2191)
ko3n1g Nov 10, 2025
2fef6bb
Inference throughput tests: refactor goldens to be in list format (#2…
mathemakitten Nov 10, 2025
1f6cde8
Enable TE custom quantization recipe (#2005)
negvet Nov 11, 2025
0acf6c2
Add MoE parameters to ModelOpt pruning example + conf fixes (#2205)
kevalmorabia97 Nov 11, 2025
49061f1
Add repr to pg collection class (#2089)
yashaswikarnati Nov 11, 2025
265af20
Move `data_samplers.py` from `legacy` to `training.datasets` & add `D…
asolergi-nv Nov 11, 2025
d82a6d8
Fix Megatron-FSDP checkpoint save failure (#2138)
shjwudp Nov 12, 2025
bcf2a59
Fix moe CODEOWNERS. (#2200)
jaredcasper Nov 12, 2025
08360ec
chore: Update LICENSE (#2219)
ko3n1g Nov 12, 2025
45b40bb
remove `megatron.training` dependency from `megatron.core` for FSDP c…
ananthsub Nov 12, 2025
909c746
Revert "remove `megatron.training` dependency from `megatron.core` fo…
ko3n1g Nov 12, 2025
7db8ae4
Tensorize dynamic inference mixed sampling (#2105)
tdene Nov 12, 2025
ac9221d
Revert "Tensorize dynamic inference mixed sampling (#2105)"
ko3n1g Nov 12, 2025
989d13e
Add unit test for inference DP coordinator (#2187)
tdene Nov 12, 2025
bb5a0fd
Inference linear layer (#1908)
sidsingh-nvidia Nov 12, 2025
34932c7
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Nov 13, 2025
b958982
ci(hotfix): Auto-update copy-pr-bot
github-actions[bot] Nov 13, 2025
dbc4a4f
chore: Prefer Nvidia email addresses for reminder bot (#2221)
ko3n1g Nov 13, 2025
aa4ec99
[Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter (…
shjwudp Nov 13, 2025
9d91916
Remove qwen symlink to fix for case-insensitive FS (#2235)
kevalmorabia97 Nov 13, 2025
7d3f4a0
Optimizer refactor: clean up public `get_megatron_optimizer` interfac…
deepakn94 Nov 13, 2025
9fd43fa
Fix CI for PR#1983 (#2245)
lmcafee-nvidia Nov 13, 2025
70f85eb
Enable kv cache in training for eagle (#1895)
yeyu-nvidia Nov 13, 2025
b7ef391
Fix aux-loss logging for hybrid models (#2197)
deepakn94 Nov 13, 2025
610a75e
Update flops calculation (for throughput) for hybrid MoEs (#2198)
deepakn94 Nov 13, 2025
2751749
Add MoE layer type to hybrid models (#2196)
deepakn94 Nov 13, 2025
9be7c7b
Tensorize dynamic inference mixed sampling (bis) (#2231)
tdene Nov 14, 2025
c4f83f0
Revert "Add MoE layer type to hybrid models (#2196)"
ko3n1g Nov 14, 2025
41eecc4
ci(hotfix): Checkout repo before install check
ko3n1g Nov 14, 2025
c4ba666
chore: Fix codeowners (#2264)
ko3n1g Nov 15, 2025
4696d42
Allow loading checkpoint from iteration 0 (#2199)
ananthsub Nov 17, 2025
a2d8519
ci: Skip install test in merge queue (#2281)
chtruong814 Nov 17, 2025
9a1c0d0
Add MoE layer type to hybrid models (#2259)
deepakn94 Nov 18, 2025
3df2009
Add the Hybrid-EP backend to the Flex Dispatcher (#2176)
Autumn1998 Nov 18, 2025
e8b9df1
[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding (#1985)
zhongbozhu Nov 18, 2025
a755887
Update ModelOpt example readmes and advanced usage (#2273)
kevalmorabia97 Nov 18, 2025
dcd3b39
Fix UVM compatibility with CUDA 13. (#2243)
lmcafee-nvidia Nov 18, 2025
5e3fa28
ci: Add flaky marker to LTS tests (#2290)
ko3n1g Nov 18, 2025
29eed5d
Dynamic engine suspend/resume via prefill. (#1982)
lmcafee-nvidia Nov 18, 2025
3b83c3f
Revert "Dynamic engine suspend/resume via prefill. (#1982)"
ko3n1g Nov 18, 2025
19d0422
fix: Pass the timeout argument for the EP group (#2268)
yanring Nov 19, 2025
efdc681
JIT for MoE router and preprocess (#1919)
yaox12 Nov 19, 2025
00884a8
Hotfix to CI, until the fix gets reviewed (#2298)
tdene Nov 19, 2025
f885d9c
Add functional test for DP coordinator throughput (#2189)
tdene Nov 19, 2025
70db86a
Add asyncio Queue like in Python 3.13 (#2224)
tdene Nov 19, 2025
744505e
Fixes for PR#1982 (#2303)
lmcafee-nvidia Nov 19, 2025
314a378
Fix PP KV cache allocation and enable multi-node PP inference (#2182)
santhnm2 Nov 19, 2025
21968ea
Revert active-buffer-size-gb arg name. (#2257)
lmcafee-nvidia Nov 19, 2025
712dff8
feat: check: api backwards compatibility (#2251)
pablo-garay Nov 19, 2025
6c8cdd5
Add MambaInferenceStateConfig dataclass (#2265)
santhnm2 Nov 19, 2025
dc473f9
Fix typo in inference example (#2311)
santhnm2 Nov 20, 2025
7dec856
feat: initialization of API backward compatibility verification (#2310)
pablo-garay Nov 20, 2025
e4b7259
Fix Mamba TP and remove confusing legacy initialization (#2202)
jaredcasper Nov 20, 2025
8463257
Refactor KD to use ModelOpt plugins file (#2305)
AAnoosheh Nov 20, 2025
9ce2482
mcore trigger mbridge
pablo-garay Nov 20, 2025
c2b1c7c
mcore trigger mbridge
pablo-garay Nov 20, 2025
a813740
mcore trigger mbridge
pablo-garay Nov 20, 2025
7e18da2
Revert "Refactor KD to use ModelOpt plugins file (#2305)"
ko3n1g Nov 20, 2025
8e830a1
Fix dynamic context syntax and remove redundant tensors (#2336)
kanz-nv Nov 20, 2025
475d7fa
Improve asyncio exception handling (#2300)
tdene Nov 20, 2025
5ab6392
ci: Upload to testpypi only on main (#2342)
ko3n1g Nov 21, 2025
0634924
implement graph config (#2203)
kanz-nv Nov 21, 2025
ddc55cd
Revert "implement graph config (#2203)"
ko3n1g Nov 21, 2025
f7fb5ec
feat: required check adjustment (#2350)
pablo-garay Nov 21, 2025
e772e06
synthesize, optimize
pablo-garay Nov 21, 2025
2cc0736
synthesize, optimize
pablo-garay Nov 21, 2025
f426230
Change default baseline commit for api compat check
pablo-garay Nov 21, 2025
f07cb14
fix: load iteration 0 for release checkpoints (#2351)
ananthsub Nov 21, 2025
81a87e2
Break apart dynamic inference step into 2 methods (#2192)
tdene Nov 21, 2025
c90160d
Bugfix for Mamba with Chunked-Prefill (#2293)
sidsingh-nvidia Nov 21, 2025
c9d2c8f
Explicitly zero out padding token activations for dynamic inference (…
santhnm2 Nov 21, 2025
63d4e7d
Refactor KD to use ModelOpt plugins file (v2) (#2355)
AAnoosheh Nov 21, 2025
29a810e
Prevent unnecessarily overwriting the default Hugging Face chat templ…
santhnm2 Nov 21, 2025
7994405
add FIM dataset support (#2291)
dimapihtar Nov 21, 2025
e35495d
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Nov 22, 2025
233b5b0
Revert "Explicitly zero out padding token activations for dynamic inf…
chtruong814 Nov 22, 2025
90c8536
Clean up DP coord code & unit test (#2277)
tdene Nov 22, 2025
8daf046
[4/4] Merge Megatron-RL into LM (#2002)
tdene Nov 22, 2025
53bbf7a
Update coordinator control logic to be compatible with RL (#2227)
tdene Nov 22, 2025
8954e04
ci: Update backwards compat check baseline to 53bbf7a (#2361)
chtruong814 Nov 22, 2025
d313c6d
Account for test regression caused by prints (#2354)
tdene Nov 22, 2025
14464d1
Remove dependency on `megatron.training` within `megatron.core` (#2274)
ananthsub Nov 22, 2025
9873958
Fixes for gpt-oss (#2038)
cuichenx Nov 22, 2025
26b2e72
update
pablo-garay Nov 24, 2025
326ec8c
[HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher (#2286)
Autumn1998 Nov 24, 2025
17cd106
ci: Remove nemo-ci environment (#2364)
chtruong814 Nov 24, 2025
278e058
ci: Pass COMMUNITY_PROJECT_ID to community bot (#2366)
chtruong814 Nov 24, 2025
d61029f
ci: Remove environment from community-bot (#2376)
chtruong814 Nov 24, 2025
9269dda
monitoring & results in mcore
pablo-garay Nov 24, 2025
77b65ed
Add mbridge_ref input to select MBridge branch
pablo-garay Nov 24, 2025
aa7a564
Fix: Use correct repo NVIDIA-NeMo/Megatron-Bridge and add mbridge_ref…
pablo-garay Nov 24, 2025
7f70e22
gha action
pablo-garay Nov 24, 2025
c28b84e
ci: Bump commit for api check to d61029f (#2386)
chtruong814 Nov 24, 2025
ab1e26e
tidy / synthesize / enhance
pablo-garay Nov 24, 2025
56e8810
Merge branch 'main' of https://github.com/NVIDIA/Megatron-LM
pablo-garay Nov 24, 2025
bc242d9
Revert: trigger_mbridge_tests.yml‎ file change (#2389)
pablo-garay Nov 25, 2025
49eef58
build: Upgrade deps (#2289)
ko3n1g Nov 25, 2025
2a51d86
Change KV cache init to empty to speedup graph recording and first pr…
kanz-nv Nov 25, 2025
4c7d3d6
Handle UVM compile lock issues (#2299)
tdene Nov 25, 2025
14b791b
Remove experimental tags for fused kernels. (#2233)
Victarry Nov 25, 2025
ffb8c35
Reduce Overhead in Timers (#2210)
yaox12 Nov 25, 2025
60df5c2
Revert "build: Upgrade deps (#2289)"
ko3n1g Nov 25, 2025
ba9caf4
Fix the entropy sign. (#2374)
yobibyte Nov 25, 2025
77a2d8b
Remove RL use of mock dataloader and kill RL inference interface on e…
jon-barker Nov 25, 2025
6f65536
Fix block_bag for RL (#2399)
kanz-nv Nov 25, 2025
13efcb8
adding action for checking whether PR author is nvidia employee or no…
theothermike Nov 25, 2025
898d633
Added top n log probs (#2262)
shanmugamr1992 Nov 25, 2025
3f91727
Fix logging when no IS is enabled. (#2375)
yobibyte Nov 26, 2025
6fc13a9
fix: exit failure when PR author is external contributor removed (#2410)
theothermike Nov 26, 2025
ebb2e91
Various small fixes for Megatron-FSDP. (#2346)
cspades Nov 26, 2025
f5531b0
Add grpo loop functional test (#2403)
jon-barker Nov 26, 2025
cb8f94e
Revert "Add grpo loop functional test (#2403)"
ko3n1g Nov 26, 2025
5153663
YARN position embedding clear forward method lru cache in init functi…
guyueh1 Nov 27, 2025
0819f3c
Graph Config Implementation (#2380)
kanz-nv Nov 27, 2025
b96d876
fix: adding k8s taints for ephermeral jobs (#2420)
theothermike Nov 27, 2025
9f15fed
ci: Enable functional tests (#2419)
ko3n1g Nov 27, 2025
40ef044
Reapply "build: Upgrade deps (NVIDIA#2289)" (#2408)
ko3n1g Nov 27, 2025
b21bbad
fix: use a script to do node tainting in the cicd workflow (#2421)
theothermike Nov 27, 2025
65ce253
ci: Mark gpt_dynamic_inference_tp1_pp1_583m_cuda_graphs_fp8_logitsmat…
ko3n1g Nov 28, 2025
6646d1a
ci: Disable `gpt_static_inference_cuda_graphs_pad_tp4_pp1_ep4_16B_log…
ko3n1g Nov 28, 2025
66c07b0
Fix rl training with data reuse. (#2428)
yobibyte Nov 28, 2025
8cde93d
Reapply - Add grpo loop functional test (#2411)
jon-barker Nov 28, 2025
6cc29a2
Revert "Reapply - Add grpo loop functional test (#2411)"
ko3n1g Nov 28, 2025
a62e237
chore: Add copyright to run_simple_mcore_train_loop.py (#2441)
chtruong814 Dec 1, 2025
66407fa
Retry inference test on different device if throughput slower than ex…
mathemakitten Dec 1, 2025
6d2a123
feat: mcore trigger mbridge (#2340)
pablo-garay Dec 1, 2025
7f4df2c
Reapply "Reapply - Add grpo loop functional test (#2411)"
ko3n1g Dec 1, 2025
848bff1
Remove redundant reduce in aux_loss logging (#2095)
BestJuly Dec 2, 2025
e2bd0db
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Dec 2, 2025
9927a85
Add support for fake distributed process groups. (#2280)
Victarry Dec 2, 2025
0150d73
[Fix] Pass metadata to sharded_state_dict in load_modelopt_checkpoint…
kevalmorabia97 Dec 2, 2025
a6764e0
chore: Update codeowners for post-training (#2462)
ko3n1g Dec 2, 2025
77bc0f5
fix: Add merge_group support with pre-flight pattern (#2463)
pablo-garay Dec 2, 2025
3cacd5b
Add assertion for mxfp8 params without dp overlap (#2271)
kunlunl Dec 2, 2025
409f954
Add missing checkpoint arguments for MoE models (#2465)
santhnm2 Dec 2, 2025
40a4674
Clean log probs (#2404)
shanmugamr1992 Dec 2, 2025
08fdf5b
ci: Bump copyright workflow (#2473)
ko3n1g Dec 3, 2025
209bd6c
Fix `ImportError` and `NameError` in `examples/run_simple_mcore_train…
marksverdhei Dec 2, 2025
e8749f8
fix: Revert "Clean log probs (#2404)" (#2475)
chtruong814 Dec 3, 2025
2e6b2bc
Make grpo CI test use read-only data (#2472)
jon-barker Dec 3, 2025
54c33cb
Update golden values to allow new PRs to be merged (#2478)
tdene Dec 3, 2025
2d3459a
Clean log probs copy (#2477)
shanmugamr1992 Dec 3, 2025
c0b5c2c
Fix default.yaml for HFDatasetAgent use in countdown (#2487)
jon-barker Dec 3, 2025
e847643
Attention mask as PackedSeqParams (#2461)
jalbericiola Dec 3, 2025
299034c
fp8 param cuda graph support main (#2088)
kunlunl Dec 4, 2025
f9d02e9
docs: Add changelog for 0.15 (#2499)
ko3n1g Dec 4, 2025
50deedf
feat: improve external contributor single use ephemeral nodes (#2503)
theothermike Dec 4, 2025
f534416
Fix sequence parallel. (#2444)
yobibyte Dec 4, 2025
7e22b9c
update API check baseline (#2505)
pablo-garay Dec 4, 2025
c7ed7c6
Associate default rl cuda graphs attributes with args (#2453)
yobibyte Dec 4, 2025
b27818c
No using tokenizer in request record. (#2382)
lmcafee-nvidia Dec 4, 2025
abba836
make default --inference-dynamic-batching-cuda-graph-max-tokens value…
jon-barker Dec 4, 2025
8950d1a
Adjust the default CG size for functional test (#2544)
tdene Dec 4, 2025
c46a8ca
feat: API compat: ignore AttributeChangedValueBreakage (not a signatu…
pablo-garay Dec 4, 2025
f32dfec
feat: add decorator: experimental_api (#2539)
pablo-garay Dec 4, 2025
e79d9a8
ci: Add release workflows (#2507)
ko3n1g Dec 5, 2025
c3e1d2d
Fixing PG routing for inference & training separation (#2485)
wdykas Dec 5, 2025
b3a814e
ci: Fix release workflow (#2553)
ko3n1g Dec 5, 2025
286c806
fix: Duplicate artifact names (#2556)
ko3n1g Dec 5, 2025
fcc1aaf
ci: Avoid naming collision (#2558)
ko3n1g Dec 5, 2025
5883064
ci: Fixing naming collision (#2559)
ko3n1g Dec 5, 2025
dcee1c5
fix: publish release wheel and github release version number (#2561)
ko3n1g Dec 5, 2025
e152473
Revert "Fixing PG routing for inference & training separation (#2485)"
ko3n1g Dec 5, 2025
ecb948e
Fix MoE capacity handling (#2214)
DaizeDong Dec 5, 2025
04d202a
Avoid calling set_save_original_input with FP8 delayed scaling (#1860)
dalgarak Dec 5, 2025
b9d3736
build: Bump TE to 2.10 (#2496)
ko3n1g Dec 5, 2025
d2e7060
Add per-module TE quant config. (#2359)
kwyss-nvidia Dec 5, 2025
2c06b04
add more tokenizer arguments (#2377)
dimapihtar Dec 5, 2025
bcf07a2
Make check_large_grads non-fatal (#2307)
kwyss-nvidia Dec 5, 2025
416687f
fix for sequence packing plus sequence parallel: padding the sequence…
jalbericiola Dec 5, 2025
32ebde7
Revert "Make check_large_grads non-fatal (#2307)"
ko3n1g Dec 5, 2025
972d9b6
Torch symmetric - new latency optimized NVLS communication kernels fo…
sidsingh-nvidia Dec 5, 2025
8c4df6b
[Main] Support MTP packed-seq in main branch (#2173)
BestJuly Dec 7, 2025
8a5f379
Various quality-of-life improvements in training loop (#2580)
deepakn94 Dec 7, 2025
f7dfb99
Support TP greater than num_kv_heads by supporting QKV activation sub…
deepakn94 Dec 7, 2025
dfc3913
Fix FA3 import (#2577)
santhnm2 Dec 7, 2025
5232820
Fix runaway Etpt in straggler detector by resetting FLOPs accumulator…
cms42 Dec 7, 2025
4cf809c
Rename TensorRT Model Optimizer to Model Optimizer (#2373)
AAnoosheh Dec 7, 2025
e2199af
Reapply "Make check_large_grads non-fatal (#2307)"
ko3n1g Dec 7, 2025
f6e0d42
Fix aux loss scale when CP is enabled. (#2237)
Victarry Dec 7, 2025
01aad93
Save memory using main_param for moe in param_l2_norm (#2249)
BestJuly Dec 7, 2025
b51db3e
Changes to support latent MoEs (#2296)
deepakn94 Dec 8, 2025
03b6d31
update API compat check baseline to b51db3e (#2588)
pablo-garay Dec 8, 2025
f4957d1
Fix invalid argument failing tests on main (#2589)
tdene Dec 8, 2025
2bc35c5
Add openmathinstruct config. (#2586)
yobibyte Dec 8, 2025
e2cf81c
Move model configs to github. (#2587)
yobibyte Dec 8, 2025
8d18afd
fix: Assign tokenizer to Encoder.tokenizer in legacy mode (#2498)
iuyo5678 Dec 9, 2025
c21bf6e
Delete redundant import in yaml_arguments.py (#2139)
wplf Dec 9, 2025
dfb78dc
Fix world size mismatch causing distributed init deadlock (Issue #245…
CodersAcademy006 Dec 9, 2025
4a4f23a
Improve performance of request_metadata logic (#2378)
tdene Dec 9, 2025
7b11553
Fix broken Table of Contents links in README.md (#1954)
JungHoyoun Dec 9, 2025
9fba363
Add minor log update (#2080)
gautham-kollu Dec 9, 2025
bd32927
Fix link to NeMo performance summary documentation (#2190)
janbernloehr Dec 9, 2025
ef12f16
Prep for refit (#2590)
wdykas Dec 9, 2025
a2aafe3
feat: API compat: ignore ParameterMovedBreakage for __init__ methods …
pablo-garay Dec 9, 2025
5c54bb6
Revert "Prep for refit (#2590)"
ko3n1g Dec 9, 2025
be4baad
Fix NameError in pretrain_retro.py (add import_module), remove unused…
vignesh1507 Dec 10, 2025
d2b500f
Use the latest Hybrid-EP (#2479)
Autumn1998 Dec 10, 2025
dd54609
QK logits clipping (non-split version) (#1929)
BoxiangW Dec 10, 2025
93da800
update checkpointing documentation (#2606)
dimapihtar Dec 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
50 changes: 50 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
megatron/core/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/models/gpt/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/gpt

megatron/core/models/multimodal/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/multi-modal

megatron/core/models/mamba/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba
megatron/core/ssm/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/distributed/fsdp/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/transformer/fsdp_dtensor_checkpoint.py @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/dist_checkpointing/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-checkpointing

megatron/core/optimizer/distrib_optimizer/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-optimizer

megatron/core/inference/modelopt_support @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/quantization-and-inference

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/pipeline_parallel/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/pipeline-parallelism

megatron/core/transformer/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/mixture-of-experts-adlr @NVIDIA/mixture-of-experts-devtech

megatron/core/inference/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/inference

megatron/core/parallel_state.py @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

megatron/post_training/ @NVIDIA/post-training

.gitlab/ @NVIDIA/ci
.github/ @NVIDIA/ci
.gitlab-ci.yml @NVIDIA/ci
docker/ @NVIDIA/ci
tests/functional_tests/python_test_utils/ @NVIDIA/ci
tests/functional_tests/shell_test_utils/ @NVIDIA/ci
tests/test_utils/recipes/ @NVIDIA/ci
tests/unit_tests/run_ci_test.sh @NVIDIA/ci

megatron/rl/ @NVIDIA/reinforcement-learning
examples/rl/ @NVIDIA/reinforcement-learning
test/unit_tests/test_rl_utils.py @NVIDIA/reinforcement-learning
train_rl.py @NVIDIA/reinforcement-learning
28 changes: 28 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
name: Bug report
about: Create a report to help us improve the repository or project
title: ""
labels: bug
assignees: ''

---

**Describe the bug**

A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blank_issues_enabled: false

20 changes: 20 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ""
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
Loading