Skip to content

Megatron has left-over memory after refit completed #1739

@yuki-97

Description

@yuki-97

Ideally the memory should be exactly equal to 0 after refit completed, but megatron will have some left.

>       assert current_allocated == 0.0, "Memory should be 0 after refit completed"
E       AssertionError: Memory should be 0 after refit completed
E       assert 0.00244140625 == 0.0

Repro:
Set train_backend == megatron in the below unit test and run it.

# megatron still holds little memory after refit, so we only test dtensor now
@pytest.mark.parametrize("train_backend", ["dtensor_v1", "dtensor_v2"])
def test_vllm_weight_update_memory(cluster, tokenizer, train_backend):
.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions