Ideally the memory should be exactly equal to 0 after refit completed, but megatron will have some left.
> assert current_allocated == 0.0, "Memory should be 0 after refit completed"
E AssertionError: Memory should be 0 after refit completed
E assert 0.00244140625 == 0.0
Repro:
Set train_backend == megatron in the below unit test and run it.
|
# megatron still holds little memory after refit, so we only test dtensor now |
|
@pytest.mark.parametrize("train_backend", ["dtensor_v1", "dtensor_v2"]) |
|
def test_vllm_weight_update_memory(cluster, tokenizer, train_backend): |
.
Ideally the memory should be exactly equal to 0 after refit completed, but megatron will have some left.
Repro:
Set
train_backend == megatronin the below unit test and run it.RL/tests/unit/models/generation/test_vllm_generation.py
Lines 1667 to 1669 in ba46741