Distributed checkpointing while just doing DDP is very buggy for me. If it's not a big change (I notice there is an option `DTensor=False` which makes me hopeful), where would one start?
Distributed checkpointing while just doing DDP is very buggy for me.
If it's not a big change (I notice there is an option
DTensor=Falsewhich makes me hopeful), where would one start?