For the model I am training, I am relying on a custom Sampler, that returns variable batch sizes. My task at hand is translation, where I following Attention is all you need (2017) create batches based on total token count in a batch, which given the variable length input, results in batches of varying numbers of examples (examples here being one source/target text translation pair).
For regular DDP based training, this worked fine, by simply creating a distributed version of this sampler, to split the variable size batch into sub-batches based on the GPU rank. For DeepSpeed however, I am forced to provide either train_micro_batch_size_per_gpu or train_batch_size, both my current understanding tells me are based on the number of examples in the batch.
As the number of examples in my batch varies for each batch, and I just want to configure the accumulation based on batch count, rather than batch size, I'm not sure how to achieve this with DeepSpeed's configuration.
Am I misunderstanding the impact of the configuration variables, missing some other configuration, or is this not possible to achieve at the moment?
For the model I am training, I am relying on a custom Sampler, that returns variable batch sizes. My task at hand is translation, where I following Attention is all you need (2017) create batches based on total token count in a batch, which given the variable length input, results in batches of varying numbers of examples (examples here being one source/target text translation pair).
For regular DDP based training, this worked fine, by simply creating a distributed version of this sampler, to split the variable size batch into sub-batches based on the GPU rank. For DeepSpeed however, I am forced to provide either
train_micro_batch_size_per_gpuortrain_batch_size, both my current understanding tells me are based on the number of examples in the batch.As the number of examples in my batch varies for each batch, and I just want to configure the accumulation based on batch count, rather than batch size, I'm not sure how to achieve this with DeepSpeed's configuration.
Am I misunderstanding the impact of the configuration variables, missing some other configuration, or is this not possible to achieve at the moment?