The current allreduce implementation for the CPU algorithm uses AllGather as a hack, but this is inefficient. We need an efficient allreduce implementation in legate, or a way to hack something more efficient together in legateboost. 