The library crashes when performing a collective operation like gather when the size of the objects is very large, but that should be still manageable with supercomputers. The issue is not easy to reproduce, because it requires quite some memory available.
The following programs illustrates this:
#include <iostream>
#include <vector>
#include <boost/serialization/vector.hpp>
#include <boost/mpi/collectives.hpp>
struct huge {
std::vector<unsigned char> data;
huge() : data(2ull << 30ull, 0) { }
template <class Archive>
void serialize(Archive& ar, const unsigned int version)
{
ar & data;
}
};
int main()
{
boost::mpi::environment env;
boost::mpi::communicator world;
huge a{};
std::cout << world.rank() << " huge created " << std::endl;
world.barrier();
if (world.rank() == 0)
{
std::vector<huge> all;
boost::mpi::gather(world, a, all, 0);
}
else
{
boost::mpi::gather(world, a, 0);
}
return 0;
}
The program create an object of 1G of memory. The struct huge is defined so to force the library to have a non-primitive MPI type. When run with only 2 tasks, it crashes giving
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_range_insert
On a supercomputer JUWELS, which has boost 1.69, the error is:
mpi: /gpfs/software/juwels/stages/2019a/software/Boost/1.69.0-gpsmpi-2019a-Python-2.7.16/include/boost/mpi/allocator.hpp:142: T* boost::mpi::allocator<T>::allocate(boost::mpi::allocator<T>::size_type, boost::mpi::allocator<void>::const_pointer) [with T = char; boost::mpi::allocator<T>::pointer = char*; boost::mpi::allocator<T>::size_type = long unsigned int; boost::mpi::allocator<void>::const_pointer = const void*]: Assertion `_check_result == 0' failed.
It appears that the programs crashes around this line on gather.hpp
The same crash is found even running with 1 task.
Reducing the size of huge as
struct huge {
std::vector<unsigned char> data;
huge() : data(2ull << 29ull, 0) { }
....
makes again the program crash, this time it appears that the crash happens at this line of gather.hpp
packed_iarchive::buffer_type recv_buffer(is_root ? std::accumulate(oasizes.begin(), oasizes.end(), 0) : 0);
My impression is that the sizes are sometimes stored as int when they should be a size_t. For example, in the line above, oasizes is a std::vector of int: even if the single-object size fits into a int, the total buffer of gathered objects could exceed 2^31.
The library crashes when performing a collective operation like
gatherwhen the size of the objects is very large, but that should be still manageable with supercomputers. The issue is not easy to reproduce, because it requires quite some memory available.The following programs illustrates this:
The program create an object of 1G of memory. The struct
hugeis defined so to force the library to have a non-primitive MPI type. When run with only 2 tasks, it crashes givingOn a supercomputer JUWELS, which has boost 1.69, the error is:
It appears that the programs crashes around this line on
gather.hppThe same crash is found even running with 1 task.
Reducing the size of
hugeasmakes again the program crash, this time it appears that the crash happens at this line of
gather.hppMy impression is that the sizes are sometimes stored as
intwhen they should be asize_t. For example, in the line above,oasizesis astd::vectorofint: even if the single-object size fits into aint, the total buffer of gathered objects could exceed 2^31.