In our MPI implementation node A has to call comm.send and node B has to call comm.recv for a message to be successfully communicated from node A to node B. In contrast, gRPC only requires calling receive from the other node. gRPC does not require send because each node is running a gRPC server on a parallel thread. The gRPC approach is much more scalable because it does not require synchronization between sender and receiver.
Therefore, we need to write an efficient MPI based server which, similar to gRPC, runs on a parallel thread and serves models and tensors when requested by a receiving user.
If it is too much unnecessary effort, then we should also consider dropping MPI entirely.