Select ibv device who has active port_state.#456
Conversation
If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList by default. This is very useful. If we choose the first device without checking, it is likely that the IB runtime can be initialized successfully, but some weird errors will be reported in the ibv_post_send stage. At this time, it is difficult to determine the reason for the error is that we chose a wrong ibv device.
| _(reg_mr, IbvLib::mr*, (IbvLib::pd*, void*, size_t, int)) \ | ||
| _(wc_status_str, const char*, (IbvLib::wc_status)) | ||
| _(wc_status_str, const char*, (IbvLib::wc_status)) \ | ||
| _(port_state_str, const char*, (IbvLib::port_state)) |
There was a problem hiding this comment.
Nit: could you keep this list sorted alphabetically?
| // device of the port whose port_state is active, instead of just selecting | ||
| // the first device in the deviceList by default. | ||
| for (int i = 0; i < deviceList.size(); i++) { | ||
| IbvContext tp_ctx_; |
There was a problem hiding this comment.
Our naming convention is camelCase. Also, a trailing underscore means that a name is a private class member, whereas this is just a local variable. Could you just name this ctx?
| } | ||
| } | ||
|
|
||
| TP_THROW_ASSERT_IF(found == false) << "Unable to find available ibv device"; |
There was a problem hiding this comment.
If we can't find any usable devices we shouldn't consider it an error (and crash the program), instead we should just disable the ibv transport. The logic to do so happens here:
tensorpipe/tensorpipe/transport/ibv/context_impl.cc
Lines 58 to 62 in bb1473a
Could you move your code to that file? You will probably need to change the constructor of the Reactor class so that it takes a IbvContext object, instead of an IbvDeviceList.
| std::memset(&portAttr, 0, sizeof(portAttr)); | ||
| tp_ctx_ = createIbvContext(getIbvLib(), deviceList[i]); | ||
| TP_CHECK_IBV_INT(ibvLib.query_port(tp_ctx_.get(), kPortNum, &portAttr)); | ||
| if (portAttr.state == IbvLib::port_state::PORT_ACTIVE) { |
There was a problem hiding this comment.
port_state is just an enum, not an enum class, hence its values should be accessed just as IbvLib::PORT_ACTIVE.
If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList by default. This is very useful. If we choose the first device without checking, it is likely that the IB runtime can be initialized successfully, but some weird errors will be reported in the ibv_post_send stage. At this time, it is difficult to determine the reason for the error is that we chose a wrong ibv device.
This PR is to fix #455.