Description
This is something @DwyaneShi proposed earlier in offline discussion. I am creating an issue to track it. In environments where RDMA is not consistently available, the system currently fails to operate if RDMA initialization fails. One possible improvement is to provide a fallback mode to TCP.
Open Questions & Discussion Points
- Should the client exit or fallback when RDMA is unavailable?
- If fallback is allowed, how do we inform the user or log the performance degradation clearly?
- Are there use cases where TCP fallback is preferred for availability over strict RDMA-only operation?
Tradeoffs
- TCP fallback ensures availability but significantly degrades performance.
- RDMA-only mode is optimal for performance but less fault-tolerant.
I will just open this issue for discussion and this is supposed to be a low priority item. We can discuss it later