Sidecar supports MarkDoneAll/UnlinkAll and retries on failed requests#198
Sidecar supports MarkDoneAll/UnlinkAll and retries on failed requests#198yizhuoliang wants to merge 1 commit intomasterfrom
Conversation
Summary of ChangesHello @yizhuoliang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the sidecar service by introducing batch operations for marking data as done and unlinking shared memory, streamlining resource management. A critical improvement is the integration of a comprehensive gRPC retry mechanism with exponential backoff across various client and server interactions, making the system more resilient to transient communication failures. These changes aim to improve both efficiency and reliability of the sidecar's data handling capabilities. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces MarkDoneAll and UnlinkAll RPCs to the sidecar service, which is a great addition for cleaning up all chunks of a request at once. It also adds retry logic with exponential backoff for gRPC requests, which improves the resilience of the system.
My main feedback is about the implementation of the retry logic:
- Code Duplication: The retry logic is duplicated in many places across multiple files for both synchronous and asynchronous calls. This makes the code difficult to maintain and prone to inconsistencies. I've left a detailed comment with a suggestion to refactor this into a reusable helper function.
- Bug in Synchronous Retries: Several of the new synchronous retry loops are missing a call to
time.sleep(). This will cause them to busy-wait and spin on the CPU during transient failures, which should be fixed. I've marked these as critical.
Once these points are addressed, the PR will be in great shape.
| logger.warning( | ||
| "Register retry %d for sidecar rank %d due to %s", | ||
| attempt + 1, | ||
| self.sidecar_rank, | ||
| code.name, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
The retry logic for this synchronous gRPC call is missing a sleep with exponential backoff. Without it, the loop will spin without waiting, causing high CPU usage on transient failures. You should add time.sleep(backoff_delay) and also log the backoff duration.
| logger.warning( | |
| "Register retry %d for sidecar rank %d due to %s", | |
| attempt + 1, | |
| self.sidecar_rank, | |
| code.name, | |
| ) | |
| continue | |
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | |
| logger.warning( | |
| "Register retry %d for sidecar rank %d due to %s, waiting %.3fs", | |
| attempt + 1, | |
| self.sidecar_rank, | |
| code.name, | |
| backoff_delay, | |
| ) | |
| time.sleep(backoff_delay) | |
| continue |
| logger.warning( | ||
| "Send retry %d for shard %d chunk %d in req %s due to %s", | ||
| attempt + 1, | ||
| self.shard_rank, | ||
| chunk_id, | ||
| id, | ||
| code.name, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
The retry logic for this synchronous gRPC call is missing a sleep with exponential backoff. Without it, the loop will spin without waiting, causing high CPU usage on transient failures. You should add time.sleep(backoff_delay) and also log the backoff duration.
| logger.warning( | |
| "Send retry %d for shard %d chunk %d in req %s due to %s", | |
| attempt + 1, | |
| self.shard_rank, | |
| chunk_id, | |
| id, | |
| code.name, | |
| ) | |
| continue | |
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | |
| logger.warning( | |
| "Send retry %d for shard %d chunk %d in req %s due to %s, waiting %.3fs", | |
| attempt + 1, | |
| self.shard_rank, | |
| chunk_id, | |
| id, | |
| code.name, | |
| backoff_delay, | |
| ) | |
| time.sleep(backoff_delay) | |
| continue |
| logger.warning( | ||
| "CloseStream retry %d for stream %s due to %s", | ||
| attempt + 1, | ||
| id, | ||
| code.name, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
The retry logic for this synchronous gRPC call is missing a sleep with exponential backoff. Without it, the loop will spin without waiting, causing high CPU usage on transient failures. You should add time.sleep(backoff_delay) and also log the backoff duration.
| logger.warning( | |
| "CloseStream retry %d for stream %s due to %s", | |
| attempt + 1, | |
| id, | |
| code.name, | |
| ) | |
| continue | |
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | |
| logger.warning( | |
| "CloseStream retry %d for stream %s due to %s, waiting %.3fs", | |
| attempt + 1, | |
| id, | |
| code.name, | |
| backoff_delay, | |
| ) | |
| time.sleep(backoff_delay) | |
| continue |
| res = None | ||
| for attempt in range(GRPC_RETRY_MAX_ATTEMPTS): | ||
| try: | ||
| res = await stub.Unlink(unlink_req) | ||
| break | ||
| except grpc.RpcError as e: | ||
| code = e.code() | ||
| if ( | ||
| code in (grpc.StatusCode.CANCELLED, grpc.StatusCode.UNAVAILABLE) | ||
| and attempt < GRPC_RETRY_MAX_ATTEMPTS - 1 | ||
| ): | ||
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | ||
| logger.warning( | ||
| "Unlink retry %d for req %s chunk %d in rank %d due to %s, waiting %.3fs", | ||
| attempt + 1, | ||
| mark_done_req.id, | ||
| mark_done_req.chunk_id, | ||
| chunk_state.intra_node_rank, | ||
| code.name, | ||
| backoff_delay, | ||
| ) | ||
| await asyncio.sleep(backoff_delay) | ||
| continue | ||
| raise | ||
| if res is None: | ||
| await context.abort( | ||
| grpc.StatusCode.INTERNAL, | ||
| f"Failed to unlink for id {mark_done_req.id}: no response received", | ||
| ) |
There was a problem hiding this comment.
This retry logic with exponential backoff is duplicated in many places across the codebase (e.g., mark_done_all in this file, and in sender.py, server.py, api.py). This makes the code harder to maintain and prone to errors (like the missing time.sleep in some sync versions).
Consider extracting this logic into a reusable helper function for both asynchronous and synchronous gRPC calls. This would centralize the retry mechanism, improve readability, and ensure consistency.
For example, you could create an async helper like this:
async def grpc_retry_async(stub_call, request, log_message_prefix: str):
response = None
for attempt in range(GRPC_RETRY_MAX_ATTEMPTS):
try:
response = await stub_call(request)
break
except grpc.RpcError as e:
code = e.code()
if (
code in (grpc.StatusCode.CANCELLED, grpc.StatusCode.UNAVAILABLE)
and attempt < GRPC_RETRY_MAX_ATTEMPTS - 1
):
backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt)
logger.warning(
f"{log_message_prefix} retry %d due to %s, waiting %.3fs",
attempt + 1,
code.name,
backoff_delay,
)
await asyncio.sleep(backoff_delay)
continue
raise
return responseAnd then use it like this:
log_prefix = f"Unlink for req {mark_done_req.id} chunk {mark_done_req.chunk_id} in rank {chunk_state.intra_node_rank}"
res = await grpc_retry_async(stub.Unlink, unlink_req, log_prefix)
if res is None:
await context.abort(
grpc.StatusCode.INTERNAL,
f"Failed to unlink for id {mark_done_req.id}: no response received",
)A similar helper grpc_retry_sync could be created for synchronous calls.
| res = None | ||
| for attempt in range(GRPC_RETRY_MAX_ATTEMPTS): | ||
| try: | ||
| res = await stub.UnlinkAll(unlink_all_req) | ||
| break | ||
| except grpc.RpcError as e: | ||
| code = e.code() | ||
| if ( | ||
| code in (grpc.StatusCode.CANCELLED, grpc.StatusCode.UNAVAILABLE) | ||
| and attempt < GRPC_RETRY_MAX_ATTEMPTS - 1 | ||
| ): | ||
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | ||
| logger.warning( | ||
| "UnlinkAll retry %d for req %s in rank %d due to %s, waiting %.3fs", | ||
| attempt + 1, | ||
| mark_done_all_req.id, | ||
| rank, | ||
| code.name, | ||
| backoff_delay, | ||
| ) | ||
| await asyncio.sleep(backoff_delay) | ||
| continue | ||
| raise | ||
|
|
||
| if res is None: | ||
| await context.abort( | ||
| grpc.StatusCode.INTERNAL, | ||
| f"Failed to unlink all for id {mark_done_all_req.id} in rank {rank}: no response received", | ||
| ) |
| res = None | ||
| for attempt in range(GRPC_RETRY_MAX_ATTEMPTS): | ||
| try: | ||
| res = await stub.PrepareReceive(req) | ||
| break | ||
| except grpc.RpcError as e: | ||
| code = e.code() | ||
| if ( | ||
| code in (grpc.StatusCode.CANCELLED, grpc.StatusCode.UNAVAILABLE) | ||
| and attempt < GRPC_RETRY_MAX_ATTEMPTS - 1 | ||
| ): | ||
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | ||
| logger.warning( | ||
| "PrepareReceive retry %d for req %s chunk %d to rank %d due to %s, waiting %.3fs", | ||
| attempt + 1, | ||
| request.id, | ||
| request.chunk_id, | ||
| dst_rank, | ||
| code.name, | ||
| backoff_delay, | ||
| ) | ||
| await asyncio.sleep(backoff_delay) | ||
| continue | ||
| raise | ||
| if res is None or res.status != common_pb2.Status.STATUS_OK: | ||
| logger.error("Failed to prepare receive") | ||
| return sidecar_pb2.SendResponse(status=common_pb2.Status.STATUS_ERROR) |
| response = None | ||
| for attempt in range(GRPC_RETRY_MAX_ATTEMPTS): | ||
| try: | ||
| response = await self.aio_stub.Receive(request) | ||
| break | ||
| except grpc.RpcError as e: | ||
| code = e.code() | ||
| if ( | ||
| code in (grpc.StatusCode.CANCELLED, grpc.StatusCode.UNAVAILABLE) | ||
| and attempt < GRPC_RETRY_MAX_ATTEMPTS - 1 | ||
| ): | ||
| backoff_delay = GRPC_RETRY_INITIAL_BACKOFF_SECONDS * (GRPC_RETRY_BACKOFF_MULTIPLIER**attempt) | ||
| logger.warning( | ||
| "Receive retry %d for chunk %d in req %s due to %s, waiting %.3fs", | ||
| attempt + 1, | ||
| chunk_id, | ||
| id, | ||
| code.name, | ||
| backoff_delay, | ||
| ) | ||
| await asyncio.sleep(backoff_delay) | ||
| continue | ||
| raise | ||
| if response is None: | ||
| raise RuntimeError(f"Failed to receive data with id {id}: no response received") |
|
Please test with |
NOTE: only tested exmaples/mllm.py so far. Need more tests.