Skip to content

Conversation

@dheeraj-vanamala
Copy link

Description

This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).

Changes:

  • Detected StatusCode.UNAVAILABLE in the export loop.
  • Added logic to close the existing channel and re-initialize it before retrying.
  • Added a regression test test_unavailable_reconnects to verify the reconnection behavior.

Fixes #4517

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.

  • test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns StatusCode.UNAVAILABLE.

Does This PR Require a Contrib Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@dheeraj-vanamala dheeraj-vanamala requested a review from a team as a code owner November 30, 2025 15:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 30, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from c670f77 to b7620d0 Compare November 30, 2025 16:00
@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from b7620d0 to 436ecc9 Compare November 30, 2025 16:13
@dheeraj-vanamala
Copy link
Author

dheeraj-vanamala commented Nov 30, 2025

I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).

I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).

While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.

This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.

…mments

- Remove aggressive gRPC keepalive and retry settings to rely on defaults.
- Fix compression precedence logic to correctly handle NoCompression (0).
- Refactor channel initialization to be stateless (remove _channel_reconnection_enabled).- Update documentation to refer to 'OTLP-compatible receiver'
@dheeraj-vanamala dheeraj-vanamala changed the title Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) (Fixes #4529) Dec 9, 2025
@xrmx xrmx moved this to Ready for review in @xrmx's Python PR digest Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready for review

Development

Successfully merging this pull request may close these issues.

Transient error StatusCode.UNAVAILABLE

3 participants