Skip to content

feat(csharp): implement telemetry client pipeline with circuit breaker protection#305

Merged
jadewang-db merged 12 commits intomainfrom
stack/itr5-pr-telemetry-client-impl
Mar 10, 2026
Merged

feat(csharp): implement telemetry client pipeline with circuit breaker protection#305
jadewang-db merged 12 commits intomainfrom
stack/itr5-pr-telemetry-client-impl

Conversation

@jadewang-db
Copy link
Collaborator

@jadewang-db jadewang-db commented Mar 5, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Implements the telemetry client pipeline for the C# ADBC Databricks driver. This includes the full lifecycle from event ingestion through circuit-breaker-protected export to the Databricks telemetry endpoint.

Key components

  • TelemetryClient — Main client coordinating enqueue → flush → export. Uses SemaphoreSlim to gate concurrent flushes, Interlocked counter for O(1) batch-size checks, and a periodic Timer driven by FlushIntervalMs. Graceful shutdown flushes all pending events before disposing.
  • TelemetryClientManager — Singleton factory managing one client per host with atomic reference counting (AddRef/Release via Interlocked). Prevents rate-limiting by sharing clients across concurrent connections to the same host.
  • TelemetryClientHolder — Holds client + thread-safe ref count using Interlocked.Increment/Decrement.
  • CircuitBreakerTelemetryExporter — Wraps the inner exporter with per-host circuit breaker protection (via Polly). Silently drops events when the circuit is open. Never throws — all exceptions including OperationCanceledException are swallowed per the telemetry contract.
  • CircuitBreakerManager — Singleton managing per-host circuit breakers with cleanup on last client release.
  • ITelemetryClient — Interface defining Enqueue, FlushAsync, CloseAsync, and IAsyncDisposable.
  • TelemetrySessionContext / StatementTelemetryContext — Immutable context objects carrying session-level and statement-level telemetry data.

Design decisions

  • Thread safety: CloseAsync uses Interlocked.Exchange for atomic dispose check; TelemetryClientHolder uses Interlocked for ref counting; flush is gated by SemaphoreSlim(1,1).
  • No data loss on close: CloseAsync performs a final FlushAsync before cancelling the token and disposing resources.
  • Telemetry never impacts driver: All exceptions are swallowed; disabled telemetry skips the entire path via _enabled check in Enqueue.
  • Memory leak prevention: CircuitBreakerManager.RemoveCircuitBreaker is called when the last client for a host is released.

Test plan

  • Unit tests for TelemetryClient (enqueue, flush, close, dispose, error handling)
  • Unit tests for TelemetryClientManager (singleton, ref counting, thread safety, multi-host)
  • Unit tests for CircuitBreakerTelemetryExporter (circuit states, cancellation swallowed, error tracking)
  • Unit tests for TelemetrySessionContext and StatementTelemetryContext
  • All 245 telemetry tests pass

🤖 Generated with Claude Code

@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from 861c265 to 8165d60 Compare March 5, 2026 18:31
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from 8165d60 to 49ec024 Compare March 5, 2026 22:42
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from 49ec024 to ecccfc6 Compare March 5, 2026 23:38
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from ecccfc6 to c16a7f2 Compare March 5, 2026 23:53
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from c16a7f2 to 7ee3e3f Compare March 6, 2026 00:26
@jadewang-db
Copy link
Collaborator Author

Code Review Summary — PR #305

Reviewed the incremental changes (10 files, +2435/-16 lines) implementing the telemetry client pipeline: TelemetryClient, TelemetryClientManager, CircuitBreakerManager, CircuitBreakerTelemetryExporter, and associated tests.

Critical Issues (2)

# File Issue
1 TelemetryClient.cs:263 Race condition in CloseAsync_disposed check-then-set is not atomic. Two threads can both proceed past the guard. Use Interlocked.Exchange.
2 TelemetryClient.cs:256 CloseAsync doesn't flush pending logs — docstring and interface contract promise graceful flush, but FlushAsync() is never called. Pending logs are silently lost.

High Issues (1)

# File Issue
3 TelemetryClient.cs:167 Fire-and-forget Task.Run for flush — concurrent flushes race on TryDequeue, causing duplicate partial exports. Unobserved exceptions. No backpressure.

Medium Issues (4)

# File Issue
4 TelemetryClient.cs:59 _cts unused — created and disposed but never passed to any operation. Dead code.
5 TelemetryClientHolder.cs:33 Misleading doc — says "Interlocked" but actual access uses lock.
6 CircuitBreakerTelemetryExporter.cs:120 TOCTOU race — pre-check of circuit state is redundant with Polly's BrokenCircuitException.
7 CircuitBreakerManager.cs:46 Memory leak — circuit breakers accumulate forever, never removed.

Low Issues (2)

# File Issue
8 TelemetryClientManager.cs:62 Unused forTesting parameter
9 TelemetryClientManagerTests.cs:146 HttpClient leaks in tests

Positive Observations

  • Clean separation of concerns: exporter → circuit breaker → client → manager hierarchy
  • Good per-host isolation design for circuit breakers and client sharing
  • Circuit breaker tests are thorough with good coverage of state transitions
  • Proper use of ActivitySource for structured tracing
  • Reference counting in TelemetryClientManager is a sound design for shared resources

Recommendations (prioritized)

  1. Fix CloseAsync: Make it async, call FlushAsync() before cancellation, use Interlocked for _disposed
  2. Fix flush concurrency: Add a SemaphoreSlim(1,1) or use a dedicated background flush task
  3. Either use or remove _cts: If keeping it, create linked tokens in FlushAsync
  4. Add RemoveCircuitBreaker to CircuitBreakerManager to prevent unbounded growth
  5. Remove redundant TOCTOU check in CircuitBreakerTelemetryExporter.ExportAsync

This comment was generated with GitHub MCP.

@jadewang-db jadewang-db marked this pull request as ready for review March 6, 2026 01:10
github-merge-queue bot pushed a commit that referenced this pull request Mar 9, 2026
…ontext, TelemetrySessionContext) (#301)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/301/files) to
review incremental changes.
-
[**stack/itr5-pr-telemetry-models**](#301)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/301/files)]
-
[stack/itr5-pr-telemetry-client-impl](#305)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/305/files/cfac5affae5a0c405d91a9246189ec7b43908ab0..88f5b5587bea7bd5cfb7e9e3877061f015e85359)]
-
[stack/itr5-pr-e2e-tests](#308)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/308/files/88f5b5587bea7bd5cfb7e9e3877061f015e85359..f33b25cd7eaae3925ddb7ea77daf9bc1ba813ad7)]

---------
## What's Changed

Please fill in a description of the changes here.

**This contains breaking changes.** <!-- Remove this line if there are
no breaking changes. -->

Closes #NNN.

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from 88f5b55 to 7390faf Compare March 10, 2026 00:22
Jade Wang and others added 11 commits March 10, 2026 00:22
- Fix final flush no-op by using _closing flag before _disposed
- Drain all batches in FlushAsync instead of just one
- Fix race condition in TelemetryClientManager with lock-based synchronization
- Replace Debug.WriteLine with Activity.Current?.AddEvent() for tracing
- Change TelemetrySessionContext setters to internal
- Use Interlocked counter for queue size instead of ConcurrentQueue.Count
- Use blocking wait in FlushAsync during close
- Use test-isolated TelemetryClientManager instances with proper cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…metryMetric

These are no longer used. TelemetryClient now uses a direct
enqueue → flush → export pipeline without the Activity-based
aggregation layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…\nTask ID: task-5.1-telemetry-client-implementation
Remove ActivityListener and MetricsAggregator dependencies. TelemetryClient
now directly enqueues TelemetryFrontendLog objects and flushes them via
the circuit breaker-protected exporter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jadewang-db jadewang-db force-pushed the stack/itr5-pr-telemetry-client-impl branch from 7390faf to d4f1460 Compare March 10, 2026 00:22
@jadewang-db jadewang-db enabled auto-merge March 10, 2026 00:23
@jadewang-db jadewang-db added this pull request to the merge queue Mar 10, 2026
github-merge-queue bot pushed a commit that referenced this pull request Mar 10, 2026
…r protection (#305)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/305/files) to
review incremental changes.
-
[**stack/itr5-pr-telemetry-client-impl**](#305)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/305/files)]
-
[stack/itr5-pr-e2e-tests](#308)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/308/files/d4f1460a559b0d1b18cf3f6869330f7a5c05fd19..488eb22c870625c990252f99d507e27987c9d12c)]

---------
## Summary

Implements the telemetry client pipeline for the C# ADBC Databricks
driver. This includes the full lifecycle from event ingestion through
circuit-breaker-protected export to the Databricks telemetry endpoint.

### Key components

- **`TelemetryClient`** — Main client coordinating enqueue → flush →
export. Uses `SemaphoreSlim` to gate concurrent flushes, `Interlocked`
counter for O(1) batch-size checks, and a periodic `Timer` driven by
`FlushIntervalMs`. Graceful shutdown flushes all pending events before
disposing.
- **`TelemetryClientManager`** — Singleton factory managing one client
per host with atomic reference counting (`AddRef`/`Release` via
`Interlocked`). Prevents rate-limiting by sharing clients across
concurrent connections to the same host.
- **`TelemetryClientHolder`** — Holds client + thread-safe ref count
using `Interlocked.Increment`/`Decrement`.
- **`CircuitBreakerTelemetryExporter`** — Wraps the inner exporter with
per-host circuit breaker protection (via Polly). Silently drops events
when the circuit is open. Never throws — all exceptions including
`OperationCanceledException` are swallowed per the telemetry contract.
- **`CircuitBreakerManager`** — Singleton managing per-host circuit
breakers with cleanup on last client release.
- **`ITelemetryClient`** — Interface defining `Enqueue`, `FlushAsync`,
`CloseAsync`, and `IAsyncDisposable`.
- **`TelemetrySessionContext` / `StatementTelemetryContext`** —
Immutable context objects carrying session-level and statement-level
telemetry data.

### Design decisions

- **Thread safety**: `CloseAsync` uses `Interlocked.Exchange` for atomic
dispose check; `TelemetryClientHolder` uses `Interlocked` for ref
counting; flush is gated by `SemaphoreSlim(1,1)`.
- **No data loss on close**: `CloseAsync` performs a final `FlushAsync`
before cancelling the token and disposing resources.
- **Telemetry never impacts driver**: All exceptions are swallowed;
disabled telemetry skips the entire path via `_enabled` check in
`Enqueue`.
- **Memory leak prevention**:
`CircuitBreakerManager.RemoveCircuitBreaker` is called when the last
client for a host is released.

## Test plan

- [x] Unit tests for `TelemetryClient` (enqueue, flush, close, dispose,
error handling)
- [x] Unit tests for `TelemetryClientManager` (singleton, ref counting,
thread safety, multi-host)
- [x] Unit tests for `CircuitBreakerTelemetryExporter` (circuit states,
cancellation swallowed, error tracking)
- [x] Unit tests for `TelemetrySessionContext` and
`StatementTelemetryContext`
- [x] All 245 telemetry tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jadewang-db jadewang-db removed this pull request from the merge queue due to a manual request Mar 10, 2026
@jadewang-db jadewang-db enabled auto-merge March 10, 2026 00:38
@jadewang-db jadewang-db added this pull request to the merge queue Mar 10, 2026
@jadewang-db jadewang-db added the integration-test Trigger integration tests in internal repo label Mar 10, 2026
@github-actions
Copy link

🚀 Integration tests triggered! View workflow run

Merged via the queue into main with commit 8571516 Mar 10, 2026
26 of 28 checks passed
@jadewang-db jadewang-db deleted the stack/itr5-pr-telemetry-client-impl branch March 10, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Trigger integration tests in internal repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants