feat(csharp): fix Phase 1 Thrift telemetry gaps with E2E test coverage#345
feat(csharp): fix Phase 1 Thrift telemetry gaps with E2E test coverage#345jadewang-db wants to merge 23 commits intomainfrom
Conversation
Code Review — PR #345Executive SummaryThis PR fills telemetry gaps in the Thrift code path with comprehensive E2E test coverage. The telemetry field additions are well-structured, but there are concerns around test isolation with the static Severity Summary
Positive Observations
Critical1. The previous implementation used
Suggestion: Restore High2. GetObjects/GetTableTypes emit ResultsConsumed telemetry before results are consumed
3. If 4. Initial chunk latency uses first-to-complete rather than first-chunk semantics
5. Tests silently pass when no telemetry is captured When 6. All other test files use 7. Blocking
Medium8. Despite the name 9. Missing The 10. ~120 lines of duplicated telemetry boilerplate GetObjects and GetTableTypes contain nearly identical telemetry code. Extract into a shared helper like 11. Allocates a 12. Unlike 13. Reflection-based testing is fragile Property names accessed via reflection will throw at runtime if renamed. Use 14. Reader resource leak on assertion failure
15. C# REST integration tests no longer auto-triggered SEA/REST dispatch logic removed. Confirm REST testing is covered elsewhere. 16. Six Tests for iteration, error propagation, and schema error handling were removed with 17. Stored solely for workspace ID extraction during telemetry init. After init, the reference keeps the entire session response alive. Set to Low
Recommendations (prioritized)
|
Comprehensive gap analysis of telemetry proto field coverage including: - SEA connections have zero telemetry (highest priority) - ChunkDetails.SetChunkDetails() defined but never called - Missing fields: auth_type, WorkspaceId, runtime_vendor, client_app_name - Composition via TelemetryHelper chosen over abstract base class - E2E test strategy for all proto fields across both protocols Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A second) Co-authored-by: Isaac
… ID: task-1.1-e2e-test-infrastructure
…ion\n\nTask ID: task-1.2-system-config-missing-fields
…task-1.5-connection-params-extended
…-1.6-chunk-metrics-aggregation
…sk-1.7-expose-chunk-metrics-reader
… ID: task-1.8-call-set-chunk-details
…track-internal-calls
…Task ID: task-1.11-metadata-operation-telemetry
The demo directory was tracked as a git submodule in the index but had no entry in .gitmodules, causing all CI jobs to fail at checkout. Co-authored-by: Isaac
…remove scratch files - Remove scratch/design docs not meant for PR (PHASE1_TEST_RESULTS.md, TELEMETRY_TIMING_ISSUE.md, fix-telemetry-gaps-design.md) - Remove accidentally committed backup file (DatabricksConnection.cs.backup) - Fix trailing whitespace in test and doc files - Fix license header in ChunkMetrics.cs (was using "modified" header for new file) Co-authored-by: Isaac
Co-authored-by: Isaac
… resource cleanup - Restore AsyncLocal for ExporterOverride to prevent parallel test interference - Add missing IsInternalCall assertion in InternalCallTests - Replace silent-pass with Skip.If when no telemetry captured in baseline tests - Use await instead of .Result for async calls in MetadataOperationTests - Add TimestampMillis to metadata operation telemetry Context - Cache Process.GetCurrentProcess() call in BuildSystemConfiguration - Move reader disposal to finally blocks in ChunkMetricsReaderTests Co-authored-by: Isaac
ebb8f6c to
d504b0a
Compare
Range-diff: stack/fix-telemetry-gaps-design (ebb8f6c -> d504b0a)
Reproduce locally: |
| /// Gets or sets the maximum time taken to download any single chunk in milliseconds. | ||
| /// Identifies the slowest chunk download, useful for identifying performance outliers. | ||
| /// </summary> | ||
| public long SlowestChunkLatencyMs { get; set; } |
There was a problem hiding this comment.
Does this makes sense? Each chunk can have different size so maybe better to chunk the raw speed? i.e. chunk size/ total time
There was a problem hiding this comment.
Good point — SlowestChunkLatencyMs is less meaningful when chunk sizes vary. I'll add a throughput-based metric in a follow-up. For now, SumChunksDownloadTimeMs / TotalChunksIterated gives average latency, and we track TotalChunksPresent + TotalChunksIterated for volume. Adding per-chunk size tracking would require changes to the download pipeline to capture byte counts, which I'd prefer to scope separately.
csharp/src/DatabricksConnection.cs
Outdated
| { | ||
| try | ||
| { | ||
| telemetryContext.RecordFirstBatchReady(); |
There was a problem hiding this comment.
Same here metadata query
There was a problem hiding this comment.
✅ Done — Same fix applied. Both GetObjects and GetTableTypes now use the shared ExecuteWithMetadataTelemetry<T>() helper with no batch/consumption timing.
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
I do not think we ever support this param: adbc.databricks.workspace_id
There was a problem hiding this comment.
✅ Done — Removed the adbc.databricks.workspace_id property lookup and the config-based extraction. Now using PropertyHelper.ParseOrgIdFromProperties(Properties) which extracts the org ID from the HTTP path query string (e.g., ?o=12345). This is the same org ID used for the x-databricks-org-id header elsewhere in the driver.
There was a problem hiding this comment.
Why this look up still exist? adbc.databricks.workspace_id
There was a problem hiding this comment.
Extract org ID from Http path would only work for SPOG url where there is org id, not for current legacy urls.
I think best chance is still to extract the orgId from opensession response http header.
There was a problem hiding this comment.
in telemetry reporting, there is no need for workspace id, we should remove the logic
There was a problem hiding this comment.
✅ Done — Good point about SPOG vs legacy URLs. Replaced PropertyHelper.ParseOrgIdFromProperties() with a new OrgIdCaptureHandler (DelegatingHandler) that captures the x-databricks-org-id header from the first successful HTTP response. This works for both SPOG and legacy URLs since the header is always present in Thrift call responses. Also removed the test that used the unsupported adbc.databricks.workspace_id connection property.
This comment was generated with GitHub MCP.
csharp/src/DatabricksConnection.cs
Outdated
| /// <returns>The batch size value.</returns> | ||
| private int GetBatchSize() | ||
| { | ||
| const int DefaultBatchSize = 50000; // HiveServer2Connection.BatchSizeDefault |
There was a problem hiding this comment.
Databricks default is 2M:
private const long DatabricksBatchSizeDefault = 2000000;
There was a problem hiding this comment.
✅ Done — Fixed default to 2000000 to match DatabricksStatement.DatabricksBatchSizeDefault.
| private bool enableComplexDatatypeSupport; | ||
| private Dictionary<string, string>? confOverlay; | ||
| internal string? StatementId { get; set; } | ||
| private QueryResult? _lastQueryResult; // Track last query result for telemetry chunk metrics |
There was a problem hiding this comment.
There are some metadata queries in DatabricksStatmeent like GetTablesAsync etc, do we plan to emit telemetry for thos?
There was a problem hiding this comment.
The metadata queries in DatabricksStatement (like GetTablesAsync, GetColumnsAsync) go through the Thrift execute path and already pick up telemetry from _pendingTelemetryContext in the statement lifecycle. The GetObjects/GetTableTypes overrides in DatabricksConnection handle the ADBC-level metadata API calls that bypass the statement path. So both paths are covered.
There was a problem hiding this comment.
This seems wrong, for metadata calls like GetTablesAsync/GetColumnsAsync they route to HiveServer2Connection, which directly call Thrift metadata: https://github.com/adbc-drivers/hiveserver2/blob/825f9b96b810c9fbbcfa452392b1f4f903547568/csharp/src/Hive2/HiveServer2Connection.cs#L1049, so it does not go through the executeQuery path in this file
There was a problem hiding this comment.
please verify this.
There was a problem hiding this comment.
You're right — GetTablesAsync/GetColumnsAsync route directly to HiveServer2Connection Thrift metadata methods, not through DatabricksStatement.ExecuteQuery. My earlier reply was incorrect. Those metadata calls don't go through the statement telemetry path. I'll file a follow-up to add telemetry for the HiveServer2-level metadata calls.
This comment was generated with GitHub MCP.
jadewang-db
left a comment
There was a problem hiding this comment.
Submitting pending review to unblock comment replies.
Range-diff: stack/fix-telemetry-gaps-design (d504b0a -> 6a928e7)
Reproduce locally: |
Range-diff: stack/fix-telemetry-gaps-design (6a928e7 -> 490d943)
Reproduce locally: |
| private bool enableComplexDatatypeSupport; | ||
| private Dictionary<string, string>? confOverlay; | ||
| internal string? StatementId { get; set; } | ||
| private QueryResult? _lastQueryResult; // Track last query result for telemetry chunk metrics |
There was a problem hiding this comment.
This seems wrong, for metadata calls like GetTablesAsync/GetColumnsAsync they route to HiveServer2Connection, which directly call Thrift metadata: https://github.com/adbc-drivers/hiveserver2/blob/825f9b96b810c9fbbcfa452392b1f4f903547568/csharp/src/Hive2/HiveServer2Connection.cs#L1049, so it does not go through the executeQuery path in this file
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
Why this look up still exist? adbc.databricks.workspace_id
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
Extract org ID from Http path would only work for SPOG url where there is org id, not for current legacy urls.
I think best chance is still to extract the orgId from opensession response http header.
csharp/src/DatabricksConnection.cs
Outdated
| private string GetClientAppName(string processName) | ||
| { | ||
| // Check connection property first, fall back to process name | ||
| Properties.TryGetValue("adbc.databricks.client_app_name", out string? appName); |
There was a problem hiding this comment.
I am not seeing adbc.databricks.client_app_name as a valid databricks parameter?
There was a problem hiding this comment.
this method is not needed anymore, it should be removed.
There was a problem hiding this comment.
✅ Already removed — GetClientAppName method and adbc.databricks.client_app_name property lookup were removed in the previous round. ClientAppName is now set directly to processName in BuildSystemConfiguration().
This comment was generated with GitHub MCP.
csharp/src/DatabricksConnection.cs
Outdated
| private string GetClientAppName(string processName) | ||
| { | ||
| // Check connection property first, fall back to process name | ||
| Properties.TryGetValue("adbc.databricks.client_app_name", out string? appName); |
There was a problem hiding this comment.
this method is not needed anymore, it should be removed.
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
in telemetry reporting, there is no need for workspace id, we should remove the logic
| private bool enableComplexDatatypeSupport; | ||
| private Dictionary<string, string>? confOverlay; | ||
| internal string? StatementId { get; set; } | ||
| private QueryResult? _lastQueryResult; // Track last query result for telemetry chunk metrics |
There was a problem hiding this comment.
please verify this.
…ID from response header - Use DatabricksStatement.DatabricksBatchSizeDefault directly (changed to internal) - Use ConnectTimeoutMilliseconds from base class instead of duplicate local const - Extract workspace ID from x-databricks-org-id response header via new OrgIdCaptureHandler, replacing HTTP path parsing (works for SPOG + legacy URLs) - Remove test using unsupported adbc.databricks.workspace_id property Co-authored-by: Isaac
Range-diff: stack/fix-telemetry-gaps-design (490d943 -> 19ae1c0)
Reproduce locally: |
…data commands Statement-level metadata commands (getcatalogs, gettables, getcolumns, etc.) executed via DatabricksStatement.ExecuteQuery were incorrectly tagged as StatementType.Query/OperationType.ExecuteStatement. This fix correctly emits StatementType.Metadata with the appropriate OperationType (ListCatalogs, ListTables, ListColumns, etc.), aligning with the connection-level GetObjects telemetry. The two paths remain distinguishable via sql_statement_id (populated for statement path, empty for GetObjects path). Co-authored-by: Isaac
Range-diff: stack/fix-telemetry-gaps-design (19ae1c0 -> 8da5457)
Reproduce locally: |

🥞 Stacked PR
Use this link to review incremental changes.
Summary
Closes telemetry gaps in the Thrift (HiveServer2) code path by populating missing fields in telemetry events and adding comprehensive E2E test coverage.
Telemetry field fixes
runtime_vendorandclient_app_nameinDriverSystemConfigurationauth_typeon the root telemetry logWorkspaceIdinTelemetrySessionContextDriverConnectionParameterswith additional fieldsChunkMetricsaggregation inCloudFetchDownloader, expose viaCloudFetchReaderinterface, and callSetChunkDetails()inDatabricksStatement.EmitTelemetry()retry_countinSqlExecutionEventis_internal_callflagGetObjectsandGetTableTypesE2E test infrastructure & coverage
CapturingTelemetryExportertest infrastructure for intercepting and asserting on telemetry eventsTest plan