Environment details
- OS type and version: macOS / Linux
- Python version: 3.13
google-cloud-spanner version: 3.63.0 (current main)
Description
Every Spanner operation that goes through trace_call() produces orphan OpenTelemetry metric data points with incomplete resource labels (missing project_id and instance_id). These orphan data points persist for the process lifetime due to cumulative aggregation and are re-exported to Cloud Monitoring every 60 seconds, which rejects them with:
INVALID_ARGUMENT: One or more TimeSeries could not be written:
timeSeries[...]: the set of resource labels is incomplete, missing (instance_id)
Root cause
trace_call() in _opentelemetry_tracing.py wraps every operation with a bare MetricsCapture() (no resource_info). Meanwhile, every caller of trace_call already provides its own MetricsCapture(self._resource_info) with correct labels.
When Python evaluates with trace_call(...) as span, MetricsCapture(self._resource_info):, two separate MetricsTracer instances are created:
- tracer_A (from
trace_call's internal MetricsCapture()): has instance_config, location, client_hash, client_uid, client_name from the factory, but never receives project_id or instance_id
- tracer_B (from the caller's
MetricsCapture(resource_info)): has correct labels, overwrites tracer_A in the context var
On exit, tracer_B records correct metrics first, then tracer_A records metrics with incomplete labels. Since the SpannerMetricsTracerFactory never has project_id/instance_id in its _client_attributes (only set per-tracer via resource_info or MetricsInterceptor), tracer_A always starts without them and is never populated because the MetricsInterceptor only touches the current context-var tracer (tracer_B).
With OpenTelemetry's cumulative aggregation, once these orphan aggregation buckets are created, they persist for the process lifetime and are re-exported every 60 seconds.
History
Impact
- Affects every Spanner operation (~27 code paths) on every invocation
- Creates persistent orphan metric aggregation buckets
- Produces repeated
INVALID_ARGUMENT error logs every 60 seconds
- Wastes CPU/network on exporting invalid TimeSeries
- Application functionality is unaffected; valid metrics from the caller's
MetricsCapture still work
Steps to reproduce
- Create a
spanner.Client() with metrics enabled (default)
- Perform any Spanner operation (e.g.,
session.create(), snapshot.execute_sql())
- Observe
INVALID_ARGUMENT errors logged from the metrics exporter every 60 seconds
Suggested fix
Remove the bare MetricsCapture() from trace_call — it is redundant since every caller already provides its own. See PR googleapis/python-spanner#1522.
Environment details
google-cloud-spannerversion: 3.63.0 (currentmain)Description
Every Spanner operation that goes through
trace_call()produces orphan OpenTelemetry metric data points with incomplete resource labels (missingproject_idandinstance_id). These orphan data points persist for the process lifetime due to cumulative aggregation and are re-exported to Cloud Monitoring every 60 seconds, which rejects them with:Root cause
trace_call()in_opentelemetry_tracing.pywraps every operation with a bareMetricsCapture()(noresource_info). Meanwhile, every caller oftrace_callalready provides its ownMetricsCapture(self._resource_info)with correct labels.When Python evaluates
with trace_call(...) as span, MetricsCapture(self._resource_info):, two separateMetricsTracerinstances are created:trace_call's internalMetricsCapture()): hasinstance_config,location,client_hash,client_uid,client_namefrom the factory, but never receivesproject_idorinstance_idMetricsCapture(resource_info)): has correct labels, overwrites tracer_A in the context varOn exit, tracer_B records correct metrics first, then tracer_A records metrics with incomplete labels. Since the
SpannerMetricsTracerFactorynever hasproject_id/instance_idin its_client_attributes(only set per-tracer viaresource_infoorMetricsInterceptor), tracer_A always starts without them and is never populated because theMetricsInterceptoronly touches the current context-var tracer (tracer_B).With OpenTelemetry's cumulative aggregation, once these orphan aggregation buckets are created, they persist for the process lifetime and are re-exported every 60 seconds.
History
MetricsCapture()instances were bare, including the one intrace_call. The design relied onMetricsInterceptorto populate labels during gRPC calls._resource_infoproperty and changed all caller sites fromMetricsCapture()toMetricsCapture(self._resource_info)for eager label propagation. However, the bareMetricsCapture()insidetrace_callwas not removed, making it redundant and harmful.Impact
INVALID_ARGUMENTerror logs every 60 secondsMetricsCapturestill workSteps to reproduce
spanner.Client()with metrics enabled (default)session.create(),snapshot.execute_sql())INVALID_ARGUMENTerrors logged from the metrics exporter every 60 secondsSuggested fix
Remove the bare
MetricsCapture()fromtrace_call— it is redundant since every caller already provides its own. See PR googleapis/python-spanner#1522.