-
Notifications
You must be signed in to change notification settings - Fork 164
chore: Add service.instance.id to OpenTelemetry #8514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduce an optional service_instance_id on OpenTelemetrySpec and include it as the service.instance.id attribute when present. Propagate service_instance_id from various servers (agent, appproxy coordinator/worker, manager, storage) using meta.display_name, and set a hostname-based instance id for the web server (webserver-{hostname}). This makes telemetry resources identify individual service instances for easier tracing and debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for the OpenTelemetry service.instance.id attribute to improve service instance identification in distributed tracing and logging. The change introduces an optional service_instance_id field to the OpenTelemetrySpec dataclass and propagates unique instance identifiers from all Backend.AI services.
Changes:
- Added optional
service_instance_idfield toOpenTelemetrySpecwith conditional inclusion in OpenTelemetry resource attributes - Updated all service servers to pass instance-specific identifiers: most services use
meta.display_name, while the web server uses a hostname-based identifier
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/logging/otel.py | Added optional service_instance_id field to OpenTelemetrySpec and conditional logic to include it in resource attributes |
| src/ai/backend/agent/server.py | Set service_instance_id to meta.display_name for agent instances |
| src/ai/backend/appproxy/coordinator/server.py | Set service_instance_id to meta.display_name for appproxy coordinator instances |
| src/ai/backend/appproxy/worker/server.py | Set service_instance_id to meta.display_name for appproxy worker instances |
| src/ai/backend/manager/server.py | Set service_instance_id to meta.display_name for manager instances |
| src/ai/backend/storage/server.py | Set service_instance_id to meta.display_name for storage proxy instances |
| src/ai/backend/web/server.py | Set service_instance_id to hostname-based identifier for web server instances |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the explicit service_id UUID with a service_instance_name and generate a stable service.instance.id using UUID v5. OpenTelemetrySpec now accepts service_instance_name and to_resource creates service.instance.id (v5 using an OTEL namespace UUID) and service.instance.name attributes. Updated callers in agent, coordinator, worker, manager, storage, and web servers to pass service_instance_name and removed ad-hoc uuid4 generation in the web server. This makes service instance IDs deterministic and consistent across restarts.
.enhance.md -> 8514.enhance.md Co-authored-by: octodog <mu001@lablup.com>
Thread a service_instance_id UUID into the OpenTelemetry spec and resource attributes, replacing the previous UUIDv5-from-name approach. This lets each process expose a unique, per-restart service.instance.id (web server now generates a uuid4 at startup) for finer-grained per-instance log filtering in Loki/Grafana. Updated changelog and applied the new field across agent, coordinator, worker, manager, storage and web servers.
Introduce an optional service_instance_id on OpenTelemetrySpec and include it as the service.instance.id attribute when present. Propagate service_instance_id from various servers (agent, appproxy coordinator/worker, manager, storage) using meta.display_name, and set a hostname-based instance id for the web server (webserver-{hostname}). This makes telemetry resources identify individual service instances for easier tracing and debugging.