Skip to content

Conversation

@Yaminyam
Copy link
Member

@Yaminyam Yaminyam commented Feb 2, 2026

Introduce an optional service_instance_id on OpenTelemetrySpec and include it as the service.instance.id attribute when present. Propagate service_instance_id from various servers (agent, appproxy coordinator/worker, manager, storage) using meta.display_name, and set a hostname-based instance id for the web server (webserver-{hostname}). This makes telemetry resources identify individual service instances for easier tracing and debugging.

Introduce an optional service_instance_id on OpenTelemetrySpec and include it as the service.instance.id attribute when present. Propagate service_instance_id from various servers (agent, appproxy coordinator/worker, manager, storage) using meta.display_name, and set a hostname-based instance id for the web server (webserver-{hostname}). This makes telemetry resources identify individual service instances for easier tracing and debugging.
@github-actions github-actions bot added size:S 10~30 LoC comp:manager Related to Manager component comp:agent Related to Agent component comp:webserver Related to Web Server component comp:storage-proxy Related to Storage proxy component comp:app-proxy Related to App Proxy component labels Feb 2, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the OpenTelemetry service.instance.id attribute to improve service instance identification in distributed tracing and logging. The change introduces an optional service_instance_id field to the OpenTelemetrySpec dataclass and propagates unique instance identifiers from all Backend.AI services.

Changes:

  • Added optional service_instance_id field to OpenTelemetrySpec with conditional inclusion in OpenTelemetry resource attributes
  • Updated all service servers to pass instance-specific identifiers: most services use meta.display_name, while the web server uses a hostname-based identifier

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/ai/backend/logging/otel.py Added optional service_instance_id field to OpenTelemetrySpec and conditional logic to include it in resource attributes
src/ai/backend/agent/server.py Set service_instance_id to meta.display_name for agent instances
src/ai/backend/appproxy/coordinator/server.py Set service_instance_id to meta.display_name for appproxy coordinator instances
src/ai/backend/appproxy/worker/server.py Set service_instance_id to meta.display_name for appproxy worker instances
src/ai/backend/manager/server.py Set service_instance_id to meta.display_name for manager instances
src/ai/backend/storage/server.py Set service_instance_id to meta.display_name for storage proxy instances
src/ai/backend/web/server.py Set service_instance_id to hostname-based identifier for web server instances

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace the explicit service_id UUID with a service_instance_name and generate a stable service.instance.id using UUID v5. OpenTelemetrySpec now accepts service_instance_name and to_resource creates service.instance.id (v5 using an OTEL namespace UUID) and service.instance.name attributes. Updated callers in agent, coordinator, worker, manager, storage, and web servers to pass service_instance_name and removed ad-hoc uuid4 generation in the web server. This makes service instance IDs deterministic and consistent across restarts.
@github-actions github-actions bot added size:M 30~100 LoC and removed size:S 10~30 LoC labels Feb 3, 2026
@Yaminyam Yaminyam requested a review from HyeockJinKim February 3, 2026 03:07
Thread a service_instance_id UUID into the OpenTelemetry spec and resource attributes, replacing the previous UUIDv5-from-name approach. This lets each process expose a unique, per-restart service.instance.id (web server now generates a uuid4 at startup) for finer-grained per-instance log filtering in Loki/Grafana. Updated changelog and applied the new field across agent, coordinator, worker, manager, storage and web servers.
@Yaminyam Yaminyam requested a review from HyeockJinKim February 3, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:app-proxy Related to App Proxy component comp:manager Related to Manager component comp:storage-proxy Related to Storage proxy component comp:webserver Related to Web Server component size:M 30~100 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants