Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
0b932d9
feat: add OTel GenAI instrumentation foundation
zhichli Feb 18, 2026
d5d990c
refactor: reorder OTel type imports for consistency
zhichli Feb 18, 2026
af3235e
refactor: reorder OTel type imports for consistency
zhichli Feb 18, 2026
e3f45aa
feat(otel): wire OTel spans into chat extension — Phase 1 core
zhichli Feb 19, 2026
7954243
feat(otel): add embeddings span, config UI settings, and unit tests
zhichli Feb 19, 2026
cc0c88b
test(otel): add unit tests for messageFormatters, genAiEvents, fileEx…
zhichli Feb 19, 2026
ab08daa
feat(otel): record token usage and time-to-first-token metrics
zhichli Feb 19, 2026
d86d286
docs: finalize sprint plan with completion status
zhichli Feb 19, 2026
20d6cef
style: apply formatter changes to OTel files
zhichli Feb 19, 2026
ebebed3
feat(otel): emit gen_ai.client.inference.operation.details event with…
zhichli Feb 19, 2026
c00de1c
feat(otel): add aggregated token usage to invoke_agent span
zhichli Feb 19, 2026
e03f667
feat(otel): add token usage attributes to chat inference span
zhichli Feb 19, 2026
134c385
style: apply formatter changes
zhichli Feb 19, 2026
d5eee69
fix: correct import paths in otelContrib and add IOTelService to test
zhichli Feb 19, 2026
3585148
feat: add diagnostic span exporter to log first successful export and…
zhichli Feb 19, 2026
58d23d6
feat: add content capture to OTel spans (messages, responses, tool ar…
zhichli Feb 19, 2026
00e3056
fix: register IOTelService in chatLib setupServices for NES test
zhichli Feb 21, 2026
d3b9ebd
fix: register OTel ConfigKey settings in Advanced namespace for confi…
zhichli Feb 21, 2026
f87fbe8
fix: register IOTelService in shared test services (createExtensionUn…
zhichli Feb 22, 2026
cef8196
fix: register IOTelService in platform test services
zhichli Feb 22, 2026
1ca5692
feat(otel): enhance GenAI span attributes per OTel semantic conventions
zhichli Feb 26, 2026
015f422
feat(otel): connect subagent spans to parent trace via context propag…
zhichli Feb 26, 2026
07d092c
fix(otel): fix subagent trace context key to use parentRequestId
zhichli Feb 26, 2026
a31682f
fix(otel): add model attrs to invoke_agent and max_prompt_tokens to B…
zhichli Feb 26, 2026
d72b134
test(otel): add trace context propagation tests for subagent linkage
zhichli Feb 26, 2026
4ca9836
fix(otel): add finish_reasons and ttft to BYOK chat spans, document o…
zhichli Feb 26, 2026
8a203f1
docs(otel): update Gap 4 analysis — wrapper spans have actual token u…
zhichli Feb 26, 2026
45592b4
feat(otel): propagate trace context through BYOK IPC to link wrapper …
zhichli Feb 26, 2026
aa1a6cf
debug(otel): add debug attribute to verify trace context capture in B…
zhichli Feb 26, 2026
f65354b
fix(otel): remove debug attribute, BYOK trace context propagation ver…
zhichli Feb 26, 2026
db8ddbd
refactor(otel): replace byok-provider bridge span with invisible cont…
zhichli Feb 26, 2026
0fdc278
refactor(otel): remove duplicate BYOK consumer-side chat span
zhichli Feb 26, 2026
6509f3f
fix(otel): restore chat span for non-wrapper BYOK providers (Anthropi…
zhichli Feb 26, 2026
6c0a6c7
fix(otel): skip consumer chat span for wrapper-based BYOK providers
zhichli Feb 26, 2026
e734ce6
fix: remove unnecessary 'google' from non-wrapper vendor set
zhichli Feb 26, 2026
ebf81bb
feat(otel): add rich chat span with usage data for Anthropic BYOK pro…
zhichli Feb 26, 2026
f472938
feat(otel): add rich chat span for Gemini BYOK, clean up extChatEndpoint
zhichli Feb 26, 2026
69fa127
feat(otel): enrich Anthropic/Gemini chat spans with full metadata
zhichli Feb 26, 2026
5b6842c
feat(otel): add server.address to CAPI/Azure BYOK chat spans
zhichli Feb 26, 2026
c136189
feat(otel): add max_tokens and output_messages to Anthropic/Gemini ch…
zhichli Feb 26, 2026
8fa5d71
fix(otel): capture tool calls in output_messages for chat spans
zhichli Feb 26, 2026
9d8cea8
fix(otel): capture tool calls in output_messages for Anthropic/Gemini…
zhichli Feb 26, 2026
66136f3
fix(otel): add input_messages and agent_name to Anthropic/Gemini chat…
zhichli Feb 27, 2026
17301a9
fix(otel): fix input_messages serialization for Anthropic/Gemini BYOK
zhichli Feb 27, 2026
defaa3d
docs(otel): add remaining metrics/events work to plan.md
zhichli Feb 27, 2026
bef11b8
feat(otel): add metrics and inference events to Anthropic/Gemini BYOK…
zhichli Feb 27, 2026
739512d
fix(otel): fix LoggerProvider constructor — use 'processors' key (SDK…
zhichli Feb 27, 2026
cc299e5
docs: add agent monitoring guide with OTel usage and Claude/Gemini co…
zhichli Feb 27, 2026
84d12a7
docs: remove Claude/Gemini comparison from monitoring guide
zhichli Feb 27, 2026
7c54ef6
docs: add OTel comparison with Claude Code and Gemini CLI
zhichli Feb 27, 2026
08fff69
docs: reorganize monitoring docs — user guide + dev architecture
zhichli Feb 27, 2026
fd1f904
fix(otel): restore _doFetchViaHttp body and _fetchWithInstrumentation…
zhichli Feb 27, 2026
e9215a3
fix(otel): propagate otelSpan through WebSocket/HTTP routing paths
zhichli Feb 27, 2026
4161319
docs(otel): improve monitoring docs, add collector setup, fix trace c…
zhichli Feb 28, 2026
dcaecf7
docs(otel): merge Backend Considerations and E2E sections to remove r…
zhichli Feb 28, 2026
236e435
docs(otel): remove internal dev debug reference from user-facing guide
zhichli Feb 28, 2026
c6844da
docs(otel): remove Grafana section and Jaeger refs from App Insights …
zhichli Feb 28, 2026
c4bba29
docs(otel): trim Backend section to factual setup guides, remove claims
zhichli Feb 28, 2026
b2afca9
docs(otel): final accuracy audit — fix false claims against code
zhichli Feb 28, 2026
00110cf
docs(otel): remove telemetry.telemetryLevel references — OTel is inde…
zhichli Feb 28, 2026
3118575
feat(otel): wire up session.start event, agent.turn event, and sessio…
zhichli Feb 28, 2026
735685a
chore: untrack .playwright-mcp/ and add to .gitignore
zhichli Feb 28, 2026
4ca67ae
chore: remove otel spec reference files
zhichli Feb 28, 2026
e995e9b
chore(otel): remove OpenTelemetry environment variables from launch c…
zhichli Feb 28, 2026
9539f8c
fix(otel): add 64KB truncation limit for content capture attributes
zhichli Feb 28, 2026
2d6bf2f
refactor(otel): make GenAiMetrics methods static to avoid per-call al…
zhichli Feb 28, 2026
0709614
fix(otel): fix timer leak, cap buffered ops, rate-limit export logs
zhichli Feb 28, 2026
9bbea99
docs(otel): fix Jaeger UI port to match docker-compose (16687)
zhichli Feb 28, 2026
9e80dc7
chore(otel): update sprint plan — mark P0/P1 tasks done
zhichli Feb 28, 2026
478f97c
fix(otel): remove as any casts in BYOK provider content capture
zhichli Feb 28, 2026
59f3521
refactor(otel): extract OTelModelOptions shared interface
zhichli Feb 28, 2026
b514a17
refactor(otel): route OTel logs through ILogService output channel
zhichli Feb 28, 2026
41a0807
fix(otel): remove orphaned OTel ConfigKey definitions
zhichli Feb 28, 2026
e53a976
test(otel): add comprehensive OTel instrumentation tests
zhichli Feb 28, 2026
ffe9f6a
chore: apply formatter import sorting
zhichli Feb 28, 2026
4d630f0
chore: remove outdated sprint plan document
zhichli Feb 28, 2026
61f3bbd
feat(otel): add OTel configuration settings for tracing and logging
zhichli Feb 28, 2026
b68436a
fix(otel): ensure metric reader is flushed and shutdown properly
zhichli Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,6 @@ test/aml/out

# claude
.claude/settings.local.json

# playwright
.playwright-mcp/
506 changes: 506 additions & 0 deletions docs/monitoring/agent_monitoring.md

Large diffs are not rendered by default.

297 changes: 297 additions & 0 deletions docs/monitoring/agent_monitoring_arch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
# OTel Instrumentation — Developer Guide

This document describes the architecture, code structure, and conventions for the OpenTelemetry instrumentation in the Copilot Chat extension. It is intended for developers contributing to or maintaining this codebase.

For user-facing configuration and usage, see [agent_monitoring.md](agent_monitoring.md).

---

## Architecture Overview

```
┌──────────────────────────────────────────────────────────────────┐
│ VS Code Copilot Chat Extension │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │
│ │ ChatML │ │ Tool Calling │ │ Tools │ │ Prompts │ │
│ │ Fetcher │ │ Loop │ │ Service │ │ │ │
│ └──────┬──────┘ └──────┬───────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ IOTelService (DI) │ │
│ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌───────────┐ │ │
│ │ │ Tracer │ │ Meter │ │ Logger │ │ Semantic │ │ │
│ │ │ (spans) │ │ (metrics)│ │ (events)│ │ Helpers │ │ │
│ │ └────┬────┘ └────┬─────┘ └────┬────┘ └───────────┘ │ │
│ └───────┼─────────────┼────────────┼──────────────────────┘ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ OTel SDK (BatchSpanProcessor, │ │
│ │ BatchLogRecordProcessor, │ │
│ │ PeriodicExportingMetricReader) │ │
│ └──────────────────┬──────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Exporters: OTLP/HTTP | OTLP/gRPC | │ │
│ │ Console | File (JSON-lines) │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```

---

## File Structure

```
src/platform/otel/
├── common/
│ ├── otelService.ts # IOTelService interface + ISpanHandle
│ ├── otelConfig.ts # Config resolution (env → settings → defaults)
│ ├── noopOtelService.ts # Zero-cost no-op implementation
│ ├── genAiAttributes.ts # GenAI semantic convention attribute keys
│ ├── genAiEvents.ts # Event emitter helpers
│ ├── genAiMetrics.ts # GenAiMetrics class (metric recording)
│ ├── messageFormatters.ts # Message → OTel JSON schema converters
│ ├── index.ts # Public API barrel export
│ └── test/ # Unit tests
└── node/
├── otelServiceImpl.ts # NodeOTelService (real SDK implementation)
├── fileExporters.ts # File-based span/log/metric exporters
└── test/ # Unit tests

src/extension/otel/
└── vscode-node/
└── otelContrib.ts # Lifecycle contribution (shutdown hook)
```

### Instrumentation Points

| File | What Gets Instrumented |
|---|---|
| `src/extension/prompt/node/chatMLFetcher.ts` | `chat` spans — one per LLM API call. Used by standard CAPI endpoints **and** all OpenAI-compatible BYOK providers (Azure, OpenAI, Ollama, OpenRouter, xAI, CustomOAI) via `CopilotLanguageModelWrapper` → `endpoint.makeChatRequest` |
| `src/extension/byok/vscode-node/anthropicProvider.ts` | `chat` spans — BYOK Anthropic requests (native SDK, instrumented directly) |
| `src/extension/byok/vscode-node/geminiNativeProvider.ts` | `chat` spans — BYOK Gemini requests (native SDK, instrumented directly) |
| `src/extension/intents/node/toolCallingLoop.ts` | `invoke_agent` spans — wraps agent orchestration |
| `src/extension/tools/vscode-node/toolsService.ts` | `execute_tool` spans — one per tool invocation |
| `src/extension/extension/vscode-node/services.ts` | Service registration (config → NodeOTelService or NoopOTelService) |

---

## Service Layer

### `IOTelService` Interface

The core abstraction. All consumers depend on this interface, never on the OTel SDK directly. It exposes methods for starting spans, recording metrics, emitting log records, managing trace context propagation, and lifecycle (`flush`/`shutdown`).

### Implementations

| Class | When Used | Characteristics |
|---|---|---|
| `NoopOTelService` | OTel disabled (default) | All methods are empty. Zero cost. |
| `NodeOTelService` | OTel enabled | Full SDK with dynamic imports, buffering, batched processors. |

### Registration

In `services.ts`, the config is resolved from env + settings, then the appropriate implementation is registered:

```typescript
const otelConfig = resolveOTelConfig({ env: process.env, ... });
if (otelConfig.enabled) {
const { NodeOTelService } = require('.../otelServiceImpl');
builder.define(IOTelService, new NodeOTelService(otelConfig));
} else {
builder.define(IOTelService, new NoopOTelService(otelConfig));
}
```

The `require()` (not `import()`) is intentional here — it avoids loading the SDK at all when disabled, while the `NodeOTelService` constructor internally uses `import()` for all OTel packages.

---

## Configuration Resolution

`resolveOTelConfig()` in `otelConfig.ts` implements layered precedence:

1. `COPILOT_OTEL_*` env vars (highest)
2. `OTEL_EXPORTER_OTLP_*` standard env vars
3. VS Code settings (`github.copilot.chat.otel.*`)
4. Defaults (lowest)

Kill switch: If `telemetry.telemetryLevel === 'off'`, the config resolver returns a disabled config. Note: `vscodeTelemetryLevel` must be passed by the call site — currently not wired in `services.ts`.

Endpoint parsing: gRPC → origin only (`scheme://host:port`). HTTP → full href.

---

## Span Conventions

### Naming

Follow the OTel GenAI conventions:

| Operation | Span Name | Kind |
|---|---|---|
| Agent orchestration | `invoke_agent {agent_name}` | `INTERNAL` |
| LLM API call | `chat {model}` | `CLIENT` |
| Tool execution | `execute_tool {tool_name}` | `INTERNAL` |

### Attributes

Use the constants from `genAiAttributes.ts`:

```typescript
import { GenAiAttr, GenAiOperationName, CopilotChatAttr, StdAttr } from '../../platform/otel/common/index';

span.setAttributes({
[GenAiAttr.OPERATION_NAME]: GenAiOperationName.CHAT,
[GenAiAttr.REQUEST_MODEL]: model,
[GenAiAttr.USAGE_INPUT_TOKENS]: inputTokens,
[StdAttr.ERROR_TYPE]: error.constructor.name,
});
```

### Error Handling

On error, set both status and `error.type`:

```typescript
span.setStatus(SpanStatusCode.ERROR, error.message);
span.setAttribute(StdAttr.ERROR_TYPE, error.constructor.name);
```

### Content Capture

Always gate content capture on `otel.config.captureContent`:

```typescript
if (this._otelService.config.captureContent) {
span.setAttribute(GenAiAttr.INPUT_MESSAGES, JSON.stringify(messages));
}
```

---

## Adding Instrumentation to New Code

### Pattern: Wrapping an Operation with a Span

```typescript
class MyService {
constructor(@IOTelService private readonly _otel: IOTelService) {}

async doWork(): Promise<Result> {
return this._otel.startActiveSpan(
'execute_tool myTool',
{ kind: SpanKind.INTERNAL, attributes: { [GenAiAttr.TOOL_NAME]: 'myTool' } },
async (span) => {
try {
const result = await this._actualWork();
span.setStatus(SpanStatusCode.OK);
return result;
} catch (err) {
span.setStatus(SpanStatusCode.ERROR, err instanceof Error ? err.message : String(err));
span.setAttribute(StdAttr.ERROR_TYPE, err instanceof Error ? err.constructor.name : 'Error');
throw err;
}
},
);
}
}
```

### Pattern: Recording Metrics

Use `GenAiMetrics` for standard metric recording:

```typescript
const metrics = new GenAiMetrics(this._otelService);
metrics.recordTokenUsage(1500, 'input', {
operationName: GenAiOperationName.CHAT,
providerName: GenAiProviderName.GITHUB,
requestModel: 'gpt-4o',
});
metrics.recordToolCallCount('readFile', true);
metrics.recordTimeToFirstToken('gpt-4o', 0.45);
```

### Pattern: Emitting Events

```typescript
import { emitToolCallEvent, emitInferenceDetailsEvent } from '../../platform/otel/common/index';

emitToolCallEvent(this._otelService, 'readFile', 50, true);
emitInferenceDetailsEvent(this._otelService, { model: 'gpt-4o' }, { inputTokens: 1500 });
```

### Pattern: Cross-Boundary Trace Propagation

When spawning a subagent, store the current trace context and retrieve it in the child:

```typescript
// Parent: store context before spawning subagent
const traceContext = this._otelService.getActiveTraceContext();
if (traceContext) {
this._otelService.storeTraceContext(`subagent:${requestId}`, traceContext);
}

// Child: retrieve and use as parent
const parentCtx = this._otelService.getStoredTraceContext(`subagent:${requestId}`);
return this._otelService.startActiveSpan('invoke_agent child', { parentTraceContext: parentCtx }, async (span) => {
// child spans are now part of the same trace
});
```

---

## Buffering & Initialization

`NodeOTelService` buffers operations during async SDK initialization. Once init completes, the buffer is drained in order; on failure, it is discarded and all future calls become no-ops. `BufferedSpanHandle` captures span mutations during this window and replays them onto the real span once available.

---

## Exporters

Four exporter types are supported: OTLP/HTTP (default), OTLP/gRPC, Console (stdout), and File (JSON-lines). All OTel SDK packages are dynamically imported — none are loaded when OTel is disabled. `DiagnosticSpanExporter` wraps the span exporter to log the first successful export (confirms connectivity).

---

## GenAI Semantic Convention Reference

All attribute names follow [OTel GenAI Semantic Conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/).

Constants are defined in `genAiAttributes.ts`:

- `GenAiAttr.*` — Standard `gen_ai.*` attribute keys
- `CopilotChatAttr.*` — Extension-specific `copilot_chat.*` keys
- `StdAttr.*` — Standard OTel keys (`error.type`, `server.address`, `server.port`)
- `GenAiOperationName.*` — Operation name values (`chat`, `invoke_agent`, `execute_tool`)
- `GenAiProviderName.*` — Provider values (`github`, `openai`, `anthropic`)

Message formatting helpers in `messageFormatters.ts` convert internal message types to the OTel JSON schema:

- `toInputMessages()` — CAPI messages → OTel input format
- `toOutputMessages()` — Model response choices → OTel output format
- `toSystemInstructions()` — System message → OTel system instruction format
- `toToolDefinitions()` — Tool schemas → OTel tool definition format

---

## Testing

Unit tests live alongside the source:

```
src/platform/otel/common/test/
├── genAiEvents.spec.ts
├── genAiMetrics.spec.ts
├── messageFormatters.spec.ts
├── noopOtelService.spec.ts
└── otelConfig.spec.ts

src/platform/otel/node/test/
├── fileExporters.spec.ts
└── traceContextPropagation.spec.ts
```

Run with: `npm test -- --grep "OTel"`
36 changes: 36 additions & 0 deletions docs/monitoring/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copilot Chat OTel monitoring stack
#
# Starts an OpenTelemetry Collector that accepts OTLP on :4318 (HTTP) and :4317 (gRPC),
# then forwards traces/metrics/logs to Azure Application Insights and a local Jaeger instance.
#
# Usage:
# # Set your App Insights connection string:
# export APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=...;IngestionEndpoint=..."
#
# # Start the stack:
# docker compose up -d
#
# # View traces in Jaeger:
# open http://localhost:16687
#
# # Then launch VS Code with:
# COPILOT_OTEL_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4328 code .

services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
ports:
- "4327:4317" # OTLP gRPC (host:4327 → container:4317)
- "4328:4318" # OTLP HTTP (host:4328 → container:4318)
environment:
- APPLICATIONINSIGHTS_CONNECTION_STRING=${APPLICATIONINSIGHTS_CONNECTION_STRING:-}
restart: unless-stopped

jaeger:
image: jaegertracing/jaeger:latest
ports:
- "16687:16686" # Jaeger UI (host:16687 to avoid conflict)
restart: unless-stopped
50 changes: 50 additions & 0 deletions docs/monitoring/otel-collector-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# OpenTelemetry Collector configuration for Copilot Chat
# Receives OTLP from Copilot Chat and exports to multiple backends.
#
# Usage:
# docker compose -f docs/monitoring/docker-compose.yaml up -d
#
# Then set in VS Code or launch.json:
# OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317

processors:
batch:
timeout: 5s
send_batch_size: 256

exporters:
# Azure Application Insights via connection string
# Replace <your-connection-string> with your App Insights connection string
azuremonitor:
connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"

# Debug exporter — prints to collector stdout (useful for troubleshooting)
debug:
verbosity: basic

# Local Jaeger for trace visualization
otlphttp/jaeger:
endpoint: http://jaeger:4318

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [azuremonitor, otlphttp/jaeger, debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [azuremonitor, debug]
logs:
receivers: [otlp]
processors: [batch]
exporters: [azuremonitor, debug]
Loading