Skip to content

Commit 8b4c6bf

Browse files
committed
feat: emit compute OTel spans (provision, restore, snapshot) in run traces
Supervisor emits OTel spans for compute lifecycle events so they appear in the run's trace view with per-stage timing breakdowns. Spans: - compute.provision: emitted after gateway create returns, includes gateway/agent/fcrun timing and cache indicators from _timing response - compute.restore: emitted after gateway restore returns (supervisor-side timing only, gateway restore timing not yet surfaced) - compute.snapshot: emitted from snapshot callback handler using duration_ms from the agent, trace context from in-memory map (best-effort, does not survive restarts - TRI-7992) Implementation: - Hand-rolled OTLP JSON client (otlpTrace.ts) - builds ExportTraceServiceRequest payload and fire-and-forget POSTs to TRIGGER_API_URL/otel - Trace context (traceparent) from DequeuedMessage links spans to run trace - Resource attributes (ctx.environment.id, ctx.run.id, etc.) link to the correct run in the trace view - COMPUTE_TRACE_SPANS_ENABLED env var (default true) to disable in prod - Span start time offset by -1ms to ensure stable sort order before attempt Also adds .claude/rules/span-timeline-events.md documenting how the trace view timeline events system works (trigger.dev/ prefix, admin visibility, ClickHouse SPAN_EVENT storage, start_time filter constraint).
1 parent 80b62d4 commit 8b4c6bf

File tree

8 files changed

+497
-0
lines changed

8 files changed

+497
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Span Timeline Events
2+
3+
The trace view's right panel shows a timeline of events for the selected span. These are OTel span events rendered by `app/utils/timelineSpanEvents.ts` and the `SpanTimeline` component.
4+
5+
## How They Work
6+
7+
1. **Span events** in OTel are attached to a parent span. In ClickHouse, they're stored as separate rows with `kind: "SPAN_EVENT"` sharing the parent span's `span_id`. The `#mergeRecordsIntoSpanDetail` method reassembles them into the span's `events` array at query time.
8+
2. The timeline only renders events whose `name` starts with `trigger.dev/` - all others are silently filtered out.
9+
3. The **display name** comes from `properties.event` (not the span event name), mapped through `getFriendlyNameForEvent()`.
10+
4. Events are shown on the **span they belong to** - events on one span don't appear in another span's timeline.
11+
12+
## ClickHouse Storage Constraint
13+
14+
When events are written to ClickHouse, `spanEventsToTaskEventV1Input()` filters out events whose `start_time` is not greater than the parent span's `startTime`. Events at or before the span start are silently dropped. This means span events must have timestamps strictly after the span's own `startTimeUnixNano`.
15+
16+
## Timeline Rendering (SpanTimeline component)
17+
18+
The `SpanTimeline` component in `app/components/run/RunTimeline.tsx` renders:
19+
20+
1. **Events** (thin 1px line with hollow dots) - all events from `createTimelineSpanEventsFromSpanEvents()`
21+
2. **"Started"** marker (thick cap) - at the span's `startTime`
22+
3. **Duration bar** (thick 7px line) - from "Started" to "Finished"
23+
4. **"Finished"** marker (thick cap) - at `startTime + duration`
24+
25+
The thin line before "Started" only appears when there are events with timestamps between the span start and the first child span. For the Attempt span this works well (Dequeued → Pod scheduled → Launched → etc. all happen before execution starts). Events all get `lineVariant: "light"` (thin) while the execution bar gets `variant: "normal"` (thick).
26+
27+
## Trace View Sort Order
28+
29+
Sibling spans (same parent) are sorted by `start_time ASC` from the ClickHouse query. The `createTreeFromFlatItems` function preserves this order. Event timestamps don't affect sort order - only the span's own `start_time`.
30+
31+
## Event Structure
32+
33+
```typescript
34+
// OTel span event format
35+
{
36+
name: "trigger.dev/run", // Must start with "trigger.dev/" to render
37+
timeUnixNano: "1711200000000000000",
38+
attributes: [
39+
{ key: "event", value: { stringValue: "dequeue" } }, // The actual event type
40+
{ key: "duration", value: { intValue: 150 } }, // Optional: duration in ms
41+
]
42+
}
43+
```
44+
45+
## Admin-Only Events
46+
47+
`getAdminOnlyForEvent()` controls visibility. Events default to **admin-only** (`true`).
48+
49+
| Event | Admin-only | Friendly name |
50+
|-------|-----------|---------------|
51+
| `dequeue` | No | Dequeued |
52+
| `fork` | No | Launched |
53+
| `import` | No (if no fork event) | Importing task file |
54+
| `create_attempt` | Yes | Attempt created |
55+
| `lazy_payload` | Yes | Lazy attempt initialized |
56+
| `pod_scheduled` | Yes | Pod scheduled |
57+
| (default) | Yes | (raw event name) |
58+
59+
## Adding New Timeline Events
60+
61+
1. Add OTLP span event with `name: "trigger.dev/<scope>"` and `properties.event: "<type>"`
62+
2. Event timestamp must be strictly after the parent span's `startTimeUnixNano` (ClickHouse drops earlier events)
63+
3. Add friendly name in `getFriendlyNameForEvent()` in `app/utils/timelineSpanEvents.ts`
64+
4. Set admin visibility in `getAdminOnlyForEvent()`
65+
5. Optionally add help text in `getHelpTextForEvent()`
66+
67+
## Key Files
68+
69+
- `app/utils/timelineSpanEvents.ts` - filtering, naming, admin logic
70+
- `app/components/run/RunTimeline.tsx` - `SpanTimeline` component (thin line + thick bar rendering)
71+
- `app/presenters/v3/SpanPresenter.server.ts` - loads span data including events
72+
- `app/v3/eventRepository/clickhouseEventRepository.server.ts` - `spanEventsToTaskEventV1Input()` (storage filter), `#mergeRecordsIntoSpanDetail` (reassembly)

apps/supervisor/src/env.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ const Env = z
8383
COMPUTE_GATEWAY_AUTH_TOKEN: z.string().optional(),
8484
COMPUTE_GATEWAY_TIMEOUT_MS: z.coerce.number().int().default(30_000),
8585
COMPUTE_SNAPSHOTS_ENABLED: BoolEnv.default(false),
86+
COMPUTE_TRACE_SPANS_ENABLED: BoolEnv.default(true),
8687
COMPUTE_SNAPSHOT_DELAY_MS: z.coerce.number().int().min(0).max(60_000).default(5_000),
8788

8889
// Kubernetes settings

apps/supervisor/src/index.ts

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,11 @@ class ManagedSupervisor {
236236
runFriendlyId: message.run.friendlyId,
237237
snapshotFriendlyId: message.snapshot.friendlyId,
238238
machine: message.run.machine,
239+
traceContext: message.run.traceContext,
240+
envId: message.environment.id,
241+
orgId: message.organization.id,
242+
projectId: message.project.id,
243+
dequeuedAt: message.dequeuedAt,
239244
});
240245

241246
if (didRestore) {
@@ -288,6 +293,24 @@ class ManagedSupervisor {
288293
return;
289294
}
290295

296+
if (env.COMPUTE_TRACE_SPANS_ENABLED) {
297+
const traceparent =
298+
message.run.traceContext &&
299+
"traceparent" in message.run.traceContext &&
300+
typeof message.run.traceContext.traceparent === "string"
301+
? message.run.traceContext.traceparent
302+
: undefined;
303+
304+
if (traceparent) {
305+
this.workloadServer.registerRunTraceContext(message.run.friendlyId, {
306+
traceparent,
307+
envId: message.environment.id,
308+
orgId: message.organization.id,
309+
projectId: message.project.id,
310+
});
311+
}
312+
}
313+
291314
try {
292315
if (!message.deployment.friendlyId) {
293316
// mostly a type guard, deployments always exists for deployed environments
@@ -315,6 +338,7 @@ class ManagedSupervisor {
315338
snapshotId: message.snapshot.id,
316339
snapshotFriendlyId: message.snapshot.friendlyId,
317340
placementTags: message.placementTags,
341+
traceContext: message.run.traceContext,
318342
});
319343

320344
// Disabled for now
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
import { describe, it, expect } from "vitest";
2+
import { buildOtlpTracePayload } from "./otlpTrace.js";
3+
4+
describe("buildOtlpTracePayload", () => {
5+
it("builds valid OTLP JSON with timing attributes", () => {
6+
const payload = buildOtlpTracePayload({
7+
traceId: "abcd1234abcd1234abcd1234abcd1234",
8+
parentSpanId: "1234567890abcdef",
9+
spanName: "compute.provision",
10+
startTimeMs: 1000,
11+
endTimeMs: 1250,
12+
resourceAttributes: {
13+
"ctx.environment.id": "env_123",
14+
"ctx.organization.id": "org_456",
15+
"ctx.project.id": "proj_789",
16+
"ctx.run.id": "run_abc",
17+
},
18+
spanAttributes: {
19+
"compute.total_ms": 250,
20+
"compute.gateway.schedule_ms": 1,
21+
"compute.cache.image_cached": true,
22+
},
23+
});
24+
25+
expect(payload.resourceSpans).toHaveLength(1);
26+
27+
const resourceSpan = payload.resourceSpans[0]!;
28+
29+
// $trigger=true so the webapp accepts it
30+
const triggerAttr = resourceSpan.resource.attributes.find((a) => a.key === "$trigger");
31+
expect(triggerAttr).toEqual({ key: "$trigger", value: { boolValue: true } });
32+
33+
// Resource attributes
34+
const envAttr = resourceSpan.resource.attributes.find(
35+
(a) => a.key === "ctx.environment.id"
36+
);
37+
expect(envAttr).toEqual({
38+
key: "ctx.environment.id",
39+
value: { stringValue: "env_123" },
40+
});
41+
42+
// Span basics
43+
const span = resourceSpan.scopeSpans[0]!.spans[0]!;
44+
expect(span.name).toBe("compute.provision");
45+
expect(span.traceId).toBe("abcd1234abcd1234abcd1234abcd1234");
46+
expect(span.parentSpanId).toBe("1234567890abcdef");
47+
48+
// Integer attribute
49+
const totalMs = span.attributes.find((a) => a.key === "compute.total_ms");
50+
expect(totalMs).toEqual({ key: "compute.total_ms", value: { intValue: 250 } });
51+
52+
// Boolean attribute
53+
const cached = span.attributes.find((a) => a.key === "compute.cache.image_cached");
54+
expect(cached).toEqual({ key: "compute.cache.image_cached", value: { boolValue: true } });
55+
});
56+
57+
it("generates a valid 16-char hex span ID", () => {
58+
const payload = buildOtlpTracePayload({
59+
traceId: "abcd1234abcd1234abcd1234abcd1234",
60+
spanName: "test",
61+
startTimeMs: 1000,
62+
endTimeMs: 1001,
63+
resourceAttributes: {},
64+
spanAttributes: {},
65+
});
66+
67+
const span = payload.resourceSpans[0]!.scopeSpans[0]!.spans[0]!;
68+
expect(span.spanId).toMatch(/^[0-9a-f]{16}$/);
69+
});
70+
71+
it("converts timestamps to nanoseconds", () => {
72+
const payload = buildOtlpTracePayload({
73+
traceId: "abcd1234abcd1234abcd1234abcd1234",
74+
spanName: "test",
75+
startTimeMs: 1000,
76+
endTimeMs: 1250,
77+
resourceAttributes: {},
78+
spanAttributes: {},
79+
});
80+
81+
const span = payload.resourceSpans[0]!.scopeSpans[0]!.spans[0]!;
82+
expect(span.startTimeUnixNano).toBe("1000000000");
83+
expect(span.endTimeUnixNano).toBe("1250000000");
84+
});
85+
86+
it("omits parentSpanId when not provided", () => {
87+
const payload = buildOtlpTracePayload({
88+
traceId: "abcd1234abcd1234abcd1234abcd1234",
89+
spanName: "test",
90+
startTimeMs: 1000,
91+
endTimeMs: 1001,
92+
resourceAttributes: {},
93+
spanAttributes: {},
94+
});
95+
96+
const span = payload.resourceSpans[0]!.scopeSpans[0]!.spans[0]!;
97+
expect(span.parentSpanId).toBeUndefined();
98+
});
99+
100+
it("handles double values for non-integer numbers", () => {
101+
const payload = buildOtlpTracePayload({
102+
traceId: "abcd1234abcd1234abcd1234abcd1234",
103+
spanName: "test",
104+
startTimeMs: 1000,
105+
endTimeMs: 1001,
106+
resourceAttributes: {},
107+
spanAttributes: { "compute.cpu": 0.25 },
108+
});
109+
110+
const span = payload.resourceSpans[0]!.scopeSpans[0]!.spans[0]!;
111+
const cpu = span.attributes.find((a) => a.key === "compute.cpu");
112+
expect(cpu).toEqual({ key: "compute.cpu", value: { doubleValue: 0.25 } });
113+
});
114+
});

apps/supervisor/src/otlpTrace.ts

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import { randomBytes } from "crypto";
2+
import { SimpleStructuredLogger } from "@trigger.dev/core/v3/utils/structuredLogger";
3+
4+
const logger = new SimpleStructuredLogger("otlp-trace");
5+
6+
export interface OtlpTraceOptions {
7+
traceId: string;
8+
parentSpanId?: string;
9+
spanName: string;
10+
startTimeMs: number;
11+
endTimeMs: number;
12+
resourceAttributes: Record<string, string | number | boolean>;
13+
spanAttributes: Record<string, string | number | boolean>;
14+
}
15+
16+
/** Build an OTLP JSON ExportTraceServiceRequest payload */
17+
export function buildOtlpTracePayload(opts: OtlpTraceOptions) {
18+
const spanId = randomBytes(8).toString("hex");
19+
20+
return {
21+
resourceSpans: [
22+
{
23+
resource: {
24+
attributes: [
25+
{ key: "$trigger", value: { boolValue: true } },
26+
...toOtlpAttributes(opts.resourceAttributes),
27+
],
28+
},
29+
scopeSpans: [
30+
{
31+
scope: { name: "supervisor.compute" },
32+
spans: [
33+
{
34+
traceId: opts.traceId,
35+
spanId,
36+
parentSpanId: opts.parentSpanId,
37+
name: opts.spanName,
38+
kind: 3, // SPAN_KIND_CLIENT
39+
startTimeUnixNano: String(opts.startTimeMs * 1_000_000),
40+
endTimeUnixNano: String(opts.endTimeMs * 1_000_000),
41+
attributes: toOtlpAttributes(opts.spanAttributes),
42+
status: { code: 1 }, // STATUS_CODE_OK
43+
},
44+
],
45+
},
46+
],
47+
},
48+
],
49+
};
50+
}
51+
52+
/** Fire-and-forget: send an OTLP trace payload to the collector */
53+
export function sendOtlpTrace(
54+
endpoint: string,
55+
payload: ReturnType<typeof buildOtlpTracePayload>
56+
) {
57+
fetch(`${endpoint}/v1/traces`, {
58+
method: "POST",
59+
headers: { "Content-Type": "application/json" },
60+
body: JSON.stringify(payload),
61+
signal: AbortSignal.timeout(5_000),
62+
}).catch((err) => {
63+
logger.warn("failed to send compute provision span", {
64+
error: err instanceof Error ? err.message : String(err),
65+
});
66+
});
67+
}
68+
69+
function toOtlpAttributes(
70+
attrs: Record<string, string | number | boolean>
71+
): Array<{ key: string; value: Record<string, unknown> }> {
72+
return Object.entries(attrs).map(([key, value]) => ({
73+
key,
74+
value: toOtlpValue(value),
75+
}));
76+
}
77+
78+
function toOtlpValue(value: string | number | boolean): Record<string, unknown> {
79+
if (typeof value === "string") return { stringValue: value };
80+
if (typeof value === "boolean") return { boolValue: value };
81+
if (Number.isInteger(value)) return { intValue: value };
82+
return { doubleValue: value };
83+
}

0 commit comments

Comments
 (0)