Skip to content

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489

Merged
platinummonkey merged 2 commits into
DataDog:mainfrom
mbldatadog:worktree-port_mcp_functionality
May 14, 2026
Merged

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489
platinummonkey merged 2 commits into
DataDog:mainfrom
mbldatadog:worktree-port_mcp_functionality

Conversation

@mbldatadog
Copy link
Copy Markdown
Contributor

@mbldatadog mbldatadog commented May 11, 2026

Summary

Ports the non-org2-gated LLMObs MCP server endpoints to pup as first-class CLI commands, and adds evaluator CRUD via the unstable MCP endpoints and a `--summary` flag for span search.

Coverage gap analysis that motivated this work: compared 26 MCP tools against existing pup commands post-v0.55.0, identified 15 missing; this PR implements the 8 non-org2-gated trace/span tools plus 7 eval commands.

Orphaned-span root cause writeup: https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516

(Notebook `edit` command split to #495 for notebook-team review.)

New commands

`pup llm-obs spans` — 6 new subcommands

Command What it does
`get-trace --trace-id` Full span hierarchy tree with depth/error summary
`get-details --trace-id --span-ids` Timing, cost metrics, children IDs; warns on stderr when a span is orphaned from the trace tree
`get-content --trace-id --span-id --field` Raw content fields — `--field` is required (server returns 400 without it)
`find-errors --trace-id` All error spans with type, message, and parent context
`expand --trace-id --span-ids` Direct children with `has_input`/`has_output` flags
`get-agent-loop --trace-id` Chronological agent execution steps

`pup llm-obs evals` — 5 new subcommands (2 existing)

Command What it does
`get-evaluator ` Full LLM-judge config via MCP: span filters, sampling, scope, prompt, schema. Use before `create-or-update` — read/write body schemas differ by backend design
`get-config ` Prompt template, assessment criteria, output schema
`get-aggregate-stats ` Pass/fail rates and score distributions over a time window; `--ml-app` narrows to one app
`create-or-update --file` Full-replace publish (flat body — see schema note below)
`delete ` Removes named evaluator

`pup llm-obs spans search --summary`

Strips `tags`, `llm_info`, and content previews from each span, keeping 11 essential fields (`span_id`, `trace_id`, `apm_trace_id`, `name`, `span_kind`, `ml_app`, `service`, `status`, `duration_ms`, `start_ms`, `parent_id`). Reduces payload ~80% for bulk analysis phases.

Changes

  • `src/commands/llm_obs.rs` — 13 new handler functions, 79 tests (26 new)
  • `src/main.rs` — new enum variants and routing for all new commands

Orphaned-span warning

`get-details` warns on stderr when fewer spans are returned than requested:

```
warning: 1 of 1 requested span(s) not found in trace hierarchy — the span may
exist but be orphaned (no path to a root span). Use 'spans get-content' to
retrieve its content directly.
```

Root cause: `spans search` returns spans by raw `@trace_id` (includes orphaned spans); `get-details` reconstructs the BFS tree and silently drops spans unreachable from any root. The response key path is `resp["spans"]` (raw API) not `resp["data"]["spans"]` (agent-mode formatter envelope) — regression test added.

Schema note for `evals create-or-update`

The `--file` uses a flat body (all fields top-level: `application_name`, `enabled`, `prompt_template`, `output_schema`, etc.), which differs from the nested structure returned by `get-evaluator` (`target.application_name`, `llm_judge_config.prompt_template`, `llm_provider.*`). This is a backend design decision — the read and write APIs use different shapes.

Testing

Automated

  • `cargo test` — 79 tests, all passing
  • `cargo clippy -- -D warnings` — clean
  • `cargo fmt --check` — clean

Manual smoke tests (run against org2)

Trace/span IDs age out of the index quickly — substitute fresh IDs from `pup llm-obs spans search --limit 1` as needed.

`evals get-evaluator`

  • `pup llm-obs evals get-evaluator failure-to-answer-verdict` — returns full LLM-judge config (prompt, schema, target, provider)
  • `pup llm-obs evals get-evaluator nonexistent-eval-xyz` — 404 "custom evaluator not found: …" (exit 1)

`evals get-config`

  • `pup llm-obs evals get-config failure-to-answer-verdict` — returns prompt template + output schema
  • `pup llm-obs evals get-config nonexistent-eval-xyz` — 404 (exit 1)

`evals get-aggregate-stats`

  • `pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --ml-app docs_ai --from 24h` — returns pass rate, fail count, top categorical values
  • `pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --from not-a-time` — local parse error before any network call (exit 1)

`evals create-or-update` + `delete` (creates and cleans up a disposable evaluator)
```bash
cat > /tmp/pup_smoke_eval.json << 'BODY'
{
"application_name": "docs_ai", "enabled": false, "sampling_percentage": 1,
"root_spans_only": true, "eval_scope": "span",
"integration_provider": "openai", "model_name": "gpt-4.1-mini",
"parsing_type": "structured_output",
"prompt_template": [
{"role": "system", "content": "Rate as pass or fail."},
{"role": "user", "content": "{{span_output}}"}
],
"assessment_criteria": {"pass_values": ["pass"]},
"output_schema": {
"name": "categorical_eval",
"schema": {
"additionalProperties": false,
"properties": {
"categorical_eval": {"anyOf": [{"const": "pass", "description": "Satisfactory."}, {"const": "fail", "description": "Unsatisfactory."}], "type": "string"},
"reasoning": {"description": "Brief explanation.", "type": "string"}
},
"required": ["categorical_eval", "reasoning"], "type": "object"
},
"strict": true
}
}
BODY
```

  • `pup llm-obs evals create-or-update pup-smoke-test-DELETE-ME --file /tmp/pup_smoke_eval.json` — prints confirmation (exit 0)
  • `pup llm-obs evals get-evaluator pup-smoke-test-DELETE-ME` — confirms it was created
  • `pup llm-obs evals delete pup-smoke-test-DELETE-ME` — prints confirmation (exit 0)
  • `echo '{}' | pup llm-obs evals create-or-update pup-smoke-bad --file /dev/stdin` — 400 "target.application_name cannot be empty" (exit 1)
  • `pup llm-obs evals delete nonexistent-eval-xyz` — 404 (exit 1)

`spans search --summary`

  • `pup llm-obs spans search --span-kind llm --root-spans-only --limit 3 --from 15m --summary` — each span has only 11 fields, no `tags`/`llm_info`/previews
  • Same without `--summary` — compare to see what's stripped

`spans get-trace`

  • `pup llm-obs spans get-trace --trace-id --include-tree` — returns span tree with `has_errors`, service list, span kind counts
  • `pup llm-obs spans get-trace --trace-id 0000000000000000` — 404 (exit 1)

`spans get-details`

  • `pup llm-obs spans get-details --trace-id --span-ids <span_id>` — returns metadata including cost metrics
  • With an orphaned span ID: stderr warning fires, spans array empty, exit 0

`spans get-content`

  • `pup llm-obs spans get-content --trace-id --span-id <span_id> --field input` — returns full content
  • Same without `--field` — clap error "required arguments not provided" (exit 2)

`spans find-errors`

  • `pup llm-obs spans find-errors --trace-id ` — returns error spans with type, message, parent context
  • `pup llm-obs spans find-errors --trace-id 0000000000000000` — empty `error_spans` array, exit 0

`spans expand`

  • `pup llm-obs spans expand --trace-id --span-ids <root_span_id>` — returns children with `has_input`/`has_output` flags
  • `pup llm-obs spans expand --trace-id 0000000000000000 --span-ids deadbeef` — empty, exit 0

`spans get-agent-loop`

  • `pup llm-obs spans get-agent-loop --trace-id ` — returns agent span name and iterations
  • On a trace with no agent span — 404 "no agent span (kind=agent) found; get_agent_loop requires an agent span with LLM children" (exit 1)

🤖 Generated with Claude Code

@mbldatadog mbldatadog requested a review from a team as a code owner May 11, 2026 20:06
@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch from b693070 to 348d806 Compare May 11, 2026 23:26
platinummonkey
platinummonkey previously approved these changes May 12, 2026
@mbldatadog
Copy link
Copy Markdown
Contributor Author

NB - please don't review/merge yet - trying to get one clean PR and tested PR that has all the pup functionality needed to run a set of skills we have that can currently only run on MCP, in advance of shipping/publicizing those skills.

@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch 2 times, most recently from 09be958 to ef4eab2 Compare May 12, 2026 15:41
mbldatadog and others added 2 commits May 13, 2026 08:11
…ary, notebooks edit

Ports the non-org2-gated LLMObs MCP server endpoints to pup, adds evaluator
CRUD commands, a --summary flag for span search, and append-only notebook edits.

## New commands

### pup llm-obs spans (6 new subcommands)
- get-trace --trace-id      Full span hierarchy tree with depth/error summary
- get-details --span-ids    Timing, cost metrics, children IDs; warns on stderr
                            when a span exists in raw storage but is orphaned
                            from the LLMObs trace tree (see note below)
- get-content --field       Raw content fields (input/output/messages/metadata)
                            --field is required; server returns 400 without it
- find-errors               All error spans with type, message, parent context
- expand --span-ids         Direct children with has_input/has_output flags
- get-agent-loop            Chronological agent execution steps

### pup llm-obs evals (7 subcommands, 5 new)
- list                      Org-wide evaluator list (existing)
- list-by-ml-app            Per-app evaluator list (existing)
- get-evaluator             Full LLM-judge config via MCP (span filters,
                            sampling, scope, prompt, schema) — use before
                            create-or-update; read/write schemas are flat vs
                            nested by backend design
- get-config                Prompt template, assessment criteria, output schema
- get-aggregate-stats       Pass/fail rates and score distributions over a
                            time window; --ml-app narrows to one app
- create-or-update          Full-replace publish (flat body; see --help)
- delete                    Removes named evaluator

### pup llm-obs spans search --summary
Strips tags, llm_info, and content previews from each span, keeping 11
essential fields. Reduces payload ~80% for bulk analysis phases.

### pup notebooks edit <id> --file <cells.json>
Append-only update: fetches current notebook, appends cells from file
(array of cell objects), writes back. Prevents clobbering existing content.

## Orphaned-span warning

get-details warns on stderr when fewer spans come back than were requested:

  warning: 1 of 1 requested span(s) not found in trace hierarchy — the span
  may exist but be orphaned (no path to a root span). Use 'spans get-content'
  to retrieve its content directly.

Root cause: spans search returns by raw @trace_id (includes orphaned spans);
get-details reconstructs the BFS tree and silently drops spans unreachable
from any root. Documented at:
https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The /api/unstable/llm-obs-mcp/v1/eval/config endpoint is being removed
upstream. All callers should use evals get-evaluator instead, which returns
a strict superset of what get-config returned (includes prompt template,
output schema, assessment criteria plus span filters, sampling, and scope).

Removes: evals_get_config function, GetConfig enum variant, routing, and tests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch from e0b3278 to 950c165 Compare May 13, 2026 12:12
@mbldatadog
Copy link
Copy Markdown
Contributor Author

Okay @platinummonkey - this is fully tested on the set of skills I want it to work with and they behave roughly the same as on the MCP endpoints, mind merging it now?

@platinummonkey platinummonkey merged commit e111fb9 into DataDog:main May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request product:bits-ai

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants