Skip to content

Commit 0441966

Browse files
committed
more fixes
1 parent 3e64efd commit 0441966

File tree

5 files changed

+488
-11
lines changed

5 files changed

+488
-11
lines changed

docs/developer_guide/tracing_integration_guide.mdx

Lines changed: 148 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,152 @@ There are two levels of support we accept for a provider integration. Providers
4646

4747
Provider authors should update their trace logging to include these identifiers in a stable metadata location (e.g., top-level `metadata`, or a well-documented nested field), and document how adapters can reliably read them.
4848

49+
### Traces vs Observations (single completions)
50+
51+
Providers often expose two granularities:
52+
- Traces: top-level units representing a multi-step flow or multi-turn conversation. These may contain nested spans/observations.
53+
- Observations (or generations): individual LLM completions or smaller units within a trace.
54+
55+
Adapter expectations:
56+
- Prefer using the provider’s trace-level API to reconstruct a complete conversation. For providers that emit both `output.messages` and `output.choices`, adapters should:
57+
- Use `output.messages` to preserve the full multi-turn flow and tool role messages.
58+
- Optionally read `output.choices` to capture the final assistant bubble used by provider UIs.
59+
- If only observation-level data is available, adapters should implement a best-effort stitch that recovers the longest/last conversation for a given parent trace. Document the fallbacks used.
60+
61+
### Provider logging contract (inputs, outputs, metadata, tools)
62+
63+
This section defines a minimal, provider-agnostic contract for what your tracer should log to unlock Basic IO and then EP-ready capabilities.
64+
65+
#### Inputs
66+
67+
- Basic input (minimal):
68+
69+
```json
70+
{
71+
"inputs": {
72+
"messages": [
73+
{"role": "system", "content": "You are helpful."},
74+
{"role": "user", "content": "Say hi in one word."}
75+
]
76+
}
77+
}
78+
```
79+
80+
- Input with EP metadata (EP-ready):
81+
82+
```json
83+
{
84+
"inputs": {
85+
"messages": [
86+
{"role": "system", "content": "You are helpful."},
87+
{"role": "user", "content": "Say hi in one word."}
88+
],
89+
"metadata": {
90+
"invocation_id": "ivk_demo",
91+
"experiment_id": "exp_demo",
92+
"rollout_id": "rll_demo",
93+
"run_id": "",
94+
"row_id": "row_demo"
95+
},
96+
"tools": [
97+
{
98+
"type": "function",
99+
"function": {
100+
"name": "calculator_add",
101+
"description": "Add two integers",
102+
"parameters": {
103+
"type": "object",
104+
"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}},
105+
"required": ["a", "b"]
106+
}
107+
}
108+
}
109+
]
110+
}
111+
}
112+
```
113+
114+
Notes for providers:
115+
- Keep metadata values non-null strings where your API forbids nulls in metadata fields.
116+
- If your SDK requires a flag to persist metadata (e.g., a “store” toggle), document it and ensure your examples enable it.
117+
118+
#### Outputs
119+
120+
- Basic output (UI-friendly final bubble):
121+
122+
```json
123+
{
124+
"output": {
125+
"choices": [
126+
{"message": {"role": "assistant", "content": "Hi"}}
127+
]
128+
}
129+
}
130+
```
131+
132+
- Full output (agent-friendly full thread):
133+
134+
```json
135+
{
136+
"output": {
137+
"messages": [
138+
{"role": "assistant", "content": null, "tool_calls": [
139+
{"id": "call_1", "type": "function", "function": {"name": "calculator_add", "arguments": "{\"a\":2,\"b\":3}"}}
140+
]},
141+
{"role": "tool", "tool_call_id": "call_1", "name": "calculator_add", "content": "{\"a\":2,\"b\":3,\"sum\":5}"},
142+
{"role": "assistant", "content": "The result is 5."}
143+
]
144+
}
145+
}
146+
```
147+
148+
- Output with EP metadata (optional redundancy):
149+
150+
```json
151+
{
152+
"output": {
153+
"messages": [/* ... as above ... */],
154+
"metadata": {
155+
"invocation_id": "ivk_demo",
156+
"experiment_id": "exp_demo",
157+
"rollout_id": "rll_demo",
158+
"run_id": "",
159+
"row_id": "row_demo"
160+
}
161+
}
162+
}
163+
```
164+
165+
Recommendations:
166+
- Log `output.messages` to preserve multi-turn and tool role messages.
167+
- Log `output.choices` to render a final assistant bubble in UIs that expect it.
168+
169+
#### Tool calling
170+
171+
To fully support tool-augmented agents:
172+
- Assistant tool calls must be logged on the assistant message as an array of OpenAI-style objects under `tool_calls`, each with `id`, `type` (function), and `function` (name, arguments string).
173+
- Tool results must be logged as a separate message with `role: tool`, include `tool_call_id` (matching the assistant’s tool_calls id), and the tool’s output under `content` (stringified JSON is acceptable).
174+
- If you support multiple tool calls in a single assistant turn, ensure all calls and matching tool messages are included and ordered.
175+
- If you expose a top-level tool schema, include it in `inputs.tools`.
176+
177+
#### Adapter mapping (EP-ready)
178+
179+
Adapters should map the provider’s metadata into EP models:
180+
- `inputs.metadata.row_id``EvaluationRow.input_metadata.row_id`
181+
- `inputs.metadata.invocation_id``EvaluationRow.execution_metadata.invocation_id`
182+
- `inputs.metadata.experiment_id``EvaluationRow.execution_metadata.experiment_id`
183+
- `inputs.metadata.rollout_id``EvaluationRow.execution_metadata.rollout_id`
184+
- `inputs.metadata.run_id``EvaluationRow.execution_metadata.run_id`
185+
186+
Additionally, adapters should continue to populate provider-native IDs in `input_metadata.session_data` (for joinability back to the provider’s UI), and they should prefer `output.messages` over `output.choices` to reconstruct the longest/last conversation thread.
187+
188+
#### Query & retrieval expectations
189+
190+
Providers should:
191+
- Expose a way to query trace roots by EP identifiers (e.g., `inputs.metadata.invocation_id`, `row_id`) and time windows.
192+
- Return the full conversation (preferably via `output.messages`) or provide a clear way to join observations to reconstruct the longest/last thread.
193+
- Support filtering by IDs or tags so integrators can validate ingestion deterministically in tests.
194+
49195
### 1. Trace ingestion fidelity
50196

51197
* **Conversation reconstruction** – Convert provider-specific trace payloads into the Eval Protocol message schema while keeping system, user, assistant, and tool messages intact. Langfuse, LangSmith, and Braintrust adapters follow this pattern by transforming trace inputs and outputs into `EvaluationRow` instances with preserved session metadata.【F:eval_protocol/adapters/langfuse.py†L60-L161】【F:eval_protocol/adapters/langsmith.py†L28-L188】【F:eval_protocol/adapters/braintrust.py†L48-L127】【F:eval_protocol/adapters/utils.py†L16-L98】
@@ -109,21 +255,13 @@ Document required environment variables, authentication expectations, and typica
109255
* **Outbound feedback** – Maintain score upload helpers so evaluations can annotate project logs directly in Braintrust.【F:eval_protocol/adapters/braintrust.py†L224-L299】
110256
* **Remote validation** – Update the Chinook smoke tests (or similar) when Braintrust introduces new trace shapes, ensuring multi-turn and tool-heavy workflows remain covered.【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】
111257

112-
### Template for New Providers
113-
114-
1. Mirror the structure of existing adapters—constructor with dependency checks, `get_evaluation_rows`, optional `upload_scores`, and a factory helper.
115-
2. Implement feature-complete ingestion for the provider's highest-value trace shapes before expanding to niche cases.
116-
3. Add unit tests that cover ingestion, metadata, and tool usage. Use the Braintrust and LangSmith tests as templates for asserting conversation fidelity.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/adapters/test_langsmith_adapter.py†L25-L181】
117-
4. Provide at least one opt-in smoke test (skipped when credentials are missing) to catch regressions against the live API.【F:tests/test_adapters_e2e.py†L17-L193】
118-
5. Document setup steps and limitations in `docs/` so contributors understand how to run validations locally.
119-
120258
### Weave
121259

122260
* **Ingestion** – Prefer `output.messages` for complete conversations (including tool role messages) and include `output.choices` for Weave chat UI rendering. Store Weave IDs (trace, project) in `input_metadata.session_data`.
123261
* **Tooling support** – Ensure assistant `tool_calls` and tool role messages are preserved by processing `output.messages`. Include parallel tool calls in future fixtures.
124262
* **EP identifiers (to become EP-ready)** – Add `metadata.invocation_id`, `metadata.experiment_id`, `metadata.rollout_id`, `metadata.run_id`, and `metadata.row_id` into traces at logging time so the adapter can map them to `EvaluationRow.input_metadata.row_id` and `execution_metadata` fields.
125263
* **Outbound scores** – No public feedback API at this time; document as not available.
126-
* **Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible.
264+
* **Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible. For chat completion capture via LiteLLM default path, validate Weave’s LiteLLM integration (`weave.init(...)` + `litellm.completion(...)`) and ensure EP metadata is present in logged payloads for multi-turn workflows.
127265

128266
## Compatibility and Validation Matrix
129267

@@ -134,7 +272,7 @@ The table below tracks the current validation status for each tracing provider.
134272
| **Langfuse** | ✅ Span-aware extraction populates `EvaluationRow` metadata for downstream joins.【F:eval_protocol/adapters/langfuse.py†L60-L161】 |`upload_scores` / `upload_score` sync evaluation results to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Utilizes shared utilities to keep tool schemas and tool messages intact.【F:eval_protocol/adapters/langfuse.py†L60-L93】【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Supported by message parsing but add dedicated regression tests for multi-call traces.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Handles standard text payloads; add fixtures for multi-modal spans as they become available.【F:eval_protocol/adapters/langfuse.py†L115-L159】 | ✅ Credential-gated E2E tests fetch live traces and validate message integrity.【F:tests/test_adapters_e2e.py†L17-L193】 |
135273
| **LangSmith** | ✅ Converts diverse payload shapes and stores run/trace IDs in session metadata.【F:eval_protocol/adapters/langsmith.py†L130-L205】【F:eval_protocol/adapters/langsmith.py†L172-L189】 | ❌ Implement score upload once LangSmith exposes a feedback API comparable to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Normalizes assistant tool calls and preserves tool role messages in reconstructed conversations.【F:eval_protocol/adapters/langsmith.py†L315-L352】【F:tests/adapters/test_langsmith_adapter.py†L83-L129】 | ✅ Unit tests cover multiple tool calls in a single assistant turn.【F:tests/adapters/test_langsmith_adapter.py†L155-L181】 | 🟡 `_extract_messages_from_payload` flattens list-based content; add coverage for richer multi-modal payloads.【F:eval_protocol/adapters/langsmith.py†L289-L406】 | ✅ Comprehensive unit tests mock `list_runs` responses across scenarios.【F:tests/adapters/test_langsmith_adapter.py†L25-L181】 |
136274
| **Braintrust** | ✅ BTQL ingestion captures conversation messages and embeds trace IDs for later score pushback.【F:eval_protocol/adapters/braintrust.py†L129-L221】【F:eval_protocol/adapters/braintrust.py†L75-L83】 | ✅ Score upload helpers annotate project logs through Braintrust's feedback API.【F:eval_protocol/adapters/braintrust.py†L224-L299】 | ✅ Tests ensure assistant tool calls, tool responses, and metadata-provided tool schemas are preserved.【F:tests/adapters/test_braintrust_adapter.py†L65-L253】 | 🟡 Shared utilities support multiple tool calls; add explicit Braintrust fixtures when providers emit them.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Current conversion focuses on text; extend tests if Braintrust exposes structured multi-modal payloads.【F:eval_protocol/adapters/braintrust.py†L102-L127】 | ✅ Unit tests mock BTQL responses and error paths; Chinook scenario validates real-world usage.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】 |
137-
| **Weave** | 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior. |
275+
| **Weave** | 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior; validate LiteLLM integration for default completion path. |
138276

139277
Legend: ✅ — validated; 🟡 — supported in code but needs additional coverage; ❌ — not yet implemented.
140278

eval_protocol/adapters/weave.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
import requests
1818

19-
from eval_protocol.models import EvaluationRow, InputMetadata, Message
19+
from eval_protocol.models import EvaluationRow, InputMetadata, Message, ExecutionMetadata
2020
from .base import BaseAdapter
2121
from .utils import extract_messages_from_data
2222

@@ -122,15 +122,36 @@ def convert_trace_to_evaluation_row(trace: Dict[str, Any], include_tool_calls: b
122122
project_id = str(trace.get("project_id", ""))
123123
weave_trace_id = str(trace.get("id", ""))
124124

125+
# EP metadata mapping (if provider logged metadata under inputs or output)
126+
ep_meta: Dict[str, Any] = {}
127+
inputs_obj = trace.get("inputs") or {}
128+
if isinstance(inputs_obj, dict) and isinstance(inputs_obj.get("metadata"), dict):
129+
ep_meta = inputs_obj.get("metadata") or {}
130+
elif isinstance(trace.get("output"), dict) and isinstance(trace["output"].get("metadata"), dict):
131+
ep_meta = trace["output"].get("metadata") or {}
132+
133+
row_id_val = ep_meta.get("row_id")
134+
invocation_id_val = ep_meta.get("invocation_id")
135+
experiment_id_val = ep_meta.get("experiment_id")
136+
rollout_id_val = ep_meta.get("rollout_id")
137+
run_id_val = ep_meta.get("run_id")
138+
125139
return EvaluationRow(
126140
messages=messages,
127141
tools=tools,
128142
input_metadata=InputMetadata(
143+
row_id=str(row_id_val) if row_id_val else None,
129144
session_data={
130145
"weave_trace_id": weave_trace_id,
131146
"weave_project_id": project_id,
132147
}
133148
),
149+
execution_metadata=ExecutionMetadata(
150+
invocation_id=str(invocation_id_val) if invocation_id_val else None,
151+
experiment_id=str(experiment_id_val) if experiment_id_val else None,
152+
rollout_id=str(rollout_id_val) if rollout_id_val else None,
153+
run_id=str(run_id_val) if run_id_val else None,
154+
),
134155
)
135156
except (KeyError, TypeError, ValueError) as e:
136157
logger.error("Error converting Weave trace %s: %s", trace.get("id", "unknown"), e)

0 commit comments

Comments
 (0)