You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -46,6 +46,152 @@ There are two levels of support we accept for a provider integration. Providers
46
46
47
47
Provider authors should update their trace logging to include these identifiers in a stable metadata location (e.g., top-level `metadata`, or a well-documented nested field), and document how adapters can reliably read them.
48
48
49
+
### Traces vs Observations (single completions)
50
+
51
+
Providers often expose two granularities:
52
+
- Traces: top-level units representing a multi-step flow or multi-turn conversation. These may contain nested spans/observations.
53
+
- Observations (or generations): individual LLM completions or smaller units within a trace.
54
+
55
+
Adapter expectations:
56
+
- Prefer using the provider’s trace-level API to reconstruct a complete conversation. For providers that emit both `output.messages` and `output.choices`, adapters should:
57
+
- Use `output.messages` to preserve the full multi-turn flow and tool role messages.
58
+
- Optionally read `output.choices` to capture the final assistant bubble used by provider UIs.
59
+
- If only observation-level data is available, adapters should implement a best-effort stitch that recovers the longest/last conversation for a given parent trace. Document the fallbacks used.
{"role": "assistant", "content": "The result is 5."}
143
+
]
144
+
}
145
+
}
146
+
```
147
+
148
+
- Output with EP metadata (optional redundancy):
149
+
150
+
```json
151
+
{
152
+
"output": {
153
+
"messages": [/* ... as above ... */],
154
+
"metadata": {
155
+
"invocation_id": "ivk_demo",
156
+
"experiment_id": "exp_demo",
157
+
"rollout_id": "rll_demo",
158
+
"run_id": "",
159
+
"row_id": "row_demo"
160
+
}
161
+
}
162
+
}
163
+
```
164
+
165
+
Recommendations:
166
+
- Log `output.messages` to preserve multi-turn and tool role messages.
167
+
- Log `output.choices` to render a final assistant bubble in UIs that expect it.
168
+
169
+
#### Tool calling
170
+
171
+
To fully support tool-augmented agents:
172
+
- Assistant tool calls must be logged on the assistant message as an array of OpenAI-style objects under `tool_calls`, each with `id`, `type` (function), and `function` (name, arguments string).
173
+
- Tool results must be logged as a separate message with `role: tool`, include `tool_call_id` (matching the assistant’s tool_calls id), and the tool’s output under `content` (stringified JSON is acceptable).
174
+
- If you support multiple tool calls in a single assistant turn, ensure all calls and matching tool messages are included and ordered.
175
+
- If you expose a top-level tool schema, include it in `inputs.tools`.
176
+
177
+
#### Adapter mapping (EP-ready)
178
+
179
+
Adapters should map the provider’s metadata into EP models:
Additionally, adapters should continue to populate provider-native IDs in `input_metadata.session_data` (for joinability back to the provider’s UI), and they should prefer `output.messages` over `output.choices` to reconstruct the longest/last conversation thread.
187
+
188
+
#### Query & retrieval expectations
189
+
190
+
Providers should:
191
+
- Expose a way to query trace roots by EP identifiers (e.g., `inputs.metadata.invocation_id`, `row_id`) and time windows.
192
+
- Return the full conversation (preferably via `output.messages`) or provide a clear way to join observations to reconstruct the longest/last thread.
193
+
- Support filtering by IDs or tags so integrators can validate ingestion deterministically in tests.
194
+
49
195
### 1. Trace ingestion fidelity
50
196
51
197
***Conversation reconstruction** – Convert provider-specific trace payloads into the Eval Protocol message schema while keeping system, user, assistant, and tool messages intact. Langfuse, LangSmith, and Braintrust adapters follow this pattern by transforming trace inputs and outputs into `EvaluationRow` instances with preserved session metadata.【F:eval_protocol/adapters/langfuse.py†L60-L161】【F:eval_protocol/adapters/langsmith.py†L28-L188】【F:eval_protocol/adapters/braintrust.py†L48-L127】【F:eval_protocol/adapters/utils.py†L16-L98】
***Outbound feedback** – Maintain score upload helpers so evaluations can annotate project logs directly in Braintrust.【F:eval_protocol/adapters/braintrust.py†L224-L299】
110
256
***Remote validation** – Update the Chinook smoke tests (or similar) when Braintrust introduces new trace shapes, ensuring multi-turn and tool-heavy workflows remain covered.【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】
111
257
112
-
### Template for New Providers
113
-
114
-
1. Mirror the structure of existing adapters—constructor with dependency checks, `get_evaluation_rows`, optional `upload_scores`, and a factory helper.
115
-
2. Implement feature-complete ingestion for the provider's highest-value trace shapes before expanding to niche cases.
116
-
3. Add unit tests that cover ingestion, metadata, and tool usage. Use the Braintrust and LangSmith tests as templates for asserting conversation fidelity.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/adapters/test_langsmith_adapter.py†L25-L181】
117
-
4. Provide at least one opt-in smoke test (skipped when credentials are missing) to catch regressions against the live API.【F:tests/test_adapters_e2e.py†L17-L193】
118
-
5. Document setup steps and limitations in `docs/` so contributors understand how to run validations locally.
119
-
120
258
### Weave
121
259
122
260
***Ingestion** – Prefer `output.messages` for complete conversations (including tool role messages) and include `output.choices` for Weave chat UI rendering. Store Weave IDs (trace, project) in `input_metadata.session_data`.
123
261
***Tooling support** – Ensure assistant `tool_calls` and tool role messages are preserved by processing `output.messages`. Include parallel tool calls in future fixtures.
124
262
***EP identifiers (to become EP-ready)** – Add `metadata.invocation_id`, `metadata.experiment_id`, `metadata.rollout_id`, `metadata.run_id`, and `metadata.row_id` into traces at logging time so the adapter can map them to `EvaluationRow.input_metadata.row_id` and `execution_metadata` fields.
125
263
***Outbound scores** – No public feedback API at this time; document as not available.
126
-
***Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible.
264
+
***Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible. For chat completion capture via LiteLLM default path, validate Weave’s LiteLLM integration (`weave.init(...)` + `litellm.completion(...)`) and ensure EP metadata is present in logged payloads for multi-turn workflows.
127
265
128
266
## Compatibility and Validation Matrix
129
267
@@ -134,7 +272,7 @@ The table below tracks the current validation status for each tracing provider.
134
272
|**Langfuse**| ✅ Span-aware extraction populates `EvaluationRow` metadata for downstream joins.【F:eval_protocol/adapters/langfuse.py†L60-L161】 | ✅ `upload_scores` / `upload_score` sync evaluation results to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Utilizes shared utilities to keep tool schemas and tool messages intact.【F:eval_protocol/adapters/langfuse.py†L60-L93】【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Supported by message parsing but add dedicated regression tests for multi-call traces.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Handles standard text payloads; add fixtures for multi-modal spans as they become available.【F:eval_protocol/adapters/langfuse.py†L115-L159】 | ✅ Credential-gated E2E tests fetch live traces and validate message integrity.【F:tests/test_adapters_e2e.py†L17-L193】 |
135
273
|**LangSmith**| ✅ Converts diverse payload shapes and stores run/trace IDs in session metadata.【F:eval_protocol/adapters/langsmith.py†L130-L205】【F:eval_protocol/adapters/langsmith.py†L172-L189】 | ❌ Implement score upload once LangSmith exposes a feedback API comparable to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Normalizes assistant tool calls and preserves tool role messages in reconstructed conversations.【F:eval_protocol/adapters/langsmith.py†L315-L352】【F:tests/adapters/test_langsmith_adapter.py†L83-L129】 | ✅ Unit tests cover multiple tool calls in a single assistant turn.【F:tests/adapters/test_langsmith_adapter.py†L155-L181】 | 🟡 `_extract_messages_from_payload` flattens list-based content; add coverage for richer multi-modal payloads.【F:eval_protocol/adapters/langsmith.py†L289-L406】 | ✅ Comprehensive unit tests mock `list_runs` responses across scenarios.【F:tests/adapters/test_langsmith_adapter.py†L25-L181】 |
136
274
| **Braintrust** | ✅ BTQL ingestion captures conversation messages and embeds trace IDs for later score pushback.【F:eval_protocol/adapters/braintrust.py†L129-L221】【F:eval_protocol/adapters/braintrust.py†L75-L83】 | ✅ Score upload helpers annotate project logs through Braintrust's feedback API.【F:eval_protocol/adapters/braintrust.py†L224-L299】 | ✅ Tests ensure assistant tool calls, tool responses, and metadata-provided tool schemas are preserved.【F:tests/adapters/test_braintrust_adapter.py†L65-L253】 | 🟡 Shared utilities support multiple tool calls; add explicit Braintrust fixtures when providers emit them.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Current conversion focuses on text; extend tests if Braintrust exposes structured multi-modal payloads.【F:eval_protocol/adapters/braintrust.py†L102-L127】 | ✅ Unit tests mock BTQL responses and error paths; Chinook scenario validates real-world usage.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】 |
137
-
|**Weave**| 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior. |
275
+
|**Weave**| 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior; validate LiteLLM integration for default completion path. |
138
276
139
277
Legend: ✅ — validated; 🟡 — supported in code but needs additional coverage; ❌ — not yet implemented.
0 commit comments