eval-protocol
diff --git a/‎docs/developer_guide/tracing_integration_guide.mdx‎
Lines changed: 148 additions & 10 deletions b/‎docs/developer_guide/tracing_integration_guide.mdx‎
Lines changed: 148 additions & 10 deletions
diff --git a/‎eval_protocol/adapters/weave.py‎
Lines changed: 22 additions & 1 deletion b/‎eval_protocol/adapters/weave.py‎
Lines changed: 22 additions & 1 deletion
@@ -46,6 +46,152 @@ There are two levels of support we accept for a provider integration. Providers
 
 Provider authors should update their trace logging to include these identifiers in a stable metadata location (e.g., top-level `metadata`, or a well-documented nested field), and document how adapters can reliably read them.
 
+### Traces vs Observations (single completions)
+
+Providers often expose two granularities:
+- Traces: top-level units representing a multi-step flow or multi-turn conversation. These may contain nested spans/observations.
+- Observations (or generations): individual LLM completions or smaller units within a trace.
+
+Adapter expectations:
+- Prefer using the provider’s trace-level API to reconstruct a complete conversation. For providers that emit both `output.messages` and `output.choices`, adapters should:
+  - Use `output.messages` to preserve the full multi-turn flow and tool role messages.
+  - Optionally read `output.choices` to capture the final assistant bubble used by provider UIs.
+- If only observation-level data is available, adapters should implement a best-effort stitch that recovers the longest/last conversation for a given parent trace. Document the fallbacks used.
+
+### Provider logging contract (inputs, outputs, metadata, tools)
+
+This section defines a minimal, provider-agnostic contract for what your tracer should log to unlock Basic IO and then EP-ready capabilities.
+
+#### Inputs
+
+- Basic input (minimal):
+
+```json
+{
+  "inputs": {
+    "messages": [
+      {"role": "system", "content": "You are helpful."},
+      {"role": "user", "content": "Say hi in one word."}
+    ]
+  }
+}
+```
+
+- Input with EP metadata (EP-ready):
+
+```json
+{
+  "inputs": {
+    "messages": [
+      {"role": "system", "content": "You are helpful."},
+      {"role": "user", "content": "Say hi in one word."}
+    ],
+    "metadata": {
+      "invocation_id": "ivk_demo",
+      "experiment_id": "exp_demo",
+      "rollout_id": "rll_demo",
+      "run_id": "",
+      "row_id": "row_demo"
+    },
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "calculator_add",
+          "description": "Add two integers",
+          "parameters": {
+            "type": "object",
+            "properties": {"a": {"type": "integer"}, "b": {"type": "integer"}},
+            "required": ["a", "b"]
+          }
+        }
+      }
+    ]
+  }
+}
+```
+
+Notes for providers:
+- Keep metadata values non-null strings where your API forbids nulls in metadata fields.
+- If your SDK requires a flag to persist metadata (e.g., a “store” toggle), document it and ensure your examples enable it.
+
+#### Outputs
+
+- Basic output (UI-friendly final bubble):
+
+```json
+{
+  "output": {
+    "choices": [
+      {"message": {"role": "assistant", "content": "Hi"}}
+    ]
+  }
+}
+```
+
+- Full output (agent-friendly full thread):
+
+```json
+{
+  "output": {
+    "messages": [
+      {"role": "assistant", "content": null, "tool_calls": [
+        {"id": "call_1", "type": "function", "function": {"name": "calculator_add", "arguments": "{\"a\":2,\"b\":3}"}}
+      ]},
+      {"role": "tool", "tool_call_id": "call_1", "name": "calculator_add", "content": "{\"a\":2,\"b\":3,\"sum\":5}"},
+      {"role": "assistant", "content": "The result is 5."}
+    ]
+  }
+}
+```
+
+- Output with EP metadata (optional redundancy):
+
+```json
+{
+  "output": {
+    "messages": [/* ... as above ... */],
+    "metadata": {
+      "invocation_id": "ivk_demo",
+      "experiment_id": "exp_demo",
+      "rollout_id": "rll_demo",
+      "run_id": "",
+      "row_id": "row_demo"
+    }
+  }
+}
+```
+
+Recommendations:
+- Log `output.messages` to preserve multi-turn and tool role messages.
+- Log `output.choices` to render a final assistant bubble in UIs that expect it.
+
+#### Tool calling
+
+To fully support tool-augmented agents:
+- Assistant tool calls must be logged on the assistant message as an array of OpenAI-style objects under `tool_calls`, each with `id`, `type` (function), and `function` (name, arguments string).
+- Tool results must be logged as a separate message with `role: tool`, include `tool_call_id` (matching the assistant’s tool_calls id), and the tool’s output under `content` (stringified JSON is acceptable).
+- If you support multiple tool calls in a single assistant turn, ensure all calls and matching tool messages are included and ordered.
+- If you expose a top-level tool schema, include it in `inputs.tools`.
+
+#### Adapter mapping (EP-ready)
+
+Adapters should map the provider’s metadata into EP models:
+- `inputs.metadata.row_id` → `EvaluationRow.input_metadata.row_id`
+- `inputs.metadata.invocation_id` → `EvaluationRow.execution_metadata.invocation_id`
+- `inputs.metadata.experiment_id` → `EvaluationRow.execution_metadata.experiment_id`
+- `inputs.metadata.rollout_id` → `EvaluationRow.execution_metadata.rollout_id`
+- `inputs.metadata.run_id` → `EvaluationRow.execution_metadata.run_id`
+
+Additionally, adapters should continue to populate provider-native IDs in `input_metadata.session_data` (for joinability back to the provider’s UI), and they should prefer `output.messages` over `output.choices` to reconstruct the longest/last conversation thread.
+
+#### Query & retrieval expectations
+
+Providers should:
+- Expose a way to query trace roots by EP identifiers (e.g., `inputs.metadata.invocation_id`, `row_id`) and time windows.
+- Return the full conversation (preferably via `output.messages`) or provide a clear way to join observations to reconstruct the longest/last thread.
+- Support filtering by IDs or tags so integrators can validate ingestion deterministically in tests.
+
 ### 1. Trace ingestion fidelity
 
 * **Conversation reconstruction** – Convert provider-specific trace payloads into the Eval Protocol message schema while keeping system, user, assistant, and tool messages intact. Langfuse, LangSmith, and Braintrust adapters follow this pattern by transforming trace inputs and outputs into `EvaluationRow` instances with preserved session metadata.【F:eval_protocol/adapters/langfuse.py†L60-L161】【F:eval_protocol/adapters/langsmith.py†L28-L188】【F:eval_protocol/adapters/braintrust.py†L48-L127】【F:eval_protocol/adapters/utils.py†L16-L98】
@@ -109,21 +255,13 @@ Document required environment variables, authentication expectations, and typica
 * **Outbound feedback** – Maintain score upload helpers so evaluations can annotate project logs directly in Braintrust.【F:eval_protocol/adapters/braintrust.py†L224-L299】
 * **Remote validation** – Update the Chinook smoke tests (or similar) when Braintrust introduces new trace shapes, ensuring multi-turn and tool-heavy workflows remain covered.【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】
 
-### Template for New Providers
-
-1. Mirror the structure of existing adapters—constructor with dependency checks, `get_evaluation_rows`, optional `upload_scores`, and a factory helper.
-2. Implement feature-complete ingestion for the provider's highest-value trace shapes before expanding to niche cases.
-3. Add unit tests that cover ingestion, metadata, and tool usage. Use the Braintrust and LangSmith tests as templates for asserting conversation fidelity.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/adapters/test_langsmith_adapter.py†L25-L181】
-4. Provide at least one opt-in smoke test (skipped when credentials are missing) to catch regressions against the live API.【F:tests/test_adapters_e2e.py†L17-L193】
-5. Document setup steps and limitations in `docs/` so contributors understand how to run validations locally.
-
 ### Weave
 
 * **Ingestion** – Prefer `output.messages` for complete conversations (including tool role messages) and include `output.choices` for Weave chat UI rendering. Store Weave IDs (trace, project) in `input_metadata.session_data`.
 * **Tooling support** – Ensure assistant `tool_calls` and tool role messages are preserved by processing `output.messages`. Include parallel tool calls in future fixtures.
 * **EP identifiers (to become EP-ready)** – Add `metadata.invocation_id`, `metadata.experiment_id`, `metadata.rollout_id`, `metadata.run_id`, and `metadata.row_id` into traces at logging time so the adapter can map them to `EvaluationRow.input_metadata.row_id` and `execution_metadata` fields.
 * **Outbound scores** – No public feedback API at this time; document as not available.
-* **Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible.
+* **Testing** – Provide unit tests that mock Service API responses, including `output.messages` with tool calls and tool messages; add a credential-gated E2E when feasible. For chat completion capture via LiteLLM default path, validate Weave’s LiteLLM integration (`weave.init(...)` + `litellm.completion(...)`) and ensure EP metadata is present in logged payloads for multi-turn workflows.
 
 ## Compatibility and Validation Matrix
 
@@ -134,7 +272,7 @@ The table below tracks the current validation status for each tracing provider.
 | **Langfuse** | ✅ Span-aware extraction populates `EvaluationRow` metadata for downstream joins.【F:eval_protocol/adapters/langfuse.py†L60-L161】 | ✅ `upload_scores` / `upload_score` sync evaluation results to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Utilizes shared utilities to keep tool schemas and tool messages intact.【F:eval_protocol/adapters/langfuse.py†L60-L93】【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Supported by message parsing but add dedicated regression tests for multi-call traces.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Handles standard text payloads; add fixtures for multi-modal spans as they become available.【F:eval_protocol/adapters/langfuse.py†L115-L159】 | ✅ Credential-gated E2E tests fetch live traces and validate message integrity.【F:tests/test_adapters_e2e.py†L17-L193】 |
 | **LangSmith** | ✅ Converts diverse payload shapes and stores run/trace IDs in session metadata.【F:eval_protocol/adapters/langsmith.py†L130-L205】【F:eval_protocol/adapters/langsmith.py†L172-L189】 | ❌ Implement score upload once LangSmith exposes a feedback API comparable to Langfuse.【F:eval_protocol/adapters/langfuse.py†L569-L625】 | ✅ Normalizes assistant tool calls and preserves tool role messages in reconstructed conversations.【F:eval_protocol/adapters/langsmith.py†L315-L352】【F:tests/adapters/test_langsmith_adapter.py†L83-L129】 | ✅ Unit tests cover multiple tool calls in a single assistant turn.【F:tests/adapters/test_langsmith_adapter.py†L155-L181】 | 🟡 `_extract_messages_from_payload` flattens list-based content; add coverage for richer multi-modal payloads.【F:eval_protocol/adapters/langsmith.py†L289-L406】 | ✅ Comprehensive unit tests mock `list_runs` responses across scenarios.【F:tests/adapters/test_langsmith_adapter.py†L25-L181】 |
 | **Braintrust** | ✅ BTQL ingestion captures conversation messages and embeds trace IDs for later score pushback.【F:eval_protocol/adapters/braintrust.py†L129-L221】【F:eval_protocol/adapters/braintrust.py†L75-L83】 | ✅ Score upload helpers annotate project logs through Braintrust's feedback API.【F:eval_protocol/adapters/braintrust.py†L224-L299】 | ✅ Tests ensure assistant tool calls, tool responses, and metadata-provided tool schemas are preserved.【F:tests/adapters/test_braintrust_adapter.py†L65-L253】 | 🟡 Shared utilities support multiple tool calls; add explicit Braintrust fixtures when providers emit them.【F:eval_protocol/adapters/utils.py†L16-L98】 | 🟡 Current conversion focuses on text; extend tests if Braintrust exposes structured multi-modal payloads.【F:eval_protocol/adapters/braintrust.py†L102-L127】 | ✅ Unit tests mock BTQL responses and error paths; Chinook scenario validates real-world usage.【F:tests/adapters/test_braintrust_adapter.py†L50-L333】【F:tests/chinook/braintrust/test_braintrust_chinook.py†L37-L86】 |
-| **Weave** | 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior. |
+| **Weave** | 🟡 Converts `output.messages` and `output.choices`; stores Weave trace/project IDs in session data. Add EP identifiers to become EP-ready. | ❌ No public score pushback API documented. | ✅ Preserves assistant `tool_calls` and tool role messages when `output.messages` provided. | 🟡 Parsing supports multiple calls; add explicit fixtures. | 🟡 Standard text payloads supported; add multi-modal when available. | ✅ Unit tests mock Service API responses; local E2E scripts validate UI/adapter behavior; validate LiteLLM integration for default completion path. |
 
 Legend: ✅ — validated; 🟡 — supported in code but needs additional coverage; ❌ — not yet implemented.
 
 
@@ -16,7 +16,7 @@
 
 import requests
 
-from eval_protocol.models import EvaluationRow, InputMetadata, Message
+from eval_protocol.models import EvaluationRow, InputMetadata, Message, ExecutionMetadata
 from .base import BaseAdapter
 from .utils import extract_messages_from_data
 
@@ -122,15 +122,36 @@ def convert_trace_to_evaluation_row(trace: Dict[str, Any], include_tool_calls: b
         project_id = str(trace.get("project_id", ""))
         weave_trace_id = str(trace.get("id", ""))
 
+        # EP metadata mapping (if provider logged metadata under inputs or output)
+        ep_meta: Dict[str, Any] = {}
+        inputs_obj = trace.get("inputs") or {}
+        if isinstance(inputs_obj, dict) and isinstance(inputs_obj.get("metadata"), dict):
+            ep_meta = inputs_obj.get("metadata") or {}
+        elif isinstance(trace.get("output"), dict) and isinstance(trace["output"].get("metadata"), dict):
+            ep_meta = trace["output"].get("metadata") or {}
+
+        row_id_val = ep_meta.get("row_id")
+        invocation_id_val = ep_meta.get("invocation_id")
+        experiment_id_val = ep_meta.get("experiment_id")
+        rollout_id_val = ep_meta.get("rollout_id")
+        run_id_val = ep_meta.get("run_id")
+
         return EvaluationRow(
             messages=messages,
             tools=tools,
             input_metadata=InputMetadata(
+                row_id=str(row_id_val) if row_id_val else None,
                 session_data={
                     "weave_trace_id": weave_trace_id,
                     "weave_project_id": project_id,
                 }
             ),
+            execution_metadata=ExecutionMetadata(
+                invocation_id=str(invocation_id_val) if invocation_id_val else None,
+                experiment_id=str(experiment_id_val) if experiment_id_val else None,
+                rollout_id=str(rollout_id_val) if rollout_id_val else None,
+                run_id=str(run_id_val) if run_id_val else None,
+            ),
         )
     except (KeyError, TypeError, ValueError) as e:
         logger.error("Error converting Weave trace %s: %s", trace.get("id", "unknown"), e)