OpenTelemetry semantic conventions and instrumentation for agent provenance, derivation lineage, and acceptance criteria evaluation.
OpenTelemetry's gen_ai.* conventions tell you what an agent did -- which model it called, how many tokens it used, which tools it invoked. They do not tell you why you should trust the output. As autonomous agents make economic decisions, delegate to sub-agents, and produce outputs that drive real-world actions, three questions become critical:
- Provenance -- Where did this output come from? Can the claim be verified?
- Derivation -- In a multi-agent system, which agents contributed to this output, and how?
- Acceptance -- Did this output meet the task specification, and how was that determined?
None of these are covered by the current OTel specification. This repo defines the conventions, provides framework-specific adapters, and includes an OTEP draft for upstream submission.
These gaps are most acute in Microsoft's AI stack. See docs/gap-analysis.md for the full analysis.
| Gap | Problem | Affected Platforms | Convention Fill |
|---|---|---|---|
| RAG retrieval influence | Azure AI Search returns N results, but there's no way to capture which results actually influenced the output vs. which were retrieved and ignored | Azure AI Search, Semantic Kernel | agent.output.source.influence, agent.output.grounding.* |
| Multi-agent derivation | When Agent A hands a message to Agent B, there's no metadata linking B's output to A's contribution -- the span tree shows structure, not causation | AutoGen, multi-agent orchestration | agent.derivation.input_spans, agent.derivation.strategy, agent.derivation.weight |
| Acceptance criteria evaluation | An agent produces output and the task is done -- no structured check of "did this meet the spec?" with strategy selection and confidence gating | AutoGen, Semantic Kernel, all frameworks | agent.task.acceptance_criteria.* |
Google's position is different: Vertex AI / ADK already emits structured grounding metadata (groundingChunks, groundingSupports, coverage scores). A ~180-line adapter translates this mechanically into OTel attributes. Google users get provenance "for free."
Microsoft's gaps are structural: They require new instrumentation at the Semantic Kernel plugin layer, the AutoGen runtime layer, and an entirely new acceptance evaluation component. Whoever provides that instrumentation defines how Microsoft's stack reports provenance.
The conventions follow a progressive trust model. Each tier builds on the previous:
| Tier | Name | What It Provides | Infrastructure Required |
|---|---|---|---|
| 1 | Self-Declared | Agent reports its own sources, model, and confidence | None -- any agent can emit these |
| 2 | Signed / Hash-Verified | Content hash, digital signature, attestation record on a verification backend | Signing keys + attestation service |
| 3 | Verifiable Callback | Live endpoint that confirms provenance claims in real-time | Verification backend (e.g. A2A-SE exchange) |
When an observability dashboard shows agent.provenance.attestation.uri = exchange.a2a-settlement.org/attestation/abc123, that's a live link to verification infrastructure. The convention defines the vocabulary; the verification backend provides the guarantee.
| Attribute | Type | Tier | Description |
|---|---|---|---|
agent.id |
string | 1 | Unique agent identifier |
agent.identity.tier |
int | 1 | KYA verification tier (1/2/3) |
agent.identity.registry |
string | 1 | Identity registry |
agent.output.provenance.tier |
int | 1 | Attestation tier of this output (1/2/3) |
agent.output.source.type |
enum | 1 | model_generation, retrieval, tool_call, agent_delegation, hybrid |
agent.output.source.uri |
string[] | 1 | Source URIs that influenced the output |
agent.output.source.influence |
enum | 1 | attended, cited, ignored -- how sources influenced output |
agent.output.confidence |
double | 1 | Calibrated confidence (0.0-1.0) |
agent.output.model.name |
string | 1 | Model that produced the output |
agent.output.model.version |
string | 1 | Model version/checkpoint |
agent.output.grounding.coverage |
double | 1 | Fraction of output grounded by sources |
agent.output.grounding.source_count |
int | 1 | Number of distinct grounding sources |
agent.output.grounding.domain_count |
int | 1 | Number of distinct source domains |
agent.output.hash.algorithm |
enum | 2 | sha256, sha3-256, sha384, sha512 |
agent.output.hash.value |
string | 2 | Hex-encoded content hash |
agent.output.signature.method |
enum | 2 | hmac, ed25519, x509 |
agent.output.signature.value |
string | 2 | Base64-encoded signature |
agent.output.signature.key_id |
string | 2 | Signing key reference |
agent.provenance.attestation.uri |
string | 2 | Attestation record URI |
agent.provenance.attestation.timestamp |
string | 2 | RFC 3339 timestamp |
agent.provenance.callback.uri |
string | 3 | Live verification endpoint |
agent.provenance.callback.method |
enum | 3 | GET or POST |
agent.provenance.callback.status |
enum | 3 | verified, expired, revoked |
agent.provenance.chain.parent_span_id |
string | 1+ | Parent agent's output span ID |
agent.provenance.chain.depth |
int | 1+ | Number of agent hops from the root task |
agent.provenance.chain.root_task_id |
string | 1+ | Original task/bounty ID |
These capture the multi-agent causal dependency graph -- which agent outputs derived from which other outputs. This is the layer absent from both Google and Microsoft today.
| Attribute | Type | Description |
|---|---|---|
agent.derivation.input_spans |
string[] | Span IDs of upstream outputs this agent consumed |
agent.derivation.input_agents |
string[] | Agent IDs that contributed input |
agent.derivation.strategy |
enum | synthesis, delegation, pipeline, consensus, review |
agent.derivation.weight |
double[] | Relative contribution weight per input (sums to 1.0) |
These capture the structured evaluation loop: did the output meet the task specification?
| Attribute | Type | Description |
|---|---|---|
agent.task.id |
string | Task or bounty identifier |
agent.task.acceptance_criteria |
string | Hash or URI of the acceptance specification |
agent.task.acceptance_criteria.met |
boolean | Whether the output met criteria |
agent.task.acceptance_criteria.score |
double | Evaluation score (0.0-1.0) |
agent.task.acceptance_criteria.strategy |
enum | llm, hash, human, hybrid, automated |
agent.task.acceptance_criteria.evaluator |
string | Who/what performed the evaluation |
agent.task.acceptance_criteria.factors |
string[] | Evaluation factors considered |
pip install -e .
# With framework-specific extras:
pip install -e ".[google]" # Google ADK / Vertex AI grounding adapter
pip install -e ".[azure]" # Azure AI Search adapter
pip install -e ".[autogen]" # AutoGen derivation tracking
pip install -e ".[settlement]" # A2A-SE exchange/mediator/shim instrumentation
pip install -e ".[dev]" # pytest, ruff, mypy
pip install -e ".[all]" # Everythingfrom otel_agent_provenance.spans import ProvenanceSpan
with ProvenanceSpan(
agent_id="my-agent",
tier=1,
source_type="retrieval",
source_uris=["https://example.com/data"],
model_name="gpt-4o",
confidence=0.85,
) as span:
result = do_work()
span.set_grounding(coverage=0.82, source_count=3, domain_count=2)with ProvenanceSpan(
agent_id="my-agent",
tier=2,
source_type="retrieval",
hash_algorithm="sha256",
hash_value="9f86d081884c7d659a2feaa0c55ad015...",
signature_method="ed25519",
attestation_uri="https://exchange.a2a-settlement.org/attestation/abc123",
attestation_timestamp="2026-03-11T14:30:00Z",
) as span:
result = do_verified_work()from otel_agent_provenance.spans import DerivationSpan
with DerivationSpan(
input_agents=["researcher", "analyst"],
strategy="synthesis",
weights=[0.4, 0.6],
):
combined_output = synthesize(research, analysis)from otel_agent_provenance.spans import AcceptanceSpan
with AcceptanceSpan(
task_id="task-001",
strategy="hybrid",
evaluator="mediator-001",
factors=["completeness", "provenance", "grounding"],
) as span:
result = evaluate(output, criteria)
span.set_result(met=True, score=0.87)from otel_agent_provenance.spans import ProvenanceSpan, DerivationSpan
# On any existing OTel span:
ProvenanceSpan.enrich(span, agent_id="my-agent", tier=1, source_type="retrieval")
DerivationSpan.enrich(span, input_agents=["agent-a"], strategy="delegation")The three-stage RAG pipeline tracer instruments what was previously invisible: the retrieval-to-generation influence chain.
from otel_agent_provenance.adapters.azure_rag import RagStageTracer
tracer = RagStageTracer(agent_id="my-rag-agent")
# Stage 1: Retrieval -- which documents came back from search
with tracer.retrieval(query="How does X work?") as stage:
results = search_client.search(query)
stage.record_results(results) # records URIs, scores, domain count
# Stage 2: Selection -- which results entered the model's context window
with tracer.selection() as stage:
stage.record_context_window(attended=top_3, ignored=rest)
# Stage 3: Generation -- what the model produced and which sources it cited
with tracer.generation(model_name="gpt-4o") as stage:
output = llm.generate(prompt)
stage.record_output(output, cited_uris=["..."], coverage=0.75)Tracks causal contribution across AutoGen agent conversations, not just structural parent-child spans.
from otel_agent_provenance.adapters.autogen_derivation import AutoGenDerivationTracker
tracker = AutoGenDerivationTracker(root_task_id="task-001")
# Agent A produces research output
tracker.record_output("agent-a", span_id_a)
# Agent B consumes A's output and produces its own
with tracer.start_as_current_span("invoke_agent agent-b") as span:
tracker.record_derivation(
agent_id="agent-b",
span=span,
input_agent_ids=["agent-a"],
strategy="synthesis",
weights=[1.0],
)The resulting trace shows: Agent B's output derived from Agent A's research (chain depth 1, root task "task-001").
from otel_agent_provenance.adapters.google_grounding import from_adk_provenance, enrich_span_from_grounding
# From adk-a2a-settlement's build_grounded_provenance() output:
attrs = from_adk_provenance(provenance_dict, agent_id="my-agent")
enrich_span_from_grounding(span, attrs)from otel_agent_provenance.adapters.semantic_kernel_rag import ProvenanceFilter
# Auto-instruments all RAG plugin invocations
kernel.add_filter(ProvenanceFilter(agent_id="my-sk-agent"))For projects using the A2A Settlement Exchange, dedicated instrumentors map the exchange's existing provenance schema, mediator evaluation, and security shim operations to OTel spans:
from otel_agent_provenance.instruments.settlement_exchange import ExchangeInstrumentor
from otel_agent_provenance.instruments.settlement_mediator import MediatorInstrumentor
from otel_agent_provenance.instruments.settlement_shim import ShimInstrumentor
# Auto-instrument FastAPI apps
ExchangeInstrumentor(agent_id="exchange-node-1").instrument(exchange_app)
ShimInstrumentor(agent_id="shim-node-1").instrument(shim_app)
# Or use the mediator span builder
with MediatorInstrumentor.mediation_span(escrow_id="esc-123") as span:
verdict = mediator.mediate(escrow_id)
MediatorInstrumentor.record_verdict(span, verdict)Four runnable examples demonstrate every convention group with in-memory OTel exporters (no backend needed):
python examples/autogen_provenance.py # 3-agent derivation chain with depth tracking
python examples/azure_rag_provenance.py # 5 retrieved -> 3 attended -> 2 cited
python examples/adk_grounding.py # Google grounding -> OTel in ~10 lines
python examples/settlement_e2e.py # Full escrow lifecycle (7 spans, all 3 convention groups)otel-agent-provenance/
model/ # YAML semantic convention definitions (OTel format)
agent-provenance.yaml # Tier 1/2/3 provenance attributes
agent-derivation.yaml # Multi-agent derivation lineage
agent-acceptance.yaml # Task acceptance criteria evaluation
docs/
agent-provenance.md # Convention specification
agent-derivation.md # Derivation lineage spec
agent-acceptance.md # Acceptance criteria spec
gap-analysis.md # Microsoft / Google / OTel gap analysis
otep/
OTEP-agent-provenance.md # OpenTelemetry Enhancement Proposal draft
src/otel_agent_provenance/
conventions.py # Attribute key constants (52 keys + enum classes)
spans.py # Span builders: ProvenanceSpan, DerivationSpan, AcceptanceSpan
adapters/
google_grounding.py # Google Vertex AI / ADK grounding -> OTel
azure_rag.py # Azure AI Search 3-stage RAG pipeline -> OTel
autogen_derivation.py # AutoGen multi-agent derivation chain -> OTel
semantic_kernel_rag.py # Semantic Kernel plugin filter -> OTel
instruments/
settlement_exchange.py # A2A-SE exchange escrow lifecycle instrumentation
settlement_mediator.py # AI Mediator 7-factor evaluation -> acceptance spans
settlement_shim.py # Economic Air Gap proxy instrumentation
tests/ # 31 tests
examples/ # 4 runnable examples
pip install -e ".[dev]"
pytest -vThese conventions sit above the existing gen_ai.* namespace and are designed to interoperate:
| This Repo | Existing OTel | Relationship |
|---|---|---|
agent.id |
gen_ai.agent.id |
May be the same value; provenance agent.id is required |
agent.output.model.name |
gen_ai.request.model |
gen_ai.* is per-call; agent.output.* is per-output |
agent.task.acceptance_criteria.score |
gen_ai.evaluation.score.value |
gen_ai.* is a single number; acceptance adds strategy, factors, evaluator |
agent.output.source.uri |
gen_ai.data_source.id |
gen_ai.* identifies the data source; agent.* identifies specific sources within it |
agent.provenance.chain.parent_span_id |
gen_ai.agent.parent_id |
gen_ai.* is structural topology; agent.* is causal derivation |
We use agent.* rather than gen_ai.provenance.* because these conventions apply to any autonomous software agent, not only generative AI systems.
| Repository | Role |
|---|---|
| a2a-settlement | Settlement Exchange — escrow, provenance verification, KYA identity |
| a2a-settlement-auth | OAuth + spending limits + SecretVault |
| a2a-settlement-mediator | AI Mediator — 7-factor dispute resolution (reference for acceptance criteria) |
| a2a-settlement-dashboard | Human oversight dashboard |
| a2a-settlement-mcp | MCP server for settlement operations |
| settlebridge-ai | SettleBridge Gateway — trust/policy enforcement |
| mcp-trust-gateway | MCP trust layer above OAuth |
| a2a-federation-rfc | Federation protocol specification |
| adk-a2a-settlement | Google ADK integration (source for grounding adapter) |
| langgraph-a2a-settlement | LangGraph integration |
| crewai-a2a-settlement | CrewAI integration |
| litellm-a2a-settlement | LiteLLM integration |
Apache 2.0