feat: add EvaluationClient with run() for on-demand session evaluation#300
Open
feat: add EvaluationClient with run() for on-demand session evaluation#300
Conversation
Contributor
Author
Manual Integration Test ScriptSave as This test invokes 20 turns to trigger batching (>10 trace IDs), waits 180s for CW ingestion, then runs """Temporary real test for EvaluationClient.run() batching — delete after testing."""
import json
import logging
import time
import uuid
import boto3
from bedrock_agentcore.evaluation import EvaluationClient
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("botocore").setLevel(logging.WARNING)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
AGENT_ARN = "arn:aws:bedrock-agentcore:us-west-2:363376058968:runtime/HealthcareAgent_HealthCareAgent-Pv2decFQqQ"
AGENT_ID = "HealthcareAgent_HealthCareAgent-Pv2decFQqQ"
REGION = "us-west-2"
def invoke_agent(session_id: str, prompt: str) -> str:
dp_client = boto3.client("bedrock-agentcore", region_name=REGION)
payload = json.dumps({"prompt": prompt}).encode()
response = dp_client.invoke_agent_runtime(
agentRuntimeArn=AGENT_ARN, runtimeSessionId=session_id, payload=payload,
)
raw_output = response["response"].read().decode("utf-8")
text_parts = []
for line in raw_output.splitlines():
if line.startswith("data: "):
chunk = line[len("data: "):]
if chunk.startswith('"') and chunk.endswith('"'):
chunk = json.loads(chunk)
text_parts.append(chunk)
return "".join(text_parts) if text_parts else raw_output
TURNS = [
"What are the symptoms of the flu?",
"How is the flu treated?",
"When should I see a doctor for the flu?",
"What causes high blood pressure?",
"What are the symptoms of diabetes?",
"How is type 2 diabetes diagnosed?",
"What are common treatments for asthma?",
"What causes migraines?",
"How can I prevent heart disease?",
"What are the side effects of ibuprofen?",
"What is the difference between a cold and the flu?",
"How does pneumonia spread?",
"What vaccines do adults need?",
"What are the early signs of arthritis?",
"How is strep throat diagnosed?",
"What causes kidney stones?",
"How can I lower my cholesterol naturally?",
"What are the symptoms of anemia?",
"How is a urinary tract infection treated?",
"What are the warning signs of a stroke?",
]
def main():
session_id = f"test-batch-{uuid.uuid4()}"
print(f"Session ID: {session_id}")
print(f"Turns: {len(TURNS)}")
for i, prompt in enumerate(TURNS):
print(f"\n Turn {i+1}/20: {prompt}")
response = invoke_agent(session_id, prompt)
print(f" Response: {response[:150]}...")
print(f"\n--- Waiting 180s for spans to land in CloudWatch ---")
time.sleep(180)
print(f"\n{'='*60}")
print(f"Running EvaluationClient.run()")
print(f"{'='*60}")
client = EvaluationClient(region_name=REGION)
results = client.run(
evaluator_ids=["Builtin.Helpfulness"],
session_id=session_id,
agent_id=AGENT_ID,
)
print(f"\n--- Results ({len(results)} total) ---")
for r in results:
print(json.dumps(r, indent=4, default=str))
if __name__ == "__main__":
main()Expected output
|
EvaluationClient collects spans from CloudWatch and calls the evaluate API with level-aware batching (SESSION/TRACE/TOOL_CALL). Accepts evaluator_ids, session_id, and agent_id or log_group_name. Auto-derives log group from agent_id, caches evaluator level lookups, and batches evaluate requests at max 10 target IDs per request.
e6b25d2 to
5615fb0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EvaluationClientwithrun()method that collects spans from CloudWatch and calls the evaluate API with level-aware batching (SESSION/TRACE/TOOL_CALL)_agent_span_collectorpackage withCloudWatchAgentSpanCollectorfor span collection with retry/pollingquery_stringandend_timeparameters toCloudWatchSpanHelperto support collector delegationDetails
run()accepts evaluator_ids, session_id, and agent_id or log_group_name/aws/bedrock-agentcore/runtimes/{agent_id}-DEFAULTattributes.session.id+ispresent(scope.name)Test plan
python -m pytest tests/bedrock_agentcore/evaluation/test_client.py -v(35 tests)python -m pytest tests/bedrock_agentcore/evaluation/ -v(111 tests)