feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support#17
Merged
kellyaa merged 13 commits intoMay 21, 2026
Conversation
- Add --use-mcp-gateway argument parsing to evaluate-benchmark.sh - Update help text and usage examples to document the new flag - Modify deploy-and-evaluate.sh to pass --use-mcp-gateway to evaluate-benchmark.sh when USE_MCP_GATEWAY is true - Ensures consistent MCP gateway routing across deployment and evaluation phases Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add std() function to calculate standard deviation - Display std alongside averages for key metrics: - LLM calls/session - Tool calls/session - LLM call latency - Time breakdown percentages - Token counts (input, output, total) - Provides better insight into metric variability across sessions Signed-off-by: Yoav Katz <katz@il.ibm.com>
- List all MCPServerRegistrations before deletion - Delete all registrations in namespace (not just specific one) - Add detailed logging for cleanup operations - Handle case when no registrations exist gracefully Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Modified A2AProxyClient.send_prompt() to accept session_id parameter - Pass session_id as request_metadata to client.send_message() - Updated runner.py to pass session_id to A2A client - Removed prompt instructions to pass session_id as string This aligns with the reference implementation in test_a2a_agent.py where session_id is passed via A2A request metadata. Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Updated toolPrefix in deploy-benchmark.sh to use exgentic_${BENCHMARK_NAME}_
- Updated EXGENTIC_MCP_TOOL_PREFIX in evaluate-benchmark.sh to include benchmark name
This allows multiple benchmarks to run concurrently with unique tool prefixes,
preventing tool name collisions in the MCP gateway.
Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Added experiment_name field to ExgenticConfig (default: 'default') - Updated OTELInstrumentation.session_span() to accept and set metadata.experiment_name attribute - Updated Runner to pass experiment_name from config to session spans - Added --experiment flag to evaluate-benchmark.sh and deploy-and-evaluate.sh - Added --experiment and --compare flags to analyze-run.sh for filtering traces - Updated analyze_traces.py to extract, display, and group by experiment_name - Fixed analyze_traces.py to properly handle command-line arguments with argparse - Added print_experiment_comparison() function for side-by-side experiment comparison - Added average LLM and tool call latency metrics to all comparative reports - Track LLM calls after first observation separately (llm_count_after_obs) - Exclude LLM calls before first observation from average LLM calls/session metric This enables: 1. Tagging runs with experiment names via EXPERIMENT_NAME env var or --experiment flag 2. Filtering Phoenix traces by experiment name in analyze-run.sh 3. Comparing two experiments side-by-side with --compare flag 4. Grouping analysis results by experiment for better organization 5. Side-by-side comparison report showing metrics across experiments 6. Average latency metrics for LLM and tool calls in all reports 7. More accurate LLM call counts excluding initial observation calls Signed-off-by: Yoav Katz <katz@il.ibm.com>
Reduce minimum duration from 60s to 1s to allow more granular metrics queries for shorter time windows. Signed-off-by: Yoav Katz <katz@il.ibm.com>
Add EXGENTIC_SET_AGENT_ENABLE_TOOL_SHORTLISTING=true environment variable when deploying tool_calling agent to optimize tool selection. Signed-off-by: Yoav Katz <katz@il.ibm.com>
Replace Phoenix GraphQL-based trace collection with MLflow REST API. The OTEL collector now forwards spans to MLflow instead of Phoenix. Add download_mlflow_traces.py for fetching and transforming traces. Harden deploy-benchmark.sh and deploy-agent.sh against silent failures when the Kagenti API is unreachable: add --max-time to curl calls, prevent set -e from silently killing the script, and verify port-forward liveness by checking the API responds rather than just the port being open. Signed-off-by: Yoav Katz <yoav.katz@ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>
…workload-harness into feature/prometheus-integration
…tivity - Replace all Phoenix references with MLflow in READMEs - Update analyze-run.sh default URL to http://mlflow.localtest.me:8080 - Make evaluate-benchmark.sh connectivity checks fail immediately instead of printing soft warnings - Add Exgentic harness and Observability section to root README Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
kellyaa
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
deploy-and-evaluate.sh --benchmark tau2 --agent tool_callingend-to-endanalyze-run.sh -f--use-mcp-gatewayflag to confirm gateway routing--mlflowis enabled🤖 Generated with Claude Code