feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support by yoavkatz · Pull Request #17 · kagenti/workload-harness

yoavkatz · 2026-05-14T09:22:39Z

Summary

Prometheus infrastructure metrics: Collect pod-level CPU, memory, and network metrics during benchmark runs and include them in trace analysis
MLflow tracing: Migrate from Phoenix/GraphQL to MLflow REST API for trace collection, using an OTEL collector to forward spans with proper auth
MCP Gateway support: Route MCP traffic through the Kagenti MCP Gateway with HTTPRoute and MCPServerRegistration resources
Deploy script hardening: Prevent silent failures when the Kagenti API is unreachable by adding timeouts, connection checks, and clear error messages
Runner improvements: Add experiment_name grouping, session_id as A2A metadata, benchmark-prefixed MCP tools, and short listing for tool_calling agent

Test plan

Run deploy-and-evaluate.sh --benchmark tau2 --agent tool_calling end-to-end
Verify MLflow traces are collected via analyze-run.sh -f
Test with --use-mcp-gateway flag to confirm gateway routing
Confirm deploy scripts fail gracefully when port-forward is down (clear error message, no silent exit)
Verify Prometheus metrics appear in session results when --mlflow is enabled

🤖 Generated with Claude Code

- Add --use-mcp-gateway argument parsing to evaluate-benchmark.sh - Update help text and usage examples to document the new flag - Modify deploy-and-evaluate.sh to pass --use-mcp-gateway to evaluate-benchmark.sh when USE_MCP_GATEWAY is true - Ensures consistent MCP gateway routing across deployment and evaluation phases Signed-off-by: Yoav Katz <katz@il.ibm.com>

- Add std() function to calculate standard deviation - Display std alongside averages for key metrics: - LLM calls/session - Tool calls/session - LLM call latency - Time breakdown percentages - Token counts (input, output, total) - Provides better insight into metric variability across sessions Signed-off-by: Yoav Katz <katz@il.ibm.com>

- List all MCPServerRegistrations before deletion - Delete all registrations in namespace (not just specific one) - Add detailed logging for cleanup operations - Handle case when no registrations exist gracefully Signed-off-by: Yoav Katz <katz@il.ibm.com>

- Modified A2AProxyClient.send_prompt() to accept session_id parameter - Pass session_id as request_metadata to client.send_message() - Updated runner.py to pass session_id to A2A client - Removed prompt instructions to pass session_id as string This aligns with the reference implementation in test_a2a_agent.py where session_id is passed via A2A request metadata. Signed-off-by: Yoav Katz <katz@il.ibm.com>

- Updated toolPrefix in deploy-benchmark.sh to use exgentic_${BENCHMARK_NAME}_ - Updated EXGENTIC_MCP_TOOL_PREFIX in evaluate-benchmark.sh to include benchmark name This allows multiple benchmarks to run concurrently with unique tool prefixes, preventing tool name collisions in the MCP gateway. Signed-off-by: Yoav Katz <katz@il.ibm.com>

- Added experiment_name field to ExgenticConfig (default: 'default') - Updated OTELInstrumentation.session_span() to accept and set metadata.experiment_name attribute - Updated Runner to pass experiment_name from config to session spans - Added --experiment flag to evaluate-benchmark.sh and deploy-and-evaluate.sh - Added --experiment and --compare flags to analyze-run.sh for filtering traces - Updated analyze_traces.py to extract, display, and group by experiment_name - Fixed analyze_traces.py to properly handle command-line arguments with argparse - Added print_experiment_comparison() function for side-by-side experiment comparison - Added average LLM and tool call latency metrics to all comparative reports - Track LLM calls after first observation separately (llm_count_after_obs) - Exclude LLM calls before first observation from average LLM calls/session metric This enables: 1. Tagging runs with experiment names via EXPERIMENT_NAME env var or --experiment flag 2. Filtering Phoenix traces by experiment name in analyze-run.sh 3. Comparing two experiments side-by-side with --compare flag 4. Grouping analysis results by experiment for better organization 5. Side-by-side comparison report showing metrics across experiments 6. Average latency metrics for LLM and tool calls in all reports 7. More accurate LLM call counts excluding initial observation calls Signed-off-by: Yoav Katz <katz@il.ibm.com>

Reduce minimum duration from 60s to 1s to allow more granular metrics queries for shorter time windows. Signed-off-by: Yoav Katz <katz@il.ibm.com>

Add EXGENTIC_SET_AGENT_ENABLE_TOOL_SHORTLISTING=true environment variable when deploying tool_calling agent to optimize tool selection. Signed-off-by: Yoav Katz <katz@il.ibm.com>

Replace Phoenix GraphQL-based trace collection with MLflow REST API. The OTEL collector now forwards spans to MLflow instead of Phoenix. Add download_mlflow_traces.py for fetching and transforming traces. Harden deploy-benchmark.sh and deploy-agent.sh against silent failures when the Kagenti API is unreachable: add --max-time to curl calls, prevent set -e from silently killing the script, and verify port-forward liveness by checking the API responds rather than just the port being open. Signed-off-by: Yoav Katz <yoav.katz@ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…workload-harness into feature/prometheus-integration

…tivity - Replace all Phoenix references with MLflow in READMEs - Update analyze-run.sh default URL to http://mlflow.localtest.me:8080 - Make evaluate-benchmark.sh connectivity checks fail immediately instead of printing soft warnings - Add Exgentic harness and Observability section to root README Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

Signed-off-by: Yoav Katz <katz@il.ibm.com>

yoavkatz added 10 commits May 3, 2026 09:36

fix: reduce Prometheus metrics minimum duration to 1s

c407482

Reduce minimum duration from 60s to 1s to allow more granular metrics queries for shorter time windows. Signed-off-by: Yoav Katz <katz@il.ibm.com>

feat: enable short listing for tool_calling agent

2a581a3

Add EXGENTIC_SET_AGENT_ENABLE_TOOL_SHORTLISTING=true environment variable when deploying tool_calling agent to optimize tool selection. Signed-off-by: Yoav Katz <katz@il.ibm.com>

Merge branch 'feature/prometheus-integration' of github.com:yoavkatz/…

d7a4cf9

…workload-harness into feature/prometheus-integration

rubambiza added this to Kagenti Issue Prioritization May 14, 2026

github-project-automation Bot moved this to Backlog in Kagenti Issue Prioritization May 14, 2026

clawgenti mentioned this pull request May 18, 2026

Weekly Report 2026-05-18 kagenti/kagenti#1608

Open

yoavkatz added 2 commits May 20, 2026 12:17

Removed unneeded text from readme.

82fbaf5

Signed-off-by: Yoav Katz <katz@il.ibm.com>

Removed prompt which is not needed - and specified in original task.

3ff465e

Signed-off-by: Yoav Katz <katz@il.ibm.com>

kellyaa approved these changes May 21, 2026

View reviewed changes

kellyaa merged commit d2ece48 into kagenti:main May 21, 2026
1 check passed

github-project-automation Bot moved this from New /:ToDo to Done in Kagenti Issue Prioritization May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support#17

feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support#17
kellyaa merged 13 commits into
kagenti:mainfrom
yoavkatz:feature/mlflow-tracing-prometheus-metrics-mcp-gateway

yoavkatz commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yoavkatz commented May 14, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants