Skip to content

feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support#17

Merged
kellyaa merged 13 commits into
kagenti:mainfrom
yoavkatz:feature/mlflow-tracing-prometheus-metrics-mcp-gateway
May 21, 2026
Merged

feat: add Prometheus metrics, MLflow tracing, and MCP Gateway support#17
kellyaa merged 13 commits into
kagenti:mainfrom
yoavkatz:feature/mlflow-tracing-prometheus-metrics-mcp-gateway

Conversation

@yoavkatz
Copy link
Copy Markdown
Contributor

Summary

  • Prometheus infrastructure metrics: Collect pod-level CPU, memory, and network metrics during benchmark runs and include them in trace analysis
  • MLflow tracing: Migrate from Phoenix/GraphQL to MLflow REST API for trace collection, using an OTEL collector to forward spans with proper auth
  • MCP Gateway support: Route MCP traffic through the Kagenti MCP Gateway with HTTPRoute and MCPServerRegistration resources
  • Deploy script hardening: Prevent silent failures when the Kagenti API is unreachable by adding timeouts, connection checks, and clear error messages
  • Runner improvements: Add experiment_name grouping, session_id as A2A metadata, benchmark-prefixed MCP tools, and short listing for tool_calling agent

Test plan

  • Run deploy-and-evaluate.sh --benchmark tau2 --agent tool_calling end-to-end
  • Verify MLflow traces are collected via analyze-run.sh -f
  • Test with --use-mcp-gateway flag to confirm gateway routing
  • Confirm deploy scripts fail gracefully when port-forward is down (clear error message, no silent exit)
  • Verify Prometheus metrics appear in session results when --mlflow is enabled

🤖 Generated with Claude Code

yoavkatz added 10 commits May 3, 2026 09:36
- Add --use-mcp-gateway argument parsing to evaluate-benchmark.sh
- Update help text and usage examples to document the new flag
- Modify deploy-and-evaluate.sh to pass --use-mcp-gateway to evaluate-benchmark.sh when USE_MCP_GATEWAY is true
- Ensures consistent MCP gateway routing across deployment and evaluation phases

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add std() function to calculate standard deviation
- Display std alongside averages for key metrics:
  - LLM calls/session
  - Tool calls/session
  - LLM call latency
  - Time breakdown percentages
  - Token counts (input, output, total)
- Provides better insight into metric variability across sessions

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- List all MCPServerRegistrations before deletion
- Delete all registrations in namespace (not just specific one)
- Add detailed logging for cleanup operations
- Handle case when no registrations exist gracefully

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Modified A2AProxyClient.send_prompt() to accept session_id parameter
- Pass session_id as request_metadata to client.send_message()
- Updated runner.py to pass session_id to A2A client
- Removed prompt instructions to pass session_id as string

This aligns with the reference implementation in test_a2a_agent.py
where session_id is passed via A2A request metadata.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Updated toolPrefix in deploy-benchmark.sh to use exgentic_${BENCHMARK_NAME}_
- Updated EXGENTIC_MCP_TOOL_PREFIX in evaluate-benchmark.sh to include benchmark name

This allows multiple benchmarks to run concurrently with unique tool prefixes,
preventing tool name collisions in the MCP gateway.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Added experiment_name field to ExgenticConfig (default: 'default')
- Updated OTELInstrumentation.session_span() to accept and set metadata.experiment_name attribute
- Updated Runner to pass experiment_name from config to session spans
- Added --experiment flag to evaluate-benchmark.sh and deploy-and-evaluate.sh
- Added --experiment and --compare flags to analyze-run.sh for filtering traces
- Updated analyze_traces.py to extract, display, and group by experiment_name
- Fixed analyze_traces.py to properly handle command-line arguments with argparse
- Added print_experiment_comparison() function for side-by-side experiment comparison
- Added average LLM and tool call latency metrics to all comparative reports
- Track LLM calls after first observation separately (llm_count_after_obs)
- Exclude LLM calls before first observation from average LLM calls/session metric

This enables:
1. Tagging runs with experiment names via EXPERIMENT_NAME env var or --experiment flag
2. Filtering Phoenix traces by experiment name in analyze-run.sh
3. Comparing two experiments side-by-side with --compare flag
4. Grouping analysis results by experiment for better organization
5. Side-by-side comparison report showing metrics across experiments
6. Average latency metrics for LLM and tool calls in all reports
7. More accurate LLM call counts excluding initial observation calls

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Reduce minimum duration from 60s to 1s to allow more granular
metrics queries for shorter time windows.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Add EXGENTIC_SET_AGENT_ENABLE_TOOL_SHORTLISTING=true environment
variable when deploying tool_calling agent to optimize tool selection.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Replace Phoenix GraphQL-based trace collection with MLflow REST API.
The OTEL collector now forwards spans to MLflow instead of Phoenix.
Add download_mlflow_traces.py for fetching and transforming traces.

Harden deploy-benchmark.sh and deploy-agent.sh against silent failures
when the Kagenti API is unreachable: add --max-time to curl calls,
prevent set -e from silently killing the script, and verify port-forward
liveness by checking the API responds rather than just the port being
open.

Signed-off-by: Yoav Katz <yoav.katz@ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
…workload-harness into feature/prometheus-integration
…tivity

- Replace all Phoenix references with MLflow in READMEs
- Update analyze-run.sh default URL to http://mlflow.localtest.me:8080
- Make evaluate-benchmark.sh connectivity checks fail immediately
  instead of printing soft warnings
- Add Exgentic harness and Observability section to root README

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
yoavkatz added 2 commits May 20, 2026 12:17
Signed-off-by: Yoav Katz <katz@il.ibm.com>
@kellyaa kellyaa merged commit d2ece48 into kagenti:main May 21, 2026
1 check passed
@github-project-automation github-project-automation Bot moved this from New /:ToDo to Done in Kagenti Issue Prioritization May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants