Skip to content

Add MCP Gateway support#1

Open
kellyaa wants to merge 56 commits into
yoavkatz:feature/exgentic-a2a-runnerfrom
kellyaa:mcp_gateway
Open

Add MCP Gateway support#1
kellyaa wants to merge 56 commits into
yoavkatz:feature/exgentic-a2a-runnerfrom
kellyaa:mcp_gateway

Conversation

@kellyaa
Copy link
Copy Markdown

@kellyaa kellyaa commented Apr 24, 2026

No description provided.

yoavkatz added 30 commits March 17, 2026 19:21
Implement complete test harness for Exgentic benchmarks following the flow
described in kagenti/kagenti#963

Key features:
- MCP client using official Python SDK with streamable HTTP transport
- Sequential session processing with full lifecycle management
- A2A protocol integration for agent communication
- OpenTelemetry instrumentation for metrics and tracing
- Comprehensive configuration and documentation

Components:
- mcp_client.py: MCP protocol client for Exgentic server
- exgentic_adapter.py: High-level adapter for session management
- runner.py: Main orchestration with telemetry
- config.py: Configuration management
- prompt.py: Prompt builder with session_id injection
- otel.py: OpenTelemetry setup
- a2a_client.py: A2A protocol client (from appworld_a2a_runner)

Testing:
- Successfully connects to Exgentic MCP server (tau2 benchmark)
- Verified session creation with 114 available tasks
- Proper error handling and logging configuration

Documentation:
- README.md: Complete usage guide
- QUICKSTART.md: Quick start for Kagenti cluster
- Architecture and implementation docs

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Changes:
- Add list_tasks() method to MCPClient to fetch all available task IDs
- Add get_task_ids() method to ExgenticAdapter
- Update iterate_sessions() to accept task_ids list and respect max_tasks
- Update create_session() to accept optional task_id parameter
- Update runner to fetch task IDs first, then iterate over them
- Remove debug exit(99) statement
- Improve logging to show progress (task X/Y)

This ensures we know the total number of tasks upfront and can properly
limit processing with max_tasks configuration.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Remove all '# Made with ...' comments from Python files for cleaner code.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
The agent card may advertise an internal URL (e.g., 0.0.0.0:8000) that is
not accessible from outside the pod. This change ensures we always use the
configured A2A_BASE_URL (e.g., localhost:8080 via port-forward) instead of
the URL from the agent card.

This fixes the 404 error when connecting to agents behind port-forwards or
proxies.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Fix syntax errors in run-with-port-forward.sh:
  * Add missing comment symbol on line 36
  * Fix unclosed quote on line 40
  * Replace parentheses in echo statements to avoid syntax errors
  * Update service names to match actual cluster services

- Configure A2A endpoint to use root path (/) instead of /v1/chat

- Enable OTEL trace collection to local Jaeger instance (localhost:4317)

- Enhance OTEL instrumentation:
  * Add full prompt text to span attributes (prompt.text)
  * Add full response text to span attributes (response.text)
  * Improve visibility of inputs/outputs in Jaeger traces

- Improve prompt instructions:
  * Add explicit instruction to call submit MCP tool when asked

- Enhance logging:
  * Add evaluation result details to session evaluation logs

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add AGENT_SERVICE and BENCHMARK_SERVICE to example.env
- Update run-with-port-forward.sh to read service names from .env
- Use default values if environment variables are not set
- Improves configurability and makes it easier to switch between different deployments

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add MAX_PARALLEL_SESSIONS configuration parameter (default: 1)
- Implement ThreadPoolExecutor for concurrent session execution
- Add thread-safe result collection with mutex lock
- Display max parallel sessions in run summary
- Maintain backward compatibility with sequential processing (max_parallel_sessions=1)
- Support abort_on_failure in parallel mode by canceling remaining futures

Benefits:
- Significantly improves throughput for I/O-bound workloads
- Allows users to configure parallelism based on their needs
- Maintains all existing functionality and error handling

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Display table of all failed sessions with their error messages at end of run summary
- Truncate long error messages to 50 characters for readability
- Only show table if there are failed sessions
- Helps quickly identify and diagnose session failures

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Extract text from artifacts and result first, regardless of state
- Then handle failed/canceled/rejected states with extracted information
- Include extracted output in error messages for better debugging
- Provides complete context when tasks don't complete successfully

Signed-off-by: Yoav Katz <katz@il.ibm.com>
…cluster

Add three new scripts to automate deployment and configuration of Exgentic
benchmark system on Kagenti Kubernetes cluster:

1. deploy-benchmark.sh: Deploy MCP tools via Kagenti API
   - Syncs local container images to cluster registry
   - Authenticates with Keycloak using password grant flow
   - Deploys tools with proper service configuration
   - Patches imagePullPolicy for local images
   - Waits for deployment readiness

2. deploy-agent.sh: Deploy A2A agents from source
   - Fetches and parses environment variables from GitHub
   - Deploys agents using Shipwright builds
   - Monitors build progress and waits for completion
   - Waits for deployment creation and readiness
   - Tests agent accessibility via A2A protocol
   - Fixes port configuration (8080 -> 8000)

3. configure-agent-environment.sh: Configure agent environment
   - Updates OpenAI API secret via kubectl patch
   - Patches agent deployment with Azure OpenAI settings
   - Accepts benchmark name as parameter
   - Waits for rollout completion

These scripts enable automated deployment and testing of the Exgentic
benchmark system without manual kubectl commands or UI interaction.

Fixes:
- Agent port mismatch (container port 8000 vs service port 8080)
- MCP_URLS environment variable configuration
- Azure OpenAI endpoint and model configuration

Signed-off-by: Yoav Katz <katz@il.ibm.com>
…agenti-ui

Port 8080 was being used by both the A2A agent port-forward and the
kagenti-ui service (via Istio gateway), causing intermittent access
issues to http://kagenti-ui.localtest.me:8080/.

Changes:
- Updated A2A_BASE_URL from localhost:8080 to localhost:8081 in example.env
- Modified run-with-port-forward.sh to forward A2A agent to port 8081
- Updated connectivity test to check port 8081

This allows kagenti-ui to be accessed on port 8080 via Istio gateway
while the A2A agent uses port 8081, eliminating port conflicts.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Changes:
- Made configure-agent-environment.sh executable (chmod +x)
- Fixed tool name in deploy-agent.sh: removed duplicate '-mcp' suffix
  from 'exgentic-mcp-${BENCHMARK_NAME}-mcp' to 'exgentic-mcp-${BENCHMARK_NAME}'
- Fixed tool name in deploy-benchmark.sh: removed duplicate '-mcp' suffix
  from 'exgentic-mcp-${BENCHMARK_NAME}-mcp' to 'exgentic-mcp-${BENCHMARK_NAME}'

This ensures consistent tool naming across deployment scripts and makes
the configuration script directly executable.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
… auth

Changes:
- Updated QUICKSTART.md with comprehensive deployment instructions
  - Added Option 1: Deploy Your Own Benchmark and Agent
  - Added Option 2: Use Existing Services
  - Documented deploy-benchmark.sh and deploy-agent.sh usage
  - Updated configuration section with new port (8081) for A2A agent
  - Added reference documentation for deployment scripts

- Fixed Keycloak authentication error in deployment scripts
  - Added automatic enabling of Direct Access Grants for kagenti client
  - Both deploy-benchmark.sh and deploy-agent.sh now configure Keycloak
  - Added better error messages for authentication failures
  - Renumbered steps after adding Keycloak configuration step

This resolves the 'unauthorized_client' error when running deployment
scripts and provides clear documentation for deploying benchmarks and
agents to the Kagenti cluster.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Changes:

1. Renamed configure-agent-environment.sh to configure-agent-and-benchmark-environment.sh
   - Use 'kubectl set env' instead of JSON patch for cleaner updates
   - Extended script to configure both agent and benchmark deployments
   - Added clear separation between agent and benchmark configuration sections
   - Improved output formatting with dedicated sections for each component
   - Added deployment-specific configuration summaries
   - Agent gets: LLM_API_BASE, OPENAI_API_BASE, LLM_MODEL
   - Benchmark gets: OPENAI_API_BASE, EXGENTIC_SET_BENCHMARK_USER_SIMULATOR_MODEL

2. Enhanced deploy-benchmark.sh
   - Added fetching and parsing of benchmark-specific environment variables
   - Fetches .env.<benchmark> from agent-examples repository
   - Parses environment variables using Kagenti API
   - Includes env vars in tool deployment configuration
   - Added graceful handling when env file is not found
   - Renumbered steps after adding env var fetching step

These improvements ensure:
- Consistent LLM configuration across agent and benchmark
- Better visibility into what's being configured
- Benchmark-specific settings are properly applied from repository
- Clearer output for troubleshooting
- Proper separation of concerns between agent and benchmark configuration

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Changed port-forward cleanup to kill processes by port number instead of
service name. This ensures all existing port-forwards on ports 8000 and
8081 are cleaned up regardless of which benchmark or agent service they
were forwarding to.

Uses lsof to find processes using the ports and kills them, making the
script more robust when switching between different benchmarks/agents.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add resource limits (2Gi memory) to benchmark pod deployments
- Rename close_session to delete_session throughout the stack
- Add validation for delete_session response (supports both 'success' and 'status' fields)
- Conditionally set EXGENTIC_SET_BENCHMARK_USER_SIMULATOR_MODEL only for tau benchmarks
- Create evaluate_benchmark.sh script that accepts benchmark name as parameter
- Set AGENT_SERVICE and BENCHMARK_SERVICE dynamically based on benchmark name

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Move .env loading before service name exports to prevent override
- Set A2A_ENDPOINT_PATH=/ for JSON-RPC protocol (was /v1/chat)
- Fix BENCHMARK_SERVICE to include -mcp suffix
- Set MAX_TASKS=1 in example.env for testing

This fixes the 404 errors when connecting to the A2A agent endpoint.
The agent uses JSON-RPC at the root path, not /v1/chat.

Tested with gsm8k benchmark: 100% success rate (1/1 sessions)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add optional model-name parameter with Azure/gpt-4o as default
- Replace hardcoded model references with MODEL_NAME variable
- Set benchmark pod memory limit to 3Gi (3GB)
- Update usage documentation and examples
- Add memory limit to configuration summary output

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Remove static resource limits (CPU and memory) from deployment JSON
- Resource limits are now set dynamically via configure script
- Allows for flexible resource allocation per benchmark

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Update evaluate_benchmark.sh to use port 7770 for MCP server (was 8000)
- Update evaluate_benchmark.sh to use port 7701 for A2A agent (was 8081)
- Update example.env with new port numbers
- Update README.md with deployment instructions and usage examples
- Increase default MAX_TASKS and MAX_PARALLEL_SESSIONS in example.env
- Enable OTEL_EXPORTER_OTLP_ENDPOINT by default in example.env

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Update agent-examples repo URL to yoavkatz/agent-examples
- Update workload-harness repo URL to yoavkatz/workload-harness
- Add git checkout for feature/exgentic-mcp-server branch
- Add git checkout for feature/exgentic-a2a-runner branch
- Ensures users clone from correct repos and use correct feature branches
- Both repositories are publicly accessible

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Uncomment cleanup function in evaluate_benchmark.sh
- Enable trap to cleanup port forwards on exit (EXIT, INT, TERM)
- Update README to reflect automatic cleanup behavior
- Update feature list: parallel session processing (not sequential)
- Add port forwarding details (7770 for MCP, 7701 for agent)
- Clarify configure script parameters and defaults
- Remove limitation about manual port forward cleanup

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add kubectl and Kagenti cluster prerequisites
- Add note about optional Keycloak credentials for deploy scripts
- Remove QUICKSTART.md (information merged into README)
- README now contains all necessary setup and usage information

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Modified otel.py to not use ConsoleSpanExporter when OTEL_EXPORTER_OTLP_ENDPOINT is not set
- Traces are still collected internally but not exported or printed
- Added comprehensive Jaeger setup instructions to README.md
- Updated example.env with clearer OTEL configuration comments

This prevents unwanted console spam when OTEL is not configured while
still allowing full observability when Jaeger or another collector is set up.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Modified otel.py to not use ConsoleSpanExporter when OTEL_EXPORTER_OTLP_ENDPOINT is not set
- Traces are still collected internally but not exported or printed
- Added comprehensive Jaeger setup instructions to README.md
- Reorganized README configuration section: main settings, debug, tracing, advanced
- Updated example.env with clearer OTEL configuration comments
- Modified evaluate_benchmark.sh to automatically set EXGENTIC_MCP_SERVER_URL and A2A_BASE_URL
- Removed run-with-port-forward.sh (functionality integrated into evaluate_benchmark.sh)
- Users no longer need to manually configure MCP and A2A URLs when using evaluate_benchmark.sh

This prevents unwanted console spam when OTEL is not configured while
still allowing full observability when Jaeger or another collector is set up.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Added asyncio logger to WARNING level to suppress 'Using selector' debug messages
- Redirected kubectl port-forward output to /dev/null to suppress 'Handling connection' messages
- Keeps console output clean and focused on actual runner progress

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add --log-level CLI argument (DEBUG, INFO, WARNING, ERROR)
- Remove --verbose flag in favor of explicit log level control
- Update evaluate_benchmark.sh to pass LOG_LEVEL from environment
- Document LOG_LEVEL configuration in README and example.env
- Default log level remains INFO for balanced output
- Priority: CLI arg > LOG_LEVEL env var > default (INFO)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
yoavkatz and others added 26 commits March 31, 2026 19:01
…lity

- Add prompt text to exgentic_a2a.prompt.build span
- Add prompt, response, and duration to exgentic_a2a.a2a.send_prompt span
- Add evaluation result and duration to exgentic_a2a.mcp.evaluate_session span
- Maintain backward compatibility by keeping attributes on parent span

This improves Jaeger trace analysis by making relevant data visible
on the specific operation spans rather than only on the root span.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add context field to SessionData dataclass as Optional[Dict[str, Any]]
- Update MCP client to extract context from create_session response
- Modify build_prompt to accept and format context in the prompt
- Pass session context through the entire pipeline to the agent

The context dictionary from the MCP server is now included in the
prompt sent to the agent, providing additional information for task
completion.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add debug logging to log full task response for troubleshooting
- Change text extraction to check for None instead of falsy values to allow empty strings
- Return empty string for completed tasks with no extracted text
- Improve handling of completed tasks without text content

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Add estimated setup time (~15 minutes)
- Clarify Python version requirements (3.13+ not supported)
- Add note about uv automatically using Python 3.12
- Include secret_values.yaml creation step in Kagenti setup
- Improve configuration section with clearer structure
- Emphasize required .env file creation before running evaluations

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Added note that project has been tested only locally with Podman
- Clarifies Docker compatibility has not been validated

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Fix HTTP 422 error by converting underscores to hyphens in agent names for Kubernetes compatibility
- Add git fields (gitUrl, gitPath, gitBranch) to image deployment payload as required by API
- Improve Step 10 agent card test with better error handling and non-blocking behavior
- Split configure-agent-and-benchmark-environment.sh into separate scripts:
  - configure-agent.sh: Configure agent deployments with support for custom agents
  - configure-benchmark.sh: Configure benchmark deployments independently
- Update evaluate_benchmark.sh to accept optional agent name parameter
- Update deploy-agent.sh to support custom agent names with automatic name normalization
- Update README.md with new script usage and examples

Signed-off-by: Bob <bob@example.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Update configure-agent.sh to require agent name parameter
- Update evaluate_benchmark.sh to require agent name parameter
- Remove generic agent fallback logic from both scripts
- Update README.md to remove generic agent references
- Simplify agent name construction by removing conditional logic

All scripts now consistently require explicit agent names for clarity

Signed-off-by: Bob <bob@example.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
…rict validation

- Fetch agent-specific .env file from agent-examples repository
- Fail with error (not warning) if env file cannot be fetched
- Fail with error if environment variables cannot be parsed
- Refactored to avoid path duplication by defining ENV_FILE_URL once
- Provide clear error messages with expected file paths

This ensures agents have proper environment configuration (OPENAI_API_KEY,
OPENAI_API_BASE, etc.) during deployment and fails fast if configuration
is missing.
- Merged deploy-benchmark.sh and configure-benchmark.sh into single script
- Merged deploy-agent.sh and configure-agent.sh into single script
- Changed to named parameters (--model, --keycloak-user, --keycloak-pass)
- Added --help flag to both scripts
- Updated README.md with new usage examples
- Removed obsolete configure-agent.sh and configure-benchmark.sh scripts

Both deployment scripts now automatically configure environment variables,
set model settings, and wait for rollouts to complete, providing a
streamlined deployment experience.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Move OPENAI_API_BASE and EXGENTIC_SET_BENCHMARK_USER_SIMULATOR_MODEL
  configuration to deployment time (Step 8) instead of post-deployment
- Add environment variables to ENV_VARS array before API call
- Simplify post-deployment steps to only update secret and set memory limit
- Avoid unnecessary deployment rollout by including all config upfront

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Move environment variable configuration to deployment time instead of
post-deployment to avoid unnecessary rollout. Move agent card test to
the end after all configuration is complete.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
The MCP server automatically closes sessions after evaluate_session,
causing delete_session to receive 'client has been closed' or 'No session
found' errors. These are expected conditions, not actual failures.

Changes:
- Catch BaseExceptionGroup from anyio TaskGroup during cleanup
- Treat 'client has been closed' as successful deletion
- Treat 'No session found' as successful deletion
- Prevents spurious failures in benchmark evaluation

This fixes the issue where sessions were marked as failed even though
evaluation succeeded, due to redundant cleanup attempts.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Agents can accumulate memory over time from handling multiple requests
and storing logs/state. The previous 1Gi limit was insufficient for
long-running workloads, causing OOMKills.

Changes:
- Add Step 11.2 to set memory limit to 3Gi for agent deployments
- Add memory limit to deployment summary printout
- Matches the memory limit already set for benchmark deployments
- Prevents OOMKills during extended benchmark runs

This resolves the issue where agent pods were being killed due to
memory exhaustion after ~4.5 hours of operation.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Implement thread-local storage for both A2A and MCP clients to enable
safe parallel session execution when max_parallel_sessions > 1.

Changes:
- A2A Client: Replace shared requests.Session with thread-local sessions
  using threading.local() to eliminate race conditions
- MCP Client: Implement thread-local async event loops to prevent
  'event loop already running' errors in multi-threaded scenarios
- Add proper cleanup of thread-local resources on shutdown

This fixes thread safety issues while maintaining backward compatibility
with single-threaded execution (max_parallel_sessions=1).

Note: The 10-second session creation delay is a server-side issue in
the MCP server's create_session implementation and is not addressed by
these client-side changes.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Update deployment scripts with memory limit settings and configuration
improvements for agent and benchmark deployments.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Change all latency and duration measurements to use seconds instead of
milliseconds for consistency and clarity.

Changes:
- SessionResult: latency_ms -> latency_seconds
- Summary metrics: average_latency_ms -> average_latency_seconds, etc.
- OTEL metrics: All histogram names and units changed from ms to seconds
- OTEL attributes: duration_ms -> duration_seconds
- Log messages: Display timing in seconds (e.g., '10.5s' instead of '10500ms')

This provides more intuitive timing values and aligns with standard
observability practices where base units are preferred.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Add deploy-and-evaluate.sh script that automates the complete workflow:
1. Deploy benchmark using deploy-benchmark.sh
2. Deploy agent using deploy-agent.sh
3. Run evaluation using evaluate_benchmark.sh

Usage:
  ./deploy-and-evaluate.sh --benchmark <name> --agent <name> [OPTIONS]

Options:
  --benchmark NAME  - Benchmark name (required)
  --agent NAME      - Agent name (required)
  --model MODEL     - Model name (optional, default: Azure/gpt-4.1)
  --keycloak-user   - Keycloak username (optional)
  --keycloak-pass   - Keycloak password (optional)

Examples:
  ./deploy-and-evaluate.sh --benchmark tau2 --agent tool_calling
  ./deploy-and-evaluate.sh --benchmark tau2 --agent tool_calling --model Azure/gpt-4o-mini

This simplifies the deployment and evaluation process by combining all
three steps into a single command with proper error handling and
progress indicators.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
The agent card testing refactoring introduced an extra fi statement
that didn't match any opening if block, causing a syntax error.
Removed the extraneous fi on line 626 to fix the issue.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Changed all scripts to use --benchmark and --agent flags instead of positional arguments
- Renamed evaluate_benchmark.sh to evaluate-benchmark.sh for consistency
- Updated deploy-agent.sh to require --benchmark and --agent flags
- Updated deploy-benchmark.sh to require --benchmark flag
- Updated evaluate-benchmark.sh to require --benchmark and --agent flags
- Updated deploy-and-evaluate.sh to pass named parameters to all scripts
- Updated README.md with new usage examples and deploy-and-evaluate.sh documentation
- All scripts now have consistent parameter naming and help messages

This improves script usability and makes the API more explicit and self-documenting.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Always use ThreadPoolExecutor for session processing, regardless of the
max_parallel_sessions value. ThreadPoolExecutor with max_workers=1 provides
the same sequential behavior as the previous special case, but with a
unified code path.

Benefits:
- Simpler code with single execution path
- Consistent behavior across all parallelism levels
- ThreadPoolExecutor handles edge cases and cleanup automatically

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Change default A2A endpoint path from /v1/chat to /
- Add separate timing metrics for session creation, agent processing, and evaluation
- Refactor to create sessions on-demand by workers instead of pre-creating all sessions
- Improve summary reporting with detailed timing breakdown
- Clarify that latency specifically refers to agent processing time
- Enhance error handling for session creation failures
- Update terminology from 'Sessions Failed' to 'Sessions With Error'

This improves observability, resource efficiency, and provides more accurate
performance metrics for the Exgentic A2A runner.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Add Step 13 to verify MCP server health after deployment:
- Fails if health check port (8009) is already in use
- Sets up temporary port-forward to MCP server
- Fails if port-forward doesn't become ready within 20 seconds
- Tries /health endpoint first, falls back to root endpoint if 404
- Reports health check status with HTTP code and response
- Cleans up port-forward after health check
- Uses sed instead of head -n -1 for macOS/BSD compatibility

This ensures the MCP server is properly deployed and accessible before
completing the deployment process.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
- Replace lsof with nc for port availability checks (more portable)
- Enhance MCP health check with proper retry logic and port-forward restart
- Add service name detection for MCP server (-mcp suffix handling)
- Implement proper MCP initialize request for health validation
- Add 60-second timeout with periodic status updates
- Improve error handling and cleanup of port-forward processes

These changes make the deployment script more robust and reliable,
especially when dealing with pod restarts and service endpoint registration.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
This commit resolves OTEL connection issues and improves metrics clarity:

1. **deploy-agent.sh**: Configure OTEL protocol for agent deployment
   - Added OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf' environment variable
   - Explicitly set EXGENTIC_OTEL_ENABLED='true' for clarity
   - Ensures agent uses correct protocol for OTEL collector port 8335

2. **runner.py**: Clarify timing metrics output
   - Updated label to show averages are per-session
   - Improve readability of performance reports

The original OTEL connection errors were caused by a transient network
failure in the cluster. The protocol configuration ensures the agent
correctly communicates with the OTEL collector using HTTP/protobuf.

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Kelly Abuelsaad <kna@us.ibm.com>
@yoavkatz yoavkatz force-pushed the feature/exgentic-a2a-runner branch from b1fcf2c to ab60a1a Compare April 26, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants