Skip to content

feat: add core MCP server, TrainerClient tools and Resources#2

Merged
google-oss-prow[bot] merged 3 commits into
kubeflow:mainfrom
abhijeet-dhumal:feat/trainer-mcp-tools
May 12, 2026
Merged

feat: add core MCP server, TrainerClient tools and Resources#2
google-oss-prow[bot] merged 3 commits into
kubeflow:mainfrom
abhijeet-dhumal:feat/trainer-mcp-tools

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Apr 9, 2026

fixes #7

Description

Implements the core MCP server framework and all TrainerClient-backed tools per KEP-936.
Claude plugin and skills moved to a separate PR per reviewer feedback.

Core Framework (kubeflow_mcp/core/)

  • Server factory with persona-based tool filtering, and progressive/semantic meta-tool modes
  • Config — YAML config file with env-var fallback chain
  • Security — K8s name validation, training bounds, secret masking, namespace enforcement, heuristic script safety
  • Auth — Bearer token + JWT/OIDC for HTTP transport
  • Policy — Persona access control (readonly, data-scientist, ml-engineer, platform-admin) with allow/deny overrides
  • Resilience — Thread-safe rate limiter and circuit breaker
  • Logging — Structured JSON audit logging with correlation IDs

TrainerClient Tools (23 tools) and 3 MCP resource guides

CLI - serve command with stdio / http / sse transport, persona, auth token, instruction tier, progressive/semantic modes

Test Suite (131 tests)

  • cli_test.py (25) — CLI flags, config fallback, auth/resilience wiring, SSE transport
  • sdk_contracts_test.py (68) — SDK dataclass fields, enum values, API signatures, K8s client methods
  • test_architecture.py (38) — Tool metadata, persona gating, instruction tiers, resource registration

Checklist

  • Tests pass locally (make test-python) — 131 passing
  • Linting passes (make verify) — 0 errors
  • Documentation updated
  • Commit messages follow conventional format

Related Issues

Implements KEP-936

@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch 2 times, most recently from 7013e56 to 08213ed Compare April 9, 2026 13:44
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch 2 times, most recently from d069607 to f418eb1 Compare April 9, 2026 16:33
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch 2 times, most recently from 9cdc9fd to db75fd1 Compare April 9, 2026 17:17
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review April 13, 2026 13:35
@google-oss-prow google-oss-prow Bot requested a review from astefanutti April 13, 2026 13:35
@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft April 15, 2026 13:36
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch from db75fd1 to 470812d Compare April 17, 2026 18:15
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review April 17, 2026 18:15
@abhijeet-dhumal abhijeet-dhumal changed the title Add core MCP server, TrainerClient tools, and unit tests feat: Add core MCP server, TrainerClient tools, and unit tests Apr 17, 2026
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Add core MCP server, TrainerClient tools, and unit tests feat: add core MCP server, TrainerClient tools, and unit tests Apr 17, 2026
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch from 470812d to 40fce0e Compare April 17, 2026 18:24
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

abhijeet-dhumal commented Apr 18, 2026

Hey @andreyvelich @astefanutti @kramaranya @szaher @Electronic-Waste - the 2nd onboarding phase is ready for review now 🏁
Would appreciate a review when you get a chance 🙏

@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch from 4db9623 to fa2b656 Compare April 18, 2026 22:10
@kramaranya
Copy link
Copy Markdown

Thanks @abhijeet-dhumal, is it possible to split the changes into a few PRs (for example MCP server, tools, docs/plugin/benchmark)?

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks @abhijeet-dhumal, is it possible to split the changes into a few PRs (for example MCP server, tools, docs/plugin/benchmark)?

Thanks @kramaranya .. I completely understand the preference for smaller PRs, and I normally follow that approach.

The context here is that after discussing with @andreyvelich and @thesuperzapper, we agreed on a phased onboarding strategy with a time constraint: land the project as a cohesive unit first, then refine via focused follow-up PRs. I tried to keep commits in this PR structured as logically separate components to make review easier.

Splitting into parallel PRs now is doable, but these components have tight interdependencies (e.g. tools depend on core server, benchmarks test the tools, security hardening touches multiple layers), so it would add significant overhead to manage the PR dependency chain and likely slow down the merge timeline.

@andreyvelich @astefanutti @thesuperzapper What way do you recommend, Happy to adjust either way.

@andreyvelich
Copy link
Copy Markdown
Member

@abhijeet-dhumal Maybe you can separate Claude plugin and SKILLS into separate PR? So we can review MCP tools first.

@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch 2 times, most recently from 8eeb74c to 5a937dc Compare April 28, 2026 19:22
…tests

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/trainer-mcp-tools branch from 5a937dc to 5f4a418 Compare April 28, 2026 19:26
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review April 28, 2026 19:32
@google-oss-prow google-oss-prow Bot requested a review from andreyvelich April 28, 2026 19:33
@abhijeet-dhumal abhijeet-dhumal changed the title feat: add core MCP server, TrainerClient tools, Prompts and Resources feat: add core MCP server, TrainerClient tools and Resources Apr 29, 2026
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Hi @abhijeet-dhumal Thanks for a ton for this.

I’ve left comments on a couple of issues above—could you please take a look at those? Aside from that, everything looks good to me at this stage of the project. 😊

Disclaimer: Used Claude Opus for PR review.

Prompt Details

Thanks @jaiakash - all 4 items addressed in the latest push.. Appreciate the thorough review 🙌
The CircuitBreaker race and urllib3 internals catches were solid finds

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

abhijeet-dhumal commented Apr 29, 2026

@abhijeet-dhumal Maybe you can separate Claude plugin and SKILLS into separate PR? So we can review MCP tools first.

@andreyvelich @kramaranya @astefanutti Apologies for the size of this PR - I understand it makes review harder.
IMHO splitting this PR further would require stubs or dead code in each PR just to pass CI, without making review easier.
This PR is the testable baseline for this new repo - reviewers can directly test this mcp project right away thoroughly to test workflows or even make inspector against a real working server.

kubeflow-mcp serve \
  --clients trainer \             # modules: trainer, optimizer (stub), hub (stub)
  --persona ml-engineer \         # readonly | data-scientist | ml-engineer | platform-admin
  --mode full \                   # full | progressive | semantic
  --transport stdio \             # stdio | http | sse
  --auth-token SECRET \           # bearer token for HTTP auth (dev/staging)
  --log-level INFO \              # DEBUG | INFO | WARNING | ERROR
  --log-format console \          # console | json (auto-detected if omitted)
  --no-banner

Re: why MCP Resources are included
The 3 resource guides (training-patterns, platform-fixes, troubleshooting) are not documentation — they're part of the tool runtime. Resources ARE packaged in the MCP server - they get bundled with pip install, and are registered on the server at startup via FastMCP's mcp.resource(uri). Any MCP client that connects gets access to them through the standard resources/read protocol method.

MCP Resources are a protocol-level feature that agents read on-demand when tools reference them:

  • pre_flight() and estimate_resources() return trainer://guides/platform-fixes when OpenShift is detected
  • fine_tune() returns trainer://guides/training-patterns when the user needs a LoRA script alternative
  • get_training_logs() returns trainer://guides/troubleshooting when OOM/CUDA errors are detected
  • Server instructions reference all 3 URIs to guide agent behavior

Without these, tool responses would contain issues that agents can't resolve. They're co-located with the tools because they're registered by the same client module and tested together (architecture tests verify all URIs are resolvable).

Happy to clarify anything further — let me know if you'd like me to adjust the scope.

Comment thread kubeflow_mcp/trainer/api/lifecycle.py Outdated
Comment thread kubeflow_mcp/core/server.py
Comment thread kubeflow_mcp/trainer/api/training.py Outdated
Comment thread kubeflow_mcp/trainer/api/training.py
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks for the thorough review, @kramaranya ! I have resolved the comments raised..
Looking forward to the next round whenever you get a chance. 😄

Comment thread kubeflow_mcp/core/dynamic_tools.py Outdated
Comment thread kubeflow_mcp/core/server.py Outdated
Comment thread kubeflow_mcp/trainer/__init__.py Outdated
Comment thread kubeflow_mcp/core/config.py Outdated
Comment thread kubeflow_mcp/core/policy.py Outdated
…tate to policy

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Hey @kramaranya, thanks again for the thorough pass. Addressed several of your threaded points..
The PR is now ready for the next round 😄

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal!
/lgtm
/approve

/hold @kramaranya in case you have more suggestions

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @abhijeet-dhumal!
LGTM
/unhold

@google-oss-prow google-oss-prow Bot merged commit 48f2a94 into kubeflow:main May 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core MCP server framework and TrainerClient tools

5 participants