RFC: Agent protocol support for apx apps by stuart-gano · Pull Request #142 · databricks-solutions/apx

stuart-gano · 2026-03-24T16:14:16Z

Summary

This RFC adds the agent protocol addon to APX — everything needed to build, run, and deploy Databricks-native AI agents with one command.

What's in this PR

Agent runtime (`core/agent.py`)

Agent type hierarchy

Type	Description
`LlmAgent` / `Agent`	Tool-calling loop over FMAPI. Alias: `Agent = LlmAgent`.
`SequentialAgent`	Runs sub-agents one after another, piping output as input.
`ParallelAgent`	Runs sub-agents concurrently, merges outputs.
`LoopAgent`	Repeats a sub-agent until it calls `finish_loop()` or `max_iterations` is hit.
`RouterAgent`	One upfront FMAPI call selects a sub-agent; synthetic transfer tools never enter the dispatch pipeline.
`HandoffAgent`	Agents pass control to each other via real ASGI `transfer_to_*` routes.

LlmAgent features

before_tool / after_tool hooks — sync or async callables, called around every tool dispatch
input_guardrails / output_guardrails — lists of sync/async callables; return None (pass) or str (short-circuit with that message)
context_window_tokens — budget cap; when exceeded, the middle of the history is summarized with one extra LLM call
custom_outputs — set_custom_output(request, key, value) helper; surfaced in InvocationResponse.custom_outputs and as event: custom_outputs in SSE

Composition patterns match Google ADK and OpenAI Agents SDK feature-for-feature, plus Databricks-native additions:

On-behalf-of auth wired end-to-end — every tool receives the caller's identity token
Zero-config MCP SSE server at /mcp/sse
A2A discovery via /.well-known/agent.json (live url + mcpEndpoint)
WorkspaceClient / UserWorkspaceClient injectable as typed FastAPI deps in tool functions
apx deploy — one command to production
app_predict_fn MLflow eval bridge
Built-in dev UI at /_apx/agent and /_apx/tools

Dev UI namespace: `/_apx/*`

/_apx/ is the APX platform tooling namespace (underscore prefix signals "platform layer, not app layer").

/_apx/agent — interactive chat UI

Send messages, stream responses
Tool call trace panel showing args, results, and timing
Inspect registered skills
Copy MCP SSE URL for Claude Desktop / Cursor
Nav link to /_apx/tools

/_apx/tools — tool inspector and live invocation form

Left sidebar: all tools grouped Local / Remote with type badges
Schema tab: inputSchema as the LLM sees it (dep-injected params stripped, FMAPI JSON, syntax highlighted) — different from /docs which shows all FastAPI params
Invoke tab: form auto-generated from inputSchema, POSTs to /api/tools/<name> or sub-agent /invocations, shows result with timing
Nav link back to /_apx/agent

Protocol endpoints

GET  /.well-known/agent.json   A2A discovery (name, skills, mcpEndpoint, url)
POST /invocations               FMAPI tool-calling loop; stream=true for SSE
GET  /health                    Liveness
POST /api/tools/<fn_name>       One route per registered tool
GET  /mcp/sse                   MCP SSE transport
POST /mcp/messages/             MCP SSE return channel
GET  /_apx/agent                Dev chat UI
GET  /_apx/tools                Tool inspector + invocation

Deployment (`crates/cli/src/deploy.rs`)

apx deploy command — builds wheel, packages app, deploys to Databricks Apps, polls until running
No DABs YAML, no separate CI step

README

Added "Why APX for agents?" section calling out the seven Databricks-native advantages vs Google ADK and OpenAI Agents SDK.

End-to-end validation: Energy billing agent

Validated the full agent flow against a common customer pattern — an energy billing Q&A agent with Lakebase-backed tools:

5 tools: get_customer_profile, query_ami_readings, get_billing_summary, get_rate_schedule, compare_months
All tools use Dependencies.UserClient (OBO auth) → Lakebase Provisioned via generate_database_credential
LLM model: databricks-claude-sonnet-4-6 via FMAPI
Streaming SSE via /_apx/agent chat UI — confirmed working end-to-end

Bugs found and fixed during validation:

OBO token not forwarded in tool dispatch — _dispatch_tool_call only passed Authorization to internal ASGI calls, dropping X-Forwarded-Access-Token. Tools using Dependencies.UserClient failed with ValueError: OBO token is not provided. Fixed by forwarding OBO headers.
JS syntax error in /_apx/agent — Python \n in an f-string produced a literal newline inside a JS string, breaking the chat UI entirely. Fixed by escaping to \\n.
/_apx/* namespace conflict (prior commit) — /_apx routes collided with app routes; fixed by separating the router merge.
/invocations not proxied (prior commit) — dev proxy only injected OBO tokens for /api/*, not root-level agent routes; fixed by adding api_utils_router.

Test plan

This pull request was AI-assisted.

Proposes generating agent protocol endpoints (invocations, A2A discovery, MCP tools, eval bridge) from existing apx routes via pyproject.toml configuration. Routes are tools — no new abstractions. Implemented as a LifespanDependency addon following the same pattern as SQL and Lakebase addons. Co-authored-by: Isaac

Adds addons/agent/ following the same pattern as sql and lakebase: - addon.toml with Dependencies.Agent type alias - LifespanDependency that reads [tool.apx.agent] from pyproject.toml - Builds tool registry from app's OpenAPI spec (routes are tools) - Generates /.well-known/agent.json (A2A discovery card) - Generates /invocations (agent protocol dispatch to routes) - Generates /health (liveness probe) Zero application code changes — configure via pyproject.toml, existing routes with operation_id automatically become agent tools. Co-authored-by: Isaac

…ispatch

….UserClient

… and agent card

…ispatches via /invocations

49 checks covering _inspect_tool_fn, _make_input_model, Agent.build_local_tools, _build_fmapi_tool_schemas, build_router signature patching, structured output, protocol models, and A2A card generation. Runs directly with python3 — no APX wheel build required.

- Remove __signature__ from _ToolFn Protocol; regular Python functions satisfy __name__ + __doc__ but not __signature__ as a direct attr - Change _patch_handler_signature handler param to Any (does dynamic attr assignment, not Protocol reads) - Change _make_route_handler return type to Any (returns async coroutines) - Fix type: ignore comment from mypy syntax [call-overload] to bare type: ignore for ty compatibility on create_model call - Add rust-embed include-exclude feature + exclude pyc/__pycache__ from embedded templates to prevent template-not-found errors at scaffold time - Add httpx>=0.27.0 as Python dependency in agent addon.toml

- Add get_root_routers() to LifespanDependency base class + factory for protocol routes that must live at / not /api/ - Agent get_root_routers(): /.well-known/agent.json, /invocations, /health - Agent get_routers(): /api/tools/* (api-prefixed tool routes) - Move addon pyproject.toml.jinja2 to addon root (was in src/base/, which mapped to src/{app_slug}/ instead of project root) - Result: [tool.apx.agent] config correctly written at scaffold time, enabling AgentContext lifespan initialization

Builds the project and deploys to Databricks Apps via the Databricks CLI, then polls until the app reaches RUNNING state. - apx deploy [APP_PATH] [--skip-build] [--profile P] [--build-path P] - Reads DATABRICKS_CONFIG_PROFILE from .env if --profile not given - Polls databricks apps get every 3s (up to 3 min) for RUNNING state - Reports ERROR/CRASHED states with hint to check logs - Extracts run_build() from build.rs so deploy can reuse it

Adds a self-contained HTML/JS chat interface at /_agent for interactively testing agents during local development, inspired by Google ADK's `adk web` experience: - Fetches agent name/description/skills from /.well-known/agent.json context at render time (no round-trip needed) - Streams responses via SSE: sends InvocationRequest with stream=true and reads output_text.delta events token by token - Maintains full conversation history client-side and sends it each request (stateless agent, stateful client) - Shows registered skills in a collapsible panel - Auto-resizing textarea, Enter to send, Shift+Enter for newlines - Dark theme matching APX style Also fixes build.rs to skip UI build when the project has no frontend (pure-API agent projects), guarded by meta.has_ui().

…g capability - Remove unused JSONResponse import - Fix skills_json construction: !r produces Python repr (single-quoted strings), which is invalid JSON. Switch to json.dumps() so the browser can actually parse the skills array without error - Set A2ACapabilities.streaming = True — /invocations supports stream: true via SSE, so the discovery card should advertise it correctly

… test 1. Auto-discover agent_router.py — _AgentDependency.get_routers() now auto-imports {backend_pkg}.agent_router via importlib if _agent_instance is None. This removes the need for the addon to overwrite app.py with a side-effect import, so existing app.py customisations are preserved. The addon's app.py template is deleted. 2. /_agent setup banner — when AgentContext is None (missing pyproject config or no Agent() call), the dev UI now shows a clear amber banner with setup instructions instead of silently sending to a 404. 3. Rename Dependencies alias Agent → AgentContext — avoids a confusing collision between the Agent builder class (used in agent_router.py) and the route-parameter dependency type (used in FastAPI handlers). Updated doc: `ctx: Dependencies.AgentContext`. 4. Integration test for agent addon — test_init_with_agent_addon verifies that `apx init --addons agent` scaffolds agent_router.py, core/agent.py, [tool.apx.agent] in pyproject.toml, httpx dep, and no ui/ directory.

Exposes all registered Agent tools as an MCP server over SSE transport, mounted at /mcp/sse (GET) + /mcp/messages/ (POST). - _build_mcp_components(ctx, app): builds mcp.server.Server + SseServerTransport from the AgentContext tool registry. Tool calls dispatch via ASGI to the existing /api/tools/<name> routes so FastAPI dep injection (auth, workspace client) applies identically to MCP and REST callers. - Lifespan wires the MCP server onto app.state; gracefully skips with a warning if the mcp package isn't installed. - /_agent dev UI gains an MCP info bar: shows the SSE URL computed from window.location.origin with a one-click copy button. - addon.toml: adds mcp>=1.0.0 to Python dependencies. Claude Desktop config: {"mcpServers": {"my-agent": {"transport": "sse", "url": "http://localhost:8000/mcp/sse"}}}

…t.json - Add mcpEndpoint field to AgentCard — populated at request time with "{base_url}/mcp/sse" when the MCP server is active, null otherwise. - Populate card.url from request.base_url — was always "" before, which breaks A2A clients that use the card to self-locate the agent. - Both fields are filled via model_copy() in the route handler so the stored ctx.card template stays clean (no request dependency at lifespan).

…gent

_dispatch_tool_call posted to /tools/<fn> but the actual routes live at {api_prefix}/tools/<fn> (e.g. /api/tools/<fn>) because build_router() returns a router that gets included under api_router which carries the prefix. The LLM tool-calling loop would 404 on every tool call. Fix: import api_prefix from ..._metadata (same pattern as _factory.py) and use f"{api_prefix}/tools/{fn_name}" — matching what the MCP dispatch already did correctly.

…gent hierarchy Adds ADK-style agent composition types alongside the existing LlmAgent: SequentialAgent([planner, writer]) — chains agents, each sees prior output ParallelAgent([legal, finance]) — runs all concurrently, merges results Key design changes: - BaseAgent abstract base: run(), stream(), get_tool_routers(), collect_tools(), fetch_remote_tools() — any custom orchestration pattern can subclass this. - LlmAgent replaces Agent (Agent = LlmAgent alias kept for backwards compat). __init__ no longer sets _agent_instance — sub-agents in a composite no longer accidentally override the root registration. - _auto_import_agent_router now looks for a module-level `agent` variable of type BaseAgent in agent_router.py rather than relying on __init__ side-effects. Explicit assignment makes intent clear and supports all agent types: agent = SequentialAgent([LlmAgent(tools=[a]), LlmAgent(tools=[b])]) - AgentContext carries the root agent instance (ctx.agent). _handle_invocation delegates to ctx.agent.run() / ctx.agent.stream() — no agent-type-specific code in the protocol layer. - _run_llm_loop now takes list[Message] instead of InvocationRequest, making it callable from LlmAgent.run() without constructing a fake request body. - Lifespan uses collect_tools() + fetch_remote_tools() instead of the LlmAgent-specific build_local_tools() / fetch_sub_agent_tools(). Usage in agent_router.py: planner = LlmAgent(tools=[search, outline]) writer = LlmAgent(tools=[draft]) agent = SequentialAgent([planner, writer]) # ← registered as root

_run_llm_loop now accepts an optional `tools` parameter. LlmAgent.run() and LlmAgent.stream() pass self.collect_tools() so each LlmAgent in a SequentialAgent or ParallelAgent hierarchy only exposes its own tools to FMAPI, preventing cross-agent tool leakage.

- Async tool support: _make_route_handler now checks iscoroutinefunction and awaits the tool fn when it's a coroutine; sync tools unchanged - MCP tool dispatch path: _build_mcp_components imported api_prefix from _metadata so /mcp tool calls hit the correct {api_prefix}/tools/{name} route instead of hardcoded /api/tools/{name} - Instructions / system prompt: AgentConfig gains an optional `instructions` field (maps to pyproject.toml [tool.apx.agent]); LlmAgent.__init__ accepts an `instructions` kwarg that overrides the config value per-agent. _run_llm_loop prepends a system message when instructions are non-empty. Useful for per-agent persona in SequentialAgent/ParallelAgent compositions.

- InvocationRequest.input now accepts list[Message] | str; a plain string is coerced via .messages() so MLflow eval harness and curl one-liners work without wrapping in a list - app_predict_fn gains an optional token param that adds Authorization: Bearer <token> to every request — required for OBO-protected Databricks Apps during mlflow.genai.evaluate() - MLflow tracing: _handle_invocation opens a root CHAIN span per request; each FMAPI call opens a child LLM span; each tool dispatch opens a child TOOL span. All attributes (model, messages, result) are set on the spans. Tracing is no-op when mlflow is not importable so the addon remains usable in plain FastAPI dev without a tracking server.

- MLflow span leak on exception: all LLM and TOOL spans now use try/finally to guarantee span.end() on error paths; root CHAIN span in _handle_invocation also wrapped in try/finally - AgentConfig gains temperature, max_tokens (both optional, None = model default), and max_iterations (default 10). All three are documented as comments in the scaffolded pyproject.toml. - LlmAgent.__init__ accepts the same three params to override per-agent within a SequentialAgent/ParallelAgent composition. - _run_llm_loop resolves precedence: constructor arg > AgentConfig > model default. Builds fmapi_extra dict only with non-None values so missing fields are not sent to FMAPI at all.

…tom_inputs - Root span set_attribute-after-end: moved set_attribute inside the try/finally block so it runs before end() in all paths including errors - result undefined if _dispatch_tool_call raises: initialise result = "" before the try block so messages.append never hits a NameError - SequentialAgent/ParallelAgent now accept an optional instructions param. When set, a system message is prepended to the conversation before any sub-agent runs — framing the whole pipeline without overriding each LlmAgent's own system prompt. - custom_inputs wired up: _handle_invocation stashes custom_inputs on request.state; _run_llm_loop reads custom_inputs["instructions"] as the highest-priority system prompt override (> constructor > AgentConfig). custom_inputs also recorded as a span attribute on the root CHAIN span. InvocationRequest.instructions_override() helper added for callers.

- _load_agent_config: prefer __file__-relative pyproject.toml search over cwd-relative; cwd may be unrelated in deployed Databricks Apps. Falls back to cwd walk for interactive/test use. - FMAPI tools payload: omit `tools` key entirely when tool list is empty instead of sending `"tools": []`; some FMAPI backends reject the empty array. - SSE error events: wrap ctx.agent.stream() in try/except inside the generator; on exception yield `event: error\ndata: {...}` and log, then close the span in finally. Previously the stream silently stopped and the client UI hung indefinitely. - MCP auth forwarding: mcp_sse handler captures the incoming Authorization header onto app.state.mcp_auth_header; _call_tool reads it and forwards it when making ASGI requests to tool routes, giving MCP clients the same OBO token context as REST callers. - app_predict_fn docstring: fix import path from `apx.agent` (wrong) to `{{app_slug}}.backend.core.agent` (correct rendered module path). - LlmAgent.stream(): add comment clarifying that streaming is simulated (full response then chunked) because FMAPI lacks per-token streaming.

…xt window - LoopAgent: runs a LlmAgent in a loop until finish_loop() tool called or max_iterations reached; finish_loop registered as a real ASGI route so it shares the same dispatch path as all other tools - Tool hooks: before_tool(name, args) and after_tool(name, args, result) on LlmAgent; sync and async callables both accepted; fire around every tool dispatch in _run_llm_loop - Guardrails: input_guardrails and output_guardrails on LlmAgent; each is a list of callables returning None (pass) or str (short-circuit with that text); applied in LlmAgent.run() and stream() before/after _run_llm_loop - custom_outputs: set_custom_output(request, key, value) helper lets tool functions surface structured data alongside text; _handle_invocation initialises request.state.custom_outputs and includes it in InvocationResponse.custom_outputs; SSE path emits a custom_outputs event - Context window management: context_window_tokens on LlmAgent; _maybe_trim_context estimates token usage (4 chars/token), keeps system messages + last 2 messages intact, and summarises the middle with a single LLM call when budget exceeded - Type aliases: BeforeToolHook, AfterToolHook, InputGuardrailFn, OutputGuardrailFn

- RouterAgent: routes to one of several named sub-agents via a single upfront FMAPI call with synthetic transfer_to_<name> tools; no routes registered — the Python layer intercepts tool_calls directly; falls back to first agent when LLM does not call a transfer tool - HandoffAgent: agents dict + start key; each active agent receives transfer_to_<name> tools for every other agent injected into its own tool list; transfer routes registered as real ASGI endpoints (same signal pattern as LoopAgent.finish_loop) so FastAPI dep injection is preserved; handoff_to is set on request.state and checked after each _run_llm_loop call; supports up to max_handoffs transfers before stopping - _TransferBody Pydantic model shared by HandoffAgent transfer handlers Both types honour LlmAgent hooks (before_tool/after_tool) and context_window_tokens of the currently active sub-agent.

… SDK

Adds a second page to the APX dev tooling namespace: /_apx/tools. Shows every registered tool exactly as the LLM sees it — dep-injected parameters stripped, FMAPI inputSchema rendered with syntax highlighting. An Invoke tab auto-generates a form from the schema and POSTs to the tool's /api/tools/<name> endpoint (or sub-agent /invocations), displaying the result with timing. Both /_apx/agent and /_apx/tools now have a nav bar linking between them. Also wires the route in get_root_routers() and updates the module docstring.

GET /_apx/probe?url=https://api.example.com makes a server-side GET and returns HTTP status, latency, content-type, server header, redirect count, and structured error details (ConnectError, Timeout, SSLError). Because the request runs from the server process, results reflect the deployed app's actual network path — useful for diagnosing egress restrictions on Databricks Apps before writing tool code. Also adds Probe to the nav bar on /_apx/agent and /_apx/tools.

- Add _build_apx_openapi_spec() — generates an OpenAPI 3.1 spec containing only tool endpoints with dep-stripped schemas (what the LLM sees, not what FastAPI sees with WorkspaceClient etc.) - Serve the filtered spec at /_apx/openapi.json - Replace 350-line hand-rolled tools UI with Scalar CDN embed (kepler theme) pointing at /_apx/openapi.json — sidebar, schema display, and try-it panel are now best-in-class without maintaining bespoke JS/CSS - Fixed APX nav bar overlays Scalar using position:fixed + ResizeObserver

The Rust dev server nested its own control router at /_apx (for /health, /logs, /stop), which consumed all /_apx/* traffic and returned 404 for the Python-side APX dev UI routes. Add /_apx/agent, /_apx/tools, /_apx/probe, /_apx/openapi.json, and /invocations as direct routes in api_utils_router. Axum gives specific .route() registrations priority over .nest() prefix matches, so these paths now reach the Python backend correctly.

Add /health, /.well-known/agent.json, /mcp/sse, /mcp/messages/ to api_utils_router alongside /invocations and the /_apx/* dev UI routes added in the previous commit. All routes registered by the APX agent protocol at app root are now forwarded to the Python backend through the dev proxy.

Each assistant turn now shows a collapsible 'tool calls' row below the response. For each tool: function name, ok/error badge, latency, input args, and result (truncated at 800 chars). Backend: _run_llm_loop writes a tool_trace entry (name, args, result, ms) to request.state after each dispatch. _sse_generator emits a 'tool.trace' SSE event after the text stream completes. Frontend: buildTraceEl() renders the trace as a <details> block. Error results are highlighted red. Multiple errors show a count in the summary line.

UV_NATIVE_TLS=1 in the project .env was not being seen by the uv subprocess during preflight because .env is not loaded into the process environment at that point. Add resolve_native_tls(app_dir) which checks the shell env first, then reads the project .env as fallback. Pass the result as the native_tls flag to Uv::sync() and Uv::tool_run(), which append --native-tls when true. This fixes apx dev start on corporate networks with SSL inspection where uv-dynamic-versioning fetches were returning 503.

_dispatch_tool_call only forwarded Authorization when dispatching tool calls via ASGI to /api/tools/<fn>. This meant X-Forwarded-Access-Token (injected by the dev proxy) was lost, causing OBO auth to fail on any tool that needs Dependencies.UserClient (e.g. Lakebase queries). Also fixes a JS syntax error in the /_apx/agent chat UI where a Python \n in an f-string produced a literal newline inside a JS string literal. Co-authored-by: Isaac

Replace verbose prose with an ASCII architecture diagram and a feature table. Keeps the code example showing tool definition. Co-authored-by: Isaac

stuart-gano added 30 commits March 24, 2026 09:08

fix: replace TestClient with httpx ASGITransport for async internal d…

ad9ace8

…ispatch

feat: scaffold agent addon with example route and zero-config pyproject

1efcd4f

feat: Agent(tools=[...]) API — plain functions become agent tools

2a3257e

feat: add Workspace alias — ws: Workspace instead of ws: Dependencies…

a631fe3

….UserClient

docs: rewrite RFC to reflect real MLflow/A2A protocols and Agent() API

09a2002

feat: correct ResponsesAgent and A2A protocol models for /invocations…

4402c79

… and agent card

feat: implement FMAPI LLM loop with tool dispatch (Phase 2)

a86a0be

feat: add model field to AgentConfig with llama default

006687f

feat: sub-agent support — Agent(sub_agents=[url]) fetches cards and d…

299a3eb

…ispatches via /invocations

feat: SSE streaming and eval bridge app_predict_fn() (Phase 4)

629c942

docs(agent): update module docstring and test to reflect /mcp and /_a…

5a7c83a

…gent

stuart-gano added 13 commits March 30, 2026 17:38

docs: call out APX agent advantages over Google ADK and OpenAI Agents…

cf39e2a

… SDK

refactor(agent): rename /_agent dev UI to /_apx/agent

5c307ea

docs: tighten agent section in README with architecture diagram

3966931

Replace verbose prose with an ASCII architecture diagram and a feature table. Keeps the code example showing tool definition. Co-authored-by: Isaac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Agent protocol support for apx apps#142

RFC: Agent protocol support for apx apps#142
stuart-gano wants to merge 43 commits intodatabricks-solutions:mainfrom
stuart-gano:rfc/agent-protocol

stuart-gano commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stuart-gano commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Agent runtime (core/agent.py)

Dev UI namespace: /_apx/*

Protocol endpoints

Deployment (crates/cli/src/deploy.rs)

README

End-to-end validation: Energy billing agent

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stuart-gano commented Mar 24, 2026 •

edited

Loading

Agent runtime (`core/agent.py`)

Dev UI namespace: `/_apx/*`

Deployment (`crates/cli/src/deploy.rs`)