[codex] Extract root upstream runtime by andreasronge · Pull Request #1028 · andreasronge/ptc_runner

andreasronge · 2026-05-28T14:21:36Z

Summary

This PR implements docs/plans/root-upstream-runtime.md by moving upstream tool execution out of ptc_runner_mcp and into the root ptc_runner library.

Key changes:

Adds PtcRunner.Upstream.* runtime modules for config parsing, credentials, transport-neutral results, catalog rendering, discovery, run contexts, collectors, OpenAPI, MCP stdio, and MCP HTTP transports.
Extends mix ptc.repl with --upstreams-config / PTC_RUNNER_UPSTREAMS, upstream call limits, catalog limits, --catalog-mode, and --catalog-snapshot-mode.
Rewires ptc_runner_mcp through PtcRunnerMcp.RootUpstreamRuntime and deletes the duplicated MCP-owned upstream registry/connection/transport stack.
Preserves MCP-specific response shaping, sessions, tracing, debug tooling, and agentic tools at the MCP boundary.
Adds defense-in-depth redaction for root upstream credentials across MCP trace/debug/session/log/agentic paths and scrubs catalog/discovery output from authenticated upstreams.
Restores agentic side-effect accounting by recording root upstream attempts before dispatch and preserving compact successful result summaries.
Documents the shared root runtime in docs/upstream-runtime.md and refreshes aggregator/MCP docs for frozen vs live catalog behavior.

Why

Upstream tools are useful outside the MCP server. The old implementation kept the runtime, config, transports, catalog, and discovery machinery inside mcp_server, which meant local REPL and embedded root callers could not use upstreams without going through an MCP server process.

This extraction makes upstreams a root library feature while keeping MCP server behavior as an integration layer.

Review Notes

Expected large deletion count: this intentionally removes the old PtcRunnerMcp.Upstream.*, AggregatorTools, CatalogBuiltins, UpstreamCalls, and related MCP-local test suites after replacing them with root runtime tests and MCP boundary tests.

Important behavior points to review:

Config transport names are now explicit: openapi, mcp_stdio, mcp_http; old stdio/http names are rejected.
MCP server uses catalog_snapshot_mode: :frozen; root mix ptc.repl defaults to :live and can opt into :frozen.
OpenAPI schemas are still loaded eagerly during runtime startup in both modes.
Catalog/discovery results and MCP trace/debug/session/log paths are scrubbed against root runtime credentials.
Remaining low-level supervision debt is unchanged: MCP client processes are directly linked from the root runtime, not supervised through a per-upstream DynamicSupervisor.

Validation

Local checks passed:

mix precommit
cd mcp_server && mix precommit
pre-commit hook during commit: root + mcp_server format, compile with warnings as errors, Credo, scoped tests
pre-push hook during push:
- root full suite: 5148 tests, 0 failures; Dialyzer passed
- mcp_server full suite: 510 tests, 0 failures; Dialyzer passed
- ptc_viewer full suite: 9 tests, 0 failures

Merge the useful stash wording into the current MCP PTC-Lisp reference while keeping rendered prompt descriptions under their byte budget. Verified with: - mix test test/ptc_runner/lisp/format_test.exs - cd mcp_server && mix test test/ptc_runner_mcp/prompt_registry_test.exs test/ptc_runner_mcp/prompt_files_test.exs

Extract upstream tool runtime into the root ptc_runner app and wire ptc_runner_mcp through it. Adds root REPL upstream support, shared transport/config/result handling, redaction bridging, catalog/discovery snapshot modes, and updated docs/tests. Verified with mix precommit and cd mcp_server && mix precommit.

Point real-provider MCP benchmarks at the root upstream runtime and emit explicit mcp_stdio upstream configs after the extraction. Verified with mcp_server mix format --check-formatted, mix compile --warnings-as-errors, and real OpenRouter benchmark runs.

andreasronge · 2026-05-28T14:33:00Z

Real-LLM MCP benchmark run

Ran the real-provider MCP benchmarks on this branch with gemini-flash-lite through OpenRouter (OPENROUTER_API_KEY from .env) and one run per cell.

Commands:

cd mcp_server
mix run --no-start bench/agentic_real_eval.exs \
  --runs=1 \
  --models=gemini-flash-lite \
  --catalog-modes=inline,lazy \
  --json-out=../tmp/agentic_real_eval_root_upstream.json \
  --md-out=../tmp/agentic_real_eval_root_upstream.md \
  --fail-on-skip

mix run --no-start bench/lisp_eval_real_client_eval.exs \
  --runs=1 \
  --models=gemini-flash-lite \
  --profiles=no-upstreams,with-upstreams \
  --catalog-modes=inline,lazy \
  --json-out=../tmp/lisp_eval_real_client_eval_root_upstream.json \
  --md-out=../tmp/lisp_eval_real_client_eval_root_upstream.md \
  --fail-on-skip

mix run --no-start bench/lisp_session_real_client_eval.exs \
  --runs=1 \
  --models=gemini-flash-lite \
  --profiles=no-upstreams,with-upstreams \
  --catalog-modes=inline,lazy \
  --json-out=../tmp/lisp_session_real_client_eval_root_upstream.json \
  --md-out=../tmp/lisp_session_real_client_eval_root_upstream.md \
  --fail-on-skip

Results:

agentic_real_eval: 5/9 cells passed. Failures were single_read inline, multi_file_reduce inline/lazy, and lazy_catalog_discovery lazy. The first three are blocked by partial_side_effects from unknown filesystem upstream effects; the lazy discovery failure selected a non-existent list tool.
lisp_eval_real_client_eval: 12/14 cells passed. Failures were context_reduce with upstreams in both inline and lazy modes; the model generated bad context access and exhausted retries.
lisp_session_real_client_eval: 13/14 cells passed. The failed cell was recover_after_error with upstreams/lazy; the final answer was correct (15), but the benchmark expected one errored lisp_session_eval and the model skipped the intentional error path.

Notes:

The benchmark harnesses needed a small source update after the root-runtime extraction: they now start PtcRunner.Upstream.Runtime instead of the deleted PtcRunnerMcp.Upstream.Supervisor, and generated upstream configs include explicit transport: "mcp_stdio".
ReqLLM printed the expected "unverified model" warning for openrouter:google/gemini-3.1-flash-lite; calls still completed.
Follow-up verification after the harness fix: pre-commit hook passed for mcp_server; pre-push hook passed root full suite + Dialyzer, MCP full suite + Dialyzer, and ptc_viewer tests.

andreasronge · 2026-05-28T14:40:35Z

Root E2E verification passed on this branch:

mix test --only e2e

Result: 52 tests, 0 failures, 5841 excluded. Runtime: 109.3s. This used the real provider config loaded from the local environment/.env; ReqLLM printed the expected unverified-model warning for openrouter:google/gemini-3.1-flash-lite in the combined compute E2E cases.

andreasronge added 4 commits May 28, 2026 10:19

docs(specs): root upstream runtime

785fe7f

andreasronge marked this pull request as ready for review May 28, 2026 14:41

andreasronge merged commit 8fb9114 into main May 28, 2026
7 checks passed

andreasronge deleted the codex/root-upstream-runtime branch May 28, 2026 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Extract root upstream runtime#1028

[codex] Extract root upstream runtime#1028
andreasronge merged 4 commits into
mainfrom
codex/root-upstream-runtime

andreasronge commented May 28, 2026

Uh oh!

andreasronge commented May 28, 2026

Uh oh!

andreasronge commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreasronge commented May 28, 2026

Summary

Why

Review Notes

Validation

Uh oh!

andreasronge commented May 28, 2026

Real-LLM MCP benchmark run

Uh oh!

andreasronge commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant