Skip to content

perf: bypass HTTP self-call from /run to /v1/responses#1439

Open
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:inprocess-self-call-perf
Open

perf: bypass HTTP self-call from /run to /v1/responses#1439
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:inprocess-self-call-perf

Conversation

@ananthsub
Copy link
Copy Markdown
Contributor

Each agent's /run handler currently makes an HTTP self-call to its own /v1/responses endpoint. This adds extra aiohttp round-trip + FastAPI middleware + pydantic re-validation on every rollout.

This PR converts all production agents that still do this to call the implementation in-process.

To keep backwards compatibility, the refactor extracts the existing responses() body into a private helper. The public responses(request, response, body) method keeps its FastAPI signature.

Microbenchmarks:

  • Model inference latency still dominates any of this framework overhead
  • With a real OpenAI endpoint for 4.1o, the absolute per-rollout saved ~7% rps on simple agent
  • the speedup is more pronounced for agents like proof_refinement_agent which have more HTTP hops as part of the implementation
agent http requests/second inprocess requests/second speedup
simple_agent 2.95 3.17 +7.5%
proof_refinement_agent 1.18 1.34 +13.3%

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ananthsub ananthsub force-pushed the inprocess-self-call-perf branch from 9dc0d1c to ed66904 Compare May 28, 2026 07:22
…sponses

Each agent's `/run` handler previously made an HTTP self-call to its own
`/v1/responses` endpoint, paying for aiohttp round-trip, FastAPI
middleware, and pydantic re-validation on every rollout. This converts
all production agents that still did this to call the implementation
in-process.

The refactor extracts the existing `responses()` body into a private
`_responses(body, cookies) -> (response, set_cookies)` helper. The
public `responses(request, response, body)` method keeps its FastAPI
signature and is now a 4-line adapter that delegates to `_responses`,
so external HTTP callers see no change. `/run` replaces the 7-line
self-call block with two lines that call `_responses` directly.

Agents converted: simple_agent, proof_refinement_agent,
non_executing_simple_agent, speed_bench_agent, cvdp_agent,
finance_agent, browsecomp_agent, hermes_agent, claude_code_agent, and
the langgraph_agent base + 4 subclasses (rewoo, orchestrator,
parallel_thinking, reflection). swe_agents, aviary_agent,
verifiers_agent, and stirrup_agent were already on this pattern.

Microbenchmark shows ~0.7-1.2 ms framework overhead per self-call at
concurrency=1, scaling to a ~70-150x rps multiplier in the
LLM-overhead-isolated case. End-to-end with a mock model: simple_agent
+25% rps / 20% wall-time reduction; proof_refinement_agent (3
self-calls per rollout) up to +2x rps / 50% wall-time reduction at
concurrency=256. With a real OpenAI model in the loop the absolute
per-rollout savings carry over (~7% rps on simple, ~13% on pref) and
the HTTP path also hit FD exhaustion at concurrency=1024 where the
in-process path did not.

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub ananthsub force-pushed the inprocess-self-call-perf branch from ed66904 to 75473cb Compare May 28, 2026 07:36
@ananthsub ananthsub marked this pull request as ready for review May 28, 2026 12:54
@cmunley1
Copy link
Copy Markdown
Contributor

Did you run training or benchmarks to confirm no accuracy or convergence degradation? Seems fine to me otherwise, although the logic around cookies seems to change a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants