feat(subagent): distillation telemetry + route force-summary content to output by esengine · Pull Request #974 · esengine/DeepSeek-Reasonix

esengine · 2026-05-15T17:52:31Z

What

Two paired changes to make spawn_subagent measurable end-to-end:

New SubagentTelemetry collector. Wire one pre-bound onSpawnComplete callback into registerSubagentTool and every spawn auto-populates per-spawn SpawnDistillation records plus a session-level SubagentSessionSummary (compression ratio, useful-spawn rate, spawn-storm count). No behavior change to the parent loop; no SessionStats touch.
Force-summary content now lands in output, not error. Discriminate the two forcedSummary paths on parentSignal.aborted: the user-abort placeholder still routes to error, but storm-breaker / context-guard partials populate final and set a new SubagentResult.forcedSummary flag. formatSubagentResult gets a { success: false, partial: true, output, note } branch so the parent agent sees the partial answer with a clear marker.

Numbers

Seven rounds of live-API probes (scripts/probe-*, ~$1.20 of DeepSeek spend). Pre vs post on the same 12-turn × 3-repeat read-heavy script:

	Pre-patch	Post-patch
Median SUB session cost vs FLAT	+24% (loss)	−13% (win)
Useful-spawn rate (non-empty output)	54%	73%
Force-summary content recovered into `output`	0%	9.1%
Spawn storms observed (≥3 spawns / turn)	2	2

37-point swing in expected cost direction on identical workload. Variance is still real (one of three post-patch runs still cost +14.6%) — exposing the metric is partly so users can see that when it happens.

The 46% empty-output rate from the pre-patch baseline decomposed cleanly in Round 5:

~⅓ paused (budget exhaustion; recoverable via resume_session — separate issue)
~⅓ force-summary content stranded in error — fixed here
rest: probe-side read_file truncation (probe artifact, not real-Reasonix)

Full investigation walk-through in docs/rfcs/0001-multi-context-orchestrator.issue.md.

Wiring (caller-side)

const telemetry = new SubagentTelemetry();
registerSubagentTool(registry, {
  client,
  onSpawnComplete: telemetry.record,  // pre-bound
});

// ... agent runs, spawn_subagent fires populate telemetry automatically ...
telemetry.summary;       // SubagentSessionSummary
telemetry.stormCount();  // turns with ≥3 spawns

Commits

70fbeb1  fix(subagent): route forceSummary content to output, not error
fcd103c  feat(telemetry): expose sub-agent distillation as first-class metric
fcb83c3  chore(probes): live-API probes for sub-agent distillation rounds 1-7
32a1108  docs(rfcs): RFC-0001 + issue draft for multi-context distillation

Non-goals

TUI cell for the new metric — separate UX-shaped PR.
Auto-wiring into CacheFirstLoop.stats — SubagentResultLike structural type is exported so rolling the collector into SessionStats later stays cycle-free, but doing it now would widen this PR.
Default maxToolIters bump — Round 7 swept 4/8/16/24/32 and DEFAULT_PAUSE_EVERY = 16 is the empirical knee. No change.
Orchestrator-as-default topology — out of scope; the metric exists so that decision can be data-supported when revisited.

Test plan

npm run lint clean (biome on src tests)
npm run typecheck clean
22 new tests: 16 in tests/subagent-distillation.test.ts, 6 in tests/subagent.test.ts (2 for onSpawnComplete, 4 for the forcedSummary formatter branch)
Public-API snapshot covers the 10 new public names
Live-API end-to-end (scripts/probe-session-e2e-post-patch.mts): median 12-turn session moved from SUB +24% to SUB −13%
Reviewer sanity-check on the structural SubagentResultLike type — that's the door open for SessionStats integration; comments welcome on whether the shape is right

) The bundled libwayland-client.so.0 inside the AppImage was built on Ubuntu 22.04 CI and ABI-mismatches the host Wayland compositor on distros that ship a different libwayland (openSUSE Tumbleweed, Fedora, Arch on some weeks). The mismatch makes WebKitWebProcess::eglGetDisplay() return EGL_BAD_PARAMETER and the process aborts with a short stack before the first paint — exactly the SIGABRT the AppImage user in #892 still hits even with #895's DMABUF disable in place. WebKitWebProcess is fork+exec'd from the parent and inherits the env, so setting LD_PRELOAD to the host's libwayland in main() redirects just the child's dynamic linker without disturbing the already-loaded parent. Gate it on APPDIR + WAYLAND_DISPLAY so distro packages, debug builds, and X11 sessions are untouched. Skip the bare /usr/lib/libwayland-client.so.0 path — on 64-bit Fedora it can be a 32-bit library and the loader rejects it with an ELF-class warning. Pattern confirmed by Tolaria, yaak, nym-vpn-client and the wider Tauri AppImage user base (tauri-apps/tauri#11988, gitbutlerapp/gitbutler#5282).

When CacheFirstLoop's storm-breaker or context-guard fires inside a spawned child loop, it emits an assistant_final event tagged forcedSummary:true carrying the partial synthesis the model managed to produce. spawnSubagent was routing that text into errorMessage and zeroing output, so the parent loop saw a "failed" spawn with no content even though the child had written a usable partial answer. Discriminate the two forcedSummary paths on parentSignal.aborted: the user-abort path (loop.ts ~670) still routes the UX placeholder to errorMessage; the storm-breaker / context-guard path now populates final, sets a new forcedSummary flag on SubagentResult, and success stays false so callers can tell partial from complete. formatSubagentResult gains a forcedSummary branch that emits { success: false, partial: true, output, note } so the parent agent sees the content with a clear "this is partial" marker. paused still takes precedence when both are set. Empirical validation against the live API: storm-summary spawns now ship 286-391 tokens of partial answer in output instead of stranding it in error. A 12-turn x 3-repeat session probe with this + the upcoming distillation telemetry wiring shows median SUB cost moving from +24% (loss) vs FLAT pre-patch to -13% (win) post-patch.

Reasonix has measured cache hit rate per session since Pillar 1 landed, but the other half of the economic story — how much parent-log growth `spawn_subagent` avoided — has been invisible. This adds the per-spawn + per-session telemetry to make it inspectable, mirroring the cache-rate surface. New module `src/telemetry/subagent-distillation.ts`: SpawnDistillation per-spawn: completionTokens, outputTokens, savingsTokens, compressionRatio, hasOutput, costUsd, paused computeSpawnDistillation derives the shape from a SubagentResult- like input (structural, not nominal — avoids a stats ↔ subagent ↔ loop import cycle if SessionStats later wants in) SubagentSessionSummary aggregate: spawnCount, usefulSpawnCount, pausedSpawnCount, successRate, total*, aggregateCompressionRatio (completion-weighted, not naive mean), totalCostUsd summarizeSubagentSession aggregates per-spawn distillations countSpawnStorms counts turns with ≥ threshold spawns SubagentTelemetry live collector — `record` is pre-bound for ergonomic use as a callback; `startTurn(n)` groups records into turn buckets so storm detection is meaningful `registerSubagentTool` gains one optional `onSpawnComplete` hook fired after `spawnSubagent` returns (errors are swallowed so telemetry can't break the spawn-tool dispatch). Pattern: const telemetry = new SubagentTelemetry(); registerSubagentTool(registry, { client, onSpawnComplete: telemetry.record, }); // ... agent runs ... console.log(telemetry.summary); console.log(telemetry.stormCount()); No behavior change to the parent loop, no SessionStats touch, no TUI surface yet — wiring is the caller's responsibility for now. Index re-exports the public surface; deep-import path also works. Tests: 16 cases on the telemetry module + 2 on the spawn hook, covering compression edge cases, empty/whitespace output detection, zero-completion guard, completion-weighted aggregation, storm counting, collector lifecycle, and onSpawnComplete error isolation. End-to-end validation against the live DeepSeek API (12-turn × 3-repeat session with this + the prior forcedSummary routing fix) showed median SUB session cost moving from +24% vs FLAT pre-patch to −13% post-patch; useful-spawn rate climbed from 54% to 73%.

Eight standalone probe scripts under scripts/. Each loads .env and hits the live DeepSeek API; none are wired into the bench harness or any CI. Purpose: reproduce + extend the measurements behind the sub-agent distillation work. probe-orchestrator-cache.mjs (R1) raw-API cache rate, FLAT vs ORCH messages[] topology probe-subagent-distillation.mts (R2) read-heavy spawns, per-spawn compression on 3 representative tasks probe-subagent-write-heavy.mts (R3a) negative case — spawns whose deliverable IS the artifact probe-session-e2e.mts (R4) 6-turn single-run E2E probe-session-e2e-long.mts (R4-long) 12-turn × 3-repeat E2E, pre-patch baseline probe-empty-output-diagnosis.mts (R5) diagnoses the 46% empty-output rate at three budget caps probe-session-e2e-post-patch.mts (R6) same script as R4-long, runs SubagentTelemetry + the forcedSummary fix end-to-end probe-budget-sweep.mts (R7) 5 tasks × 5 budgets sweep to locate the right `maxToolIters` default All scripts are idempotent reads of ./src — none mutate state. Costs ~$1.20 of DeepSeek spend to run end-to-end on a fresh clone. Run individually via `npx tsx scripts/probe-*.mts` / `node scripts/probe-orchestrator-cache.mjs`.

Two internal docs under docs/rfcs/: 0001-multi-context-orchestrator.md The opinionated RFC. Frames the design space (cache-economics vs distillation vs reliability), defines roles, sketches the orchestrator topology proposal that started this investigation. 0001-multi-context-orchestrator.issue.md A draft GitHub issue body, evidence-first. Walks through 7 rounds of live-API probes: cache-rate falsification (R1), read-heavy distillation (R2, ~6% compression), write-heavy negative case (R3a, ~85%), break-even arithmetic (R3b), full FLAT-vs-SUB session pre-patch (R4-long, SUB +24% loss, 46% empty-output rate), empty-output diagnosis (R5, three distinct mechanisms), post-patch validation (R6, median SUB -13% win), and budget sweep (R7, keep DEFAULT_PAUSE_EVERY=16). The issue draft closes with a four-issue split for the upstream: 1 Expose distillation + reliability metrics (this branch — landed) 2 Route forcedSummary → output (this branch — landed) 3 Tune default maxToolIters (resolved as no-change) 4 Sub-context topology (conditional unblock) Total spend across the seven rounds: ~$1.20 of live DeepSeek API.

reasonix added 5 commits May 15, 2026 06:41

esengine merged commit 3b0435a into main May 16, 2026
5 checks passed

esengine deleted the feat/subagent-distillation-telemetry branch May 16, 2026 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(subagent): distillation telemetry + route force-summary content to output#974

feat(subagent): distillation telemetry + route force-summary content to output#974
esengine merged 5 commits into
mainfrom
feat/subagent-distillation-telemetry

esengine commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esengine commented May 15, 2026

What

Numbers

Wiring (caller-side)

Commits

Non-goals

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant