feat(subagent): distillation telemetry + route force-summary content to output#974
Merged
Merged
Conversation
added 5 commits
May 15, 2026 06:41
) The bundled libwayland-client.so.0 inside the AppImage was built on Ubuntu 22.04 CI and ABI-mismatches the host Wayland compositor on distros that ship a different libwayland (openSUSE Tumbleweed, Fedora, Arch on some weeks). The mismatch makes WebKitWebProcess::eglGetDisplay() return EGL_BAD_PARAMETER and the process aborts with a short stack before the first paint — exactly the SIGABRT the AppImage user in #892 still hits even with #895's DMABUF disable in place. WebKitWebProcess is fork+exec'd from the parent and inherits the env, so setting LD_PRELOAD to the host's libwayland in main() redirects just the child's dynamic linker without disturbing the already-loaded parent. Gate it on APPDIR + WAYLAND_DISPLAY so distro packages, debug builds, and X11 sessions are untouched. Skip the bare /usr/lib/libwayland-client.so.0 path — on 64-bit Fedora it can be a 32-bit library and the loader rejects it with an ELF-class warning. Pattern confirmed by Tolaria, yaak, nym-vpn-client and the wider Tauri AppImage user base (tauri-apps/tauri#11988, gitbutlerapp/gitbutler#5282).
When CacheFirstLoop's storm-breaker or context-guard fires inside a
spawned child loop, it emits an assistant_final event tagged
forcedSummary:true carrying the partial synthesis the model managed
to produce. spawnSubagent was routing that text into errorMessage
and zeroing output, so the parent loop saw a "failed" spawn with no
content even though the child had written a usable partial answer.
Discriminate the two forcedSummary paths on parentSignal.aborted:
the user-abort path (loop.ts ~670) still routes the UX placeholder
to errorMessage; the storm-breaker / context-guard path now
populates final, sets a new forcedSummary flag on SubagentResult,
and success stays false so callers can tell partial from complete.
formatSubagentResult gains a forcedSummary branch that emits
{ success: false, partial: true, output, note } so the parent
agent sees the content with a clear "this is partial" marker.
paused still takes precedence when both are set.
Empirical validation against the live API: storm-summary spawns
now ship 286-391 tokens of partial answer in output instead of
stranding it in error. A 12-turn x 3-repeat session probe with
this + the upcoming distillation telemetry wiring shows median
SUB cost moving from +24% (loss) vs FLAT pre-patch to -13% (win)
post-patch.
Reasonix has measured cache hit rate per session since Pillar 1
landed, but the other half of the economic story — how much
parent-log growth `spawn_subagent` avoided — has been invisible.
This adds the per-spawn + per-session telemetry to make it
inspectable, mirroring the cache-rate surface.
New module `src/telemetry/subagent-distillation.ts`:
SpawnDistillation per-spawn: completionTokens, outputTokens,
savingsTokens, compressionRatio, hasOutput,
costUsd, paused
computeSpawnDistillation derives the shape from a SubagentResult-
like input (structural, not nominal — avoids
a stats ↔ subagent ↔ loop import cycle if
SessionStats later wants in)
SubagentSessionSummary aggregate: spawnCount, usefulSpawnCount,
pausedSpawnCount, successRate, total*,
aggregateCompressionRatio (completion-weighted,
not naive mean), totalCostUsd
summarizeSubagentSession aggregates per-spawn distillations
countSpawnStorms counts turns with ≥ threshold spawns
SubagentTelemetry live collector — `record` is pre-bound for
ergonomic use as a callback; `startTurn(n)`
groups records into turn buckets so storm
detection is meaningful
`registerSubagentTool` gains one optional `onSpawnComplete` hook
fired after `spawnSubagent` returns (errors are swallowed so
telemetry can't break the spawn-tool dispatch). Pattern:
const telemetry = new SubagentTelemetry();
registerSubagentTool(registry, {
client,
onSpawnComplete: telemetry.record,
});
// ... agent runs ...
console.log(telemetry.summary);
console.log(telemetry.stormCount());
No behavior change to the parent loop, no SessionStats touch, no
TUI surface yet — wiring is the caller's responsibility for now.
Index re-exports the public surface; deep-import path also works.
Tests: 16 cases on the telemetry module + 2 on the spawn hook,
covering compression edge cases, empty/whitespace output detection,
zero-completion guard, completion-weighted aggregation, storm
counting, collector lifecycle, and onSpawnComplete error isolation.
End-to-end validation against the live DeepSeek API (12-turn ×
3-repeat session with this + the prior forcedSummary routing fix)
showed median SUB session cost moving from +24% vs FLAT pre-patch
to −13% post-patch; useful-spawn rate climbed from 54% to 73%.
Eight standalone probe scripts under scripts/. Each loads .env and
hits the live DeepSeek API; none are wired into the bench harness
or any CI. Purpose: reproduce + extend the measurements behind the
sub-agent distillation work.
probe-orchestrator-cache.mjs (R1) raw-API cache rate, FLAT vs ORCH
messages[] topology
probe-subagent-distillation.mts (R2) read-heavy spawns, per-spawn
compression on 3 representative
tasks
probe-subagent-write-heavy.mts (R3a) negative case — spawns whose
deliverable IS the artifact
probe-session-e2e.mts (R4) 6-turn single-run E2E
probe-session-e2e-long.mts (R4-long) 12-turn × 3-repeat E2E,
pre-patch baseline
probe-empty-output-diagnosis.mts (R5) diagnoses the 46% empty-output
rate at three budget caps
probe-session-e2e-post-patch.mts (R6) same script as R4-long, runs
SubagentTelemetry + the
forcedSummary fix end-to-end
probe-budget-sweep.mts (R7) 5 tasks × 5 budgets sweep to
locate the right `maxToolIters`
default
All scripts are idempotent reads of ./src — none mutate state.
Costs ~$1.20 of DeepSeek spend to run end-to-end on a fresh clone.
Run individually via `npx tsx scripts/probe-*.mts` /
`node scripts/probe-orchestrator-cache.mjs`.
Two internal docs under docs/rfcs/:
0001-multi-context-orchestrator.md
The opinionated RFC. Frames the design space (cache-economics
vs distillation vs reliability), defines roles, sketches the
orchestrator topology proposal that started this investigation.
0001-multi-context-orchestrator.issue.md
A draft GitHub issue body, evidence-first. Walks through 7
rounds of live-API probes: cache-rate falsification (R1),
read-heavy distillation (R2, ~6% compression), write-heavy
negative case (R3a, ~85%), break-even arithmetic (R3b),
full FLAT-vs-SUB session pre-patch (R4-long, SUB +24% loss,
46% empty-output rate), empty-output diagnosis (R5, three
distinct mechanisms), post-patch validation (R6, median SUB
-13% win), and budget sweep (R7, keep DEFAULT_PAUSE_EVERY=16).
The issue draft closes with a four-issue split for the upstream:
1 Expose distillation + reliability metrics (this branch — landed)
2 Route forcedSummary → output (this branch — landed)
3 Tune default maxToolIters (resolved as no-change)
4 Sub-context topology (conditional unblock)
Total spend across the seven rounds: ~$1.20 of live DeepSeek API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two paired changes to make
spawn_subagentmeasurable end-to-end:SubagentTelemetrycollector. Wire one pre-boundonSpawnCompletecallback intoregisterSubagentTooland every spawn auto-populates per-spawnSpawnDistillationrecords plus a session-levelSubagentSessionSummary(compression ratio, useful-spawn rate, spawn-storm count). No behavior change to the parent loop; noSessionStatstouch.output, noterror. Discriminate the twoforcedSummarypaths onparentSignal.aborted: the user-abort placeholder still routes toerror, but storm-breaker / context-guard partials populatefinaland set a newSubagentResult.forcedSummaryflag.formatSubagentResultgets a{ success: false, partial: true, output, note }branch so the parent agent sees the partial answer with a clear marker.Numbers
Seven rounds of live-API probes (
scripts/probe-*, ~$1.20 of DeepSeek spend). Pre vs post on the same 12-turn × 3-repeat read-heavy script:output37-point swing in expected cost direction on identical workload. Variance is still real (one of three post-patch runs still cost +14.6%) — exposing the metric is partly so users can see that when it happens.
The 46% empty-output rate from the pre-patch baseline decomposed cleanly in Round 5:
resume_session— separate issue)error— fixed hereread_filetruncation (probe artifact, not real-Reasonix)Full investigation walk-through in
docs/rfcs/0001-multi-context-orchestrator.issue.md.Wiring (caller-side)
Commits
Non-goals
CacheFirstLoop.stats—SubagentResultLikestructural type is exported so rolling the collector intoSessionStatslater stays cycle-free, but doing it now would widen this PR.maxToolItersbump — Round 7 swept 4/8/16/24/32 andDEFAULT_PAUSE_EVERY = 16is the empirical knee. No change.Test plan
npm run lintclean (biome onsrc tests)npm run typecheckcleantests/subagent-distillation.test.ts, 6 intests/subagent.test.ts(2 foronSpawnComplete, 4 for theforcedSummaryformatter branch)scripts/probe-session-e2e-post-patch.mts): median 12-turn session moved from SUB +24% to SUB −13%SubagentResultLiketype — that's the door open forSessionStatsintegration; comments welcome on whether the shape is right