Skip to content

feat(subagent): distillation telemetry + route force-summary content to output#974

Merged
esengine merged 5 commits into
mainfrom
feat/subagent-distillation-telemetry
May 16, 2026
Merged

feat(subagent): distillation telemetry + route force-summary content to output#974
esengine merged 5 commits into
mainfrom
feat/subagent-distillation-telemetry

Conversation

@esengine
Copy link
Copy Markdown
Owner

What

Two paired changes to make spawn_subagent measurable end-to-end:

  1. New SubagentTelemetry collector. Wire one pre-bound onSpawnComplete callback into registerSubagentTool and every spawn auto-populates per-spawn SpawnDistillation records plus a session-level SubagentSessionSummary (compression ratio, useful-spawn rate, spawn-storm count). No behavior change to the parent loop; no SessionStats touch.
  2. Force-summary content now lands in output, not error. Discriminate the two forcedSummary paths on parentSignal.aborted: the user-abort placeholder still routes to error, but storm-breaker / context-guard partials populate final and set a new SubagentResult.forcedSummary flag. formatSubagentResult gets a { success: false, partial: true, output, note } branch so the parent agent sees the partial answer with a clear marker.

Numbers

Seven rounds of live-API probes (scripts/probe-*, ~$1.20 of DeepSeek spend). Pre vs post on the same 12-turn × 3-repeat read-heavy script:

Pre-patch Post-patch
Median SUB session cost vs FLAT +24% (loss) −13% (win)
Useful-spawn rate (non-empty output) 54% 73%
Force-summary content recovered into output 0% 9.1%
Spawn storms observed (≥3 spawns / turn) 2 2

37-point swing in expected cost direction on identical workload. Variance is still real (one of three post-patch runs still cost +14.6%) — exposing the metric is partly so users can see that when it happens.

The 46% empty-output rate from the pre-patch baseline decomposed cleanly in Round 5:

  • ~⅓ paused (budget exhaustion; recoverable via resume_session — separate issue)
  • ~⅓ force-summary content stranded in errorfixed here
  • rest: probe-side read_file truncation (probe artifact, not real-Reasonix)

Full investigation walk-through in docs/rfcs/0001-multi-context-orchestrator.issue.md.

Wiring (caller-side)

const telemetry = new SubagentTelemetry();
registerSubagentTool(registry, {
  client,
  onSpawnComplete: telemetry.record,  // pre-bound
});

// ... agent runs, spawn_subagent fires populate telemetry automatically ...
telemetry.summary;       // SubagentSessionSummary
telemetry.stormCount();  // turns with ≥3 spawns

Commits

70fbeb1  fix(subagent): route forceSummary content to output, not error
fcd103c  feat(telemetry): expose sub-agent distillation as first-class metric
fcb83c3  chore(probes): live-API probes for sub-agent distillation rounds 1-7
32a1108  docs(rfcs): RFC-0001 + issue draft for multi-context distillation

Non-goals

  • TUI cell for the new metric — separate UX-shaped PR.
  • Auto-wiring into CacheFirstLoop.statsSubagentResultLike structural type is exported so rolling the collector into SessionStats later stays cycle-free, but doing it now would widen this PR.
  • Default maxToolIters bump — Round 7 swept 4/8/16/24/32 and DEFAULT_PAUSE_EVERY = 16 is the empirical knee. No change.
  • Orchestrator-as-default topology — out of scope; the metric exists so that decision can be data-supported when revisited.

Test plan

  • npm run lint clean (biome on src tests)
  • npm run typecheck clean
  • 22 new tests: 16 in tests/subagent-distillation.test.ts, 6 in tests/subagent.test.ts (2 for onSpawnComplete, 4 for the forcedSummary formatter branch)
  • Public-API snapshot covers the 10 new public names
  • Live-API end-to-end (scripts/probe-session-e2e-post-patch.mts): median 12-turn session moved from SUB +24% to SUB −13%
  • Reviewer sanity-check on the structural SubagentResultLike type — that's the door open for SessionStats integration; comments welcome on whether the shape is right

reasonix added 5 commits May 15, 2026 06:41
)

The bundled libwayland-client.so.0 inside the AppImage was built on
Ubuntu 22.04 CI and ABI-mismatches the host Wayland compositor on
distros that ship a different libwayland (openSUSE Tumbleweed,
Fedora, Arch on some weeks). The mismatch makes
WebKitWebProcess::eglGetDisplay() return EGL_BAD_PARAMETER and the
process aborts with a short stack before the first paint — exactly
the SIGABRT the AppImage user in #892 still hits even with #895's
DMABUF disable in place.

WebKitWebProcess is fork+exec'd from the parent and inherits the env,
so setting LD_PRELOAD to the host's libwayland in main() redirects
just the child's dynamic linker without disturbing the already-loaded
parent. Gate it on APPDIR + WAYLAND_DISPLAY so distro packages,
debug builds, and X11 sessions are untouched. Skip the bare
/usr/lib/libwayland-client.so.0 path — on 64-bit Fedora it can be
a 32-bit library and the loader rejects it with an ELF-class warning.

Pattern confirmed by Tolaria, yaak, nym-vpn-client and the wider
Tauri AppImage user base (tauri-apps/tauri#11988,
gitbutlerapp/gitbutler#5282).
When CacheFirstLoop's storm-breaker or context-guard fires inside a
spawned child loop, it emits an assistant_final event tagged
forcedSummary:true carrying the partial synthesis the model managed
to produce. spawnSubagent was routing that text into errorMessage
and zeroing output, so the parent loop saw a "failed" spawn with no
content even though the child had written a usable partial answer.

Discriminate the two forcedSummary paths on parentSignal.aborted:
the user-abort path (loop.ts ~670) still routes the UX placeholder
to errorMessage; the storm-breaker / context-guard path now
populates final, sets a new forcedSummary flag on SubagentResult,
and success stays false so callers can tell partial from complete.

formatSubagentResult gains a forcedSummary branch that emits
{ success: false, partial: true, output, note } so the parent
agent sees the content with a clear "this is partial" marker.
paused still takes precedence when both are set.

Empirical validation against the live API: storm-summary spawns
now ship 286-391 tokens of partial answer in output instead of
stranding it in error. A 12-turn x 3-repeat session probe with
this + the upcoming distillation telemetry wiring shows median
SUB cost moving from +24% (loss) vs FLAT pre-patch to -13% (win)
post-patch.
Reasonix has measured cache hit rate per session since Pillar 1
landed, but the other half of the economic story — how much
parent-log growth `spawn_subagent` avoided — has been invisible.
This adds the per-spawn + per-session telemetry to make it
inspectable, mirroring the cache-rate surface.

New module `src/telemetry/subagent-distillation.ts`:

  SpawnDistillation        per-spawn: completionTokens, outputTokens,
                           savingsTokens, compressionRatio, hasOutput,
                           costUsd, paused
  computeSpawnDistillation derives the shape from a SubagentResult-
                           like input (structural, not nominal — avoids
                           a stats ↔ subagent ↔ loop import cycle if
                           SessionStats later wants in)
  SubagentSessionSummary   aggregate: spawnCount, usefulSpawnCount,
                           pausedSpawnCount, successRate, total*,
                           aggregateCompressionRatio (completion-weighted,
                           not naive mean), totalCostUsd
  summarizeSubagentSession aggregates per-spawn distillations
  countSpawnStorms         counts turns with ≥ threshold spawns
  SubagentTelemetry        live collector — `record` is pre-bound for
                           ergonomic use as a callback; `startTurn(n)`
                           groups records into turn buckets so storm
                           detection is meaningful

`registerSubagentTool` gains one optional `onSpawnComplete` hook
fired after `spawnSubagent` returns (errors are swallowed so
telemetry can't break the spawn-tool dispatch). Pattern:

    const telemetry = new SubagentTelemetry();
    registerSubagentTool(registry, {
      client,
      onSpawnComplete: telemetry.record,
    });
    // ... agent runs ...
    console.log(telemetry.summary);
    console.log(telemetry.stormCount());

No behavior change to the parent loop, no SessionStats touch, no
TUI surface yet — wiring is the caller's responsibility for now.
Index re-exports the public surface; deep-import path also works.

Tests: 16 cases on the telemetry module + 2 on the spawn hook,
covering compression edge cases, empty/whitespace output detection,
zero-completion guard, completion-weighted aggregation, storm
counting, collector lifecycle, and onSpawnComplete error isolation.

End-to-end validation against the live DeepSeek API (12-turn ×
3-repeat session with this + the prior forcedSummary routing fix)
showed median SUB session cost moving from +24% vs FLAT pre-patch
to −13% post-patch; useful-spawn rate climbed from 54% to 73%.
Eight standalone probe scripts under scripts/. Each loads .env and
hits the live DeepSeek API; none are wired into the bench harness
or any CI. Purpose: reproduce + extend the measurements behind the
sub-agent distillation work.

  probe-orchestrator-cache.mjs        (R1) raw-API cache rate, FLAT vs ORCH
                                            messages[] topology
  probe-subagent-distillation.mts     (R2) read-heavy spawns, per-spawn
                                            compression on 3 representative
                                            tasks
  probe-subagent-write-heavy.mts      (R3a) negative case — spawns whose
                                            deliverable IS the artifact
  probe-session-e2e.mts               (R4)  6-turn single-run E2E
  probe-session-e2e-long.mts          (R4-long) 12-turn × 3-repeat E2E,
                                            pre-patch baseline
  probe-empty-output-diagnosis.mts    (R5)  diagnoses the 46% empty-output
                                            rate at three budget caps
  probe-session-e2e-post-patch.mts    (R6)  same script as R4-long, runs
                                            SubagentTelemetry + the
                                            forcedSummary fix end-to-end
  probe-budget-sweep.mts              (R7)  5 tasks × 5 budgets sweep to
                                            locate the right `maxToolIters`
                                            default

All scripts are idempotent reads of ./src — none mutate state.
Costs ~$1.20 of DeepSeek spend to run end-to-end on a fresh clone.
Run individually via `npx tsx scripts/probe-*.mts` /
`node scripts/probe-orchestrator-cache.mjs`.
Two internal docs under docs/rfcs/:

  0001-multi-context-orchestrator.md
    The opinionated RFC. Frames the design space (cache-economics
    vs distillation vs reliability), defines roles, sketches the
    orchestrator topology proposal that started this investigation.

  0001-multi-context-orchestrator.issue.md
    A draft GitHub issue body, evidence-first. Walks through 7
    rounds of live-API probes: cache-rate falsification (R1),
    read-heavy distillation (R2, ~6% compression), write-heavy
    negative case (R3a, ~85%), break-even arithmetic (R3b),
    full FLAT-vs-SUB session pre-patch (R4-long, SUB +24% loss,
    46% empty-output rate), empty-output diagnosis (R5, three
    distinct mechanisms), post-patch validation (R6, median SUB
    -13% win), and budget sweep (R7, keep DEFAULT_PAUSE_EVERY=16).

The issue draft closes with a four-issue split for the upstream:

  1 Expose distillation + reliability metrics  (this branch — landed)
  2 Route forcedSummary → output                (this branch — landed)
  3 Tune default maxToolIters                   (resolved as no-change)
  4 Sub-context topology                        (conditional unblock)

Total spend across the seven rounds: ~$1.20 of live DeepSeek API.
@esengine esengine merged commit 3b0435a into main May 16, 2026
5 checks passed
@esengine esengine deleted the feat/subagent-distillation-telemetry branch May 16, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant