Skip to content

feat: trace recursive RLM child runs#95

Merged
namastex888 merged 1 commit into
devfrom
feat/recursive-tree-observability
May 28, 2026
Merged

feat: trace recursive RLM child runs#95
namastex888 merged 1 commit into
devfrom
feat/recursive-tree-observability

Conversation

@namastex888
Copy link
Copy Markdown
Collaborator

Summary

  • emit child_start/child_end JSONL events for rlm_query recursion
  • propagate recursive ancestry and bounded child flags into child rlmx runs
  • add root/child/total usage splits and minimal Langfuse trace spans
  • fail closed with default recursive max-depth, process-group termination, bounded Langfuse flush, and child usage budget accounting

Proof

  • npm run build
  • npm run check
  • node --test dist/tests/recursive-trace.test.js
  • npm test: 365/365 pass
  • static added-line secret/injection scan: empty
  • independent reviewer: passed after fixes

@namastex888 namastex888 merged commit 3b77ed9 into dev May 28, 2026
7 checks passed
@namastex888 namastex888 deleted the feat/recursive-tree-observability branch May 28, 2026 08:10
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2f471de-7a0c-423f-a7fb-7641545edbbe

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/recursive-tree-observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces recursive RLM tree observability by implementing a minimal Langfuse ingestion recorder (LangfuseTraceRecorder) to trace recursive runs and child spans. It also adds support for parsing child RLM process results, tracking child usage splits (root vs. child vs. total), passing down remaining global budgets to recursive child processes, and includes corresponding unit tests. The review feedback highlights three key areas for improvement: a potential memory leak in rlmQuery due to unremoved abort event listeners, a potential budget overrun risk when concurrent queries in rlm_query_batched share the same remaining budget calculation, and a data loss risk in LangfuseTraceRecorder.flush() where queued events are permanently discarded even if the ingestion request fails.

Comment thread src/llm.ts
Comment on lines 464 to 468
if (signal) {
signal.addEventListener("abort", () => child.kill("SIGTERM"), {
signal.addEventListener("abort", terminateChildTree, {
once: true,
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The abort event listener on signal is added using signal.addEventListener('abort', terminateChildTree), but it is never removed when the child process exits normally or with an error. Since signal is typically a long-lived AbortSignal associated with the entire rlmLoop run, calling rlmQuery multiple times recursively will accumulate these event listeners, leading to a memory leak and potentially triggering a MaxListenersExceededWarning in Node.js.

To fix this, register a cleanup listener on the child process's close and error events to remove the abort listener.

Suggested change
if (signal) {
signal.addEventListener("abort", () => child.kill("SIGTERM"), {
signal.addEventListener("abort", terminateChildTree, {
once: true,
});
}
if (signal) {
signal.addEventListener("abort", terminateChildTree, {
once: true,
});
const cleanup = () => signal.removeEventListener("abort", terminateChildTree);
child.on("close", cleanup);
child.on("error", cleanup);
}

Comment thread src/rlm.ts
Comment on lines +315 to +316
const childUsageBefore = { ...childUsage };
const remainingChildBudget = buildRemainingChildBudget(config, budget);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When executing concurrent queries via rlm_query_batched, the remaining budget (maxCost and maxTokens) is calculated once at the start of the batch and passed to all concurrent child processes. If multiple child processes run concurrently, each will believe it has access to the full remaining budget, potentially leading to a significant budget overrun.

Consider allocating a fraction of the remaining budget to each concurrent child process (e.g., dividing the remaining budget by the batch size) to strictly enforce the budget limit.

Comment thread src/langfuse.ts

async flush(): Promise<void> {
if (!this.enabled || this.queue.length === 0) return;
const batch = this.queue.splice(0, this.queue.length);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When flush() is called, the queue is immediately cleared via splice. If the subsequent fetchImpl call fails or times out, the spliced events are permanently lost. Consider restoring the events to the queue in the catch block if the ingestion fails, allowing for potential retries.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

rlmx/src/rlm.ts

Line 676 in e96470a

return buildResult(

P2 Badge Flush Langfuse traces on timeout exits

When a run times out after any rlm_query has queued trace/span events, this AbortError path returns without the langfuse.flush() call that only exists inside finalize(). The CLI then exits with those queued Langfuse events unsent, so recursive traces disappear exactly for timed-out runs where the trace is most useful; flush the recorder on these error returns or in a shared finally path.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/langfuse.ts
Comment on lines +136 to +138
if (!res.ok) {
throw new Error(`Langfuse ingestion failed: ${res.status} ${await res.text().catch(() => "")}`.trim());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect Langfuse multi-status ingestion errors

Langfuse's ingestion endpoint reports per-event validation failures with HTTP 207 Multi-Status, and fetch still marks 207 as ok; with this check, a batch containing rejected trace/span events is treated as a successful flush after it has already been removed from the queue. In environments where Langfuse rejects an event schema or field, recursive tracing is silently lost instead of surfacing the ingestion error, so handle 207/error items explicitly.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants