feat: trace recursive RLM child runs#95
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces recursive RLM tree observability by implementing a minimal Langfuse ingestion recorder (LangfuseTraceRecorder) to trace recursive runs and child spans. It also adds support for parsing child RLM process results, tracking child usage splits (root vs. child vs. total), passing down remaining global budgets to recursive child processes, and includes corresponding unit tests. The review feedback highlights three key areas for improvement: a potential memory leak in rlmQuery due to unremoved abort event listeners, a potential budget overrun risk when concurrent queries in rlm_query_batched share the same remaining budget calculation, and a data loss risk in LangfuseTraceRecorder.flush() where queued events are permanently discarded even if the ingestion request fails.
| if (signal) { | ||
| signal.addEventListener("abort", () => child.kill("SIGTERM"), { | ||
| signal.addEventListener("abort", terminateChildTree, { | ||
| once: true, | ||
| }); | ||
| } |
There was a problem hiding this comment.
The abort event listener on signal is added using signal.addEventListener('abort', terminateChildTree), but it is never removed when the child process exits normally or with an error. Since signal is typically a long-lived AbortSignal associated with the entire rlmLoop run, calling rlmQuery multiple times recursively will accumulate these event listeners, leading to a memory leak and potentially triggering a MaxListenersExceededWarning in Node.js.
To fix this, register a cleanup listener on the child process's close and error events to remove the abort listener.
| if (signal) { | |
| signal.addEventListener("abort", () => child.kill("SIGTERM"), { | |
| signal.addEventListener("abort", terminateChildTree, { | |
| once: true, | |
| }); | |
| } | |
| if (signal) { | |
| signal.addEventListener("abort", terminateChildTree, { | |
| once: true, | |
| }); | |
| const cleanup = () => signal.removeEventListener("abort", terminateChildTree); | |
| child.on("close", cleanup); | |
| child.on("error", cleanup); | |
| } |
| const childUsageBefore = { ...childUsage }; | ||
| const remainingChildBudget = buildRemainingChildBudget(config, budget); |
There was a problem hiding this comment.
When executing concurrent queries via rlm_query_batched, the remaining budget (maxCost and maxTokens) is calculated once at the start of the batch and passed to all concurrent child processes. If multiple child processes run concurrently, each will believe it has access to the full remaining budget, potentially leading to a significant budget overrun.
Consider allocating a fraction of the remaining budget to each concurrent child process (e.g., dividing the remaining budget by the batch size) to strictly enforce the budget limit.
|
|
||
| async flush(): Promise<void> { | ||
| if (!this.enabled || this.queue.length === 0) return; | ||
| const batch = this.queue.splice(0, this.queue.length); |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Line 676 in e96470a
When a run times out after any rlm_query has queued trace/span events, this AbortError path returns without the langfuse.flush() call that only exists inside finalize(). The CLI then exits with those queued Langfuse events unsent, so recursive traces disappear exactly for timed-out runs where the trace is most useful; flush the recorder on these error returns or in a shared finally path.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!res.ok) { | ||
| throw new Error(`Langfuse ingestion failed: ${res.status} ${await res.text().catch(() => "")}`.trim()); | ||
| } |
There was a problem hiding this comment.
Detect Langfuse multi-status ingestion errors
Langfuse's ingestion endpoint reports per-event validation failures with HTTP 207 Multi-Status, and fetch still marks 207 as ok; with this check, a batch containing rejected trace/span events is treated as a successful flush after it has already been removed from the queue. In environments where Langfuse rejects an event schema or field, recursive tracing is silently lost instead of surfacing the ingestion error, so handle 207/error items explicitly.
Useful? React with 👍 / 👎.
Summary
Proof