fix: prevent agent process leaks with event loop fixes and ESLint rules (#213)#214
fix: prevent agent process leaks with event loop fixes and ESLint rules (#213)#214
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #213
#213) Root cause: Agent processes spawned by tests would remain alive indefinitely when hitting rate limits or API errors. Multiple factors contributed: 1. No process lifetime guard - processes could run forever during retries 2. setTimeout in waitForPending not unref'd - kept event loop alive 3. SIGINT handlers accumulated via process.on() instead of process.once() 4. stdin error listener not removed in input-queue.js stop() Changes: - Add AGENT_PROCESS_LIFETIME_TIMEOUT env var (default: disabled) that force-exits the process after a configurable number of seconds - Use .unref() on recursive setTimeout in continuous-mode.js waitForPending - Use process.once('SIGINT') instead of process.on('SIGINT') to prevent handler accumulation - Add safeResolve guard against double-resolution of the wait promise - Fix missing error listener removal in input-queue.js stop() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Log uptime and error state at process exit for debugging - Add experiments/test-process-lifetime-timeout.sh to verify the AGENT_PROCESS_LIFETIME_TIMEOUT watchdog behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- docs/case-studies/issue-213/README.md: detailed root cause analysis, timeline, fixes, and recommendations - docs/case-studies/issue-213/process-snapshot.txt: process tree snapshot - js/.changeset/fix-process-leaks.md: changeset for patch release Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
|
Agent can execute single task for hours, so we cannot set global timeout on the process. May be we can add some eslinter rules (may be such exist?) to track all dangling promises and so on? Can we also double check the code and may be write our eslint rule, if no such rule exists yet in npm? We need to make sure our system is leak proof, and after this run we still have 11 dangling agent processes, so not all cases were fixed. Also double check you will actually execute local tests one by one until limit will be reached on minimax 2.5 free model (the default model, if it is not default it should be fixed). Check original requirements of the issue. |
|
🤖 AI Work Session Started Starting automated work session at 2026-03-10T15:01:22.546Z The PR has been converted to draft mode while work is in progress. This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback. |
…les (#213) - Remove AGENT_PROCESS_LIFETIME_TIMEOUT (agents can run for hours) - Add .unref() to timers in retry-fetch.ts, session/retry.ts, util/timeout.ts - Fix Bun.serve() idleTimeout from 0 (infinite) to 255 (default) - Add eslint-plugin-promise for detecting dangling/floating promises - Add no-restricted-syntax rules warning against process.on('SIGINT'/'SIGTERM') - Update case study and changeset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 Work Session SummaryAddressed all feedback from the review comments: Changes Made
Test ResultsUnit tests (all pass): log-lazy, json-standard-unit, model-validation, session-usage, retry-state, retry-fetch, safe-json-serialization, process-name, model-not-supported, model-fallback, mcp-timeout, verbose-http-logging Integration tests: Tests spawning child processes time out due to Process leak verification: Confirmed no orphaned processes after killing test agent — CI: ✅ All checks passing Note on remaining dangling processesThe 11 existing dangling agent processes from the previous session are from before these fixes. The Work session completed. Ready for review. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
|
We should have option |
|
🤖 AI Work Session Started Starting automated work session at 2026-03-10T19:34:24.867Z The PR has been converted to draft mode while work is in progress. This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback. |
…213) - Add --retry-on-rate-limits boolean flag (default: true) to yargs and Flag Use --no-retry-on-rate-limits to fail fast on HTTP 429 instead of waiting This is the correct behavior for integration tests where waiting hours for rate limits would block CI - Move all integration tests from tests/ to tests/integration/ Prevents accidental bulk running of API-heavy tests with bun test Integration tests must be run one at a time to respect rate limits - Update package.json test script to only run unit tests by default: bun test -> runs tests/*.test.{js,ts} (unit tests only, no API calls) bun run test:integration -> runs tests/integration/basic.test.js (one hi test) - Add --no-retry-on-rate-limits to all integration test invocations so they fail fast instead of hanging when rate limited - Add unit tests for RETRY_ON_RATE_LIMITS flag behavior in retry-fetch.test.ts - Update bunfig.toml and TESTING.md to document the new test structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 Work Session SummaryAddressed all feedback from the latest review comment: Changes Made
Test ResultsUnit tests (245 pass, 0 fail):
Lint: ESLint 0 errors, Prettier all formatted How Integration Tests Work Now# Default bun test = fast unit tests only (no API calls)
bun test
# Run the one "hi" integration test
bun run test:integration
# Run any specific integration test one at a time
bun test tests/integration/basic.test.jsWork session completed. Ready for review. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
Summary
Fixes #213 — agent processes were leaking and accumulating (52+ simultaneous processes) because timers without
.unref()prevented the event loop from exiting, even after all work was complete. Rate limit retry loops (default: 7-day timeout) kept zombie processes alive indefinitely.Root Cause Analysis
See full case study:
docs/case-studies/issue-213/README.mdMultiple contributing factors:
setTimeout/setIntervalwithout.unref()across multiple files — kept event loop alive, preventing natural process exit even when all work was doneBun.serve()withidleTimeout: 0(infinite) — HTTP server connections kept event loop alive even afterserver.stop()process.on('SIGINT')— each continuous mode entry added a new handler without removing old onesinput-queue.js—stop()forgot to remove theerrorevent listenerEvidence
Process snapshot showed orphaned processes (PPID=1) with uptimes of 58+ minutes, stuck in retry loops or waiting on
ep_poll(epoll event loop) with open socket and timerfd handles.Changes
js/src/provider/retry-fetch.ts.unref()to sleep timer, globalTimeout, and cancellation interval; addRETRY_ON_RATE_LIMITSflag checkjs/src/session/retry.ts.unref()to sleep timerjs/src/util/timeout.ts.unref()to withTimeout timerjs/src/server/server.tsBun.serve()idleTimeout from 0 (infinite) to 255 (default)js/src/cli/continuous-mode.jswaitForPendingsetTimeout to use.unref(), useprocess.once('SIGINT'), addsafeResolveguardjs/src/cli/input-queue.jsstop()js/src/index.jsAGENT_PROCESS_LIFETIME_TIMEOUT; add--retry-on-rate-limitsyargs flag and middleware handlerjs/src/flag/flag.tsPROCESS_LIFETIME_TIMEOUTconfig; addRETRY_ON_RATE_LIMITSflag withsetRetryOnRateLimits()js/eslint.config.jseslint-plugin-promise+ customno-restricted-syntaxrules for leak preventionjs/package.jsoneslint-plugin-promisedev dep; updatetestscript to run only unit tests; addtest:integrationscriptjs/bunfig.tomltests/, integration tests are intests/integration/js/tests/retry-fetch.test.tsRETRY_ON_RATE_LIMITSflag behaviorjs/tests/integration/--no-retry-on-rate-limitsflagjs/.changeset/fix-process-leaks.mddocs/case-studies/issue-213/TESTING.mdNew:
--retry-on-rate-limitsFlagAdded
--retry-on-rate-limitsboolean CLI flag (yargs auto-generates--no-retry-on-rate-limits):true— agent retries on HTTP 429 (rate limit) as before (7-day timeout)--no-retry-on-rate-limits— agent fails fast on 429 instead of waiting. Used in all integration tests to avoid blocking CI when rate limited.ESLint Rules Added
promise/catch-or-return(warn): Detects dangling/floating promises not awaited or caughtpromise/no-nesting(warn): Warns about nested promise anti-patternsno-restricted-syntax(warn): Warns whenprocess.on('SIGINT')orprocess.on('SIGTERM')is used — preferprocess.once()to prevent handler accumulationTest Structure Change
Integration tests now live in
js/tests/integration/and are excluded from the defaultbun testrun:Why: The default model (
minimax-m2.5-free) has strict rate limits. Running multiple integration tests in bulk exhausts the quota. All integration tests now pass--no-retry-on-rate-limitsto fail fast instead of blocking.Test Results
Unit tests (all pass, 245 tests):
Lint:
Key Decision
Removed
AGENT_PROCESS_LIFETIME_TIMEOUT: Per reviewer feedback, agents can execute single tasks for hours, so a global process timeout is not appropriate. Instead, we fix the root cause (timers keeping event loop alive) and add ESLint rules to prevent future regressions.Related Issues
🤖 Generated with Claude Code
Fixes #213