fix: prevent agent process leaks with event loop fixes and ESLint rules (#213) by konard · Pull Request #214 · link-assistant/agent

konard · 2026-03-09T17:39:05Z

Summary

Fixes #213 — agent processes were leaking and accumulating (52+ simultaneous processes) because timers without .unref() prevented the event loop from exiting, even after all work was complete. Rate limit retry loops (default: 7-day timeout) kept zombie processes alive indefinitely.

Root Cause Analysis

See full case study: docs/case-studies/issue-213/README.md

Multiple contributing factors:

setTimeout/setInterval without .unref() across multiple files — kept event loop alive, preventing natural process exit even when all work was done
Bun.serve() with idleTimeout: 0 (infinite) — HTTP server connections kept event loop alive even after server.stop()
SIGINT handlers accumulated via process.on('SIGINT') — each continuous mode entry added a new handler without removing old ones
Missing error listener cleanup in input-queue.js — stop() forgot to remove the error event listener
No ESLint rules to catch dangling promises or process listener accumulation patterns

Evidence

Process snapshot showed orphaned processes (PPID=1) with uptimes of 58+ minutes, stuck in retry loops or waiting on ep_poll (epoll event loop) with open socket and timerfd handles.

Changes

File	Change
`js/src/provider/retry-fetch.ts`	Add `.unref()` to sleep timer, globalTimeout, and cancellation interval; add `RETRY_ON_RATE_LIMITS` flag check
`js/src/session/retry.ts`	Add `.unref()` to sleep timer
`js/src/util/timeout.ts`	Add `.unref()` to withTimeout timer
`js/src/server/server.ts`	Change `Bun.serve()` idleTimeout from 0 (infinite) to 255 (default)
`js/src/cli/continuous-mode.js`	Fix `waitForPending` setTimeout to use `.unref()`, use `process.once('SIGINT')`, add `safeResolve` guard
`js/src/cli/input-queue.js`	Fix missing error listener removal in `stop()`
`js/src/index.js`	Remove `AGENT_PROCESS_LIFETIME_TIMEOUT`; add `--retry-on-rate-limits` yargs flag and middleware handler
`js/src/flag/flag.ts`	Remove `PROCESS_LIFETIME_TIMEOUT` config; add `RETRY_ON_RATE_LIMITS` flag with `setRetryOnRateLimits()`
`js/eslint.config.js`	Add `eslint-plugin-promise` + custom `no-restricted-syntax` rules for leak prevention
`js/package.json`	Add `eslint-plugin-promise` dev dep; update `test` script to run only unit tests; add `test:integration` script
`js/bunfig.toml`	Document that unit tests are in `tests/`, integration tests are in `tests/integration/`
`js/tests/retry-fetch.test.ts`	Add unit tests for `RETRY_ON_RATE_LIMITS` flag behavior
`js/tests/integration/`	New directory: all integration tests moved here with `--no-retry-on-rate-limits` flag
`js/.changeset/fix-process-leaks.md`	Updated changeset for patch release
`docs/case-studies/issue-213/`	Updated case study with comprehensive root cause analysis
`TESTING.md`	Updated with test structure documentation

New: `--retry-on-rate-limits` Flag

Added --retry-on-rate-limits boolean CLI flag (yargs auto-generates --no-retry-on-rate-limits):

Default: true — agent retries on HTTP 429 (rate limit) as before (7-day timeout)
--no-retry-on-rate-limits — agent fails fast on 429 instead of waiting. Used in all integration tests to avoid blocking CI when rate limited.

# Production usage (default - retries with back-off)
agent --prompt "fix this bug"

# Integration tests - fail fast instead of waiting
bun run src/index.js --no-retry-on-rate-limits --prompt "hi"

ESLint Rules Added

promise/catch-or-return (warn): Detects dangling/floating promises not awaited or caught
promise/no-nesting (warn): Warns about nested promise anti-patterns
no-restricted-syntax (warn): Warns when process.on('SIGINT') or process.on('SIGTERM') is used — prefer process.once() to prevent handler accumulation

Test Structure Change

Integration tests now live in js/tests/integration/ and are excluded from the default bun test run:

# Default: runs only unit tests (no API calls, fast)
bun test

# Run the basic "hi" integration test
bun run test:integration

# Run any integration test one at a time
bun test tests/integration/basic.test.js

Why: The default model (minimax-m2.5-free) has strict rate limits. Running multiple integration tests in bulk exhausts the quota. All integration tests now pass --no-retry-on-rate-limits to fail fast instead of blocking.

Test Results

Unit tests (all pass, 245 tests):

log-lazy, json-standard-unit, model-validation, session-usage, retry-state, retry-fetch, safe-json-serialization, process-name, model-not-supported, model-fallback, mcp-timeout, verbose-http-logging

Lint:

ESLint: 0 errors, 0 warnings with new rules
Prettier: all files formatted

Key Decision

Removed AGENT_PROCESS_LIFETIME_TIMEOUT: Per reviewer feedback, agents can execute single tasks for hours, so a global process timeout is not appropriate. Instead, we fix the root cause (timers keeping event loop alive) and add ESLint rules to prevent future regressions.

Related Issues

oven-sh/bun#13887 — Process not terminated automatically
oven-sh/bun#11892 — Spawned child process not killed

🤖 Generated with Claude Code

Fixes #213

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #213

#213) Root cause: Agent processes spawned by tests would remain alive indefinitely when hitting rate limits or API errors. Multiple factors contributed: 1. No process lifetime guard - processes could run forever during retries 2. setTimeout in waitForPending not unref'd - kept event loop alive 3. SIGINT handlers accumulated via process.on() instead of process.once() 4. stdin error listener not removed in input-queue.js stop() Changes: - Add AGENT_PROCESS_LIFETIME_TIMEOUT env var (default: disabled) that force-exits the process after a configurable number of seconds - Use .unref() on recursive setTimeout in continuous-mode.js waitForPending - Use process.once('SIGINT') instead of process.on('SIGINT') to prevent handler accumulation - Add safeResolve guard against double-resolution of the wait promise - Fix missing error listener removal in input-queue.js stop() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Log uptime and error state at process exit for debugging - Add experiments/test-process-lifetime-timeout.sh to verify the AGENT_PROCESS_LIFETIME_TIMEOUT watchdog behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- docs/case-studies/issue-213/README.md: detailed root cause analysis, timeline, fixes, and recommendations - docs/case-studies/issue-213/process-snapshot.txt: process tree snapshot - js/.changeset/fix-process-leaks.md: changeset for patch release Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-03-09T18:18:40Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $14.174725
Calculated by Anthropic: $10.094975 USD
Difference: $-4.079750 (-28.78%)
📎 Log file uploaded as Gist (5011KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-03-09T18:18:45Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

konard · 2026-03-10T14:58:26Z

Agent can execute single task for hours, so we cannot set global timeout on the process. May be we can add some eslinter rules (may be such exist?) to track all dangling promises and so on? Can we also double check the code and may be write our eslint rule, if no such rule exists yet in npm?

We need to make sure our system is leak proof, and after this run we still have 11 dangling agent processes, so not all cases were fixed.

Also double check you will actually execute local tests one by one until limit will be reached on minimax 2.5 free model (the default model, if it is not default it should be fixed). Check original requirements of the issue.

konard · 2026-03-10T15:01:26Z

🤖 AI Work Session Started

Starting automated work session at 2026-03-10T15:01:22.546Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

…les (#213) - Remove AGENT_PROCESS_LIFETIME_TIMEOUT (agents can run for hours) - Add .unref() to timers in retry-fetch.ts, session/retry.ts, util/timeout.ts - Fix Bun.serve() idleTimeout from 0 (infinite) to 255 (default) - Add eslint-plugin-promise for detecting dangling/floating promises - Add no-restricted-syntax rules warning against process.on('SIGINT'/'SIGTERM') - Update case study and changeset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-03-10T15:27:19Z

🤖 Work Session Summary

Addressed all feedback from the review comments:

Changes Made

Removed AGENT_PROCESS_LIFETIME_TIMEOUT — per feedback, agents can run for hours so a global timeout is not appropriate. Removed from flag.ts, index.js, and the experiment script.
Fixed root cause: .unref() on all timer-based sleeps/waits — added .unref() to timers in:
- retry-fetch.ts (sleep, globalTimeout, cancellation interval)
- session/retry.ts (sleep timer)
- util/timeout.ts (withTimeout timer)
- server.ts (Bun.serve idleTimeout: 0→255)
Added ESLint rules to prevent future leaks:
- eslint-plugin-promise with catch-or-return and no-nesting rules
- Custom no-restricted-syntax rules warning against process.on('SIGINT'/'SIGTERM') — prefer process.once()

Test Results

Unit tests (all pass): log-lazy, json-standard-unit, model-validation, session-usage, retry-state, retry-fetch, safe-json-serialization, process-name, model-not-supported, model-fallback, mcp-timeout, verbose-http-logging

Integration tests: Tests spawning child processes time out due to minimax-m2.5-free rate limits (retry-after: ~9 hours). This is expected behavior — the API model is heavily rate-limited.

Process leak verification: Confirmed no orphaned processes after killing test agent — .unref() fixes are working.

CI: ✅ All checks passing

Note on remaining dangling processes

The 11 existing dangling agent processes from the previous session are from before these fixes. The .unref() changes ensure that new processes won't linger as zombies when their parent exits or they're killed. The existing zombies need to be manually cleaned up with pkill -f "bun.*src/index.js".

Work session completed. Ready for review.

konard · 2026-03-10T15:28:10Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $12.786305
Calculated by Anthropic: $8.708476 USD
Difference: $-4.077830 (-31.89%)
📎 Log file uploaded as Gist (4739KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-03-10T15:28:16Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

konard · 2026-03-10T19:33:40Z

We should have option --retry-on-rate-limits which should only mean AI completions/chat/responses API rate limits, and yargs will allow us to use --no-retry-on-rate-limits auto-generated option, we should use it for all integration tests. And we should write in docs or in tests or in configuration, that we don't ever run integration tests in bulk, so it will be impossible to use bun run test to run multiple integration test - may be just one with hi should be there, but not all of them.

konard · 2026-03-10T19:34:26Z

🤖 AI Work Session Started

Starting automated work session at 2026-03-10T19:34:24.867Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

…213) - Add --retry-on-rate-limits boolean flag (default: true) to yargs and Flag Use --no-retry-on-rate-limits to fail fast on HTTP 429 instead of waiting This is the correct behavior for integration tests where waiting hours for rate limits would block CI - Move all integration tests from tests/ to tests/integration/ Prevents accidental bulk running of API-heavy tests with bun test Integration tests must be run one at a time to respect rate limits - Update package.json test script to only run unit tests by default: bun test -> runs tests/*.test.{js,ts} (unit tests only, no API calls) bun run test:integration -> runs tests/integration/basic.test.js (one hi test) - Add --no-retry-on-rate-limits to all integration test invocations so they fail fast instead of hanging when rate limited - Add unit tests for RETRY_ON_RATE_LIMITS flag behavior in retry-fetch.test.ts - Update bunfig.toml and TESTING.md to document the new test structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-03-10T19:54:01Z

🤖 Work Session Summary

Addressed all feedback from the latest review comment:

Changes Made

Added --retry-on-rate-limits flag (yargs boolean, auto-generates --no-retry-on-rate-limits):
- Wired to Flag.RETRY_ON_RATE_LIMITS in flag.ts with setRetryOnRateLimits() setter
- In retry-fetch.ts: when flag is false, HTTP 429 responses return immediately without waiting/retrying
- Default: true (existing behavior unchanged for production use)
- Use --no-retry-on-rate-limits in tests to fail fast instead of waiting hours for rate limits
Moved all integration tests to js/tests/integration/:
- 34 integration tests moved out of tests/ root into tests/integration/ subdirectory
- bun test (via test script in package.json) now runs only unit tests: bun test tests/*.test.js tests/*.test.ts
- Added test:integration script that runs tests/integration/basic.test.js (the hi test)
Added --no-retry-on-rate-limits to all integration test invocations:
- All bun run src/index.js calls in tests/integration/*.test.js now include this flag
- Tests fail fast on rate limit instead of blocking indefinitely
Added unit tests for RETRY_ON_RATE_LIMITS flag in retry-fetch.test.ts:
- Tests that flag=false returns 429 immediately without retrying
- Tests that flag=true (default) still retries normally
- Tests that default value is true

Test Results

Unit tests (245 pass, 0 fail): bun test tests/*.test.js tests/*.test.ts

All previous unit tests plus 3 new tests for RETRY_ON_RATE_LIMITS flag

Lint: ESLint 0 errors, Prettier all formatted

How Integration Tests Work Now

# Default bun test = fast unit tests only (no API calls)
bun test

# Run the one "hi" integration test
bun run test:integration

# Run any specific integration test one at a time
bun test tests/integration/basic.test.js

Work session completed. Ready for review.

konard · 2026-03-10T19:54:44Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $6.213419
Calculated by Anthropic: $5.279143 USD
Difference: $-0.934276 (-15.04%)
📎 Log file uploaded as Gist (3963KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-03-10T19:55:55Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

Initial commit with task details

ddac7f1

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #213

konard self-assigned this Mar 9, 2026

konard and others added 3 commits March 9, 2026 17:59

konard changed the title ~~[WIP] Do we have any idea why so many agent processes are keep executing after we were working on https://github.com/link-assistant/agent/pull/212?~~ fix: prevent agent process leaks with lifetime watchdog and event loop fixes (#213) Mar 9, 2026

chore: remove gitkeep file

7d27fe8

konard marked this pull request as ready for review March 9, 2026 18:18

konard marked this pull request as draft March 10, 2026 15:01

konard and others added 2 commits March 10, 2026 15:10

chore: add experiment to verify process exit after rate limit (#213)

86b7a34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard changed the title ~~fix: prevent agent process leaks with lifetime watchdog and event loop fixes (#213)~~ fix: prevent agent process leaks with event loop fixes and ESLint rules (#213) Mar 10, 2026

konard marked this pull request as ready for review March 10, 2026 15:26

konard marked this pull request as draft March 10, 2026 19:34

konard marked this pull request as ready for review March 10, 2026 19:53

konard merged commit 83e0523 into main Mar 11, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent agent process leaks with event loop fixes and ESLint rules (#213)#214

fix: prevent agent process leaks with event loop fixes and ESLint rules (#213)#214
konard merged 8 commits intomainfrom
issue-213-e4c047fc16a3

konard commented Mar 9, 2026 •

edited

Loading

Uh oh!

konard commented Mar 9, 2026

Uh oh!

konard commented Mar 9, 2026

Uh oh!

konard commented Mar 10, 2026 •

edited

Loading

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

konard commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause Analysis

Multiple contributing factors:

Evidence

Changes

New: --retry-on-rate-limits Flag

ESLint Rules Added

Test Structure Change

Test Results

Unit tests (all pass, 245 tests):

Lint:

Key Decision

Related Issues

Uh oh!

konard commented Mar 9, 2026

🤖 Solution Draft Log

Uh oh!

konard commented Mar 9, 2026

✅ Ready to merge

Uh oh!

konard commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

🤖 Work Session Summary

Changes Made

Test Results

Note on remaining dangling processes

Uh oh!

konard commented Mar 10, 2026

🤖 Solution Draft Log

Uh oh!

konard commented Mar 10, 2026

✅ Ready to merge

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

Uh oh!

konard commented Mar 10, 2026

🤖 Work Session Summary

Changes Made

Test Results

How Integration Tests Work Now

Uh oh!

konard commented Mar 10, 2026

🤖 Solution Draft Log

Uh oh!

konard commented Mar 10, 2026

✅ Ready to merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

konard commented Mar 9, 2026 •

edited

Loading

New: `--retry-on-rate-limits` Flag

konard commented Mar 10, 2026 •

edited

Loading