Skip to content

fix: chronic OOM crash loop — heap limit reached at 4GB (595 restarts) #4032

@OneStepAt4time

Description

@OneStepAt4time

Bug: Aegis Server OOM Crash Loop (4GB heap limit)

Environment

  • Aegis version: latest develop
  • Node.js: v22.22.1
  • OS: Linux 6.17.0-23-generic (x64)
  • Restart counter: 595 (systemd auto-restart)
  • OOM frequency: 8 crashes in the last hour

Description

Aegis server process crashes with FATAL ERROR: Reached heap limit Allocation failed — JavaScript heap out of memory at ~4GB. Systemd auto-restarts, but the leak is chronic — the process re-OOMs after accumulating memory again.

Evidence

May 22 13:58:45 node[1435024]: [1435024:0x3bbc3000]   191663 ms: Scavenge 4066.5 (4091.1) -> 4061.1 (4093.3) MB
May 22 13:58:45 node[1435024]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
May 22 13:58:47 systemd: aegis.service: Main process exited, code=dumped, status=6/ABRT
May 22 13:58:53 systemd: aegis.service: Scheduled restart job, restart counter is at 595.

Impact

  • P0/Critical: Cannot use Aegis for development (sessions die on OOM)
  • Session data lost on each crash
  • Telegram topic mappings grow (105-106 restored each restart)
  • 220 total sessions tracked

Steps to Reproduce

  1. Run Aegis server with normal workload
  2. Observe heap growth over ~3 minutes
  3. Crash at ~4GB heap

Suspected Causes

  • Memory leak in ACP session tracking or state store
  • Growing in-memory structures not being GC'd (session maps, transcript caches)
  • Possible leak in ACP local storage JSON parsing/writing

Immediate Mitigation

  • Increase Node.js heap: NODE_OPTIONS=--max-old-space-size=8192
  • Root cause: profile memory allocation to find the leak

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions