Describe the bug
After upgrading to to Etherpad 2.7.3 we noticed our etherpad instance would consistently hit FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory and crash after about 15 minutes. After digging through some heap snapshots I'm reasonably confident that the source of the problem is session cleanup which begins on startup and was introduced in da9f5ac. I have since set sessionCleanup to false and memory usage is consistently low and the process has managed to live for over 15 minutes after this config change.
To Reproduce
Steps to reproduce the behavior:
- Run etherpad for many years (I think our DB is over a decade old) to accumulate many
sessionstorage:.* database records
- Upgrade to 2.7.3 and set sessionCleanup to true (the default)
- Watch memory usage climb until node and etherpad crash due to reaching the heap limit
I realize that run etherpad for a decade isn't a great reproducer, but I'm not super familiar with the database layout or how to artificially create session records. In theory this is possible and should reproduce things though.
We run etherpad without authentication (not sure if this impacts session handling in the DB). I suspect that we're affected by #5010 but then the new cleanup routine attempts to load all of these many records into memory at once and then we run out of memory.
Expected behavior
Session cleanup should use a reasonable amount of memory. Possibly by paging through the session records rather than loading them all at once. The new cleanup routine does const keys = await DB.findKeys('sessionstorage:*', null); maybe we can set limits on the number of records returned at one time and work our way through a reasonable batch size before proceeding to the next batch?
Server (please complete the following information):
- Etherpad version: 2.7.3
- OS: Debian Trixie
- Node.js version (
node --version): nodejs_version_info{version="v24.15.0",major="24",minor="15",patch="0"}
- pnpm version (
pnpm --version): 11.0.6 (note the bug template asks for npm version but we're using pnpm to match the upstream docker container)
- Is the server free of plugins: No we have ep_headings2 installed
- Are you using any abstraction IE docker? Yes, we build our own images using node:24-trixie-slim as a base but try to follow the approach used by the upstream Dockerfile.
Additional context
Our database is a mariadb 10.11 database. Not sure if that makes a difference in how the internal memory for query results is structured.
Also, the way I tracked this down was to run etherpad with NODE_OPTIONS: "--heapsnapshot-signal=SIGUSR2" then I grabbed a heap snapshot shortly after startup then again about 2 minutes later. Viewing these snapshots in Chrome's developer tools I was able to see that this chain of objects was retaining a significant amount of memory compared to the initial startup: _pool in mysql_db_default -> _allConnections in Pool -> _list in denque -> [$INDEX] in Array -> _command in PoolConnection -> _rows in Query -> [$INDEX] in Array -> [$INDEX] in Array -> key in {key}. Those key values appear to be sessionstorage:.* records. Each of them is listed as 0.1kB in size, but in aggregate having many of them is significant enough to run out of heap space.
We are currently running with sessionCleanup set to false. I figure that is no worse than we were before the upgrade, but it doesn't OOM.
Describe the bug
After upgrading to to Etherpad 2.7.3 we noticed our etherpad instance would consistently hit
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memoryand crash after about 15 minutes. After digging through some heap snapshots I'm reasonably confident that the source of the problem is session cleanup which begins on startup and was introduced in da9f5ac. I have since set sessionCleanup to false and memory usage is consistently low and the process has managed to live for over 15 minutes after this config change.To Reproduce
Steps to reproduce the behavior:
sessionstorage:.*database recordsI realize that run etherpad for a decade isn't a great reproducer, but I'm not super familiar with the database layout or how to artificially create session records. In theory this is possible and should reproduce things though.
We run etherpad without authentication (not sure if this impacts session handling in the DB). I suspect that we're affected by #5010 but then the new cleanup routine attempts to load all of these many records into memory at once and then we run out of memory.
Expected behavior
Session cleanup should use a reasonable amount of memory. Possibly by paging through the session records rather than loading them all at once. The new cleanup routine does
const keys = await DB.findKeys('sessionstorage:*', null);maybe we can set limits on the number of records returned at one time and work our way through a reasonable batch size before proceeding to the next batch?Server (please complete the following information):
node --version): nodejs_version_info{version="v24.15.0",major="24",minor="15",patch="0"}pnpm --version): 11.0.6 (note the bug template asks for npm version but we're using pnpm to match the upstream docker container)Additional context
Our database is a mariadb 10.11 database. Not sure if that makes a difference in how the internal memory for query results is structured.
Also, the way I tracked this down was to run etherpad with
NODE_OPTIONS: "--heapsnapshot-signal=SIGUSR2"then I grabbed a heap snapshot shortly after startup then again about 2 minutes later. Viewing these snapshots in Chrome's developer tools I was able to see that this chain of objects was retaining a significant amount of memory compared to the initial startup:_pool in mysql_db_default -> _allConnections in Pool -> _list in denque -> [$INDEX] in Array -> _command in PoolConnection -> _rows in Query -> [$INDEX] in Array -> [$INDEX] in Array -> key in {key}. Those key values appear to besessionstorage:.*records. Each of them is listed as 0.1kB in size, but in aggregate having many of them is significant enough to run out of heap space.We are currently running with sessionCleanup set to false. I figure that is no worse than we were before the upgrade, but it doesn't OOM.