Summary
A --state-pruning archive node serving sustained JSON-RPC traffic exhibits unbounded growth of anonymous (non-reclaimable) process memory at a steady ~3–4 GB/hour. The growth is real process memory (memory.stat anon), not page cache, and it is not bounded by --db-cache or --trie-cache-size. Left running, the node climbs to the host/cgroup memory limit and — with swap disabled — gets hard-throttled by memory.high (millions of throttle events), at which point RPC stops responding and the node effectively hangs until restarted.
Reproduced on two versions (a Nov-2025 build and the current v3.4.2-415), so this is not a recently-introduced regression — it appears inherent to the archive node under RPC load.
Environment
- subtensor:
v3.4.2-415 (also reproduced on a Nov-2025 v3.2.9-era build). Note: system_version RPC returns 4.0.0-dev-unknown on both, so it's not a useful version discriminator.
- OS: Ubuntu 24.04.4 LTS
- Host: 16 vCPU, 62 GiB RAM, NVMe (RocksDB backend,
db/full)
- Chain: finney mainnet, archive node (~3.7 TB DB)
- Run via: systemd unit, with a cgroup
MemoryHigh=48G / MemoryMax=52G and MemorySwapMax=0.
Launch flags
node-subtensor \
--chain <finney-raw-spec> --base-path <data> \
--state-pruning archive --blocks-pruning archive \
--rpc-external --rpc-cors all --rpc-methods unsafe \
--rpc-port 9944 --port 30333 \
--rpc-max-connections 1000 --no-mdns \
--rpc-max-response-size 256 --rpc-max-request-size 256 \
--in-peers 75 --out-peers 25 \
--prometheus-external --prometheus-port 9615 \
--db-cache 8192 --trie-cache-size 4294967296 \
--runtime-cache-size 4 --max-runtime-instances 8 \
--wasm-execution compiled --wasmtime-instantiation-strategy pooling-copy-on-write
The node serves a steady stream of archive state RPCs from an external client (historical state_getStorage, state_call, state_getReadProof, state_queryStorageAt, etc. against old block hashes).
Observed behaviour
Anonymous memory grows roughly linearly under load and never plateaus:
# /sys/fs/cgroup/.../subtensor.service/memory.stat
anon ≈ 24–39 GB ← real, non-reclaimable
file ≈ 0.1–4.5 GB ← page cache (small)
Growth curve after a fresh restart (netdata, RSS):
t+0min ~10 GB (post-restart)
t+1h ~18 GB
t+2h ~23.5 GB steady creep ≈ +3–4 GB/h, NOT decelerating to zero
... climbs linearly
t+~11h ~48 GB hits MemoryHigh
At the ceiling, with MemorySwapMax=0, the kernel cannot reclaim the (anonymous) memory, so it throttles via memory.high:
# memory.events at the ceiling
high 254531610 ← ~254M throttle events
max 0
oom_kill 0
The node spins in reclaim, RPC latency explodes, and system_health RPC eventually times out — the node is effectively hung until restarted. (oom_kill is 0 because it throttles rather than OOMs.)
Key points:
- The leaked memory is
anon, not page cache — so it is genuinely held by the process and cannot be reclaimed.
- It vastly exceeds the configured caches (8 GiB db-cache + 4 GiB trie-cache = 12 GiB, but anon reaches 24–39 GiB and keeps climbing).
- Reducing
--db-cache 8192 → 4096 did NOT reduce the steady-state footprint and did not stop the climb — confirming the growth is not the configured block cache.
- Growth correlates with RPC query load; an idle/lite node does not exhibit it at the same rate.
- A restart drops it back to ~10 GB and the cycle repeats.
Steps to reproduce
- Run an archive node (
--state-pruning archive) on finney with RocksDB.
- Subject it to sustained archive-state JSON-RPC queries against historical block hashes (e.g.
state_getReadProof / state_call / state_getStorage at old blocks), as a high-traffic archive RPC provider would.
- Watch
anon in the service's cgroup memory.stat (or RSS) over several hours.
- Observe a steady ~3–4 GB/h climb with no plateau, until the host/cgroup limit is reached.
Expected behaviour
Steady-state memory should plateau (bounded by the configured caches + a stable working set) rather than growing unbounded under continuous RPC load.
Impact
On a RAM-constrained host this forces a periodic restart treadmill (every ~6–8 h) to avoid the node hanging itself at the memory ceiling. For archive RPC providers this means recurring downtime and degraded tail latency as the node approaches the limit.
Current workaround
Scheduled restart every ~8 h (before the node reaches the throttle ceiling). This is a band-aid, not a fix.
Questions for maintainers
- Is unbounded
anon growth under archive RPC load a known issue?
- Is it related to the trie/state cache not honouring
--trie-cache-size under archive queries, the wasmtime pooling allocator, RPC subscription/connection buffers, or something else?
- Is there a flag to bound the per-process memory under archive RPC load that we've missed?
Happy to provide netdata exports, memory.stat snapshots over time, heaptrack/massif profiles, or an RPC query sample if useful.
Summary
A
--state-pruning archivenode serving sustained JSON-RPC traffic exhibits unbounded growth of anonymous (non-reclaimable) process memory at a steady ~3–4 GB/hour. The growth is real process memory (memory.statanon), not page cache, and it is not bounded by--db-cacheor--trie-cache-size. Left running, the node climbs to the host/cgroup memory limit and — with swap disabled — gets hard-throttled bymemory.high(millions of throttle events), at which point RPC stops responding and the node effectively hangs until restarted.Reproduced on two versions (a Nov-2025 build and the current v3.4.2-415), so this is not a recently-introduced regression — it appears inherent to the archive node under RPC load.
Environment
v3.4.2-415(also reproduced on a Nov-2025v3.2.9-era build). Note:system_versionRPC returns4.0.0-dev-unknownon both, so it's not a useful version discriminator.db/full)MemoryHigh=48G/MemoryMax=52GandMemorySwapMax=0.Launch flags
The node serves a steady stream of archive state RPCs from an external client (historical
state_getStorage,state_call,state_getReadProof,state_queryStorageAt, etc. against old block hashes).Observed behaviour
Anonymous memory grows roughly linearly under load and never plateaus:
Growth curve after a fresh restart (netdata, RSS):
At the ceiling, with
MemorySwapMax=0, the kernel cannot reclaim the (anonymous) memory, so it throttles viamemory.high:The node spins in reclaim, RPC latency explodes, and
system_healthRPC eventually times out — the node is effectively hung until restarted. (oom_kill is 0 because it throttles rather than OOMs.)Key points:
anon, not page cache — so it is genuinely held by the process and cannot be reclaimed.--db-cache8192 → 4096 did NOT reduce the steady-state footprint and did not stop the climb — confirming the growth is not the configured block cache.Steps to reproduce
--state-pruning archive) on finney with RocksDB.state_getReadProof/state_call/state_getStorageat old blocks), as a high-traffic archive RPC provider would.anonin the service's cgroupmemory.stat(or RSS) over several hours.Expected behaviour
Steady-state memory should plateau (bounded by the configured caches + a stable working set) rather than growing unbounded under continuous RPC load.
Impact
On a RAM-constrained host this forces a periodic restart treadmill (every ~6–8 h) to avoid the node hanging itself at the memory ceiling. For archive RPC providers this means recurring downtime and degraded tail latency as the node approaches the limit.
Current workaround
Scheduled restart every ~8 h (before the node reaches the throttle ceiling). This is a band-aid, not a fix.
Questions for maintainers
anongrowth under archive RPC load a known issue?--trie-cache-sizeunder archive queries, the wasmtime pooling allocator, RPC subscription/connection buffers, or something else?Happy to provide netdata exports,
memory.statsnapshots over time, heaptrack/massif profiles, or an RPC query sample if useful.