Title
merge_insert in a loop causes unbounded RSS growth on Linux due to glibc arena fragmentation.
Summary
When calling ds.merge_insert(...).when_matched_update_all().execute(table) repeatedly in a long-lived process (e.g., a Ray driver doing ~1000+ flushes), the process RSS grows monotonically at ~5–7 MB per iteration, eventually reaching 40+ GB and triggering OOM on our production cluster. This looks like a memory leak in lance's Rust layer but is actually caused by glibc's per-thread arena fragmentation.
Environment
- OS: Linux (glibc 2.35+)
- pylance: 7.0.0
- Python: 3.12
- Workload: Long-running single process doing 1000+ sequential
merge_insert commits against a 14M-row dataset (1518 fragments)
Root Cause Analysis
pylance links against the system glibc malloc (no custom allocator). The tokio multi-thread runtime spawns N worker threads. Under glibc, each thread that calls malloc gets its own arena (up to 8 × nCPU arenas by default).
Each merge_insert execution triggers DataFusion's hash-join and fragment-write logic on tokio worker threads, which allocates several MB of temporary buffers. After execution, Rust properly free()s all of it — but glibc does not return freed pages from per-thread arenas to the OS. malloc_trim(0) only trims arena 0 (main thread), leaving the other arenas bloated.
Evidence (using mallinfo2 to decompose the heap)
MALLOC_ARENA_MAX = unlimited (default):
| iter |
RssAnon (MB) |
in_use (MB) |
free_not_returned (MB) |
| 1 |
79 |
20 |
39 |
| 10 |
129 |
23 |
144 ← fragmentation! |
| 30 |
218 |
27 |
276 ← fragmentation! |
| 50 |
297 |
29 |
395 ← fragmentation! |
SUMMARY: Actually in use: 29 MB | Free but not returned: 395 MB | Ratio: 13.7x
With MALLOC_ARENA_MAX=2:
| iter |
RssAnon (MB) |
in_use (MB) |
free_not_returned (MB) |
| 1 |
75 |
19 |
21 |
| 10 |
74 |
23 |
24 |
| 30 |
80 |
28 |
23 |
| 50 |
83 |
30 |
24 |
SUMMARY: Actually in use: 30 MB | Free but not returned: 24 MB | Ratio: 0.8x
The in_use column is nearly identical in both cases (~30 MB), proving lance's Rust code correctly frees memory. The difference is entirely in how glibc manages the freed pages.
Reproduction Script
"""Reproduce: run with and without MALLOC_ARENA_MAX=2 on Linux."""
import ctypes, gc, os, uuid
import lance, pyarrow as pa
uri = "/tmp/lance_arena_repro"
n_rows = 50_000
lance.write_dataset(
pa.table({
"uid": pa.array([str(uuid.uuid4()) for _ in range(n_rows)], type=pa.utf8()),
"value": pa.array(["x"] * n_rows, type=pa.utf8()),
}),
uri, mode="overwrite", max_rows_per_file=10_000,
)
ds = lance.dataset(uri)
uids = ds.scanner(columns=["uid"], limit=10_000).to_table().column("uid").to_pylist()
key_field, val_field = ds.schema.field("uid"), ds.schema.field("value")
libc = ctypes.CDLL("libc.so.6")
class Mallinfo2(ctypes.Structure):
_fields_ = [("arena", ctypes.c_size_t)] * 2 + [("_pad", ctypes.c_size_t)] * 5 + \
[("uordblks", ctypes.c_size_t), ("fordblks", ctypes.c_size_t), ("_", ctypes.c_size_t)]
libc.mallinfo2.restype = Mallinfo2
pid = os.getpid()
for i in range(1, 51):
tbl = pa.table({"uid": pa.array(uids, type=key_field.type),
"value": pa.array(["x"] * len(uids), type=val_field.type)})
ds.merge_insert("uid").when_matched_update_all().execute(tbl)
del tbl; gc.collect()
if i % 10 == 0:
with open(f"/proc/{pid}/status") as f:
rss = [l for l in f if l.startswith("RssAnon:")][0].split()[1]
info = libc.mallinfo2()
print(f"iter={i} RssAnon={int(rss)//1024}MB "
f"in_use={info.uordblks//1048576}MB "
f"free_not_returned={info.fordblks//1048576}MB")
Proposed Fix
Maybe set a #[global_allocator] in python/src/lib.rs to use jemalloc (or mimalloc). These allocators eagerly return freed pages to the OS via madvise(MADV_FREE) and don't suffer from per-thread arena fragmentation.
Workaround (no code change)
Users can set the environment variable before launching their process:
export MALLOC_ARENA_MAX=2
This forces glibc to share 2 arenas across all threads, making malloc_trim effective. The performance impact on IO-bound lance workloads is negligible.
Impact
Any long-running process that does repeated merge_insert / update / add_columns commits (e.g., a streaming ingestion pipeline, a batch upsert job, a Ray driver coordinating filter/apply across thousands of fragments) will see RSS grow without bound until OOM. In our case: 42 GB RSS after ~1400 flushes on a 44 GB machine.
Title
merge_insertin a loop causes unbounded RSS growth on Linux due to glibc arena fragmentation.Summary
When calling
ds.merge_insert(...).when_matched_update_all().execute(table)repeatedly in a long-lived process (e.g., a Ray driver doing ~1000+ flushes), the process RSS grows monotonically at ~5–7 MB per iteration, eventually reaching 40+ GB and triggering OOM on our production cluster. This looks like a memory leak in lance's Rust layer but is actually caused by glibc's per-thread arena fragmentation.Environment
merge_insertcommits against a 14M-row dataset (1518 fragments)Root Cause Analysis
pylance links against the system glibc malloc (no custom allocator). The tokio multi-thread runtime spawns N worker threads. Under glibc, each thread that calls
mallocgets its own arena (up to 8 × nCPU arenas by default).Each
merge_insertexecution triggers DataFusion's hash-join and fragment-write logic on tokio worker threads, which allocates several MB of temporary buffers. After execution, Rust properlyfree()s all of it — but glibc does not return freed pages from per-thread arenas to the OS.malloc_trim(0)only trims arena 0 (main thread), leaving the other arenas bloated.Evidence (using
mallinfo2to decompose the heap)MALLOC_ARENA_MAX= unlimited (default):With
MALLOC_ARENA_MAX=2:The
in_usecolumn is nearly identical in both cases (~30 MB), proving lance's Rust code correctly frees memory. The difference is entirely in how glibc manages the freed pages.Reproduction Script
Proposed Fix
Maybe set a
#[global_allocator]inpython/src/lib.rsto use jemalloc (or mimalloc). These allocators eagerly return freed pages to the OS viamadvise(MADV_FREE)and don't suffer from per-thread arena fragmentation.Workaround (no code change)
Users can set the environment variable before launching their process:
export MALLOC_ARENA_MAX=2This forces glibc to share 2 arenas across all threads, making
malloc_trimeffective. The performance impact on IO-bound lance workloads is negligible.Impact
Any long-running process that does repeated
merge_insert/update/add_columnscommits (e.g., a streaming ingestion pipeline, a batch upsert job, a Ray driver coordinating filter/apply across thousands of fragments) will see RSS grow without bound until OOM. In our case: 42 GB RSS after ~1400 flushes on a 44 GB machine.