Skip to content

Apparent RSS memory leak in repeated merge_insert caused by glibc per-thread arena fragmentation #7242

@xloya

Description

@xloya

Title

merge_insert in a loop causes unbounded RSS growth on Linux due to glibc arena fragmentation.

Summary

When calling ds.merge_insert(...).when_matched_update_all().execute(table) repeatedly in a long-lived process (e.g., a Ray driver doing ~1000+ flushes), the process RSS grows monotonically at ~5–7 MB per iteration, eventually reaching 40+ GB and triggering OOM on our production cluster. This looks like a memory leak in lance's Rust layer but is actually caused by glibc's per-thread arena fragmentation.

Environment

  • OS: Linux (glibc 2.35+)
  • pylance: 7.0.0
  • Python: 3.12
  • Workload: Long-running single process doing 1000+ sequential merge_insert commits against a 14M-row dataset (1518 fragments)

Root Cause Analysis

pylance links against the system glibc malloc (no custom allocator). The tokio multi-thread runtime spawns N worker threads. Under glibc, each thread that calls malloc gets its own arena (up to 8 × nCPU arenas by default).

Each merge_insert execution triggers DataFusion's hash-join and fragment-write logic on tokio worker threads, which allocates several MB of temporary buffers. After execution, Rust properly free()s all of it — but glibc does not return freed pages from per-thread arenas to the OS. malloc_trim(0) only trims arena 0 (main thread), leaving the other arenas bloated.

Evidence (using mallinfo2 to decompose the heap)

MALLOC_ARENA_MAX = unlimited (default):

iter RssAnon (MB) in_use (MB) free_not_returned (MB)
1 79 20 39
10 129 23 144 ← fragmentation!
30 218 27 276 ← fragmentation!
50 297 29 395 ← fragmentation!

SUMMARY: Actually in use: 29 MB | Free but not returned: 395 MB | Ratio: 13.7x

With MALLOC_ARENA_MAX=2:

iter RssAnon (MB) in_use (MB) free_not_returned (MB)
1 75 19 21
10 74 23 24
30 80 28 23
50 83 30 24

SUMMARY: Actually in use: 30 MB | Free but not returned: 24 MB | Ratio: 0.8x

The in_use column is nearly identical in both cases (~30 MB), proving lance's Rust code correctly frees memory. The difference is entirely in how glibc manages the freed pages.

Reproduction Script

"""Reproduce: run with and without MALLOC_ARENA_MAX=2 on Linux."""
import ctypes, gc, os, uuid
import lance, pyarrow as pa

uri = "/tmp/lance_arena_repro"
n_rows = 50_000
lance.write_dataset(
    pa.table({
        "uid": pa.array([str(uuid.uuid4()) for _ in range(n_rows)], type=pa.utf8()),
        "value": pa.array(["x"] * n_rows, type=pa.utf8()),
    }),
    uri, mode="overwrite", max_rows_per_file=10_000,
)

ds = lance.dataset(uri)
uids = ds.scanner(columns=["uid"], limit=10_000).to_table().column("uid").to_pylist()
key_field, val_field = ds.schema.field("uid"), ds.schema.field("value")

libc = ctypes.CDLL("libc.so.6")

class Mallinfo2(ctypes.Structure):
    _fields_ = [("arena", ctypes.c_size_t)] * 2 + [("_pad", ctypes.c_size_t)] * 5 + \
               [("uordblks", ctypes.c_size_t), ("fordblks", ctypes.c_size_t), ("_", ctypes.c_size_t)]
libc.mallinfo2.restype = Mallinfo2

pid = os.getpid()
for i in range(1, 51):
    tbl = pa.table({"uid": pa.array(uids, type=key_field.type),
                    "value": pa.array(["x"] * len(uids), type=val_field.type)})
    ds.merge_insert("uid").when_matched_update_all().execute(tbl)
    del tbl; gc.collect()
    if i % 10 == 0:
        with open(f"/proc/{pid}/status") as f:
            rss = [l for l in f if l.startswith("RssAnon:")][0].split()[1]
        info = libc.mallinfo2()
        print(f"iter={i}  RssAnon={int(rss)//1024}MB  "
              f"in_use={info.uordblks//1048576}MB  "
              f"free_not_returned={info.fordblks//1048576}MB")

Proposed Fix

Maybe set a #[global_allocator] in python/src/lib.rs to use jemalloc (or mimalloc). These allocators eagerly return freed pages to the OS via madvise(MADV_FREE) and don't suffer from per-thread arena fragmentation.

Workaround (no code change)

Users can set the environment variable before launching their process:

export MALLOC_ARENA_MAX=2

This forces glibc to share 2 arenas across all threads, making malloc_trim effective. The performance impact on IO-bound lance workloads is negligible.

Impact

Any long-running process that does repeated merge_insert / update / add_columns commits (e.g., a streaming ingestion pipeline, a batch upsert job, a Ray driver coordinating filter/apply across thousands of fragments) will see RSS grow without bound until OOM. In our case: 42 GB RSS after ~1400 flushes on a 44 GB machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions