Apparent RSS memory leak in repeated merge_insert caused by glibc per-thread arena fragmentation

## Title

`merge_insert` in a loop causes unbounded RSS growth on Linux due to glibc arena fragmentation.

## Summary

When calling `ds.merge_insert(...).when_matched_update_all().execute(table)` repeatedly in a long-lived process (e.g., a Ray driver doing ~1000+ flushes), the process RSS grows monotonically at ~5–7 MB per iteration, eventually reaching 40+ GB and triggering OOM on our production cluster. This looks like a memory leak in lance's Rust layer but is actually caused by glibc's per-thread arena fragmentation.

## Environment

- **OS:** Linux (glibc 2.35+)
- **pylance:** 7.0.0
- **Python:** 3.12
- **Workload:** Long-running single process doing 1000+ sequential `merge_insert` commits against a 14M-row dataset (1518 fragments)

## Root Cause Analysis

pylance links against the system glibc malloc (no custom allocator). The tokio multi-thread runtime spawns N worker threads. Under glibc, each thread that calls `malloc` gets its own arena (up to 8 × nCPU arenas by default).

Each `merge_insert` execution triggers DataFusion's hash-join and fragment-write logic on tokio worker threads, which allocates several MB of temporary buffers. After execution, Rust properly `free()`s all of it — but glibc does not return freed pages from per-thread arenas to the OS. `malloc_trim(0)` only trims arena 0 (main thread), leaving the other arenas bloated.

### Evidence (using `mallinfo2` to decompose the heap)

**`MALLOC_ARENA_MAX` = unlimited (default):**

| iter | RssAnon (MB) | in_use (MB) | free_not_returned (MB) |
|------|--------------|-------------|------------------------|
| 1    | 79           | 20          | 39                     |
| 10   | 129          | 23          | 144  ← fragmentation!  |
| 30   | 218          | 27          | 276  ← fragmentation!  |
| 50   | 297          | 29          | 395  ← fragmentation!  |

> **SUMMARY:** Actually in use: 29 MB | Free but not returned: 395 MB | Ratio: 13.7x

**With `MALLOC_ARENA_MAX=2`:**

| iter | RssAnon (MB) | in_use (MB) | free_not_returned (MB) |
|------|--------------|-------------|------------------------|
| 1    | 75           | 19          | 21                     |
| 10   | 74           | 23          | 24                     |
| 30   | 80           | 28          | 23                     |
| 50   | 83           | 30          | 24                     |

> **SUMMARY:** Actually in use: 30 MB | Free but not returned: 24 MB | Ratio: 0.8x

The `in_use` column is nearly identical in both cases (~30 MB), proving lance's Rust code correctly frees memory. The difference is entirely in how glibc manages the freed pages.

## Reproduction Script

```python
"""Reproduce: run with and without MALLOC_ARENA_MAX=2 on Linux."""
import ctypes, gc, os, uuid
import lance, pyarrow as pa

uri = "/tmp/lance_arena_repro"
n_rows = 50_000
lance.write_dataset(
    pa.table({
        "uid": pa.array([str(uuid.uuid4()) for _ in range(n_rows)], type=pa.utf8()),
        "value": pa.array(["x"] * n_rows, type=pa.utf8()),
    }),
    uri, mode="overwrite", max_rows_per_file=10_000,
)

ds = lance.dataset(uri)
uids = ds.scanner(columns=["uid"], limit=10_000).to_table().column("uid").to_pylist()
key_field, val_field = ds.schema.field("uid"), ds.schema.field("value")

libc = ctypes.CDLL("libc.so.6")

class Mallinfo2(ctypes.Structure):
    _fields_ = [("arena", ctypes.c_size_t)] * 2 + [("_pad", ctypes.c_size_t)] * 5 + \
               [("uordblks", ctypes.c_size_t), ("fordblks", ctypes.c_size_t), ("_", ctypes.c_size_t)]
libc.mallinfo2.restype = Mallinfo2

pid = os.getpid()
for i in range(1, 51):
    tbl = pa.table({"uid": pa.array(uids, type=key_field.type),
                    "value": pa.array(["x"] * len(uids), type=val_field.type)})
    ds.merge_insert("uid").when_matched_update_all().execute(tbl)
    del tbl; gc.collect()
    if i % 10 == 0:
        with open(f"/proc/{pid}/status") as f:
            rss = [l for l in f if l.startswith("RssAnon:")][0].split()[1]
        info = libc.mallinfo2()
        print(f"iter={i}  RssAnon={int(rss)//1024}MB  "
              f"in_use={info.uordblks//1048576}MB  "
              f"free_not_returned={info.fordblks//1048576}MB")
```

## Proposed Fix

Maybe set a `#[global_allocator]` in `python/src/lib.rs` to use jemalloc (or mimalloc). These allocators eagerly return freed pages to the OS via `madvise(MADV_FREE)` and don't suffer from per-thread arena fragmentation.

## Workaround (no code change)

Users can set the environment variable before launching their process:

```bash
export MALLOC_ARENA_MAX=2
```

This forces glibc to share 2 arenas across all threads, making `malloc_trim` effective. The performance impact on IO-bound lance workloads is negligible.

## Impact

Any long-running process that does repeated `merge_insert` / `update` / `add_columns` commits (e.g., a streaming ingestion pipeline, a batch upsert job, a Ray driver coordinating filter/apply across thousands of fragments) will see RSS grow without bound until OOM. In our case: 42 GB RSS after ~1400 flushes on a 44 GB machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent RSS memory leak in repeated merge_insert caused by glibc per-thread arena fragmentation #7242

Title

Summary

Environment

Root Cause Analysis

Evidence (using `mallinfo2` to decompose the heap)

Reproduction Script

Proposed Fix

Workaround (no code change)

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

iter	RssAnon (MB)	in_use (MB)	free_not_returned (MB)
1	79	20	39
10	129	23	144 ← fragmentation!
30	218	27	276 ← fragmentation!
50	297	29	395 ← fragmentation!

Apparent RSS memory leak in repeated merge_insert caused by glibc per-thread arena fragmentation #7242

Description

Title

Summary

Environment

Root Cause Analysis

Evidence (using mallinfo2 to decompose the heap)

Reproduction Script

Proposed Fix

Workaround (no code change)

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence (using `mallinfo2` to decompose the heap)