Skip to content

perf(dir-catalog): rewrite manifest mutations with copy-on-write#7176

Merged
jackye1995 merged 3 commits into
lance-format:mainfrom
jackye1995:jack/dir-manifest-cow-clean
Jun 11, 2026
Merged

perf(dir-catalog): rewrite manifest mutations with copy-on-write#7176
jackye1995 merged 3 commits into
lance-format:mainfrom
jackye1995:jack/dir-manifest-cow-clean

Conversation

@jackye1995

@jackye1995 jackye1995 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace __manifest merge-insert/delete maintenance with always copy-on-write rewrites: each
    mutation streams the latest __manifest into a single replacement data file and commits a new
    manifest version with freshly built replacement scalar indices (object_id BTree, object_type
    Bitmap, base_objects LabelList), so reads stay index-accelerated.
  • Commit the rewrite by writing the manifest directly via the storage commit handler — atomic
    put-if-not-exists at version N+1, version hint, and the overwrite transaction embedded inline
    (no _transactions/*.txn file). No CommitBuilder/transaction-rebase machinery, and no new
    lance-core API surface
    (kept entirely in lance-namespace-impls).
  • Native conflict handling: on a storage commit error, read back the latest version and check
    whether the intent is already satisfied (create → object now exists ⇒ fail; delete → object gone
    ⇒ succeed) before retrying. Staged data/index files are deleted only once the commit is proven
    not to have landed, so a lost-ack commit can no longer orphan files a committed manifest
    references (fixes a silent corruption window).
  • Delete table-version ranges by streaming the rewrite, without expanding every version id.
  • Adds a commit benchmark: examples/manifest_bench.rs + benches/manifest_commit_sweep.sh
    (see BENCHMARK.md).

This is the production implementation of the copy-on-write approach prototyped in #6794.

Measured performance

c7i.48xlarge, S3 us-east-1, op write-create-namespace (pure __manifest commit). The catalog
is single-writer-throughput-bound: per-commit cost scales ~O(rows), and throughput does not
increase with concurrency (every commit is a serialized manifest version bump).

Continuous (1 process, 100 commits), ops/s — inline index vs no index:

rows inline index no index
1,000 2.0 3.5
100,000 1.1 2.1
1,000,000 0.34 0.53

Concurrent steady TPS is flat across 10–200 processes (e.g. inline @100k ≈ 1.4–1.5 ops/s at every
concurrency level; @1m ≈ 0.3 ops/s). Conflicts beyond the retry budget surface as errors that grow
with concurrency — the contention ceiling, not data loss (≈0 errors at ≤20 processes). No-index
commits run ~1.5–2× faster (no per-commit index build). This is on par with the #6794 prototype's
~2.3–2.5 ops/s single-process S3 write throughput.

@github-actions github-actions Bot added A-deps Dependency updates A-namespace Namespace impls performance labels Jun 9, 2026
@jackye1995 jackye1995 force-pushed the jack/dir-manifest-cow-clean branch from 2a96c17 to 2f1b29a Compare June 9, 2026 10:00
@github-actions github-actions Bot added A-python Python bindings and removed A-python Python bindings labels Jun 9, 2026
@jackye1995 jackye1995 force-pushed the jack/dir-manifest-cow-clean branch from 2f1b29a to 7aa7936 Compare June 9, 2026 10:11
@github-actions github-actions Bot added the A-python Python bindings label Jun 9, 2026
@jackye1995 jackye1995 force-pushed the jack/dir-manifest-cow-clean branch from 7aa7936 to 06ad0be Compare June 9, 2026 10:35
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 force-pushed the jack/dir-manifest-cow-clean branch from 06ad0be to d8c90da Compare June 10, 2026 23:23

@LuQQiu LuQQiu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the performance number with this new approach

@LuQQiu

LuQQiu commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Oh see the performance numbers.... lolll skip reviewing the PR descriptions fully

@LuQQiu

LuQQiu commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Why
Cold read: ~11-15% lower latency

@jackye1995

Copy link
Copy Markdown
Contributor Author

Why Cold read: ~11-15% lower latency

I think it's because in general just fewer files to read

@jackye1995 jackye1995 force-pushed the jack/dir-manifest-cow-clean branch from d8c90da to 9ff6552 Compare June 11, 2026 14:35
@jackye1995

Copy link
Copy Markdown
Contributor Author

Benchmark — copy-on-write __manifest commit throughput

Re-ran the commit benchmark on the native-commit implementation in this PR and verified it matches the #6794 prototype.

Setup. EC2 c7i.48xlarge (192 vCPU / 371 GB), S3 us-east-1. Operation = write-create-namespace (pure __manifest mutation, no table data). Continuous = 1 process, 100 commits. Concurrent = C processes committing for 30 s (steady TPS), each run S3-copy-isolated to start at exactly rows. Harness: examples/manifest_bench.rs + benches/manifest_commit_sweep.sh (this PR).

Continuous — 1 process, 100 commits

rows inline ops/s inline p50 / p99 (ms) no-index ops/s no-index p50 / p99 (ms)
1,000 2.04 442 / 990 3.55 271 / 394
2,000 1.82 502 / 1143 3.92 242 / 388
5,000 1.89 519 / 712 2.35 401 / 683
10,000 1.50 639 / 1034 2.43 408 / 520
20,000 1.51 644 / 1008 1.99 480 / 795
50,000 1.49 666 / 904 2.07 462 / 852
100,000 1.13 811 / 1722 2.06 478 / 637
200,000 0.98 937 / 1994 1.55 602 / 1074
500,000 0.58 1694 / 2587 0.92 1067 / 1344
1,000,000 0.34 2806 / 3816 0.53 1827 / 2877

Per-commit cost scales ~O(rows) (read + full rewrite + index build + write + version hint); no-index is ~1.5–2× faster on writes; 0 errors (no contention).

Concurrent — steady TPS over 30 s, inline index (ops/s)

rows c=10 c=20 c=50 c=100 c=120 c=150 c=200
1,000 2.74 3.06 3.71 3.70 3.30 2.46 2.86
2,000 2.86 2.96 3.47 3.26 3.56 3.44 2.87
5,000 2.06 2.25 2.38 2.22 2.40 2.39 2.21
10,000 1.97 2.04 2.32 2.38 2.54 2.33 2.12
20,000 1.84 2.10 2.23 2.13 2.37 2.25 2.05
50,000 1.76 1.71 1.78 1.79 1.75 1.85 1.78
100,000 1.39 1.38 1.54 1.48 1.52 1.49 1.45
200,000 1.12 1.15 1.31 1.30 1.23 1.33 1.22
500,000 0.63 0.65 0.66 0.66 0.64 0.66 0.65
1,000,000 0.32 0.34 0.36 0.35 0.34 0.33 0.30

Concurrent — steady TPS over 30 s, no index (ops/s)

rows c=10 c=20 c=50 c=100 c=120 c=150 c=200
1,000 4.74 5.21 5.73 5.88 5.92 5.93 5.33
2,000 4.91 5.39 6.25 6.02 5.70 5.87 5.12
5,000 2.80 3.14 3.48 3.25 3.52 3.76 3.50
10,000 2.88 3.02 3.25 3.22 3.49 3.54 3.34
20,000 2.96 2.82 3.17 3.17 3.24 3.29 3.28
50,000 2.38 2.62 2.86 2.74 2.75 2.95 2.81
100,000 2.26 2.32 2.36 2.58 2.38 2.51 2.41
200,000 1.89 1.91 2.02 1.93 1.89 2.11 1.86
500,000 0.86 0.94 0.95 0.87 0.88 0.84 0.82
1,000,000 0.54 0.53 0.53 0.50 0.49 0.45 0.42

Concurrent — errors (commits that exceeded the retry budget in the 30 s window)

variant / rows c=10 c=20 c=50 c=100 c=120 c=150 c=200
inline 1,000 0 3 22 94 157 120 194
inline 2,000 0 3 21 88 139 162 185
inline 5,000 2 4 33 78 141 210 202
inline 10,000 1 3 28 90 139 150 179
inline 20,000 2 3 30 78 152 155 186
inline 50,000 1 7 31 73 100 153 184
inline 100,000 2 6 34 78 104 131 182
inline 200,000 0 4 34 100 95 143 191
inline 500,000 0 5 28 76 96 124 197
inline 1,000,000 0 4 28 78 98 128 176
no-index 1,000 0 2 41 187 237 277 346
no-index 2,000 0 2 45 191 241 276 337
no-index 5,000 1 7 35 150 191 258 348
no-index 10,000 2 9 43 132 201 250 341
no-index 20,000 0 6 46 144 197 239 353
no-index 50,000 2 9 52 148 188 257 352
no-index 100,000 4 9 48 170 191 251 345
no-index 200,000 1 11 60 169 197 277 354
no-index 500,000 2 10 38 82 102 141 213
no-index 1,000,000 3 7 34 80 98 143 176

Findings

  • Single-writer-throughput-bound. Throughput is set by the serial commit rate at a given size and does not scale with concurrency — every commit is an atomic put-if-not-exists at version N+1, so concurrent writers contend and retry rather than parallelize.
  • ~O(rows) per commit. Continuous drops from ~2–3.5 ops/s at 1k to ~0.3–0.5 ops/s at 1M.
  • Inline index costs ~1.5–2× on writes (per-commit BTree/Bitmap/LabelList build); the tradeoff is index-accelerated reads.
  • Conflicts are clean, not corruption. ≈0 errors at ≤20 processes; errors climb with concurrency as commits exceed the retry budget (no-index errors more — faster commits ⇒ higher attempt rate ⇒ more contention).
  • Matches feat(dir-catalog): copy-on-write directory manifest rewrites #6794 (~2.3–2.5 ops/s single-process S3 write at 1k), so the native-commit rework (direct manifest write + inline txn + native conflict handling, no lance-core surface) is on par with the prototype.

Takeaway for high-TPS multi-writer catalogs: this single-version __manifest chain won't sustain it — sharding or a different commit substrate is needed (consistent with the table_version_storage removal in #7222).

The native commit wrote the manifest through the namespace's own ObjectStore,
which for stores like memory:// is a different instance than the one the manifest
dataset reads from, so commits were invisible to reads (stale version -> endless
conflict / not-found). Route the commit and cleanup through dataset.object_store().
@jackye1995 jackye1995 merged commit 7173468 into lance-format:main Jun 11, 2026
31 checks passed
@jackye1995 jackye1995 deleted the jack/dir-manifest-cow-clean branch June 11, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-namespace Namespace impls A-python Python bindings performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants