perf(dir-catalog): rewrite manifest mutations with copy-on-write#7176
Conversation
2a96c17 to
2f1b29a
Compare
2f1b29a to
7aa7936
Compare
7aa7936 to
06ad0be
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
06ad0be to
d8c90da
Compare
LuQQiu
left a comment
There was a problem hiding this comment.
Curious about the performance number with this new approach
|
Oh see the performance numbers.... lolll skip reviewing the PR descriptions fully |
|
Why |
I think it's because in general just fewer files to read |
d8c90da to
9ff6552
Compare
Benchmark — copy-on-write
|
| rows | inline ops/s | inline p50 / p99 (ms) | no-index ops/s | no-index p50 / p99 (ms) |
|---|---|---|---|---|
| 1,000 | 2.04 | 442 / 990 | 3.55 | 271 / 394 |
| 2,000 | 1.82 | 502 / 1143 | 3.92 | 242 / 388 |
| 5,000 | 1.89 | 519 / 712 | 2.35 | 401 / 683 |
| 10,000 | 1.50 | 639 / 1034 | 2.43 | 408 / 520 |
| 20,000 | 1.51 | 644 / 1008 | 1.99 | 480 / 795 |
| 50,000 | 1.49 | 666 / 904 | 2.07 | 462 / 852 |
| 100,000 | 1.13 | 811 / 1722 | 2.06 | 478 / 637 |
| 200,000 | 0.98 | 937 / 1994 | 1.55 | 602 / 1074 |
| 500,000 | 0.58 | 1694 / 2587 | 0.92 | 1067 / 1344 |
| 1,000,000 | 0.34 | 2806 / 3816 | 0.53 | 1827 / 2877 |
Per-commit cost scales ~O(rows) (read + full rewrite + index build + write + version hint); no-index is ~1.5–2× faster on writes; 0 errors (no contention).
Concurrent — steady TPS over 30 s, inline index (ops/s)
| rows | c=10 | c=20 | c=50 | c=100 | c=120 | c=150 | c=200 |
|---|---|---|---|---|---|---|---|
| 1,000 | 2.74 | 3.06 | 3.71 | 3.70 | 3.30 | 2.46 | 2.86 |
| 2,000 | 2.86 | 2.96 | 3.47 | 3.26 | 3.56 | 3.44 | 2.87 |
| 5,000 | 2.06 | 2.25 | 2.38 | 2.22 | 2.40 | 2.39 | 2.21 |
| 10,000 | 1.97 | 2.04 | 2.32 | 2.38 | 2.54 | 2.33 | 2.12 |
| 20,000 | 1.84 | 2.10 | 2.23 | 2.13 | 2.37 | 2.25 | 2.05 |
| 50,000 | 1.76 | 1.71 | 1.78 | 1.79 | 1.75 | 1.85 | 1.78 |
| 100,000 | 1.39 | 1.38 | 1.54 | 1.48 | 1.52 | 1.49 | 1.45 |
| 200,000 | 1.12 | 1.15 | 1.31 | 1.30 | 1.23 | 1.33 | 1.22 |
| 500,000 | 0.63 | 0.65 | 0.66 | 0.66 | 0.64 | 0.66 | 0.65 |
| 1,000,000 | 0.32 | 0.34 | 0.36 | 0.35 | 0.34 | 0.33 | 0.30 |
Concurrent — steady TPS over 30 s, no index (ops/s)
| rows | c=10 | c=20 | c=50 | c=100 | c=120 | c=150 | c=200 |
|---|---|---|---|---|---|---|---|
| 1,000 | 4.74 | 5.21 | 5.73 | 5.88 | 5.92 | 5.93 | 5.33 |
| 2,000 | 4.91 | 5.39 | 6.25 | 6.02 | 5.70 | 5.87 | 5.12 |
| 5,000 | 2.80 | 3.14 | 3.48 | 3.25 | 3.52 | 3.76 | 3.50 |
| 10,000 | 2.88 | 3.02 | 3.25 | 3.22 | 3.49 | 3.54 | 3.34 |
| 20,000 | 2.96 | 2.82 | 3.17 | 3.17 | 3.24 | 3.29 | 3.28 |
| 50,000 | 2.38 | 2.62 | 2.86 | 2.74 | 2.75 | 2.95 | 2.81 |
| 100,000 | 2.26 | 2.32 | 2.36 | 2.58 | 2.38 | 2.51 | 2.41 |
| 200,000 | 1.89 | 1.91 | 2.02 | 1.93 | 1.89 | 2.11 | 1.86 |
| 500,000 | 0.86 | 0.94 | 0.95 | 0.87 | 0.88 | 0.84 | 0.82 |
| 1,000,000 | 0.54 | 0.53 | 0.53 | 0.50 | 0.49 | 0.45 | 0.42 |
Concurrent — errors (commits that exceeded the retry budget in the 30 s window)
| variant / rows | c=10 | c=20 | c=50 | c=100 | c=120 | c=150 | c=200 |
|---|---|---|---|---|---|---|---|
| inline 1,000 | 0 | 3 | 22 | 94 | 157 | 120 | 194 |
| inline 2,000 | 0 | 3 | 21 | 88 | 139 | 162 | 185 |
| inline 5,000 | 2 | 4 | 33 | 78 | 141 | 210 | 202 |
| inline 10,000 | 1 | 3 | 28 | 90 | 139 | 150 | 179 |
| inline 20,000 | 2 | 3 | 30 | 78 | 152 | 155 | 186 |
| inline 50,000 | 1 | 7 | 31 | 73 | 100 | 153 | 184 |
| inline 100,000 | 2 | 6 | 34 | 78 | 104 | 131 | 182 |
| inline 200,000 | 0 | 4 | 34 | 100 | 95 | 143 | 191 |
| inline 500,000 | 0 | 5 | 28 | 76 | 96 | 124 | 197 |
| inline 1,000,000 | 0 | 4 | 28 | 78 | 98 | 128 | 176 |
| no-index 1,000 | 0 | 2 | 41 | 187 | 237 | 277 | 346 |
| no-index 2,000 | 0 | 2 | 45 | 191 | 241 | 276 | 337 |
| no-index 5,000 | 1 | 7 | 35 | 150 | 191 | 258 | 348 |
| no-index 10,000 | 2 | 9 | 43 | 132 | 201 | 250 | 341 |
| no-index 20,000 | 0 | 6 | 46 | 144 | 197 | 239 | 353 |
| no-index 50,000 | 2 | 9 | 52 | 148 | 188 | 257 | 352 |
| no-index 100,000 | 4 | 9 | 48 | 170 | 191 | 251 | 345 |
| no-index 200,000 | 1 | 11 | 60 | 169 | 197 | 277 | 354 |
| no-index 500,000 | 2 | 10 | 38 | 82 | 102 | 141 | 213 |
| no-index 1,000,000 | 3 | 7 | 34 | 80 | 98 | 143 | 176 |
Findings
- Single-writer-throughput-bound. Throughput is set by the serial commit rate at a given size and does not scale with concurrency — every commit is an atomic put-if-not-exists at version N+1, so concurrent writers contend and retry rather than parallelize.
- ~O(rows) per commit. Continuous drops from ~2–3.5 ops/s at 1k to ~0.3–0.5 ops/s at 1M.
- Inline index costs ~1.5–2× on writes (per-commit BTree/Bitmap/LabelList build); the tradeoff is index-accelerated reads.
- Conflicts are clean, not corruption. ≈0 errors at ≤20 processes; errors climb with concurrency as commits exceed the retry budget (no-index errors more — faster commits ⇒ higher attempt rate ⇒ more contention).
- Matches feat(dir-catalog): copy-on-write directory manifest rewrites #6794 (~2.3–2.5 ops/s single-process S3 write at 1k), so the native-commit rework (direct manifest write + inline txn + native conflict handling, no lance-core surface) is on par with the prototype.
Takeaway for high-TPS multi-writer catalogs: this single-version __manifest chain won't sustain it — sharding or a different commit substrate is needed (consistent with the table_version_storage removal in #7222).
The native commit wrote the manifest through the namespace's own ObjectStore, which for stores like memory:// is a different instance than the one the manifest dataset reads from, so commits were invisible to reads (stale version -> endless conflict / not-found). Route the commit and cleanup through dataset.object_store().
Summary
__manifestmerge-insert/delete maintenance with always copy-on-write rewrites: eachmutation streams the latest
__manifestinto a single replacement data file and commits a newmanifest version with freshly built replacement scalar indices (
object_idBTree,object_typeBitmap,
base_objectsLabelList), so reads stay index-accelerated.put-if-not-exists at version N+1, version hint, and the overwrite transaction embedded inline
(no
_transactions/*.txnfile). NoCommitBuilder/transaction-rebase machinery, and no newlance-core API surface (kept entirely in
lance-namespace-impls).whether the intent is already satisfied (create → object now exists ⇒ fail; delete → object gone
⇒ succeed) before retrying. Staged data/index files are deleted only once the commit is proven
not to have landed, so a lost-ack commit can no longer orphan files a committed manifest
references (fixes a silent corruption window).
examples/manifest_bench.rs+benches/manifest_commit_sweep.sh(see
BENCHMARK.md).This is the production implementation of the copy-on-write approach prototyped in #6794.
Measured performance
c7i.48xlarge, S3us-east-1, opwrite-create-namespace(pure__manifestcommit). The catalogis single-writer-throughput-bound: per-commit cost scales ~O(rows), and throughput does not
increase with concurrency (every commit is a serialized manifest version bump).
Continuous (1 process, 100 commits), ops/s — inline index vs no index:
Concurrent steady TPS is flat across 10–200 processes (e.g. inline @100k ≈ 1.4–1.5 ops/s at every
concurrency level; @1m ≈ 0.3 ops/s). Conflicts beyond the retry budget surface as errors that grow
with concurrency — the contention ceiling, not data loss (≈0 errors at ≤20 processes). No-index
commits run ~1.5–2× faster (no per-commit index build). This is on par with the #6794 prototype's
~2.3–2.5 ops/s single-process S3 write throughput.