diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md new file mode 100644 index 000000000000..ad726abdd320 --- /dev/null +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -0,0 +1,668 @@ +--- +title: Multipart Upload GC Pressure Optimizations +summary: Change Multipart Upload Logic to improve OM GC Pressure +date: 2026-02-19 +jira: HDDS-10611 +status: proposed +author: Abhishek Pal, Rakesh Radhakrishnan +--- + + +# Ozone MPU Optimization - Design Doc + + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Proposal](#2-proposal) + * [Split-table design (V2)](#split-table-design-v2) + * [Comparison: V1 (legacy) vs V2](#comparison-v1-legacy-vs-v2) + * [2.1 Data Layout Changes](#21-data-layout-changes) + * [2.2 MPU Flow Changes](#22-mpu-flow-changes) + * [2.3 Summary and Trade-offs](#23-summary-and-trade-offs) +3. [Upgrades](#3-upgrades) +4. [Industry Patterns](#4-industry-patterns-flattened-keys-in-lsmrocksdb-systems) +--- + +## 1. Motivation +Presently Ozone has several overheads when uploading large files via Multipart upload (MPU). This document presents a detailed design for optimizing the MPU storage layout to reduce these overheads. + +### Problem with the current MPU schema +**Current design:** +* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}` +* Value = full `OmMultipartKeyInfo` with all parts inline. + +**Implications:** +1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). +
+``` +Side note: This is a common pattern in regular open key writes as well, but the MPU case is more severe due to the growing part list and more frequent updates. +``` +2. RocksDB WAL logs each full write → WAL growth (HDDS-8238). +3. GC pressure grows with the size of the object (HDDS-10611). + +#### a) Deserialization overhead +| Operation | Current | +|:--------------|:--------------------------------------------------------| +| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | + +#### b) WAL overhead +Assuming one MPU part info object takes ~1.5KB. + +| Scenario | Current WAL | +|:------------|:--------------------------------| +| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | + +#### c) GC pressure +Current: Large short-lived objects per part commit. + +#### Existing Storage Layout Overview +```protobuf +MultipartKeyInfo { + uploadID : string + creationTime : uint64 + type : ReplicationType + factor : ReplicationFactor (optional) + partKeyInfoList : repeated PartKeyInfo ← grows with each part + objectID : uint64 (optional) + updateID : uint64 (optional) + parentID : uint64 (optional) + ecReplicationConfig : optional +} +``` + +--- + +## 2. Proposal +The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. + +### Split-table design (V2) +Split MPU metadata into: +* **Metadata table:** Lightweight per-MPU metadata (no part list). +* **Parts table:** One row per part (flat structure). + +**New MultipartPartInfo Structure:** +```protobuf +message MultipartPartInfo { + optional string partName = 1; + optional uint32 partNumber = 2; + optional string eTag = 3; + optional KeyLocationList keyLocationList = 4; + optional uint64 dataSize = 5; + optional uint64 modificationTime = 6; + optional uint64 objectID = 7; + optional uint64 updateID = 8; + optional FileEncryptionInfoProto fileEncryptionInfo = 9; + optional FileChecksumProto fileChecksum = 10; +} +``` + +``` +Note: Here we are setting all fields to optional because Protobuf states that required field should be enforced in the application level. Also proto3 doesn't support required fields. +``` + +### Comparison: V1 (legacy) vs V2 +| Metric | Current (V1) | Split-Table (V2) | +|:--------------------|:------------------------------|:-------------------------------------------------| +| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | +| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | +| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | + +--- + +### 2.1 Data Layout Changes + +#### 2.1.1 Chosen Approach: Reuse `multipartInfoTable` + add `multipartPartsTable` + +Keep `multipartInfoTable` for MPU metadata, and store part rows in `multipartPartsTable`. + +**Storage Layout:** +* **`multipartInfoTable` (RocksDB):** + * V1: Key -> `OmMultipartKeyInfo` { parts inline } + * V2: Key -> `OmMultipartKeyInfo` { empty list, `schemaVersion: 1` } +* **`multipartPartsTable` (RocksDB):** + * Key type: `OmMultipartPartKey(uploadId, partNumber)` + * Value type: `OmMultipartPartInfo` + +**`multipartPartsTable` key codec (V2):** +* `OmMultipartPartKey` uses two logical fields: + * `uploadId` (`String`) + * `partNumber` (`int32`) +* Persisted key bytes are encoded as: + * `uploadId(UTF-8 bytes)` + `'/' (0x2f)` + `partNumber(4-byte big-endian int)` +* Prefix scan for all parts in one upload uses: + * `uploadId(UTF-8 bytes)` + `'/' (0x2f)` + +```text +`OmMultipartPartKey.toString()` returns: + - full key: "/" + - prefix key: "" (used only as in-memory prefix object) + +Example: + OmMultipartPartKey.of("abc123-uuid-456", 2).toString() == "abc123-uuid-456/2" +``` + +The parts are stored in lexicographical order by uploadID and part number, which complies with the S3 specifications for ordering of ListPart and ListMultipartUpload operations. + +#### MultipartKeyInfo Structure +```protobuf +message MultipartKeyInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + repeated PartKeyInfo partKeyInfoList = 5; [deprecated = true] + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9; + optional uint32 schemaVersion = 10; // default 0 + // this is being pull up from the part information as this wil not change per part for a given key + optional string volumeName = 11; + optional string bucketName = 12; + optional string keyName = 13; + optional string ownerName = 14; + repeated OzoneAclInfo acls = 15; +} +``` + +##### V1: `OmMultipartKeyInfo` (parts inline) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ] <- all parts inline + objectID + updateID + parentID + schemaVersion: 0 (or absent) +} +``` + +##### V2: `OmMultipartKeyInfo` (empty list + schemaVersion) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [] <- empty + objectID + updateID + parentID + schemaVersion: 1 +} +``` + +##### Example (for a 10-part MPU) + +`multipartInfoTable`: +``` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 + +Value: +OmMultipartKeyInfo { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + partKeyInfoList: [] + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +`multipartPartsTable` (logical keys): +```text +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=1} + String form: "abc123-uuid-456/1" +Value: OmMultipartPartInfo{partNumber=1, partName=".../part1", ...} + +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=2} + String form: "abc123-uuid-456/2" +Value: OmMultipartPartInfo{partNumber=2, partName=".../part2", ...} +... +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=10} + String form: "abc123-uuid-456/10" +Value: OmMultipartPartInfo{partNumber=10, partName=".../part10", ...} +``` + +`multipartPartsTable` (encoded key sample): +```text +uploadId = "abc123-uuid-456" +partNumber = 2 + +encodedKey = [61 62 63 31 32 33 2d 75 75 69 64 2d 34 35 36 2f 00 00 00 02] + [--------------uploadId UTF-8---------------][2f][-int32 BE-] +``` + +#### 2.1.2 Alternative Approach: Add `multipartMetadataTable` + `multipartPartsTable` + +Split metadata and introduce two new tables: +* **`multipartMetadataTable`**: lightweight per-MPU metadata (no part list). +* **`multipartPartsTable`**: one row per part (no aggregation). + +```protobuf +message MultipartMetadataInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 5; + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional uint32 schemaVersion = 9; // default 0 +} +``` + +**Storage Layout Overview:** +* **`multipartInfoTable` (RocksDB):** + * V1: `/vol/bucket/key/uploadId` -> `OmMultipartKeyInfo { partKeyInfoList: [...] }` +* **`multipartMetadataTable` (RocksDB):** + * V2: `/vol/bucket/key/uploadId` -> `MultipartMetadata { schemaVersion: 1 }` +* **`multipartPartsTable` (RocksDB):** + * Key: `OmMultipartPartKey(uploadId, partNumber)` + * Value: `PartKeyInfo`-equivalent part payload + +```protobuf +message MultipartMetadata { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional uint64 objectID = 5; + optional uint64 updateID = 6; + optional uint64 parentID = 7; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 8; + optional uint32 schemaVersion = 9; + // NO partKeyInfoList - moved to new table +} +``` + +Example: +``` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 + +Value: +MultipartMetadata { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +### 2.2 MPU Flow Changes + +#### 2.2.1 Chosen Approach Flow Changes + +##### Multipart Upload Initiate + +**Old Flow** +* Create `multipartKey = /{vol}/{bucket}/{key}/{uploadId}`. +* Build `OmMultipartKeyInfo` (schema default/legacy, inline `partKeyInfoList` model). +* Write: + * `openKeyTable[multipartKey] = OmKeyInfo` + * `multipartInfoTable[multipartKey] = OmMultipartKeyInfo` + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=0, partKeyInfoList=[]} +openKeyTable[/vol1/b1/fileA/upload-001] -> + OmKeyInfo{key=fileA, objectID=9001} +``` + +**New Flow** +* Same keys/tables as old flow, but initiate sets `schemaVersion` explicitly: + * `schemaVersion=1` when `OMLayoutFeature.MPU_PARTS_TABLE_SPLIT` is allowed. + * `schemaVersion=0` otherwise. +* No part row is created at initiate time; part rows are created during commit-part. +* FSO response path (`S3InitiateMultipartUploadResponseWithFSO`) still writes parent directory entries, then open-file + multipart-info rows. +* Backward compatibility: write path selection is schema-based and layout-gated (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=1, partKeyInfoList=[]} +``` + +##### Multipart Upload Commit Part + +**Old Flow** +* Read `multipartInfoTable[multipartKey]`. +* Read current uploaded part blocks from `openKeyTable[getOpenKey(..., clientID)]`. +* Insert in inline map: + * `oldPart = multipartKeyInfo.getPartKeyInfo(partNumber)` + * `multipartKeyInfo.addPartKeyInfo(currentPart)` +* Delete committed one-shot open key for this part. +* Update quota based on overwrite delta. + +Example: +```text +Before: partKeyInfoList=[{part=1,size=64MB},{part=2,size=32MB}] +Commit part 2 size=40MB +After: partKeyInfoList=[{part=1,size=64MB},{part=2,size=40MB}] +``` + +**New Flow** +* Load `multipartKeyInfo` and validate layout gate: + * if split feature is not allowed and `schemaVersion != 0`, fail early. +* Branch by schema: + * `schemaVersion=0`: same old inline behavior. + * `schemaVersion=1`: + * create `multipartPartKey = OmMultipartPartKey(uploadId, partNumber)`, + * write `multipartPartTable[multipartPartKey] = OmMultipartPartInfo{openKey, partName, partNumber, dataSize, modificationTime, objectID, updateID, metadata, keyLocationList, fileEncryptionInfo?, fileChecksum?}` + * keep current part open key in `openKeyTable` (needed later by list/complete/abort), + * if overwriting an existing part row, delete old part open key and adjust quota. +* `multipartInfoTable[multipartKey]` is still updated for metadata/updateID. +* Backward compatibility: schema decides row format, and split behavior is blocked before finalization (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1`): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + OmMultipartPartInfo{partNumber=2, openKey=/vol1/b1/fileA#client-77, dataSize=40MB} +``` + +##### Multipart Upload Complete + +**Old Flow** +* Read `multipartInfoTable[multipartKey]` and validate ordered requested parts. +* Validate each requested part from inline `partKeyInfoMap`. +* Build final key block list from selected parts. +* Write final key: + * `keyTable[/vol1/b1/fileA] = OmKeyInfo{locations=p1+p2+...}` +* Delete MPU state: + * `openKeyTable[multipartOpenKey]` + * `multipartInfoTable[multipartKey]` +* Move unused parts to deleted table. + +Example: +```text +Complete [1,2,3] -> keyTable[/vol1/b1/fileA] written, MPU rows removed +``` + +**New Flow** +* Validate layout gate first (same pattern as commit/abort when `schemaVersion != 0`). +* Load part materialization by schema: + * `schemaVersion=0`: use inline `partKeyInfoMap`. + * `schemaVersion=1`: + * scan `multipartPartTable` with prefix `OmMultipartPartKey.prefix(uploadId)`, + * rebuild `PartKeyInfo` view directly from `OmMultipartPartInfo` (including locations, metadata, objectID/updateID, and optional encryption/checksum fields), + * track part open keys from part metadata for cleanup. +* Perform same user-facing validation (order, existence, eTag/partName, min size). +* Commit final key to `keyTable`. +* Cleanup: + * always delete `multipartInfoTable[multipartKey]` and `openKeyTable[multipartOpenKey]`, + * for split schema also delete all matching `multipartPartTable` rows and their tracked part open keys. +* Backward compatibility: completion transparently supports both persisted schemas and enforces layout gating for split rows (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1` cleanup): +```text +Delete: + multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=1}] + multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] + ... + openKeyTable[part-open-key-1], openKeyTable[part-open-key-2], ... +``` + +##### Multipart Upload Abort + +**Old Flow** +* Read `multipartInfoTable[multipartKey]`. +* Release bucket used-bytes from inline `partKeyInfoMap`. +* Delete `openKeyTable[multipartOpenKey]` and `multipartInfoTable[multipartKey]`. +* Move part key infos to deleted table. + +Example: +```text +Abort upload-001 -> delete MPU metadata/open rows and tombstone parts +``` + +**New Flow** +* Validate layout gate (reject split behavior before finalization if schema indicates split). +* Branch by schema for quota and cleanup: + * `schemaVersion=0`: same legacy inline part iteration. + * `schemaVersion=1`: + * iterate `multipartPartTable` by prefix (`OmMultipartPartKey.prefix(uploadId)`), + * compute released quota from part rows, + * for each part: move corresponding open key to deleted table, delete part open key, delete part row. +* Delete `multipartInfoTable[multipartKey]` and `multipartOpenKey` entry. +* Backward compatibility: abort handles both old inline and new split rows in the same codepath (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1`): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=3}] -> deleted +openKeyTable[/vol1/b1/fileA#client-90] -> deleted +deletedTable[objectId:/vol1/b1/fileA/upload-001] -> appended +``` + +##### Multipart Upload List Parts + +**Old Flow** +* API path: + * S3G `MultipartKeyHandler.listParts(...)` + * `OzoneManager.listParts(...)` + * `KeyManagerImpl.listParts(...)` +* Read `multipartInfoTable[multipartKey]`. +* Iterate inline `partKeyInfoMap` in part-number order; apply marker + `maxParts`. +* If no part entries exist yet, read replication from `openKeyTable[multipartKey]`. + +Example: +```text +parts=[1,2,3,4], marker=1, maxParts=2 => return [2,3], nextMarker=3, truncated=true +``` + +**New Flow** +* Same API path and response contract. +* Branch by schema in `KeyManagerImpl.listParts`: + * `schemaVersion=0`: iterate inline `partKeyInfoMap`. + * `schemaVersion=1`: + * scan `multipartPartTable` by `OmMultipartPartKey.prefix(uploadId)`, + * resolve each part's `openKey` from `openKeyTable`, + * build `PartKeyInfo` view on the fly and paginate. +* Replication fallback remains from MPU open key when no part is returned. + +Example (`schemaVersion=1` read materialization): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + openKey=/vol1/b1/fileA#c2 +openKeyTable[/vol1/b1/fileA#c2] -> KeyInfo{size=40MB, eTag=...} +listParts response partNumber=2, size=40MB, eTag=... +``` + +#### 2.2.2 Alternative Approach Flow + +##### Multipart Upload Initiate + +**Old Flow** +* Write `openKeyTable` + `multipartInfoTable` (`OmMultipartKeyInfo`), with legacy inline-part model. + +**New Flow** +* Write `openKeyTable` + `multipartMetadataTable` (new metadata object, no part list). +* `multipartInfoTable` is no longer used for new V1 MPU writes in this approach. +* `schemaVersion` (or equivalent layout marker) is stored in `multipartMetadataTable` row. + +Example: +```text +multipartMetadataTable[/vol1/b1/fileA/upload-001] -> + MultipartMetadata{schemaVersion=1, replication=RATIS/THREE, objectID=9001} +openKeyTable[/vol1/b1/fileA/upload-001] -> + OmKeyInfo{key=fileA, locations=[empty]} +``` + +##### Multipart Upload Commit Part + +**Old Flow** +* Read/modify/write one `multipartInfoTable` row by updating inline part list. + +**New Flow** +* Read `multipartMetadataTable` only for MPU metadata + validation context. +* Write one row per part into `multipartPartsTable`: + * key: `OmMultipartPartKey(uploadId, partNumber)`, + * value: `PartKeyInfo` / equivalent flattened part payload. +* Avoid rewriting a large aggregate MPU value for each part. + +Example: +```text +multipartPartsTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + PartKeyInfo{partNumber=2, partName=..., keyInfo={blocks,size,etag}} +``` + +##### Multipart Upload Complete + +**Old Flow** +* Read parts from inline `partKeyInfoList` in `multipartInfoTable`. +* Commit final key, delete MPU row and MPU open row. + +**New Flow** +* Read MPU metadata from `multipartMetadataTable`. +* Scan `multipartPartsTable` prefix to gather candidate parts; validate request list/order/eTags. +* Build final key and commit to `keyTable`. +* Cleanup: + * delete `multipartMetadataTable[multipartKey]`, + * delete all `multipartPartsTable` rows for that upload, + * delete MPU open key and part open keys. + +##### Multipart Upload Abort + +**Old Flow** +* Iterate inline `partKeyInfoList` to release quota and move parts to deleted table. + +**New Flow** +* Read `multipartMetadataTable` for replication + MPU identity. +* Iterate `multipartPartsTable` by prefix for quota release and delete-table movement. +* Delete metadata row + all part rows + corresponding open keys. + +##### Multipart Upload List Parts + +**Old Flow** +* `listParts` reads `multipartInfoTable` and paginates inline `partKeyInfoList`. + +**New Flow** +* `listParts` reads `multipartMetadataTable` for MPU existence/metadata. +* Materialize part listing from `multipartPartsTable` prefix scan and paginate by part number marker. +* If needed, join with `openKeyTable` for additional per-part runtime fields. + +##### Compatibility and Upgrade Guard +* Same compatibility strategy as section [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating): + * legacy rows continue on old path, + * new writes only when layout feature is finalized, + * reject unsupported split-layout mutations pre-finalize. + +### 2.3 Summary and Trade-offs +* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag. +* **Approach-2:** Dedicated metadata table, cleanest separation, requires broader refactor. + +#### Pros and Cons + +| | Pros | Cons | +|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Chosen Approach | * Minimal migration risk
* Reuses existing `OmMultipartKeyInfo` message
* Easiest incremental rollout with `schemaVersion` gating, so finalization of upgrades do not affect existing key writes.
* Lower implementation impact request/response paths. | * Carries complexity for mixed code (same table serving legacy + split metadata modes).
* Still coupled to `OmMultipartKeyInfo`.
* More conditional logic over time. | +| Alternative Approach | * Clean separation of concerns (`multipartMetadataTable` vs `multipartPartsTable`)
* Clearer long-term model and easier mental mapping.
* Avoids overloading legacy value type. | * Requires wider code changes (new codecs/table wiring/request/response updates).
* Higher migration and compatibility test scope.
* More rollout complexity than chosen approach. | + +## 3. Upgrades +Add a new feature in `OMLayoutFeature`: +```java +MPU_PARTS_TABLE_SPLIT(10, "Split multipart table into separate table for parts and key"); +``` + +### 3.1 Backward compatibility and layout gating + +Backward compatibility is handled by combining `schemaVersion` with layout-feature checks. + +- **New MPU initiate writes** + - `schemaVersion` is set to `1` only when `MPU_PARTS_TABLE_SPLIT` is allowed. + - Otherwise initiate writes `schemaVersion=0` and stays on legacy inline part behavior. + +- **Existing MPU rows** + - `schemaVersion=0` rows continue to use legacy inline-part read/write paths. + - `schemaVersion=1` rows use split-table paths (`multipartPartsTable` + tracked open keys). + +- **Pre-finalize protection** + - Mutating split-table operations (commit part / complete / abort) check: + - if feature is not allowed and `schemaVersion != 0`, reject with `NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION`. + - This prevents accidental split-layout writes/updates before finalization. + +- **Read compatibility** + - `listParts` supports both schemas: + - schema 0 -> read inline `partKeyInfoMap`, + - schema 1 -> materialize parts from `multipartPartsTable` + `openKeyTable`. + +In short: old MPU entries keep working unchanged, new entries only use split layout when cluster layout allows it, and write paths are guarded to avoid unsafe transitions. + +## 4. Industry Patterns: Flattened Keys in LSM/RocksDB Systems + +Using flattened keys (for example, `baseKey + sortable suffix`) is a common design in RocksDB-backed systems. +The MPU `multipartPartsTable` layout follows the same principle by using one row per part keyed by +`OmMultipartPartKey(uploadId, partNumber)` with byte-level ordering from codec serialization. + +### 4.1 Why this is common + +RocksDB is optimized for: +- point lookups by key, +- prefix/range scans over lexicographically ordered keys, +- append-like write patterns with small values. + +Flattened schemas map naturally to this model: +- each logical sub-record (part/version) becomes an independent KV row, +- updates rewrite only one small row instead of a large aggregate object, +- range scans can fetch all rows for one logical entity via a shared prefix. + +### 4.2 MVCC systems as examples + +In MVCC-oriented systems such as CockroachDB and TiKV, a common pattern is: +- encode the user key as a prefix, +- encode version dimension (timestamp / sequence / commit-ts) in key suffix, +- use key ordering to make version reads and range scans efficient. + +High-level shape: +```text +/ +``` + +Example (illustrative): +```text +user:42@1699000001 +user:42@1699000005 +user:42@1699000010 +``` + +The idea is conceptually similar to MPU part storage: +```text +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=1} +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=2} +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=3} +``` + +Both designs rely on ordered keys to make grouped reads/writes efficient. + +### 4.3 Relevance to MPU optimization + +For MPU, flattened part rows provide: +- lower write amplification per part commit (single-row updates), +- lower object allocation pressure in OM (no repeated large list rebuild), +- straightforward cleanup by prefix scan during complete/abort, +- better operational visibility (`one row = one part`). + +This is why the split schema is not just an optimization for this code path, but also a storage-layout pattern that aligns well with how LSM/RocksDB systems are typically modeled.