From bf34ede6d5dbe6a1bc62b101458b89a0ffe3e73f Mon Sep 17 00:00:00 2001 From: Abhishek Pal Date: Thu, 19 Feb 2026 22:28:41 +0530 Subject: [PATCH 01/12] HDDS-10611. Design document for MPU GC Optimization --- .../content/design/mpu-gc-optimization.md | 344 ++++++++++++++++++ 1 file changed, 344 insertions(+) create mode 100644 hadoop-hdds/docs/content/design/mpu-gc-optimization.md diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md new file mode 100644 index 000000000000..f22d218cc255 --- /dev/null +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -0,0 +1,344 @@ +--- +title: Multipart Upload GC Pressure Optimizations +summary: Change Multipart Upload Logic to improve OM GC Pressure +date: 2026-02-19 +jira: HDDS-10611 +status: implemented +author: Abhishek Pal, Rakesh Radhakrishnan +--- + + +# Ozone MPU Optimization - Design Doc + + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Proposal](#2-proposal) +* [Backward Compatibility](#backward-compatibility) +* [Split-table design (V1)](#split-table-design-v1) +* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1) +* [2.1 Approach-1: Reuse multipartInfoTable with empty part list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list) +* [2.2 Approach-2: Introduce new multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable) +* [2.3 Summary](#23-summary) +* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1) +3. [Upgrades](#3-upgrades) +4. [Benchmarking and Performance](#4-benchmarking-and-performance) +5. [Open Questions](#5-open-questions) + +--- + +## 1. Motivation +Presently Ozone has several overheads when uploading large files via Multipart upload (MPU). This document presents a detailed design for optimizing the MPU storage layout to reduce these overheads. + +### Problem with the current MPU schema +**Current design:** +* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}` +* Value = full `OmMultipartKeyInfo` with all parts inline. + +**Implications:** +1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). +2. RocksDB WAL logs each full write → WAL growth (HDDS-8238). +3. GC pressure grows with the size of the object (HDDS-10611). + +#### a) Deserialization overhead +| Operation | Current | +| :--- | :--- | +| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | + +#### b) WAL overhead +Assuming one MPU part info object takes ~1.5KB. + +| Scenario | Current WAL | +| :--- | :--- | +| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | + +#### c) GC pressure +Current: Large short-lived objects per part commit. + +#### Existing Storage Layout Overview +```protobuf +MultipartKeyInfo { + uploadID : string + creationTime : uint64 + type : ReplicationType + factor : ReplicationFactor (optional) + partKeyInfoList : repeated PartKeyInfo ← grows with each part + objectID : uint64 (optional) + updateID : uint64 (optional) + parentID : uint64 (optional) + ecReplicationConfig : optional +} +``` + +--- + +## 2. Proposal +The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. + +### Split-table design (V1) +Split MPU metadata into: +* **Metadata table:** Lightweight per-MPU metadata (no part list). +* **Parts table:** One row per part (flat structure). + +**New MultipartPartInfo Structure:** +```protobuf +message MultipartPartInfo { + required string partName = 1; + required uint32 partNumber = 2; + required string volumeName = 3; + required string bucketName = 4; + required string keyName = 5; + required uint64 dataSize = 6; + required uint64 modificationTime = 7; + repeated KeyLocationList keyLocationList = 8; + repeated hadoop.hdds.KeyValue metadata = 9; + optional FileEncryptionInfoProto fileEncryptionInfo = 10; + optional FileChecksumProto fileChecksum = 11; +} +``` + +### Comparison: V0 (legacy) vs V1 +| Metric | Current (V0) | Split-Table (V1) | +| :--- | :--- | :--- | +| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | +| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | +| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | + +--- + +### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list +Reuse the existing table but introduce a new `multipartPartsTable`. + +**Storage Layout:** +* **multipartInfoTable (RocksDB):** + * V0: Key → `OmMultipartKeyInfo` { parts inline } + * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 } +* **multipartPartsTable (RocksDB) [V1 only]:** + * `/uploadId/part1` → `PartKeyInfo` + * `/uploadId/part2` → `PartKeyInfo` + * `/uploadId/part3` → `PartKeyInfo` + + +```protobuf +message MultipartKeyInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + repeated PartKeyInfo partKeyInfoList = 5; + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9; + optional uint32 schemaVersion = 10; // default 0 +} +``` + +#### V0: OmMultipartKeyInfo (parts inline) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ] ← all parts inline + objectID + updateID + parentID + schemaVersion: 0 (or absent) +} +``` +##### V1: OmMultipartKeyInfo (empty list + schemaVersion) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [] ← empty + objectID + updateID + parentID + schemaVersion: 1 +} +``` + +#### Example (for a 10 part MPU) + +--- +#### MultipartInfoTable : +``` +Key: `/vol1/bucket1/mp_file1/abc123-uuid-456` + +Value: +OmMultipartKeyInfo { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + partKeyInfoList: [] + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +#### MultipartPartsTable – 10 rows: + +```text +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part1 +Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: KeyInfo{blocks, size,...} } + +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part2 +Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: KeyInfo{...} } +... +... +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part10 +Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: KeyInfo{...} } +``` + +### 2.2. Approach-2 : Introduce new multipartMetadataTable + +Split metadata and introduce two new tables: +- **multipartMetadataTable**: lightweight per-MPU metadata (no part list). +- **multipartPartsTable**: one row per part (no aggregation). + +Below is the new metadata table info object structure: +```protobuf +message MultipartMetadataInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 5; + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional uint32 schemaVersion = 9; // default 0 +} +``` + +#### Storage Layout Overview + +* **multipartInfoTable (RocksDB):** + * V0: Key → `OmMultipartKeyInfo` { parts inline } + * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 } +* **multipartPartsTable (RocksDB) [V1 only]:** + * `/uploadId/part1` → `PartKeyInfo` + * `/uploadId/part2` → `PartKeyInfo` + * `/uploadId/part3` → `PartKeyInfo` + +* **multipartInfoTable (RocksDB):** + * V0: `/vol/bucket/key/uploadId` → `OmMultipartKeyInfo { partKeyInfoList: [...] }` + + +* **multipartMetadataTable (RocksDB)** + * V1: `/vol/bucket/key/uploadId` → `MultipartMetadata { schemaVersion: 1 }` + + +* **multipartPartsTable (RocksDB) [v1 only]** + * `/vol/bucket/key/uploadId/part1` → `PartKeyInfo` + * `/vol/bucket/key/uploadId/part2` → PartKeyInfo + * `/vol/bucket/key/uploadId/part3` → PartKeyInfo + * `...` + +#### multipartMetadataInfo Table – 1 row +**V1: OmMultipartMetadataInfo (metadata only)** +```text +OmMultipartMetadataInfo { + uploadID + creationTime + type (ReplicationType) + factor (ReplicationFactor) + objectID + updateID + parentID + ecReplicationConfig + schemaVersion: 1 +} +``` + +```protobuf +message MultipartMetadata { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional uint64 objectID = 5; + optional uint64 updateID = 6; + optional uint64 parentID = 7; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 8; + optional uint32 schemaVersion = 9; + // NO partKeyInfoList - moved to new table +} +``` + +#### Example: + +--- +``` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 + +Value: +MultipartMetadata { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +--- + +### 2.3. Summary +* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag. +* **Approach-2:** Dedicated table, cleanest separation, requires new lookup logic. + +---- +### 2.4. Chosen Approach: Approach-1 +We have chosen **Approach-1: Reuse multipartInfoTable with empty part list** +as the preferred implementation for MPU optimization (V1). + +This approach is favored because it introduces minimal changes to the existing `OmMultipartKeyInfo` protobuf structure. +
+By simply introducing an optional `schemaVersion` field and ensuring the partKeyInfoList is empty for V1 entries, +we maintain backward compatibility. + +The key advantages are: +* **Minimal Protobuf Change**: Older clients and processes can still read the multipartInfoTable entries without issue, + as the **core structure remains the same**. +* **Compatibility**: Older uploads (V0) **remain fully functional**, and new uploads (V1) can be distinguished by + the schemaVersion. This significantly reduces the risk of breaking existing functionality. +* **Simplicity**: The transition logic between V0 and V1 is straightforward, primarily checking the + `schemaVersion` field upon read. +--- + +## 3. Upgrades +Add a new feature in `OMLayoutFeature`: +```java +MPU_PARTS_TABLE_SPLIT(10, "Split multipart table into separate table for parts and key"); +``` +`schemaVersion` is set to `1` only when the upgrade is finalized. + +For each of the four S3MultipartUpload request types we need to ensure the check that split table layout feature +is allowed and only then we can set the schema as `version:1` +
+This ensures that no new writes are happening if the split table is not supported - +specially in cases where in pre-finalize the client may try to write a new MPU key. From c4416bb99d64c02579468ee3e6a0eb39d6b007b7 Mon Sep 17 00:00:00 2001 From: Abhishek Pal Date: Fri, 20 Feb 2026 19:32:08 +0530 Subject: [PATCH 02/12] Address review comments --- .../content/design/mpu-gc-optimization.md | 44 ++++++++----------- 1 file changed, 18 insertions(+), 26 deletions(-) diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md index f22d218cc255..810ba5fb8146 100644 --- a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -53,15 +53,15 @@ Presently Ozone has several overheads when uploading large files via Multipart u 3. GC pressure grows with the size of the object (HDDS-10611). #### a) Deserialization overhead -| Operation | Current | -| :--- | :--- | +| Operation | Current | +|:--------------|:--------------------------------------------------------| | Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | #### b) WAL overhead Assuming one MPU part info object takes ~1.5KB. -| Scenario | Current WAL | -| :--- | :--- | +| Scenario | Current WAL | +|:------------|:--------------------------------| | 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | #### c) GC pressure @@ -110,11 +110,11 @@ message MultipartPartInfo { ``` ### Comparison: V0 (legacy) vs V1 -| Metric | Current (V0) | Split-Table (V1) | -| :--- | :--- | :--- | -| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | -| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | -| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | +| Metric | Current (V0) | Split-Table (V1) | +|:--------------------|:------------------------------|:-------------------------------------------------| +| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | +| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | +| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | --- @@ -126,9 +126,9 @@ Reuse the existing table but introduce a new `multipartPartsTable`. * V0: Key → `OmMultipartKeyInfo` { parts inline } * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 } * **multipartPartsTable (RocksDB) [V1 only]:** - * `/uploadId/part1` → `PartKeyInfo` - * `/uploadId/part2` → `PartKeyInfo` - * `/uploadId/part3` → `PartKeyInfo` + * `/uploadId/part00001` → `PartKeyInfo` + * `/uploadId/part00002` → `PartKeyInfo` + * `/uploadId/part00003` → `PartKeyInfo` ```protobuf @@ -199,14 +199,14 @@ OmMultipartKeyInfo { #### MultipartPartsTable – 10 rows: ```text -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part1 +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00001 Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: KeyInfo{blocks, size,...} } -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part2 +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00002 Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: KeyInfo{...} } ... ... -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part10 +Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00010 Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: KeyInfo{...} } ``` @@ -233,14 +233,6 @@ message MultipartMetadataInfo { #### Storage Layout Overview -* **multipartInfoTable (RocksDB):** - * V0: Key → `OmMultipartKeyInfo` { parts inline } - * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 } -* **multipartPartsTable (RocksDB) [V1 only]:** - * `/uploadId/part1` → `PartKeyInfo` - * `/uploadId/part2` → `PartKeyInfo` - * `/uploadId/part3` → `PartKeyInfo` - * **multipartInfoTable (RocksDB):** * V0: `/vol/bucket/key/uploadId` → `OmMultipartKeyInfo { partKeyInfoList: [...] }` @@ -250,9 +242,9 @@ message MultipartMetadataInfo { * **multipartPartsTable (RocksDB) [v1 only]** - * `/vol/bucket/key/uploadId/part1` → `PartKeyInfo` - * `/vol/bucket/key/uploadId/part2` → PartKeyInfo - * `/vol/bucket/key/uploadId/part3` → PartKeyInfo + * `/vol/bucket/key/uploadId/part00001` → `PartKeyInfo` + * `/vol/bucket/key/uploadId/part00002` → `PartKeyInfo` + * `/vol/bucket/key/uploadId/part00003` → `PartKeyInfo` * `...` #### multipartMetadataInfo Table – 1 row From 28f8ac6afd6f87b435afd2ebcd4e70ffd3199fc7 Mon Sep 17 00:00:00 2001 From: Abhishek Pal Date: Sun, 22 Feb 2026 12:35:24 +0530 Subject: [PATCH 03/12] Update code flow logic details, fix sections, add industry practice section --- .../content/design/mpu-gc-optimization.md | 525 ++++++++++++++---- 1 file changed, 419 insertions(+), 106 deletions(-) diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md index 810ba5fb8146..9f3db8955c6e 100644 --- a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -26,17 +26,13 @@ author: Abhishek Pal, Rakesh Radhakrishnan ## Table of Contents 1. [Motivation](#1-motivation) 2. [Proposal](#2-proposal) -* [Backward Compatibility](#backward-compatibility) -* [Split-table design (V1)](#split-table-design-v1) -* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1) -* [2.1 Approach-1: Reuse multipartInfoTable with empty part list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list) -* [2.2 Approach-2: Introduce new multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable) -* [2.3 Summary](#23-summary) -* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1) + * [Split-table design (V2)](#split-table-design-v2) + * [Comparison: V1 (legacy) vs V2](#comparison-v1-legacy-vs-v2) + * [2.1 Data Layout Changes](#21-data-layout-changes) + * [2.2 MPU Flow Changes](#22-mpu-flow-changes) + * [2.3 Summary and Trade-offs](#23-summary-and-trade-offs) 3. [Upgrades](#3-upgrades) -4. [Benchmarking and Performance](#4-benchmarking-and-performance) -5. [Open Questions](#5-open-questions) - +4. [Industry Patterns](#4-industry-patterns-flattened-keys-in-lsmrocksdb-systems) --- ## 1. Motivation @@ -87,7 +83,7 @@ MultipartKeyInfo { ## 2. Proposal The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. -### Split-table design (V1) +### Split-table design (V2) Split MPU metadata into: * **Metadata table:** Lightweight per-MPU metadata (no part list). * **Parts table:** One row per part (flat structure). @@ -109,8 +105,8 @@ message MultipartPartInfo { } ``` -### Comparison: V0 (legacy) vs V1 -| Metric | Current (V0) | Split-Table (V1) | +### Comparison: V1 (legacy) vs V2 +| Metric | Current (V1) | Split-Table (V2) | |:--------------------|:------------------------------|:-------------------------------------------------| | **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | | **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | @@ -118,18 +114,30 @@ message MultipartPartInfo { --- -### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list -Reuse the existing table but introduce a new `multipartPartsTable`. +### 2.1 Data Layout Changes -**Storage Layout:** -* **multipartInfoTable (RocksDB):** - * V0: Key → `OmMultipartKeyInfo` { parts inline } - * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 } -* **multipartPartsTable (RocksDB) [V1 only]:** - * `/uploadId/part00001` → `PartKeyInfo` - * `/uploadId/part00002` → `PartKeyInfo` - * `/uploadId/part00003` → `PartKeyInfo` +#### 2.1.1 Chosen Approach (Implemented): Reuse `multipartInfoTable` + add `multipartPartsTable` +Keep `multipartInfoTable` for MPU metadata, and store part rows in `multipartPartsTable`. + +**Storage Layout:** +* **`multipartInfoTable` (RocksDB):** + * V1: Key -> `OmMultipartKeyInfo` { parts inline } + * V2: Key -> `OmMultipartKeyInfo` { empty list, `schemaVersion: 1` } +* **`multipartPartsTable` (RocksDB):** + * Key type: `OmMultipartPartKey(uploadId, partNumber)` + * Value type: `OmMultipartPartInfo` + +**`multipartPartsTable` key codec (V2):** +* `OmMultipartPartKey` uses two logical fields: + * `uploadId` (`String`) + * `partNumber` (`int32`) +* Persisted key bytes are encoded as: + * `uploadId(UTF-8 bytes)` + `0x00` + `partNumber(4-byte big-endian int)` +* Prefix scan for all parts in one upload uses: + * `uploadId(UTF-8 bytes)` + `0x00` + +This replaces the previous string-padding style examples (like `part00001`) with typed key encoding. ```protobuf message MultipartKeyInfo { @@ -146,28 +154,29 @@ message MultipartKeyInfo { } ``` -#### V0: OmMultipartKeyInfo (parts inline) +##### V1: `OmMultipartKeyInfo` (parts inline) ``` OmMultipartKeyInfo { uploadID creationTime type factor - partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ] ← all parts inline + partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ] <- all parts inline objectID updateID parentID schemaVersion: 0 (or absent) } ``` -##### V1: OmMultipartKeyInfo (empty list + schemaVersion) + +##### V2: `OmMultipartKeyInfo` (empty list + schemaVersion) ``` OmMultipartKeyInfo { uploadID creationTime type factor - partKeyInfoList: [] ← empty + partKeyInfoList: [] <- empty objectID updateID parentID @@ -175,12 +184,11 @@ OmMultipartKeyInfo { } ``` -#### Example (for a 10 part MPU) +##### Example (for a 10-part MPU) ---- -#### MultipartInfoTable : +`multipartInfoTable`: ``` -Key: `/vol1/bucket1/mp_file1/abc123-uuid-456` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 Value: OmMultipartKeyInfo { @@ -196,27 +204,35 @@ OmMultipartKeyInfo { } ``` -#### MultipartPartsTable – 10 rows: - +`multipartPartsTable` (logical keys): ```text -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00001 -Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: KeyInfo{blocks, size,...} } +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=1} +Value: OmMultipartPartInfo{partNumber=1, partName=".../part1", ...} -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00002 -Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: KeyInfo{...} } -... +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=2} +Value: OmMultipartPartInfo{partNumber=2, partName=".../part2", ...} ... -Key: /vol1/bucket1/mp_file1/abc123-uuid-456/part00010 -Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: KeyInfo{...} } +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=10} +Value: OmMultipartPartInfo{partNumber=10, partName=".../part10", ...} ``` -### 2.2. Approach-2 : Introduce new multipartMetadataTable +`multipartPartsTable` (encoded key bytes, illustrative): +```text +uploadId = "abc123-uuid-456" +partNumber = 2 + +encodedKey = [61 62 63 31 32 33 2d 75 75 69 64 2d 34 35 36 00 00 00 00 02] + [--------------uploadId UTF-8------------][00][--int32 BE--] +``` + +#### 2.1.2 Alternative Approach (Proposed): Add `multipartMetadataTable` + `multipartPartsTable` Split metadata and introduce two new tables: -- **multipartMetadataTable**: lightweight per-MPU metadata (no part list). -- **multipartPartsTable**: one row per part (no aggregation). +* **`multipartMetadataTable`**: lightweight per-MPU metadata (no part list). +* **`multipartPartsTable`**: one row per part (no aggregation). + +> Note: This approach is proposed and not implemented in current code. -Below is the new metadata table info object structure: ```protobuf message MultipartMetadataInfo { required string uploadID = 1; @@ -231,37 +247,14 @@ message MultipartMetadataInfo { } ``` -#### Storage Layout Overview - -* **multipartInfoTable (RocksDB):** - * V0: `/vol/bucket/key/uploadId` → `OmMultipartKeyInfo { partKeyInfoList: [...] }` - - -* **multipartMetadataTable (RocksDB)** - * V1: `/vol/bucket/key/uploadId` → `MultipartMetadata { schemaVersion: 1 }` - - -* **multipartPartsTable (RocksDB) [v1 only]** - * `/vol/bucket/key/uploadId/part00001` → `PartKeyInfo` - * `/vol/bucket/key/uploadId/part00002` → `PartKeyInfo` - * `/vol/bucket/key/uploadId/part00003` → `PartKeyInfo` - * `...` - -#### multipartMetadataInfo Table – 1 row -**V1: OmMultipartMetadataInfo (metadata only)** -```text -OmMultipartMetadataInfo { - uploadID - creationTime - type (ReplicationType) - factor (ReplicationFactor) - objectID - updateID - parentID - ecReplicationConfig - schemaVersion: 1 -} -``` +**Storage Layout Overview:** +* **`multipartInfoTable` (RocksDB):** + * V1: `/vol/bucket/key/uploadId` -> `OmMultipartKeyInfo { partKeyInfoList: [...] }` +* **`multipartMetadataTable` (RocksDB):** + * V2: `/vol/bucket/key/uploadId` -> `MultipartMetadata { schemaVersion: 1 }` +* **`multipartPartsTable` (RocksDB):** + * Key: `OmMultipartPartKey(uploadId, partNumber)` + * Value: `PartKeyInfo`-equivalent part payload ```protobuf message MultipartMetadata { @@ -278,9 +271,7 @@ message MultipartMetadata { } ``` -#### Example: - ---- +Example: ``` Key: /vol1/bucket1/mp_file1/abc123-uuid-456 @@ -297,40 +288,362 @@ MultipartMetadata { } ``` ---- +### 2.2 MPU Flow Changes + +#### 2.2.1 Chosen Approach Flow Changes + +##### Multipart Upload Initiate + +**Old Flow** +* Create `multipartKey = /{vol}/{bucket}/{key}/{uploadId}`. +* Build `OmMultipartKeyInfo` (schema default/legacy, inline `partKeyInfoList` model). +* Write: + * `openKeyTable[multipartKey] = OmKeyInfo` + * `multipartInfoTable[multipartKey] = OmMultipartKeyInfo` + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=0, partKeyInfoList=[]} +openKeyTable[/vol1/b1/fileA/upload-001] -> + OmKeyInfo{key=fileA, objectID=9001} +``` + +**New Flow** +* Same keys/tables as old flow, but initiate sets `schemaVersion` explicitly: + * `schemaVersion=1` when `OMLayoutFeature.MPU_PARTS_TABLE_SPLIT` is allowed. + * `schemaVersion=0` otherwise. +* No part row is created at initiate time; part rows are created during commit-part. +* FSO response path (`S3InitiateMultipartUploadResponseWithFSO`) still writes parent directory entries, then open-file + multipart-info rows. +* Backward compatibility check overview: write path selection is schema-based and layout-gated (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=1, partKeyInfoList=[]} +``` + +##### Multipart Upload Commit Part + +**Old Flow** +* Read `multipartInfoTable[multipartKey]`. +* Read current uploaded part blocks from `openKeyTable[getOpenKey(..., clientID)]`. +* Upsert in inline map: + * `oldPart = multipartKeyInfo.getPartKeyInfo(partNumber)` + * `multipartKeyInfo.addPartKeyInfo(currentPart)` +* Delete committed one-shot open key for this part. +* Update quota based on overwrite delta. + +Example: +```text +Before: partKeyInfoList=[{part=1,size=64MB},{part=2,size=32MB}] +Commit part 2 size=40MB +After: partKeyInfoList=[{part=1,size=64MB},{part=2,size=40MB}] +``` + +**New Flow** +* Load `multipartKeyInfo` and validate layout gate: + * if split feature is not allowed and `schemaVersion != 0`, fail early. +* Branch by schema: + * `schemaVersion=0`: same old inline behavior. + * `schemaVersion=1`: + * create `multipartPartKey = OmMultipartPartKey(uploadId, partNumber)`, + * write `multipartPartTable[multipartPartKey] = OmMultipartPartInfo{openKey, partName, partNumber, size, metadata, locations}`, + * keep current part open key in `openKeyTable` (needed later by list/complete/abort), + * if overwriting an existing part row, delete old part open key and adjust quota. +* `multipartInfoTable[multipartKey]` is still updated for metadata/updateID. +* Backward compatibility check overview: schema decides row format, and split behavior is blocked before finalization (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1`): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + OmMultipartPartInfo{partNumber=2, openKey=/vol1/b1/fileA#client-77, size=40MB} +``` + +##### Multipart Upload Complete + +**Old Flow** +* Read `multipartInfoTable[multipartKey]` and validate ordered requested parts. +* Validate each requested part from inline `partKeyInfoMap`. +* Build final key block list from selected parts. +* Write final key: + * `keyTable[/vol1/b1/fileA] = OmKeyInfo{locations=p1+p2+...}` +* Delete MPU state: + * `openKeyTable[multipartOpenKey]` + * `multipartInfoTable[multipartKey]` +* Move unused parts to deleted table. + +Example: +```text +Complete [1,2,3] -> keyTable[/vol1/b1/fileA] written, MPU rows removed +``` + +**New Flow** +* Validate layout gate first (same pattern as commit/abort when `schemaVersion != 0`). +* Load part materialization by schema: + * `schemaVersion=0`: use inline `partKeyInfoMap`. + * `schemaVersion=1`: + * scan `multipartPartTable` with prefix `OmMultipartPartKey.prefix(uploadId)`, + * rebuild `PartKeyInfo` view from `OmMultipartPartInfo`, + * pull block locations from stored part metadata / open keys as needed. +* Perform same user-facing validation (order, existence, eTag/partName, min size). +* Commit final key to `keyTable`. +* Cleanup: + * always delete `multipartInfoTable[multipartKey]` and `openKeyTable[multipartOpenKey]`, + * for split schema also delete all matching `multipartPartTable` rows and their tracked part open keys. +* Backward compatibility check overview: completion transparently supports both persisted schemas and enforces layout gating for split rows (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1` cleanup): +```text +Delete: + multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=1}] + multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] + ... + openKeyTable[part-open-key-1], openKeyTable[part-open-key-2], ... +``` + +##### Multipart Upload Abort + +**Old Flow** +* Read `multipartInfoTable[multipartKey]`. +* Release bucket used-bytes from inline `partKeyInfoMap`. +* Delete `openKeyTable[multipartOpenKey]` and `multipartInfoTable[multipartKey]`. +* Move part key infos to deleted table. + +Example: +```text +Abort upload-001 -> delete MPU metadata/open rows and tombstone parts +``` + +**New Flow** +* Validate layout gate (reject split behavior before finalization if schema indicates split). +* Branch by schema for quota and cleanup: + * `schemaVersion=0`: same legacy inline part iteration. + * `schemaVersion=1`: + * iterate `multipartPartTable` by prefix (`OmMultipartPartKey.prefix(uploadId)`), + * compute released quota from part rows, + * for each part: move corresponding open key to deleted table, delete part open key, delete part row. +* Delete `multipartInfoTable[multipartKey]` and `multipartOpenKey` entry. +* Backward compatibility check overview: abort handles both old inline and new split rows in the same codepath (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example (`schemaVersion=1`): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=3}] -> deleted +openKeyTable[/vol1/b1/fileA#client-90] -> deleted +deletedTable[objectId:/vol1/b1/fileA/upload-001] -> appended +``` + +##### Multipart Upload List Parts + +**Old Flow** +* API path: + * S3G `MultipartKeyHandler.listParts(...)` + * `OzoneManager.listParts(...)` + * `KeyManagerImpl.listParts(...)` +* Read `multipartInfoTable[multipartKey]`. +* Iterate inline `partKeyInfoMap` in part-number order; apply marker + `maxParts`. +* If no part entries exist yet, read replication from `openKeyTable[multipartKey]`. + +Example: +```text +parts=[1,2,3,4], marker=1, maxParts=2 => return [2,3], nextMarker=3, truncated=true +``` + +**New Flow** +* Same API path and response contract. +* Branch by schema in `KeyManagerImpl.listParts`: + * `schemaVersion=0`: iterate inline `partKeyInfoMap`. + * `schemaVersion=1`: + * scan `multipartPartTable` by `OmMultipartPartKey.prefix(uploadId)`, + * resolve each part's `openKey` from `openKeyTable`, + * build `PartKeyInfo` view on the fly and paginate. +* Replication fallback remains from MPU open key when no part is returned. + +Example (`schemaVersion=1` read materialization): +```text +multipartPartTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + openKey=/vol1/b1/fileA#c2 +openKeyTable[/vol1/b1/fileA#c2] -> KeyInfo{size=40MB, eTag=...} +listParts response partNumber=2, size=40MB, eTag=... +``` + +##### Related APIs +* `ListMultipartUploads` remains upload-session listing (not part listing), returning upload IDs and metadata from MPU entries. +* Wire path for list-parts remains unchanged: + * `MultipartKeyHandler` (S3G) -> `OzoneBucket.listParts` -> OM RPC (`OzoneManagerRequestHandler`) -> `OzoneManager.listParts` -> `KeyManagerImpl.listParts`. + +#### 2.2.2 Alternative Approach Flow (Proposed) + +##### Multipart Upload Initiate + +**Old Flow** +* Write `openKeyTable` + `multipartInfoTable` (`OmMultipartKeyInfo`), with legacy inline-part model. + +**New Flow** +* Write `openKeyTable` + `multipartMetadataTable` (new metadata object, no part list). +* `multipartInfoTable` is no longer used for new V1 MPU writes in this approach. +* `schemaVersion` (or equivalent layout marker) is stored in `multipartMetadataTable` row. + +Example: +```text +multipartMetadataTable[/vol1/b1/fileA/upload-001] -> + MultipartMetadata{schemaVersion=1, replication=RATIS/THREE, objectID=9001} +openKeyTable[/vol1/b1/fileA/upload-001] -> + OmKeyInfo{key=fileA, locations=[empty]} +``` + +##### Multipart Upload Commit Part + +**Old Flow** +* Read/modify/write one `multipartInfoTable` row by updating inline part list. + +**New Flow** +* Read `multipartMetadataTable` only for MPU metadata + validation context. +* Write one row per part into `multipartPartsTable`: + * key: `OmMultipartPartKey(uploadId, partNumber)`, + * value: `PartKeyInfo` / equivalent flattened part payload. +* Avoid rewriting a large aggregate MPU value for each part. + +Example: +```text +multipartPartsTable[OmMultipartPartKey{uploadId="upload-001", partNumber=2}] -> + PartKeyInfo{partNumber=2, partName=..., keyInfo={blocks,size,etag}} +``` + +##### Multipart Upload Complete + +**Old Flow** +* Read parts from inline `partKeyInfoList` in `multipartInfoTable`. +* Commit final key, delete MPU row and MPU open row. + +**New Flow** +* Read MPU metadata from `multipartMetadataTable`. +* Scan `multipartPartsTable` prefix to gather candidate parts; validate request list/order/eTags. +* Build final key and commit to `keyTable`. +* Cleanup: + * delete `multipartMetadataTable[multipartKey]`, + * delete all `multipartPartsTable` rows for that upload, + * delete MPU open key and part open keys. + +##### Multipart Upload Abort + +**Old Flow** +* Iterate inline `partKeyInfoList` to release quota and move parts to deleted table. + +**New Flow** +* Read `multipartMetadataTable` for replication + MPU identity. +* Iterate `multipartPartsTable` by prefix for quota release and delete-table movement. +* Delete metadata row + all part rows + corresponding open keys. + +##### Multipart Upload List Parts + +**Old Flow** +* `listParts` reads `multipartInfoTable` and paginates inline `partKeyInfoList`. -### 2.3. Summary +**New Flow** +* `listParts` reads `multipartMetadataTable` for MPU existence/metadata. +* Materialize part listing from `multipartPartsTable` prefix scan and paginate by part number marker. +* If needed, join with `openKeyTable` for additional per-part runtime fields. + +##### Compatibility and Upgrade Guard +* Same compatibility strategy as section [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating): + * legacy rows continue on old path, + * new writes only when layout feature is finalized, + * reject unsupported split-layout mutations pre-finalize. + +### 2.3 Summary and Trade-offs * **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag. -* **Approach-2:** Dedicated table, cleanest separation, requires new lookup logic. - ----- -### 2.4. Chosen Approach: Approach-1 -We have chosen **Approach-1: Reuse multipartInfoTable with empty part list** -as the preferred implementation for MPU optimization (V1). - -This approach is favored because it introduces minimal changes to the existing `OmMultipartKeyInfo` protobuf structure. -
-By simply introducing an optional `schemaVersion` field and ensuring the partKeyInfoList is empty for V1 entries, -we maintain backward compatibility. - -The key advantages are: -* **Minimal Protobuf Change**: Older clients and processes can still read the multipartInfoTable entries without issue, - as the **core structure remains the same**. -* **Compatibility**: Older uploads (V0) **remain fully functional**, and new uploads (V1) can be distinguished by - the schemaVersion. This significantly reduces the risk of breaking existing functionality. -* **Simplicity**: The transition logic between V0 and V1 is straightforward, primarily checking the - `schemaVersion` field upon read. ---- +* **Approach-2:** Dedicated metadata table, cleanest separation, requires broader refactor. + +#### Pros and Cons + +| | Pros | Cons | +|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Chosen Approach | * Minimal migration risk
* Reuses existing `OmMultipartKeyInfo` plumbing
* Easiest incremental rollout with `schemaVersion` gating.
* Lower implementation impact request/response paths. | * Carries complexity for mixed code (same table serving legacy + split metadata modes).
* Still coupled to `OmMultipartKeyInfo`.
* More conditional logic over time. | +| Alternative Approach | * Clean separation of concerns (`multipartMetadataTable` vs `multipartPartsTable`)
* Clearer long-term model and easier mental mapping.
* Avoids overloading legacy value type. | * Requires wider code changes (new codecs/table wiring/request/response updates).
* Higher migration and compatibility test surface.
* More rollout complexity than chosen approach. | ## 3. Upgrades Add a new feature in `OMLayoutFeature`: ```java MPU_PARTS_TABLE_SPLIT(10, "Split multipart table into separate table for parts and key"); ``` -`schemaVersion` is set to `1` only when the upgrade is finalized. -For each of the four S3MultipartUpload request types we need to ensure the check that split table layout feature -is allowed and only then we can set the schema as `version:1` -
-This ensures that no new writes are happening if the split table is not supported - -specially in cases where in pre-finalize the client may try to write a new MPU key. +### 3.1 Backward compatibility and layout gating + +Backward compatibility is handled by combining `schemaVersion` with layout-feature checks. + +- **New MPU initiate writes** + - `schemaVersion` is set to `1` only when `MPU_PARTS_TABLE_SPLIT` is allowed. + - Otherwise initiate writes `schemaVersion=0` and stays on legacy inline part behavior. + +- **Existing MPU rows** + - `schemaVersion=0` rows continue to use legacy inline-part read/write paths. + - `schemaVersion=1` rows use split-table paths (`multipartPartsTable` + tracked open keys). + +- **Pre-finalize protection** + - Mutating split-table operations (commit part / complete / abort) check: + - if feature is not allowed and `schemaVersion != 0`, reject with `NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION`. + - This prevents accidental split-layout writes/updates before finalization. + +- **Read compatibility** + - `listParts` supports both schemas: + - schema 0 -> read inline `partKeyInfoMap`, + - schema 1 -> materialize parts from `multipartPartsTable` + `openKeyTable`. + +In short: old MPU entries keep working unchanged, new entries only use split layout when cluster layout allows it, and write paths are guarded to avoid unsafe transitions. + +## 4. Industry Patterns: Flattened Keys in LSM/RocksDB Systems + +Using flattened keys (for example, `baseKey + sortable suffix`) is a common design in RocksDB-backed systems. +The MPU `multipartPartsTable` layout follows the same principle by using one row per part keyed by +`OmMultipartPartKey(uploadId, partNumber)` with byte-level ordering from codec serialization. + +### 4.1 Why this is common + +RocksDB is optimized for: +- point lookups by key, +- prefix/range scans over lexicographically ordered keys, +- append-like write patterns with small values. + +Flattened schemas map naturally to this model: +- each logical sub-record (part/version) becomes an independent KV row, +- updates rewrite only one small row instead of a large aggregate object, +- range scans can fetch all rows for one logical entity via a shared prefix. + +### 4.2 MVCC systems as examples + +In MVCC-oriented systems such as CockroachDB and TiKV, a common pattern is: +- encode the user key as a prefix, +- encode version dimension (timestamp / sequence / commit-ts) in key suffix, +- use key ordering to make version reads and range scans efficient. + +High-level shape: +```text +/ +``` + +Example (illustrative): +```text +user:42@1699000001 +user:42@1699000005 +user:42@1699000010 +``` + +The idea is conceptually similar to MPU part storage: +```text +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=1} +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=2} +OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=3} +``` + +Both designs rely on ordered keys to make grouped reads/writes efficient. + +### 4.3 Relevance to MPU optimization + +For MPU, flattened part rows provide: +- lower write amplification per part commit (single-row updates), +- lower object allocation pressure in OM (no repeated large list rebuild), +- straightforward cleanup by prefix scan during complete/abort, +- better operational visibility (`one row = one part`). + +This is why the split schema is not just an optimization for this code path, but also a storage-layout pattern that aligns well with how LSM/RocksDB systems are typically modeled. From 5a721f772f3f406bfce0eb2045eb3ab1598ea16a Mon Sep 17 00:00:00 2001 From: Abhishek Pal Date: Sun, 22 Feb 2026 19:46:32 +0530 Subject: [PATCH 04/12] Update section headings --- hadoop-hdds/docs/content/design/mpu-gc-optimization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md index 9f3db8955c6e..cfb373e701b3 100644 --- a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -116,7 +116,7 @@ message MultipartPartInfo { ### 2.1 Data Layout Changes -#### 2.1.1 Chosen Approach (Implemented): Reuse `multipartInfoTable` + add `multipartPartsTable` +#### 2.1.1 Chosen Approach: Reuse `multipartInfoTable` + add `multipartPartsTable` Keep `multipartInfoTable` for MPU metadata, and store part rows in `multipartPartsTable`. @@ -225,7 +225,7 @@ encodedKey = [61 62 63 31 32 33 2d 75 75 69 64 2d 34 35 36 00 00 00 00 02] [--------------uploadId UTF-8------------][00][--int32 BE--] ``` -#### 2.1.2 Alternative Approach (Proposed): Add `multipartMetadataTable` + `multipartPartsTable` +#### 2.1.2 Alternative Approach: Add `multipartMetadataTable` + `multipartPartsTable` Split metadata and introduce two new tables: * **`multipartMetadataTable`**: lightweight per-MPU metadata (no part list). From c2102ba9779a300ba7786fc6d5764901985a48b1 Mon Sep 17 00:00:00 2001 From: Abhishek Pal Date: Sun, 22 Feb 2026 21:16:35 +0530 Subject: [PATCH 05/12] Correct some statements --- .../content/design/mpu-gc-optimization.md | 37 +++++++------------ 1 file changed, 14 insertions(+), 23 deletions(-) diff --git a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md index cfb373e701b3..dc5f119cd5c4 100644 --- a/hadoop-hdds/docs/content/design/mpu-gc-optimization.md +++ b/hadoop-hdds/docs/content/design/mpu-gc-optimization.md @@ -3,7 +3,7 @@ title: Multipart Upload GC Pressure Optimizations summary: Change Multipart Upload Logic to improve OM GC Pressure date: 2026-02-19 jira: HDDS-10611 -status: implemented +status: proposed author: Abhishek Pal, Rakesh Radhakrishnan ---