HDDS-10611. Design document for MPU GC Optimization by devabhishekpal · Pull Request #9793 · apache/ozone

devabhishekpal · 2026-02-19T17:02:03Z

What changes were proposed in this pull request?

HDDS-10611. Design document for MPU GC Optimization

Please describe your PR in detail:
This PR adds the design doc for optimizing the OM GC pressure by the MPU file handling

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10611

How was this patch tested?

N/A

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

ivandika3

Thanks @devabhishekpal for the design on this long overdue issue. I am +1 on the overall direction. Left some comments.

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

…ection

devabhishekpal · 2026-02-22T13:47:34Z

Thanks for the exhaustive review and inputs @ivandika3.
I updated the document with the new details, please do let me know in case I am missing something in the understanding and also if something else could be improved.

FYI, I have a sample/PoC patch created if anybody wants to check the changes.
master...devabhishekpal:ozone:HDDS-10611

errose28

Thanks for the design @devabhishekpal @rakeshadr. Overall LGTM, just a few things we can clarify in the doc.

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

errose28 · 2026-02-23T21:20:56Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  * `uploadId(UTF-8 bytes)` + `0x00`
+
+```protobuf
+message MultipartKeyInfo {


This should be the same as the current MultipartKeyInfo, but we should label the now unused fields as deprecated instead of removing them.

We are not completely removing the fields, as that again introduces the problem of backward compatibility with older clients. Hence not marking them as deprecated.

Maybe this is something we can further optimize once we know that the old flow/inline storage is not being used anymore - maybe something for the future release line after the current release?

In protobuf, deprecated is just a label that the field should not be used going forward, it does not change any functionality so it is safe to put on the fields in the metadata entry that we will not use in the new schema. See deprecated in the proto3 language guide.

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

errose28 · 2026-02-25T16:56:41Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  * `uploadId(UTF-8 bytes)` + `0x00`
+
+```text
+Note: The null byte separator ensures that the "uploadId" is properly delimited from the "partNumber" in the byte encoding, allowing for correct lexicographical ordering.


Are we guaranteed that nothing in the upload ID or part number will encode to this null byte, making it a reliable separator? In container schema v3 on the datanode we use | to avoid this conflict.

Yes, there isn't anything which would be encoded to the NULL byte.
Multipart upload ID is generated by the getMultipartUploadId().

public static String getMultipartUploadId() { return UUIDv7.randomUUID() + "-" + UniqueId.next(); }

This is going to be completely alphanumeric.
The part number is going to be an incremental integer till 10,000
Hence nothing would encode to the null byte, and this separator is safe to use here

The part number is going to be an incremental integer till 10,000

This is big endian right? So the leftmost bytes in the serialized output would be null bytes until the number reaches a certain size. This could still work, we would just need to look for the first null byte after the upload ID, and therefore need to be guaranteed that the upload ID will not contain \0.

This still seems somewhat fragile, since the higher levels are dependent on the implicit serialization method that getMultipartUploadId and by extension UniqueId#next abstracts away.

Would it be better to use a fixed length string for the upload ID so we can seek to the end to know where the part starts? Or use a different delimiter character like / or | appended to the string part like we do in container and FSO tables?

I think some of the confusion here is also coming from the mix of string and int types in the key, which is not a practice we use anywhere else in the code to my knowledge. I think FSO should be the reference here, and in that case the entire key is made a string.

I agree, thinking about the implications of how uniqueId is generated - maybe the dependency on the abstraction of UniqueId#next might bring in behaviour changes in the future which could potentially break this.

However moving to a string codec means we again bring in the overhead of padding the number using string utils.
Maybe I can change the separator to be a / instead of null byte as that is the logical delimiter for a path.
When it gets byte encoded the part would still have null bytes until it reaches the size range, but we could look for the / byte which is guaranteed to not be used anywhere else.

Prefer "/" too.

@errose28 do you mean:

partNumber -> big-endian bytes -> hex/base64 string

key = uploadId + "/" + encodedPartBytes from above transition

And then we UTF-8 encode it?

The other way to ensure 2f doesn't occur again is to directly encode the long as string and then convert to bytes.
uploadID + "/" + String.valueOf(partNumber) -> byte encode

But this again would need the overhead of 0 padding to the left of partNumber and also the conversion overhead.

@errose28 the method I am thinking as far as I understand from your suggestion is:

Start with say:

uploadId = "abc123-uuid-456" partNumber = 47

Convert 47 to int32 big-endian bytes:

00 00 00 2f

Convert those bytes to hex text:

0000002f

Build full key string:

abc123-uuid-456/0000002f

And the above is what will be stored as the rocksDB key

So the above approach adds the conversion overhead, one way I figured out is:
Keep binary 4-byte suffix, and decode by checking full-key layout first len-5 then prefix len-1.

Say for the below sample input:

uploadId = <abcd-1234...> partNumber = 47 -> taking 47 because it has 2f in the byte representation

Encoding format:
keyBytes = UTF8(uploadId) + '/' + int32_be(partNumber)

For 47:

int32 big-endian bytes = 00 00 00 2f

So final tail is:

separator / = 2f

part bytes = 00 00 00 2f

Total tail bytes = 2f 00 00 00 2f
Readable-ish representation= <uploadId>/\x00\x00\x00\x2f

Now we run the following checks:

Check full-key first: if byte at len - 5 is /, decode as full key (used to identify full row)

Else, if byte at len - 1 is /, decode as prefix. (this is used for prefix scan for iterating)

Else invalid.

This also skips extra hops for converting from byte -> hex -> string.

The assumption here is that the key has sufficient length such that len - 5 doesn't break.
I asked ChatGPT for edge case on when this can break and it says that in normal operation this should not break. Below is it's explanation:

Why valid keys always have enough length

For full key (binary-tail design):

format is uploadId + '/' + 4 bytes

minimum valid size = 1 + 1 + 4 = 6 bytes (even uploadId length 1)

For prefix key:

format is uploadId + '/'

minimum valid size = 1 + 1 = 2 bytes

So any legit key generated by toPersistedFormat is long enough.

When it can be too short

Only if decoder sees invalid raw data, for example:

empty byte array

truncated/corrupted key bytes in DB/scan tool

wrong table decoded with wrong codec

manual/debug insertion of malformed key

So the length check is not for normal runtime correctness; it’s for robustness and clean failure on bad bytes.

Encoding the whole key to UTF-8 similar to FSO or similar to FixedLengthStringCodec (ISO_ 8859-1) will still not be numerically sorted.
In FSO the keys/files are grouped by the parentID, hence the ordering of the files themselves do not matter.
For listing children of one parent, OM computes a parent prefix:

getOzonePathKey(volumeId, bucketId, parentId, "")

Then it scans entries under that prefix and merges file/dir iterators with a min-heap, final output is accumulated in a TreeMap (sorted).

Similarly in DN schema-v3 key it looks like for block-data tables, key format is effectively:
containerPrefix + localBlockIdAsString.
All entries are grouped by the containerID prefix, but inside container block suffix is decimal string text ("1", "10", "2"), so sibling block IDs are lexicographic text order, not numeric order.

@errose28 in both the cases it seems we are not caring about the ordering in RocksDB - which we need to do here.
For FSO we are doing the sorting on demand instead of in RocksDB.
So maybe we have no workaround for this.

…n info to metadata

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

errose28 · 2026-02-27T17:38:01Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  required uint64 modificationTime = 8;
+  required uint64 objectID = 9;
+  required uint64 updateID = 10;
+  repeated hadoop.hdds.KeyValue metadata = 11;


I think S3 only supports adding metadata to the whole key, not individual parts. Can we check the spec on this?

Hmm going through the upload part API documentation it seems like the request is allowed to have some x-amz-meta* type metadata / headers

Example:

PUT /Key+?partNumber=PartNumber&uploadId=UploadId HTTP/1.1 Host: Bucket.s3.amazonaws.com Content-Length: ContentLength Content-MD5: ContentMD5 x-amz-sdk-checksum-algorithm: ChecksumAlgorithm x-amz-checksum-crc32: ChecksumCRC32 x-amz-checksum-crc32c: ChecksumCRC32C x-amz-checksum-crc64nvme: ChecksumCRC64NVME x-amz-checksum-sha1: ChecksumSHA1 x-amz-checksum-sha256: ChecksumSHA256 x-amz-server-side-encryption-customer-algorithm: SSECustomerAlgorithm x-amz-server-side-encryption-customer-key: SSECustomerKey x-amz-server-side-encryption-customer-key-MD5: SSECustomerKeyMD5 x-amz-request-payer: RequestPayer x-amz-expected-bucket-owner: ExpectedBucketOwner Body

In our case this metadata field is also being used for other things like tracking the open key entry for a given part in the open key table and also storing the ETag for the part

There is no x-amz-meta in the example, nor in the UploadPart doc linked. We should introduce regular fields for distinct items like ETag. Storing it in repeated KeyValue is like storing parts in repeated PartKeyInfo.

Addressed in the latest commit

errose28 · 2026-02-27T18:02:01Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  * `uploadId(UTF-8 bytes)` + `0x00`
+
+```text
+Note: The null byte separator ensures that the "uploadId" is properly delimited from the "partNumber" in the byte encoding, allowing for correct lexicographical ordering.


The / separator partially fixes the problem, but the mixed encoding scheme of UTF-8 and big-endian long serialization still means we may have conflicts. The 2f byte could also occur in the long section of the key. If we just find the first 2f byte this would still work, but I think it would be better to use one uniform encoding scheme for the whole key.

I'm thinking we could still convert the long to a big endian byte array for fixed length, but UTF-8 encode it with the rest of the key. This is similar to the FSO keys, which still UTF-8 encode their parent object IDs.

devabhishekpal · 2026-03-05T20:14:29Z

@ivandika3 @errose28 could you take another look at the current design doc?

HDDS-10611. Design document for MPU GC Optimization

bf34ede

devabhishekpal requested review from ChenSammi, adoroszlai, ivandika3, jojochuang, kerneltime and rakeshadr February 19, 2026 17:03

devabhishekpal self-assigned this Feb 19, 2026

devabhishekpal added the design label Feb 19, 2026

jojochuang reviewed Feb 20, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Outdated Show resolved Hide resolved

jojochuang reviewed Feb 20, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Outdated Show resolved Hide resolved

Address review comments

c4416bb

devabhishekpal requested a review from jojochuang February 20, 2026 14:03

ivandika3 reviewed Feb 21, 2026

View reviewed changes

ivandika3 requested a review from szetszwo February 21, 2026 09:24

Update code flow logic details, fix sections, add industry practice s…

28f8ac6

…ection

devabhishekpal requested a review from ivandika3 February 22, 2026 13:47

devabhishekpal added 2 commits February 22, 2026 19:46

Update section headings

5a721f7

Correct some statements

c2102ba

errose28 reviewed Feb 23, 2026

View reviewed changes

devabhishekpal added 2 commits February 25, 2026 22:06

Added some side notes and descriptions

69b7220

Add clarification regarding sorting of part keys

949602c

errose28 reviewed Feb 25, 2026

View reviewed changes

Modify protobuf object, mark some fields as deprecated, pull in commo…

cf14bd9

…n info to metadata

devabhishekpal requested a review from errose28 February 25, 2026 18:23

ChenSammi reviewed Feb 27, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Show resolved Hide resolved

ChenSammi reviewed Feb 27, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Outdated Show resolved Hide resolved

Address review comments

a47277f

errose28 reviewed Feb 27, 2026

View reviewed changes

devabhishekpal requested review from ChenSammi and errose28 March 2, 2026 15:53

devabhishekpal mentioned this pull request Mar 5, 2026

HDDS-14659. Add new proto message for lightweight MultipartPartInfo #9867

Open

devabhishekpal added 2 commits March 5, 2026 12:47

Added explicit eTag field

9591c73

Mark fields as optional

4a66020

Add owner and ACL information for the MPU key at metadata level

2349e05

Conversation

devabhishekpal commented Feb 19, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devabhishekpal commented Feb 22, 2026

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devabhishekpal Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Why valid keys always have enough length

When it can be too short

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devabhishekpal commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

ivandika3 left a comment •

edited

Loading

devabhishekpal Feb 26, 2026 •

edited

Loading