HDDS-10611. Design document for MPU GC Optimization#9793
HDDS-10611. Design document for MPU GC Optimization#9793devabhishekpal wants to merge 12 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Thanks @devabhishekpal for the design on this long overdue issue. I am +1 on the overall direction. Left some comments.
|
Thanks for the exhaustive review and inputs @ivandika3. FYI, I have a sample/PoC patch created if anybody wants to check the changes. |
errose28
left a comment
There was a problem hiding this comment.
Thanks for the design @devabhishekpal @rakeshadr. Overall LGTM, just a few things we can clarify in the doc.
| * `uploadId(UTF-8 bytes)` + `0x00` | ||
|
|
||
| ```protobuf | ||
| message MultipartKeyInfo { |
There was a problem hiding this comment.
This should be the same as the current MultipartKeyInfo, but we should label the now unused fields as deprecated instead of removing them.
There was a problem hiding this comment.
We are not completely removing the fields, as that again introduces the problem of backward compatibility with older clients. Hence not marking them as deprecated.
Maybe this is something we can further optimize once we know that the old flow/inline storage is not being used anymore - maybe something for the future release line after the current release?
There was a problem hiding this comment.
In protobuf, deprecated is just a label that the field should not be used going forward, it does not change any functionality so it is safe to put on the fields in the metadata entry that we will not use in the new schema. See deprecated in the proto3 language guide.
| * `uploadId(UTF-8 bytes)` + `0x00` | ||
|
|
||
| ```text | ||
| Note: The null byte separator ensures that the "uploadId" is properly delimited from the "partNumber" in the byte encoding, allowing for correct lexicographical ordering. |
There was a problem hiding this comment.
Are we guaranteed that nothing in the upload ID or part number will encode to this null byte, making it a reliable separator? In container schema v3 on the datanode we use | to avoid this conflict.
There was a problem hiding this comment.
Yes, there isn't anything which would be encoded to the NULL byte.
Multipart upload ID is generated by the getMultipartUploadId().
public static String getMultipartUploadId() {
return UUIDv7.randomUUID() + "-" + UniqueId.next();
}
This is going to be completely alphanumeric.
The part number is going to be an incremental integer till 10,000
Hence nothing would encode to the null byte, and this separator is safe to use here
There was a problem hiding this comment.
The part number is going to be an incremental integer till 10,000
This is big endian right? So the leftmost bytes in the serialized output would be null bytes until the number reaches a certain size. This could still work, we would just need to look for the first null byte after the upload ID, and therefore need to be guaranteed that the upload ID will not contain \0.
This still seems somewhat fragile, since the higher levels are dependent on the implicit serialization method that getMultipartUploadId and by extension UniqueId#next abstracts away.
Would it be better to use a fixed length string for the upload ID so we can seek to the end to know where the part starts? Or use a different delimiter character like / or | appended to the string part like we do in container and FSO tables?
I think some of the confusion here is also coming from the mix of string and int types in the key, which is not a practice we use anywhere else in the code to my knowledge. I think FSO should be the reference here, and in that case the entire key is made a string.
There was a problem hiding this comment.
I agree, thinking about the implications of how uniqueId is generated - maybe the dependency on the abstraction of UniqueId#next might bring in behaviour changes in the future which could potentially break this.
However moving to a string codec means we again bring in the overhead of padding the number using string utils.
Maybe I can change the separator to be a / instead of null byte as that is the logical delimiter for a path.
When it gets byte encoded the part would still have null bytes until it reaches the size range, but we could look for the / byte which is guaranteed to not be used anywhere else.
There was a problem hiding this comment.
@errose28 do you mean:
partNumber->big-endian bytes->hex/base64string- key =
uploadId+ "/" +encodedPartBytesfrom above transition
And then we UTF-8 encode it?
There was a problem hiding this comment.
The other way to ensure 2f doesn't occur again is to directly encode the long as string and then convert to bytes.
uploadID + "/" + String.valueOf(partNumber) -> byte encode
But this again would need the overhead of 0 padding to the left of partNumber and also the conversion overhead.
There was a problem hiding this comment.
@errose28 the method I am thinking as far as I understand from your suggestion is:
- Start with say:
uploadId = "abc123-uuid-456"
partNumber = 47
- Convert 47 to int32 big-endian bytes:
00 00 00 2f
- Convert those bytes to hex text:
0000002f
- Build full key string:
abc123-uuid-456/0000002f
And the above is what will be stored as the rocksDB key
There was a problem hiding this comment.
So the above approach adds the conversion overhead, one way I figured out is:
Keep binary 4-byte suffix, and decode by checking full-key layout first len-5 then prefix len-1.
Say for the below sample input:
uploadId = <abcd-1234...>
partNumber = 47 -> taking 47 because it has 2f in the byte representation
Encoding format:
keyBytes = UTF8(uploadId) + '/' + int32_be(partNumber)
For 47:
- int32 big-endian bytes =
00 00 00 2f
So final tail is:
- separator / =
2f - part bytes =
00 00 00 2f
Total tail bytes = 2f 00 00 00 2f
Readable-ish representation= <uploadId>/\x00\x00\x00\x2f
Now we run the following checks:
- Check full-key first: if byte at
len - 5is/, decode as full key (used to identify full row) - Else, if byte at
len - 1is/, decode as prefix. (this is used for prefix scan for iterating) - Else invalid.
This also skips extra hops for converting from byte -> hex -> string.
The assumption here is that the key has sufficient length such that len - 5 doesn't break.
I asked ChatGPT for edge case on when this can break and it says that in normal operation this should not break. Below is it's explanation:
Why valid keys always have enough length
For full key (binary-tail design):
- format is uploadId + '/' + 4 bytes
- minimum valid size = 1 + 1 + 4 = 6 bytes (even uploadId length 1)
For prefix key:
- format is uploadId + '/'
- minimum valid size = 1 + 1 = 2 bytes
So any legit key generated by toPersistedFormat is long enough.
When it can be too short
Only if decoder sees invalid raw data, for example:
- empty byte array
- truncated/corrupted key bytes in DB/scan tool
- wrong table decoded with wrong codec
- manual/debug insertion of malformed key
So the length check is not for normal runtime correctness; it’s for robustness and clean failure on bad bytes.
There was a problem hiding this comment.
Encoding the whole key to UTF-8 similar to FSO or similar to FixedLengthStringCodec (ISO_ 8859-1) will still not be numerically sorted.
In FSO the keys/files are grouped by the parentID, hence the ordering of the files themselves do not matter.
For listing children of one parent, OM computes a parent prefix:
- getOzonePathKey(volumeId, bucketId, parentId, "")
- Then it scans entries under that prefix and merges file/dir iterators with a min-heap, final output is accumulated in a TreeMap (sorted).
Similarly in DN schema-v3 key it looks like for block-data tables, key format is effectively:
containerPrefix + localBlockIdAsString.
All entries are grouped by the containerID prefix, but inside container block suffix is decimal string text ("1", "10", "2"), so sibling block IDs are lexicographic text order, not numeric order.
@errose28 in both the cases it seems we are not caring about the ordering in RocksDB - which we need to do here.
For FSO we are doing the sorting on demand instead of in RocksDB.
So maybe we have no workaround for this.
…n info to metadata
| required uint64 modificationTime = 8; | ||
| required uint64 objectID = 9; | ||
| required uint64 updateID = 10; | ||
| repeated hadoop.hdds.KeyValue metadata = 11; |
There was a problem hiding this comment.
I think S3 only supports adding metadata to the whole key, not individual parts. Can we check the spec on this?
There was a problem hiding this comment.
Hmm going through the upload part API documentation it seems like the request is allowed to have some x-amz-meta* type metadata / headers
Example:
PUT /Key+?partNumber=PartNumber&uploadId=UploadId HTTP/1.1
Host: Bucket.s3.amazonaws.com
Content-Length: ContentLength
Content-MD5: ContentMD5
x-amz-sdk-checksum-algorithm: ChecksumAlgorithm
x-amz-checksum-crc32: ChecksumCRC32
x-amz-checksum-crc32c: ChecksumCRC32C
x-amz-checksum-crc64nvme: ChecksumCRC64NVME
x-amz-checksum-sha1: ChecksumSHA1
x-amz-checksum-sha256: ChecksumSHA256
x-amz-server-side-encryption-customer-algorithm: SSECustomerAlgorithm
x-amz-server-side-encryption-customer-key: SSECustomerKey
x-amz-server-side-encryption-customer-key-MD5: SSECustomerKeyMD5
x-amz-request-payer: RequestPayer
x-amz-expected-bucket-owner: ExpectedBucketOwner
Body
In our case this metadata field is also being used for other things like tracking the open key entry for a given part in the open key table and also storing the ETag for the part
There was a problem hiding this comment.
There is no x-amz-meta in the example, nor in the UploadPart doc linked. We should introduce regular fields for distinct items like ETag. Storing it in repeated KeyValue is like storing parts in repeated PartKeyInfo.
There was a problem hiding this comment.
Addressed in the latest commit
| * `uploadId(UTF-8 bytes)` + `0x00` | ||
|
|
||
| ```text | ||
| Note: The null byte separator ensures that the "uploadId" is properly delimited from the "partNumber" in the byte encoding, allowing for correct lexicographical ordering. |
There was a problem hiding this comment.
The / separator partially fixes the problem, but the mixed encoding scheme of UTF-8 and big-endian long serialization still means we may have conflicts. The 2f byte could also occur in the long section of the key. If we just find the first 2f byte this would still work, but I think it would be better to use one uniform encoding scheme for the whole key.
I'm thinking we could still convert the long to a big endian byte array for fixed length, but UTF-8 encode it with the rest of the key. This is similar to the FSO keys, which still UTF-8 encode their parent object IDs.
|
@ivandika3 @errose28 could you take another look at the current design doc? |
What changes were proposed in this pull request?
HDDS-10611. Design document for MPU GC Optimization
Please describe your PR in detail:
This PR adds the design doc for optimizing the OM GC pressure by the MPU file handling
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10611
How was this patch tested?
N/A