-
Notifications
You must be signed in to change notification settings - Fork 3k
API, Core: Introduce foundational types for V4 manifest support #15049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
These types follow the https://s.apache.org/iceberg-single-file-commit and will be used by subsequent PRs for manifest reading/writing. For now, we are adding these as package-private interfaces in core, and eventually we will move them into api. Changes API: - Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types Core: - TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4) - TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id - ContentInfo - Metadata for externally stored content (deletion vectors) - ManifestStats - Manifest-level statistics (file/row counts, min sequence number)
| DATA(0), | ||
| POSITION_DELETES(1), | ||
| EQUALITY_DELETES(2); | ||
| EQUALITY_DELETES(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !
| * | ||
| * <p>Used for applying manifest deletion vectors. | ||
| */ | ||
| Long pos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a primitive long? as There would always be some non-null ordinal position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Preference: Would prefer to use the full name position()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.
| * | ||
| * <p>Can only be set if content_type is DATA. | ||
| */ | ||
| Integer sortOrderId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primitive int? And then implementations throw if it's not a data file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to keep the nullable Integer API because:
- The field is optional in the V4 schema
ContentFile.sortOrderId()already returns Integer with a default of null
| * Returns the tracking information for this entry. | ||
| * | ||
| * <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if | ||
| * tracking info is inherited. | ||
| */ | ||
| TrackingInfo trackingInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed the discussion on the original PR, so looks like TrackingInfo has a manifestLocation and ordinal as well. I think the intent there is so that TrackingInfo has enough information self contained on it's structure so it can stand alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just saw the top level comment in TrackingInfo, cool!
| * <p>Must be defined if location is defined. | ||
| */ | ||
| Long fileSizeInBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a miss on the proposal but this would be required always no? And as a result primitive long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.
These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.
For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.
Prototype PR: #14533
Changes
API:
Core: