Skip to content

Conversation

@anoopj
Copy link
Contributor

@anoopj anoopj commented Jan 14, 2026

These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.

For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.

Prototype PR: #14533

Changes

API:

  • Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types

Core:

  • TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4)
  • TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id
  • ContentInfo - Metadata for externally stored content (deletion vectors)
  • ManifestStats - Manifest-level statistics (file/row counts, min sequence number)

These types follow the https://s.apache.org/iceberg-single-file-commit
and will be used by subsequent PRs for manifest reading/writing.

For now, we are adding these as package-private interfaces in core, and
eventually we will move them into api.

Changes

API:
 - Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types

Core:
  - TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4)
  - TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id
  - ContentInfo - Metadata for externally stored content (deletion vectors)
  - ManifestStats - Manifest-level statistics (file/row counts, min sequence number)
DATA(0),
POSITION_DELETES(1),
EQUALITY_DELETES(2);
EQUALITY_DELETES(2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !

*
* <p>Used for applying manifest deletion vectors.
*/
Long pos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a primitive long? as There would always be some non-null ordinal position

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Preference: Would prefer to use the full name position()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.

*
* <p>Can only be set if content_type is DATA.
*/
Integer sortOrderId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primitive int? And then implementations throw if it's not a data file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the nullable Integer API because:

  • The field is optional in the V4 schema
  • ContentFile.sortOrderId() already returns Integer with a default of null

Comment on lines +130 to +135
* Returns the tracking information for this entry.
*
* <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if
* tracking info is inherited.
*/
TrackingInfo trackingInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed the discussion on the original PR, so looks like TrackingInfo has a manifestLocation and ordinal as well. I think the intent there is so that TrackingInfo has enough information self contained on it's structure so it can stand alone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw the top level comment in TrackingInfo, cool!

Comment on lines +173 to +175
* <p>Must be defined if location is defined.
*/
Long fileSizeInBytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a miss on the proposal but this would be required always no? And as a result primitive long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.

@nastra nastra changed the title Introduce foundational types for V4 manifest support API, Core: Introduce foundational types for V4 manifest support Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants