Introduces a new LogicalType: FILE by brkyvz · Pull Request #585 · apache/parquet-format

brkyvz · 2026-06-09T22:04:45Z

Rationale for this change

Introduces a new type called File as a typed FileReference. The design document is here.

The motivation is as follows:

Unstructured data ingestion is getting extremely popular with the advances in Generative AI. 
Today, our only means of dealing with unstructured data is to store it as a binary blob inside Parquet, 
or point to files that exist in some object store with a string. These solutions fail to address these use 
cases, because of scalability, usability, and governance issues.

We would like to introduce a new logical type annotation in Parquet called “File” for storing a struct that 
contains a path reference to a file with additional metadata. This reference may be to a file that exists 
(or expected to exist) in storage at a given path. We’d like to define the minimum required list of fields 
that would allow a client to correctly read the referenced data. Any additional metadata can be optionally 
stored by engines and table formats as necessary adjacent to this type.

What changes are included in this PR?

Introduces the specification for FileType.

Do these changes have PoC implementations?

Yes:

emkornfield · 2026-06-10T17:14:02Z

+#### offset
+
+A byte offset for range reads. If not provided, readers must treat the value as 0.
+If provided and non-zero, readers must seek to this offset and read `size` bytes.


Suggested change

If provided and non-zero, readers must seek to this offset and read `size` bytes.

If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.

emkornfield · 2026-06-10T17:14:26Z

+#### size
+
+The length of the content in bytes. Must be zero or a positive integer if provided.
+A value of 0 indicates an empty file.


maybe be explicit about semantics if not provided.

emkornfield

Looks reasonable to me a few minor comments for clarification.

danielcweeks · 2026-06-10T17:40:18Z

+
+#### offset
+
+A byte offset for range reads. If not provided, readers must treat the value as 0.


This isn't actually "for range reads". It's to indicate that the content referenced content is a slice of the referenced file. The term "range read" implies that it's still the full file but the act of reading it just uses a particular range. I find the terminology a little strange.

danielcweeks · 2026-06-10T17:42:24Z

+Can be used to detect whether the referenced file has been updated. If the reference
+points to a byte range within a file, the eTag applies to the entire file.
+
+Validation rules for readers and writers:


Do we need a separate heading here? Seems like this should all be under ####Validation not etag

etseidl

Just a few questions I have after reviewing the Rust implementation.

etseidl · 2026-06-10T21:43:01Z

+The annotated group must contain the following fields, identified by name. Field IDs
+may also be used for projection:


After looking at the reference implementation I find this wording a bit confusing. The group "must" contain these fields, except it really only "must" contain path, and "may" contain the other three. I thought the Required column in the following table only referred to whether the field repetition was 'required' or 'optional'.

I think it should be that the struct "must contain the following fields". That avoids the confusing meaning of "required".

etseidl · 2026-06-10T21:51:00Z

+If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.
+If `offset` is provided, `size` must also be provided.


Again, looking at the implementation in Rust, I don't see this enforced. What if the group definition is

optional group my_file (FILE) { required binary path (STRING); optional int64 offset; }

I think we need to specify this at both the field and data levels, i.e. if the group contains an 'offset' field, it must also contain a 'size' field; if the optional 'offset' field is populated, it must be a non-negative integer, and a value for 'size' must also present.

Edit: I think the MAP description in this file is a good example of what I'm wanting here:

The key field encodes the map's key type. This field must have repetition required and must always be present. It must always be the first field of the repeated key_value group.

The value field encodes the map's value type and repetition. This field can be required, optional, or omitted. It must always be the second field of the repeated key_value group if present. If not present, it can be represented as a map with all null values or as a set of keys.

wgtmac · 2026-06-11T03:22:17Z

+ * File logical type annotation
+ *
+ * Annotates a group that represents a reference to an external file.
+ * The group must contain the following fields identified by name:


identified by name

Do we need to mention its case sensitivity?

wgtmac · 2026-06-11T03:28:27Z

+The annotated group must contain the following fields, identified by name. Field IDs
+may also be used for projection:


Suggested change

The annotated group must contain the following fields, identified by name. Field IDs

may also be used for projection:

The annotated group must contain the following fields, identified by name (not by order).

Field IDs (if exist) may also be used for projection:

wgtmac · 2026-06-11T03:30:18Z

+
+##### path
+
+An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special


If this is opaque, should we remove No special encoding (e.g., URI encoding) is applied? It seems valid if users want to apply URI encoding.

I think this is saying that implementations must not apply encoding on top of what the user provides. We should clarify the wording

wgtmac · 2026-06-11T03:31:05Z

+##### size
+
+The length of the content in bytes. Must be zero or a positive integer if provided.
+A value of 0 indicates an empty file. If not provided, the length of the referenced


Do we want to advise the reader to ignore negative value here?

IMO we should specify that readers error on invalid values, not silently ignore

We can't require readers to fail, we can only say what it means when the length is -1. Failure is a responsibility of whatever uses this data, and it's even hard to require that component to fail because users don't want their query to fail for one bad row. They expect reasonable fallback behavior.

Also, does 0 indicate an empty file or does it indicate 0-length content? With offset, I think this indicates that the content length is 0, not that the underlying file is empty.

wgtmac · 2026-06-11T03:36:40Z

+* The `path` field is required and must be present. Readers must reject a `FILE`-annotated
+  group that does not contain `path`.
+* If `offset` is present and non-zero, `size` must also be provided.
+* Additional metadata about the file (e.g., content type, modification timestamp) should


Why adding this restriction? Do we want to enforce Parquet reader implementations to validate against this rule? I can foresee that users may use another group type (a.k.a struct) to wrap a FILE type and a VARIANT type for extra metadata.

I think we should just specify that extra fields must not be added to this struct. @wgtmac wdyt of just saying:

Additional metadata about the file (e.g. content type, modification timestamp) must not be stored inside this struct.

How are we thinking about various engines adding very common fields like content type, modification timestamp? If they are not part of the parquet file format, I suspect there will be lot of non-interoperable implementations with those fields. Do we expect this file type to evolve to support those in the future?

e.g. these two definitions won't be interoperable because they wrap the FILE struct differently. It would be easier/less error-prone if data type allowed extra fields or a variant to capture additional attributes so that engines don't need a wrapper on top of the file object.

FileEngine1 (outer struct) { file_ref FILE content_type STRING last_modified TIMESTAMP } FileEngine2 (outer struct) { fileRef FILE mimeType STRING lastModified TIMESTAMP }

divjotarora · 2026-06-11T06:53:42Z

+A value of 0 indicates an empty file. If not provided, the length of the referenced
+content is unknown and the entirety of the content can be read.
+
+##### offset


Should we specify that it must be non-negative?

rdblue · 2026-06-11T20:47:34Z

+The annotated group must contain the following fields, identified by name. Field IDs
+may also be used for projection:
+
+| Field    | Type   | Required |


I think it would be better to use repetition instead of required, and either required or optional in the table.

I would also state that the struct must contain exactly these fields, rather than saying they are all required since "required" has a special meaning here (the Parquet type's repetition).

rdblue · 2026-06-11T20:51:10Z

+storing file inventories, manifests, and unstructured data references (e.g., images
+or audio files stored in object storage).
+
+The annotated group must contain the following fields, identified by name. Field IDs


I think that using "identified by name" conflicts with the next sentence. I would remove both. You don't need to state how these are identified.

If you want to be more clear, then you could add requirements like not renaming the fields. You could also say that the fields can be reordered, which would also remove the ability to project by order (which would be weird anyway).

rdblue · 2026-06-11T20:56:04Z

+
+An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage).
+Can be used to detect whether the referenced file has been updated. If the reference
+points to a byte range within a file, the eTag applies to the entire file.


What should writers do when there is no etag provided by the system, like in a local FS? What about systems that have checksums but don't call them etags?

rdblue · 2026-06-11T20:56:34Z

+#### Validation
+
+* The `path` field is required and must be present. Readers must reject a `FILE`-annotated
+  group that does not contain `path`.


So this makes a file completely unreadable? 🤔

rdblue · 2026-06-11T20:57:30Z

+}
+```
+
+*Compatibility*


I wouldn't include this. Seems unnecessary to me.

sfc-gh-sgrafberger · 2026-06-12T09:49:37Z

+| `size`   | INT64  | No       |
+| `offset` | INT64  | No       |
+| `etag`   | STRING | No       |
+


I really think we should include content_type as field here. I fully understand that we don't want to have all kinds of optional metadata like last_modified as part of the spec, as this optional metadata is only needed for a smaller subset of queries and differs between different engines. However, the content_type is required information for a large fraction of common AI/ML workloads, and there is a good reason why existing FILE datatypes in engines like BigQuery or Snowflake have it as part of the datatype.

Many existing AI and ML-related functions on FILEs only work for certain modalities. E.g., an AI_TRANSCRIBE function only makes sense for audio and video data. Many common ML preprocessing techniques/functions only work for certain modalities, e.g., downscaling the resolution of images, turning image colours into grey scale, cropping and rotating pictures, downsampling the fps of a video etc. Some AI models only support text data, while others also support modalities like images. Routing AI function calls to compatible models critically depends on modalities, conceptually. If this isn't part of the datatype, a large fraction of FILE usecases would have to repeatedly read the first n bytes of a file to try to infer the modality, resulting in a big, unnecessary performance overhead that could be easily avoided.

The main reason to add a FILE datatype to the spec is to better support new Gen AI workloads with unstructured data. I think not having a content_type field would critically impact the user experience for the intended main usecase of FILE.

Can we please reconsider adding content_type?

Add FILE type definitions

7d966f0

emkornfield reviewed Jun 10, 2026

View reviewed changes

emkornfield approved these changes Jun 10, 2026

View reviewed changes

danielcweeks reviewed Jun 10, 2026

View reviewed changes

Address comments

b6ae0de

RussellSpitzer approved these changes Jun 10, 2026

View reviewed changes

danielcweeks approved these changes Jun 10, 2026

View reviewed changes

etseidl reviewed Jun 10, 2026

View reviewed changes

wgtmac reviewed Jun 11, 2026

View reviewed changes

divjotarora reviewed Jun 11, 2026

View reviewed changes

rdblue reviewed Jun 11, 2026

View reviewed changes

Comment thread LogicalTypes.md

}

```

*Compatibility*

rdblue Jun 11, 2026

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this. Seems unnecessary to me.

sfc-gh-sgrafberger reviewed Jun 12, 2026

View reviewed changes

	If provided and non-zero, readers must seek to this offset and read `size` bytes.
	If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.


		#### offset

		A byte offset for range reads. If not provided, readers must treat the value as 0.

		The annotated group must contain the following fields, identified by name. Field IDs
		may also be used for projection:

		If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.
		If `offset` is provided, `size` must also be provided.


		##### path

		An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special

Conversation

brkyvz commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-nsharma Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sgrafberger Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

brkyvz commented Jun 9, 2026 •

edited

Loading

etseidl Jun 10, 2026 •

edited

Loading

sfc-gh-nsharma Jun 12, 2026 •

edited

Loading

sfc-gh-sgrafberger Jun 12, 2026 •

edited

Loading