Skip to content

Introduces a new LogicalType: FILE#585

Open
brkyvz wants to merge 2 commits into
apache:masterfrom
brkyvz:fileType
Open

Introduces a new LogicalType: FILE#585
brkyvz wants to merge 2 commits into
apache:masterfrom
brkyvz:fileType

Conversation

@brkyvz

@brkyvz brkyvz commented Jun 9, 2026

Copy link
Copy Markdown

Rationale for this change

Introduces a new type called File as a typed FileReference. The design document is here.

The motivation is as follows:

Unstructured data ingestion is getting extremely popular with the advances in Generative AI. 
Today, our only means of dealing with unstructured data is to store it as a binary blob inside Parquet, 
or point to files that exist in some object store with a string. These solutions fail to address these use 
cases, because of scalability, usability, and governance issues.

We would like to introduce a new logical type annotation in Parquet called “File” for storing a struct that 
contains a path reference to a file with additional metadata. This reference may be to a file that exists 
(or expected to exist) in storage at a given path. We’d like to define the minimum required list of fields 
that would allow a client to correctly read the referenced data. Any additional metadata can be optionally 
stored by engines and table formats as necessary adjacent to this type. 

What changes are included in this PR?

Introduces the specification for FileType.

Do these changes have PoC implementations?

Yes:

Comment thread LogicalTypes.md Outdated
#### offset

A byte offset for range reads. If not provided, readers must treat the value as 0.
If provided and non-zero, readers must seek to this offset and read `size` bytes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If provided and non-zero, readers must seek to this offset and read `size` bytes.
If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.

Comment thread LogicalTypes.md Outdated
#### size

The length of the content in bytes. Must be zero or a positive integer if provided.
A value of 0 indicates an empty file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe be explicit about semantics if not provided.

@emkornfield emkornfield left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me a few minor comments for clarification.

Comment thread LogicalTypes.md Outdated

#### offset

A byte offset for range reads. If not provided, readers must treat the value as 0.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually "for range reads". It's to indicate that the content referenced content is a slice of the referenced file. The term "range read" implies that it's still the full file but the act of reading it just uses a particular range. I find the terminology a little strange.

Comment thread LogicalTypes.md Outdated
Can be used to detect whether the referenced file has been updated. If the reference
points to a byte range within a file, the eTag applies to the entire file.

Validation rules for readers and writers:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate heading here? Seems like this should all be under ####Validation not etag

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions I have after reviewing the Rust implementation.

Comment thread LogicalTypes.md
Comment on lines +647 to +648
The annotated group must contain the following fields, identified by name. Field IDs
may also be used for projection:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the reference implementation I find this wording a bit confusing. The group "must" contain these fields, except it really only "must" contain path, and "may" contain the other three. I thought the Required column in the following table only referred to whether the field repetition was 'required' or 'optional'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be that the struct "must contain the following fields". That avoids the confusing meaning of "required".

Comment thread LogicalTypes.md
Comment on lines +674 to +675
If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.
If `offset` is provided, `size` must also be provided.

@etseidl etseidl Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, looking at the implementation in Rust, I don't see this enforced. What if the group definition is

optional group my_file (FILE) {
  required binary path (STRING);
  optional int64 offset;
}

I think we need to specify this at both the field and data levels, i.e. if the group contains an 'offset' field, it must also contain a 'size' field; if the optional 'offset' field is populated, it must be a non-negative integer, and a value for 'size' must also present.

Edit: I think the MAP description in this file is a good example of what I'm wanting here:

  • The key field encodes the map's key type. This field must have repetition required and must always be present. It must always be the first field of the repeated key_value group.
  • The value field encodes the map's value type and repetition. This field can be required, optional, or omitted. It must always be the second field of the repeated key_value group if present. If not present, it can be represented as a map with all null values or as a set of keys.

* File logical type annotation
*
* Annotates a group that represents a reference to an external file.
* The group must contain the following fields identified by name:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identified by name

Do we need to mention its case sensitivity?

Comment thread LogicalTypes.md
Comment on lines +647 to +648
The annotated group must contain the following fields, identified by name. Field IDs
may also be used for projection:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The annotated group must contain the following fields, identified by name. Field IDs
may also be used for projection:
The annotated group must contain the following fields, identified by name (not by order).
Field IDs (if exist) may also be used for projection:

Comment thread LogicalTypes.md

##### path

An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is opaque, should we remove No special encoding (e.g., URI encoding) is applied? It seems valid if users want to apply URI encoding.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is saying that implementations must not apply encoding on top of what the user provides. We should clarify the wording

Comment thread LogicalTypes.md
##### size

The length of the content in bytes. Must be zero or a positive integer if provided.
A value of 0 indicates an empty file. If not provided, the length of the referenced

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to advise the reader to ignore negative value here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should specify that readers error on invalid values, not silently ignore

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't require readers to fail, we can only say what it means when the length is -1. Failure is a responsibility of whatever uses this data, and it's even hard to require that component to fail because users don't want their query to fail for one bad row. They expect reasonable fallback behavior.

Also, does 0 indicate an empty file or does it indicate 0-length content? With offset, I think this indicates that the content length is 0, not that the underlying file is empty.

Comment thread LogicalTypes.md
* The `path` field is required and must be present. Readers must reject a `FILE`-annotated
group that does not contain `path`.
* If `offset` is present and non-zero, `size` must also be provided.
* Additional metadata about the file (e.g., content type, modification timestamp) should

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding this restriction? Do we want to enforce Parquet reader implementations to validate against this rule? I can foresee that users may use another group type (a.k.a struct) to wrap a FILE type and a VARIANT type for extra metadata.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just specify that extra fields must not be added to this struct. @wgtmac wdyt of just saying:

Additional metadata about the file (e.g. content type, modification timestamp) must not be stored inside this struct.

@sfc-gh-nsharma sfc-gh-nsharma Jun 12, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we thinking about various engines adding very common fields like content type, modification timestamp? If they are not part of the parquet file format, I suspect there will be lot of non-interoperable implementations with those fields. Do we expect this file type to evolve to support those in the future?

e.g. these two definitions won't be interoperable because they wrap the FILE struct differently. It would be easier/less error-prone if data type allowed extra fields or a variant to capture additional attributes so that engines don't need a wrapper on top of the file object.

  FileEngine1 (outer struct) {
  file_ref FILE
  content_type STRING
  last_modified TIMESTAMP
  }
  
  FileEngine2 (outer struct) {
  fileRef FILE
  mimeType STRING
  lastModified TIMESTAMP
  }

Comment thread LogicalTypes.md
A value of 0 indicates an empty file. If not provided, the length of the referenced
content is unknown and the entirety of the content can be read.

##### offset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify that it must be non-negative?

Comment thread LogicalTypes.md
The annotated group must contain the following fields, identified by name. Field IDs
may also be used for projection:

| Field | Type | Required |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use repetition instead of required, and either required or optional in the table.

I would also state that the struct must contain exactly these fields, rather than saying they are all required since "required" has a special meaning here (the Parquet type's repetition).

Comment thread LogicalTypes.md
storing file inventories, manifests, and unstructured data references (e.g., images
or audio files stored in object storage).

The annotated group must contain the following fields, identified by name. Field IDs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using "identified by name" conflicts with the next sentence. I would remove both. You don't need to state how these are identified.

If you want to be more clear, then you could add requirements like not renaming the fields. You could also say that the fields can be reordered, which would also remove the ability to project by order (which would be weird anyway).

Comment thread LogicalTypes.md

An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage).
Can be used to detect whether the referenced file has been updated. If the reference
points to a byte range within a file, the eTag applies to the entire file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should writers do when there is no etag provided by the system, like in a local FS? What about systems that have checksums but don't call them etags?

Comment thread LogicalTypes.md
#### Validation

* The `path` field is required and must be present. Readers must reject a `FILE`-annotated
group that does not contain `path`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this makes a file completely unreadable? 🤔

Comment thread LogicalTypes.md
}
```

*Compatibility*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this. Seems unnecessary to me.

Comment thread LogicalTypes.md
| `size` | INT64 | No |
| `offset` | INT64 | No |
| `etag` | STRING | No |

@sfc-gh-sgrafberger sfc-gh-sgrafberger Jun 12, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should include content_type as field here. I fully understand that we don't want to have all kinds of optional metadata like last_modified as part of the spec, as this optional metadata is only needed for a smaller subset of queries and differs between different engines. However, the content_type is required information for a large fraction of common AI/ML workloads, and there is a good reason why existing FILE datatypes in engines like BigQuery or Snowflake have it as part of the datatype.

Many existing AI and ML-related functions on FILEs only work for certain modalities. E.g., an AI_TRANSCRIBE function only makes sense for audio and video data. Many common ML preprocessing techniques/functions only work for certain modalities, e.g., downscaling the resolution of images, turning image colours into grey scale, cropping and rotating pictures, downsampling the fps of a video etc. Some AI models only support text data, while others also support modalities like images. Routing AI function calls to compatible models critically depends on modalities, conceptually. If this isn't part of the datatype, a large fraction of FILE usecases would have to repeatedly read the first n bytes of a file to try to infer the modality, resulting in a big, unnecessary performance overhead that could be easily avoided.

The main reason to add a FILE datatype to the spec is to better support new Gen AI workloads with unstructured data. I think not having a content_type field would critically impact the user experience for the intended main usecase of FILE.

Can we please reconsider adding content_type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants