Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 70 additions & 1 deletion LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -635,7 +635,76 @@ The type has two type parameters:

The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored.
are found during reading, they must be ignored.

### FILE

`FILE` annotates a group that represents a reference to an external file, along with
the minimum metadata required to read it. It is intended for use cases such as
storing file inventories, manifests, and unstructured data references (e.g., images
or audio files stored in object storage).

The annotated group must contain the following fields, identified by name. Field IDs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using "identified by name" conflicts with the next sentence. I would remove both. You don't need to state how these are identified.

If you want to be more clear, then you could add requirements like not renaming the fields. You could also say that the fields can be reordered, which would also remove the ability to project by order (which would be weird anyway).

may also be used for projection:
Comment on lines +647 to +648

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the reference implementation I find this wording a bit confusing. The group "must" contain these fields, except it really only "must" contain path, and "may" contain the other three. I thought the Required column in the following table only referred to whether the field repetition was 'required' or 'optional'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be that the struct "must contain the following fields". That avoids the confusing meaning of "required".

Comment on lines +647 to +648

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The annotated group must contain the following fields, identified by name. Field IDs
may also be used for projection:
The annotated group must contain the following fields, identified by name (not by order).
Field IDs (if exist) may also be used for projection:


| Field | Type | Required |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use repetition instead of required, and either required or optional in the table.

I would also state that the struct must contain exactly these fields, rather than saying they are all required since "required" has a special meaning here (the Parquet type's repetition).

|----------|--------|----------|
| `path` | STRING | Yes |
| `size` | INT64 | No |
| `offset` | INT64 | No |
| `etag` | STRING | No |

@sfc-gh-sgrafberger sfc-gh-sgrafberger Jun 12, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should include content_type as field here. I fully understand that we don't want to have all kinds of optional metadata like last_modified as part of the spec, as this optional metadata is only needed for a smaller subset of queries and differs between different engines. However, the content_type is required information for a large fraction of common AI/ML workloads, and there is a good reason why existing FILE datatypes in engines like BigQuery or Snowflake have it as part of the datatype.

Many existing AI and ML-related functions on FILEs only work for certain modalities. E.g., an AI_TRANSCRIBE function only makes sense for audio and video data. Many common ML preprocessing techniques/functions only work for certain modalities, e.g., downscaling the resolution of images, turning image colours into grey scale, cropping and rotating pictures, downsampling the fps of a video etc. Some AI models only support text data, while others also support modalities like images. Routing AI function calls to compatible models critically depends on modalities, conceptually. If this isn't part of the datatype, a large fraction of FILE usecases would have to repeatedly read the first n bytes of a file to try to infer the modality, resulting in a big, unnecessary performance overhead that could be easily avoided.

The main reason to add a FILE datatype to the spec is to better support new Gen AI workloads with unstructured data. I think not having a content_type field would critically impact the user experience for the intended main usecase of FILE.

Can we please reconsider adding content_type?

#### Fields

##### path

An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is opaque, should we remove No special encoding (e.g., URI encoding) is applied? It seems valid if users want to apply URI encoding.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is saying that implementations must not apply encoding on top of what the user provides. We should clarify the wording

encoding (e.g., URI encoding) is applied. This is the only required field.

##### size

The length of the content in bytes. Must be zero or a positive integer if provided.
A value of 0 indicates an empty file. If not provided, the length of the referenced

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to advise the reader to ignore negative value here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should specify that readers error on invalid values, not silently ignore

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't require readers to fail, we can only say what it means when the length is -1. Failure is a responsibility of whatever uses this data, and it's even hard to require that component to fail because users don't want their query to fail for one bad row. They expect reasonable fallback behavior.

Also, does 0 indicate an empty file or does it indicate 0-length content? With offset, I think this indicates that the content length is 0, not that the underlying file is empty.

content is unknown and the entirety of the content can be read.

##### offset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify that it must be non-negative?


A byte offset indicating the start of a content slice within the referenced file.
If not provided, readers must treat the value as 0.
If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data.
If `offset` is provided, `size` must also be provided.
Comment on lines +674 to +675

@etseidl etseidl Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, looking at the implementation in Rust, I don't see this enforced. What if the group definition is

optional group my_file (FILE) {
  required binary path (STRING);
  optional int64 offset;
}

I think we need to specify this at both the field and data levels, i.e. if the group contains an 'offset' field, it must also contain a 'size' field; if the optional 'offset' field is populated, it must be a non-negative integer, and a value for 'size' must also present.

Edit: I think the MAP description in this file is a good example of what I'm wanting here:

  • The key field encodes the map's key type. This field must have repetition required and must always be present. It must always be the first field of the repeated key_value group.
  • The value field encodes the map's value type and repetition. This field can be required, optional, or omitted. It must always be the second field of the repeated key_value group if present. If not present, it can be represented as a map with all null values or as a set of keys.


##### etag

An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage).
Can be used to detect whether the referenced file has been updated. If the reference
points to a byte range within a file, the eTag applies to the entire file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should writers do when there is no etag provided by the system, like in a local FS? What about systems that have checksums but don't call them etags?


#### Validation

* The `path` field is required and must be present. Readers must reject a `FILE`-annotated
group that does not contain `path`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this makes a file completely unreadable? 🤔

* If `offset` is present and non-zero, `size` must also be provided.
* Additional metadata about the file (e.g., content type, modification timestamp) should

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding this restriction? Do we want to enforce Parquet reader implementations to validate against this rule? I can foresee that users may use another group type (a.k.a struct) to wrap a FILE type and a VARIANT type for extra metadata.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just specify that extra fields must not be added to this struct. @wgtmac wdyt of just saying:

Additional metadata about the file (e.g. content type, modification timestamp) must not be stored inside this struct.

@sfc-gh-nsharma sfc-gh-nsharma Jun 12, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we thinking about various engines adding very common fields like content type, modification timestamp? If they are not part of the parquet file format, I suspect there will be lot of non-interoperable implementations with those fields. Do we expect this file type to evolve to support those in the future?

e.g. these two definitions won't be interoperable because they wrap the FILE struct differently. It would be easier/less error-prone if data type allowed extra fields or a variant to capture additional attributes so that engines don't need a wrapper on top of the file object.

  FileEngine1 (outer struct) {
  file_ref FILE
  content_type STRING
  last_modified TIMESTAMP
  }
  
  FileEngine2 (outer struct) {
  fileRef FILE
  mimeType STRING
  lastModified TIMESTAMP
  }

be stored adjacent to this struct by engines or table formats, not inside it.

Statistics may be collected for the individual fields of a `FILE`-annotated group
according to the sort order of each field's logical type.

This is an example `FILE`-annotated group in Parquet:

```
optional group my_file (FILE) {
required binary path (STRING);
optional int64 size;
optional int64 offset;
optional binary etag (STRING);
}
```

*Compatibility*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this. Seems unnecessary to me.


`FILE` has no corresponding `ConvertedType`.

## Nested Types

Expand Down
16 changes: 16 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,21 @@ struct GeographyType {
2: optional EdgeInterpolationAlgorithm algorithm;
}

/**
* File logical type annotation
*
* Annotates a group that represents a reference to an external file.
* The group must contain the following fields identified by name:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identified by name

Do we need to mention its case sensitivity?

* - path (STRING, required): an opaque string path to the file (e.g. s3://bucket/file.jpg)
* - size (INT64, optional): the length of the content in bytes; must be zero or positive
* - offset (INT64, optional): byte offset for range reads; if provided, size must also be provided
* - etag (STRING, optional): eTag from the storage system for staleness detection
*
* See LogicalTypes.md for details.
*/
struct FileType {
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -501,6 +516,7 @@ union LogicalType {
16: VariantType VARIANT // no compatible ConvertedType
17: GeometryType GEOMETRY // no compatible ConvertedType
18: GeographyType GEOGRAPHY // no compatible ConvertedType
19: FileType FILE // no compatible ConvertedType
}

/**
Expand Down