From 7d966f06629623ebc7c80f63572afbc492d04810 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Mon, 8 Jun 2026 18:36:59 +0000 Subject: [PATCH 1/2] Add FILE type definitions --- LogicalTypes.md | 67 +++++++++++++++++++++++++++++++++- src/main/thrift/parquet.thrift | 16 ++++++++ 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 690ae3f5..662bb169 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -635,7 +635,72 @@ The type has two type parameters: The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max statistics should be saved for this type and if such non-compliant statistics -are found during reading, they must be ignored. +are found during reading, they must be ignored. + +### FILE + +`FILE` annotates a group that represents a reference to an external file, along with +the minimum metadata required to read it. It is intended for use cases such as +storing file inventories, manifests, and unstructured data references (e.g., images +or audio files stored in object storage). + +The annotated group must contain the following fields, identified by name. Field IDs +may also be used for projection: + +| Field | Type | Required | +|----------|--------|----------| +| `path` | STRING | Yes | +| `size` | INT64 | No | +| `offset` | INT64 | No | +| `etag` | STRING | No | + +#### path + +An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special +encoding (e.g., URI encoding) is applied. This is the only required field. + +#### size + +The length of the content in bytes. Must be zero or a positive integer if provided. +A value of 0 indicates an empty file. + +#### offset + +A byte offset for range reads. If not provided, readers must treat the value as 0. +If provided and non-zero, readers must seek to this offset and read `size` bytes. +If `offset` is provided, `size` must also be provided. + +#### etag + +An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage). +Can be used to detect whether the referenced file has been updated. If the reference +points to a byte range within a file, the eTag applies to the entire file. + +Validation rules for readers and writers: + +* The `path` field is required and must be present. Readers must reject a `FILE`-annotated + group that does not contain `path`. +* If `offset` is present and non-zero, `size` must also be provided. +* Additional metadata about the file (e.g., content type, modification timestamp) should + be stored adjacent to this struct by engines or table formats, not inside it. + +Statistics may be collected for the individual fields of a `FILE`-annotated group +according to the sort order of each field's logical type. + +This is an example `FILE`-annotated group in Parquet: + +``` +optional group my_file (FILE) { + required binary path (STRING); + optional int64 size; + optional int64 offset; + optional binary etag (STRING); +} +``` + +*Compatibility* + +`FILE` has no corresponding `ConvertedType`. ## Nested Types diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index fe259d61..b4f9a1e0 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -468,6 +468,21 @@ struct GeographyType { 2: optional EdgeInterpolationAlgorithm algorithm; } +/** + * File logical type annotation + * + * Annotates a group that represents a reference to an external file. + * The group must contain the following fields identified by name: + * - path (STRING, required): an opaque string path to the file (e.g. s3://bucket/file.jpg) + * - size (INT64, optional): the length of the content in bytes; must be zero or positive + * - offset (INT64, optional): byte offset for range reads; if provided, size must also be provided + * - etag (STRING, optional): eTag from the storage system for staleness detection + * + * See LogicalTypes.md for details. + */ +struct FileType { +} + /** * LogicalType annotations to replace ConvertedType. * @@ -501,6 +516,7 @@ union LogicalType { 16: VariantType VARIANT // no compatible ConvertedType 17: GeometryType GEOMETRY // no compatible ConvertedType 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 19: FileType FILE // no compatible ConvertedType } /** From b6ae0de760dd13918beef4aca60b7026c0585969 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Wed, 10 Jun 2026 19:32:50 +0000 Subject: [PATCH 2/2] Address comments --- LogicalTypes.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 662bb169..c44333c2 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -654,29 +654,33 @@ may also be used for projection: | `offset` | INT64 | No | | `etag` | STRING | No | -#### path +#### Fields + +##### path An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special encoding (e.g., URI encoding) is applied. This is the only required field. -#### size +##### size The length of the content in bytes. Must be zero or a positive integer if provided. -A value of 0 indicates an empty file. +A value of 0 indicates an empty file. If not provided, the length of the referenced +content is unknown and the entirety of the content can be read. -#### offset +##### offset -A byte offset for range reads. If not provided, readers must treat the value as 0. -If provided and non-zero, readers must seek to this offset and read `size` bytes. +A byte offset indicating the start of a content slice within the referenced file. +If not provided, readers must treat the value as 0. +If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data. If `offset` is provided, `size` must also be provided. -#### etag +##### etag An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage). Can be used to detect whether the referenced file has been updated. If the reference points to a byte range within a file, the eTag applies to the entire file. -Validation rules for readers and writers: +#### Validation * The `path` field is required and must be present. Readers must reject a `FILE`-annotated group that does not contain `path`.