-
Notifications
You must be signed in to change notification settings - Fork 492
Introduces a new LogicalType: FILE #585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -635,7 +635,76 @@ The type has two type parameters: | |||||||||
|
|
||||||||||
| The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max | ||||||||||
| statistics should be saved for this type and if such non-compliant statistics | ||||||||||
| are found during reading, they must be ignored. | ||||||||||
| are found during reading, they must be ignored. | ||||||||||
|
|
||||||||||
| ### FILE | ||||||||||
|
|
||||||||||
| `FILE` annotates a group that represents a reference to an external file, along with | ||||||||||
| the minimum metadata required to read it. It is intended for use cases such as | ||||||||||
| storing file inventories, manifests, and unstructured data references (e.g., images | ||||||||||
| or audio files stored in object storage). | ||||||||||
|
|
||||||||||
| The annotated group must contain the following fields, identified by name. Field IDs | ||||||||||
| may also be used for projection: | ||||||||||
|
Comment on lines
+647
to
+648
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After looking at the reference implementation I find this wording a bit confusing. The group "must" contain these fields, except it really only "must" contain
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should be that the struct "must contain the following fields". That avoids the confusing meaning of "required".
Comment on lines
+647
to
+648
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| | Field | Type | Required | | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be better to use I would also state that the struct must contain exactly these fields, rather than saying they are all required since "required" has a special meaning here (the Parquet type's repetition). |
||||||||||
| |----------|--------|----------| | ||||||||||
| | `path` | STRING | Yes | | ||||||||||
| | `size` | INT64 | No | | ||||||||||
| | `offset` | INT64 | No | | ||||||||||
| | `etag` | STRING | No | | ||||||||||
|
|
||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really think we should include Many existing AI and ML-related functions on FILEs only work for certain modalities. E.g., an The main reason to add a FILE datatype to the spec is to better support new Gen AI workloads with unstructured data. I think not having a Can we please reconsider adding |
||||||||||
| #### Fields | ||||||||||
|
|
||||||||||
| ##### path | ||||||||||
|
|
||||||||||
| An opaque path string to the referenced file (e.g., `s3://bucket/file.jpg`). No special | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is opaque, should we remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is saying that implementations must not apply encoding on top of what the user provides. We should clarify the wording |
||||||||||
| encoding (e.g., URI encoding) is applied. This is the only required field. | ||||||||||
|
|
||||||||||
| ##### size | ||||||||||
|
|
||||||||||
| The length of the content in bytes. Must be zero or a positive integer if provided. | ||||||||||
| A value of 0 indicates an empty file. If not provided, the length of the referenced | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to advise the reader to ignore negative value here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO we should specify that readers error on invalid values, not silently ignore
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't require readers to fail, we can only say what it means when the length is -1. Failure is a responsibility of whatever uses this data, and it's even hard to require that component to fail because users don't want their query to fail for one bad row. They expect reasonable fallback behavior. Also, does 0 indicate an empty file or does it indicate 0-length content? With |
||||||||||
| content is unknown and the entirety of the content can be read. | ||||||||||
|
|
||||||||||
| ##### offset | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we specify that it must be non-negative? |
||||||||||
|
|
||||||||||
| A byte offset indicating the start of a content slice within the referenced file. | ||||||||||
| If not provided, readers must treat the value as 0. | ||||||||||
| If provided and non-zero, readers must seek to this offset and read `size` bytes to retrieve the referenced data. | ||||||||||
| If `offset` is provided, `size` must also be provided. | ||||||||||
|
Comment on lines
+674
to
+675
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, looking at the implementation in Rust, I don't see this enforced. What if the group definition is I think we need to specify this at both the field and data levels, i.e. if the group contains an 'offset' field, it must also contain a 'size' field; if the optional 'offset' field is populated, it must be a non-negative integer, and a value for 'size' must also present. Edit: I think the
|
||||||||||
|
|
||||||||||
| ##### etag | ||||||||||
|
|
||||||||||
| An eTag value provided by the storage system (e.g., from S3 or Azure Blob Storage). | ||||||||||
| Can be used to detect whether the referenced file has been updated. If the reference | ||||||||||
| points to a byte range within a file, the eTag applies to the entire file. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should writers do when there is no etag provided by the system, like in a local FS? What about systems that have checksums but don't call them etags? |
||||||||||
|
|
||||||||||
| #### Validation | ||||||||||
|
|
||||||||||
| * The `path` field is required and must be present. Readers must reject a `FILE`-annotated | ||||||||||
| group that does not contain `path`. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this makes a file completely unreadable? 🤔 |
||||||||||
| * If `offset` is present and non-zero, `size` must also be provided. | ||||||||||
| * Additional metadata about the file (e.g., content type, modification timestamp) should | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why adding this restriction? Do we want to enforce Parquet reader implementations to validate against this rule? I can foresee that users may use another group type (a.k.a struct) to wrap a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should just specify that extra fields must not be added to this struct. @wgtmac wdyt of just saying:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are we thinking about various engines adding very common fields like content type, modification timestamp? If they are not part of the parquet file format, I suspect there will be lot of non-interoperable implementations with those fields. Do we expect this file type to evolve to support those in the future? e.g. these two definitions won't be interoperable because they wrap the FILE struct differently. It would be easier/less error-prone if data type allowed extra fields or a variant to capture additional attributes so that engines don't need a wrapper on top of the file object. |
||||||||||
| be stored adjacent to this struct by engines or table formats, not inside it. | ||||||||||
|
|
||||||||||
| Statistics may be collected for the individual fields of a `FILE`-annotated group | ||||||||||
| according to the sort order of each field's logical type. | ||||||||||
|
|
||||||||||
| This is an example `FILE`-annotated group in Parquet: | ||||||||||
|
|
||||||||||
| ``` | ||||||||||
| optional group my_file (FILE) { | ||||||||||
| required binary path (STRING); | ||||||||||
| optional int64 size; | ||||||||||
| optional int64 offset; | ||||||||||
| optional binary etag (STRING); | ||||||||||
| } | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| *Compatibility* | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't include this. Seems unnecessary to me. |
||||||||||
|
|
||||||||||
| `FILE` has no corresponding `ConvertedType`. | ||||||||||
|
|
||||||||||
| ## Nested Types | ||||||||||
|
|
||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -468,6 +468,21 @@ struct GeographyType { | |
| 2: optional EdgeInterpolationAlgorithm algorithm; | ||
| } | ||
|
|
||
| /** | ||
| * File logical type annotation | ||
| * | ||
| * Annotates a group that represents a reference to an external file. | ||
| * The group must contain the following fields identified by name: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do we need to mention its case sensitivity? |
||
| * - path (STRING, required): an opaque string path to the file (e.g. s3://bucket/file.jpg) | ||
| * - size (INT64, optional): the length of the content in bytes; must be zero or positive | ||
| * - offset (INT64, optional): byte offset for range reads; if provided, size must also be provided | ||
| * - etag (STRING, optional): eTag from the storage system for staleness detection | ||
| * | ||
| * See LogicalTypes.md for details. | ||
| */ | ||
| struct FileType { | ||
| } | ||
|
|
||
| /** | ||
| * LogicalType annotations to replace ConvertedType. | ||
| * | ||
|
|
@@ -501,6 +516,7 @@ union LogicalType { | |
| 16: VariantType VARIANT // no compatible ConvertedType | ||
| 17: GeometryType GEOMETRY // no compatible ConvertedType | ||
| 18: GeographyType GEOGRAPHY // no compatible ConvertedType | ||
| 19: FileType FILE // no compatible ConvertedType | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that using "identified by name" conflicts with the next sentence. I would remove both. You don't need to state how these are identified.
If you want to be more clear, then you could add requirements like not renaming the fields. You could also say that the fields can be reordered, which would also remove the ability to project by order (which would be weird anyway).