Skip to content

fix(parquet_derive): support raw identifiers as column names#10113

Open
cbmixx wants to merge 1 commit into
apache:mainfrom
cbmixx:fix-parquet-derive-raw-identifiers
Open

fix(parquet_derive): support raw identifiers as column names#10113
cbmixx wants to merge 1 commit into
apache:mainfrom
cbmixx:fix-parquet-derive-raw-identifiers

Conversation

@cbmixx

@cbmixx cbmixx commented Jun 11, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

#[derive(ParquetRecordReader)] and #[derive(ParquetRecordWriter)] could not
handle a Parquet column whose name is a Rust keyword (e.g. type). The only way
to spell such a field in Rust is a raw identifier (r#type), but the derives
stringified the identifier including the r# prefix:

  • The reader's column-index lookup used name_to_index.get(stringify!(#field_names)),
    and stringify!(r#type) yields "r#type", so reading failed with
    ParquetError::General("column name 'r#type' is not found in parquet file!").
  • The writer's Field::parquet_type() used self.ident.to_string(), which keeps
    the r# prefix, so the written schema got a column literally named r#type.

This made it impossible to read or write Parquet columns whose names are Rust
keywords, e.g. files produced by other Parquet writers with a column named type.

What changes are included in this PR?

Unraw the identifier (via syn::ext::IdentExt::unraw, already available through
the existing syn dependency) wherever it is used as a column name, while keeping
the raw identifier for field access in the generated code:

  • parquet_derive/src/lib.rs: the reader derive builds a parallel list of unrawed
    field-name strings for the name_to_index lookup and its error message.
  • parquet_derive/src/parquet_field.rs: Field::parquet_type() uses
    self.ident.unraw().to_string() for the schema column name.

Are these changes tested?

Yes. Added a unit test (test_parquet_type_with_raw_identifier) and an
integration round-trip test (test_parquet_derive_raw_identifiers) covering a
struct with a raw-identifier field (r#type) alongside a normal field, asserting
the schema columns are named type/count. I verified both tests fail without
the fix (the writer emits a column named r#type) and pass with it.

Are there any user-facing changes?

Structs with raw-identifier fields now read and write columns named without the
r# prefix. This is a bug fix; there are no public API changes. Code that somehow
relied on the previous r#-prefixed column names would change behavior, but such
names could not be produced by any other Parquet writer.


AI disclosure (per CONTRIBUTING.md): this change was developed with the
assistance of an AI coding tool. I reviewed every line, verified the fix against
the failing/passing tests described above, and own the change.

ParquetRecordReader and ParquetRecordWriter derives stringified struct
field identifiers including the r# prefix, so a field declared as
r#type was looked up (reader) and written to the schema (writer) as a
column literally named "r#type" instead of "type". This made it
impossible to read or write parquet columns whose names are Rust
keywords.

Unraw the identifier wherever it is used as a column name, while
keeping the raw identifier for field access in the generated code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet_derive: cannot read or write columns whose name is a Rust keyword (raw identifiers like r#type become column "r#type")

1 participant