-
Notifications
You must be signed in to change notification settings - Fork 198
[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new configuration option to support case-insensitive field name matching when reading ORC files written by Hive, addressing compatibility issues with uppercase field names.
Key Changes:
- Added
ORC_SCHEMA_ISCASE_SENSITIVEconfiguration option to control case-sensitivity in ORC schema matching - Modified ORC file reading logic to support both case-sensitive and case-insensitive field name matching
- Propagated the configuration through Java configuration classes to the Rust native execution layer
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| spark-extension/src/main/java/org/apache/spark/sql/auron/AuronConf.java | Added ORC_SCHEMA_ISCASE_SENSITIVE configuration with default value true |
| spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java | Added ORC_SCHEMA_ISCASE_SENSITIVE configuration option with default value false and an unrelated import |
| native-engine/datafusion-ext-plans/src/orc_exec.rs | Implemented case-insensitive field matching logic in ORC schema adapter with configuration support |
| native-engine/auron-jni-bridge/src/conf.rs | Added ORC_SCHEMA_ISCASE_SENSITIVE to the JNI bridge configuration definitions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java
Show resolved
Hide resolved
spark-extension/src/main/java/org/apache/spark/sql/auron/AuronConf.java
Outdated
Show resolved
Hide resolved
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for named_column in file_named_columns { | ||
| // Case-insensitive field name matching | ||
| let named_column_name_lower = named_column.name().to_lowercase(); | ||
| if let Some((proj_idx, _)) = self | ||
| .projected_schema | ||
| .fields() | ||
| .iter() | ||
| .enumerate() | ||
| .find(|(_, f)| f.name().to_lowercase() == named_column_name_lower) | ||
| { | ||
| field_mappings[proj_idx] = Some(projection.len()); | ||
| projection.push(named_column.data_type().column_index()); | ||
| } | ||
| } |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case-insensitive field matching performs string lowercase conversion on every iteration (line 383) and then again for each field comparison (line 389). For files with many columns, this could be inefficient. Consider pre-computing a HashMap of lowercase field names to their indices for O(1) lookup instead of O(n*m) where n is the number of file columns and m is the number of projected fields. Alternatively, at minimum, compute the lowercase versions of projected_schema field names once before the loop to avoid repeated conversions.
| public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key( | ||
| "auron.orc.schema.iscasesensitive") |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration key uses "iscasesensitive" (all lowercase), which is inconsistent with the naming convention used in similar ORC configurations. The key should use dots to separate words for readability, like "auron.orc.schema.is.case.sensitive" to match the pattern seen in other configurations such as "auron.orc.force.positional.evolution" and "auron.orc.timestamp.use.microsecond".
| .defaultValue(false); | ||
| public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key( | ||
| "auron.orc.schema.iscasesensitive") | ||
| .description("does matching ORC file schema distinguish between uppercase and lowercase ") |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description has a grammatical error. It should read "does matching ORC file schema distinguish between uppercase and lowercase?" with a question mark at the end, or better yet, be rephrased as a statement: "whether ORC file schema matching distinguishes between uppercase and lowercase" or "enable case-sensitive matching for ORC file schema fields".
| .description("does matching ORC file schema distinguish between uppercase and lowercase ") | |
| .description("whether ORC file schema matching distinguishes between uppercase and lowercase.") |
| define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE); | ||
| define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE); | ||
| define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION); | ||
| define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND); |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration name "ORC_SCHEMA_ISCASE_SENSITIVE" with a default value of false is semantically confusing. The name suggests that when set to false (the default), case-insensitive matching is disabled, but the actual behavior is the opposite: false enables case-insensitive matching. This creates a double negative that makes the configuration harder to understand. Consider renaming to "ORC_SCHEMA_CASE_INSENSITIVE" (with default true for Hive compatibility) or improving the documentation to clearly state that false means "case-insensitive matching enabled" and true means "case-sensitive matching enabled".
| define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND); | |
| define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND); | |
| // NOTE: The semantics of this flag are intentionally inverted for Hive compatibility: | |
| // - Default: false => case-insensitive schema matching is ENABLED | |
| // - true => case-sensitive schema matching is ENABLED | |
| // The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it without | |
| // coordinating with the corresponding configuration on the JVM side. |
| define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE); | ||
| define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION); | ||
| define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND); | ||
| define_conf!(BooleanConf, ORC_SCHEMA_ISCASE_SENSITIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with Copilot's comment, we should use a better name for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be consistent with Spark's [spark.sql.caseSensitive]
Which issue does this PR close?
Closes #1780
Rationale for this change
Fix the issue of reading null ORC files written by Hive
What changes are included in this PR?
Modify the matching field logic of the ORC file
Are there any user-facing changes?
no
How was this patch tested?
cluster test