Skip to content

Conversation

@dh20
Copy link
Contributor

@dh20 dh20 commented Dec 23, 2025

Which issue does this PR close?

Closes #1780

Rationale for this change

Fix the issue of reading null ORC files written by Hive

What changes are included in this PR?

Modify the matching field logic of the ORC file

Are there any user-facing changes?

no

How was this patch tested?

cluster test

@github-actions github-actions bot added the spark label Dec 23, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new configuration option to support case-insensitive field name matching when reading ORC files written by Hive, addressing compatibility issues with uppercase field names.

Key Changes:

  • Added ORC_SCHEMA_ISCASE_SENSITIVE configuration option to control case-sensitivity in ORC schema matching
  • Modified ORC file reading logic to support both case-sensitive and case-insensitive field name matching
  • Propagated the configuration through Java configuration classes to the Rust native execution layer

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
spark-extension/src/main/java/org/apache/spark/sql/auron/AuronConf.java Added ORC_SCHEMA_ISCASE_SENSITIVE configuration with default value true
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java Added ORC_SCHEMA_ISCASE_SENSITIVE configuration option with default value false and an unrelated import
native-engine/datafusion-ext-plans/src/orc_exec.rs Implemented case-insensitive field matching logic in ORC schema adapter with configuration support
native-engine/auron-jni-bridge/src/conf.rs Added ORC_SCHEMA_ISCASE_SENSITIVE to the JNI bridge configuration definitions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cxzl25 cxzl25 changed the title Fix the ORC table written by Hive to read null uppercase fields [AURON #1780] Fix the ORC table written by Hive to read null uppercase fields Dec 23, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +381 to +394
for named_column in file_named_columns {
// Case-insensitive field name matching
let named_column_name_lower = named_column.name().to_lowercase();
if let Some((proj_idx, _)) = self
.projected_schema
.fields()
.iter()
.enumerate()
.find(|(_, f)| f.name().to_lowercase() == named_column_name_lower)
{
field_mappings[proj_idx] = Some(projection.len());
projection.push(named_column.data_type().column_index());
}
}
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case-insensitive field matching performs string lowercase conversion on every iteration (line 383) and then again for each field comparison (line 389). For files with many columns, this could be inefficient. Consider pre-computing a HashMap of lowercase field names to their indices for O(1) lookup instead of O(n*m) where n is the number of file columns and m is the number of projected fields. Alternatively, at minimum, compute the lowercase versions of projected_schema field names once before the loop to avoid repeated conversions.

Copilot uses AI. Check for mistakes.
Comment on lines 240 to 241
public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key(
"auron.orc.schema.iscasesensitive")
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration key uses "iscasesensitive" (all lowercase), which is inconsistent with the naming convention used in similar ORC configurations. The key should use dots to separate words for readability, like "auron.orc.schema.is.case.sensitive" to match the pattern seen in other configurations such as "auron.orc.force.positional.evolution" and "auron.orc.timestamp.use.microsecond".

Copilot uses AI. Check for mistakes.
.defaultValue(false);
public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key(
"auron.orc.schema.iscasesensitive")
.description("does matching ORC file schema distinguish between uppercase and lowercase ")
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description has a grammatical error. It should read "does matching ORC file schema distinguish between uppercase and lowercase?" with a question mark at the end, or better yet, be rephrased as a statement: "whether ORC file schema matching distinguishes between uppercase and lowercase" or "enable case-sensitive matching for ORC file schema fields".

Suggested change
.description("does matching ORC file schema distinguish between uppercase and lowercase ")
.description("whether ORC file schema matching distinguishes between uppercase and lowercase.")

Copilot uses AI. Check for mistakes.
define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE);
define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration name "ORC_SCHEMA_ISCASE_SENSITIVE" with a default value of false is semantically confusing. The name suggests that when set to false (the default), case-insensitive matching is disabled, but the actual behavior is the opposite: false enables case-insensitive matching. This creates a double negative that makes the configuration harder to understand. Consider renaming to "ORC_SCHEMA_CASE_INSENSITIVE" (with default true for Hive compatibility) or improving the documentation to clearly state that false means "case-insensitive matching enabled" and true means "case-sensitive matching enabled".

Suggested change
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
// NOTE: The semantics of this flag are intentionally inverted for Hive compatibility:
// - Default: false => case-insensitive schema matching is ENABLED
// - true => case-sensitive schema matching is ENABLED
// The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it without
// coordinating with the corresponding configuration on the JVM side.

Copilot uses AI. Check for mistakes.
define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
define_conf!(BooleanConf, ORC_SCHEMA_ISCASE_SENSITIVE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with Copilot's comment, we should use a better name for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be consistent with Spark's [spark.sql.caseSensitive]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reading the ORC table written by Hive will result in fields being null

2 participants