[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781

dh20 · 2025-12-23T08:27:22Z

Which issue does this PR close?

Closes #1780

Rationale for this change

Fix the issue of reading null ORC files written by Hive

What changes are included in this PR?

Modify the matching field logic of the ORC file

Are there any user-facing changes?

no

How was this patch tested?

cluster test

Copilot

Pull request overview

This PR introduces a new configuration option to support case-insensitive field name matching when reading ORC files written by Hive, addressing compatibility issues with uppercase field names.

Key Changes:

Added ORC_SCHEMA_ISCASE_SENSITIVE configuration option to control case-sensitivity in ORC schema matching
Modified ORC file reading logic to support both case-sensitive and case-insensitive field name matching
Propagated the configuration through Java configuration classes to the Rust native execution layer

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
spark-extension/src/main/java/org/apache/spark/sql/auron/AuronConf.java	Added ORC_SCHEMA_ISCASE_SENSITIVE configuration with default value true
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java	Added ORC_SCHEMA_ISCASE_SENSITIVE configuration option with default value false and an unrelated import
native-engine/datafusion-ext-plans/src/orc_exec.rs	Implemented case-insensitive field matching logic in ORC schema adapter with configuration support
native-engine/auron-jni-bridge/src/conf.rs	Added ORC_SCHEMA_ISCASE_SENSITIVE to the JNI bridge configuration definitions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java

native-engine/datafusion-ext-plans/src/orc_exec.rs

spark-extension/src/main/java/org/apache/spark/sql/auron/AuronConf.java

spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T08:41:17Z

native-engine/datafusion-ext-plans/src/orc_exec.rs

+            for named_column in file_named_columns {
+                // Case-insensitive field name matching
+                let named_column_name_lower = named_column.name().to_lowercase();
+                if let Some((proj_idx, _)) = self
+                    .projected_schema
+                    .fields()
+                    .iter()
+                    .enumerate()
+                    .find(|(_, f)| f.name().to_lowercase() == named_column_name_lower)
+                {
+                    field_mappings[proj_idx] = Some(projection.len());
+                    projection.push(named_column.data_type().column_index());
+                }
+            }


The case-insensitive field matching performs string lowercase conversion on every iteration (line 383) and then again for each field comparison (line 389). For files with many columns, this could be inefficient. Consider pre-computing a HashMap of lowercase field names to their indices for O(1) lookup instead of O(n*m) where n is the number of file columns and m is the number of projected fields. Alternatively, at minimum, compute the lowercase versions of projected_schema field names once before the loop to avoid repeated conversions.

Copilot · 2025-12-25T08:41:17Z

spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java

+    public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key(
+                    "auron.orc.schema.iscasesensitive")


The configuration key uses "iscasesensitive" (all lowercase), which is inconsistent with the naming convention used in similar ORC configurations. The key should use dots to separate words for readability, like "auron.orc.schema.is.case.sensitive" to match the pattern seen in other configurations such as "auron.orc.force.positional.evolution" and "auron.orc.timestamp.use.microsecond".

Copilot · 2025-12-25T08:41:17Z

spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java

            .defaultValue(false);
+    public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key(
+                    "auron.orc.schema.iscasesensitive")
+            .description("does matching ORC file schema distinguish between uppercase and lowercase ")


The description has a grammatical error. It should read "does matching ORC file schema distinguish between uppercase and lowercase?" with a question mark at the end, or better yet, be rephrased as a statement: "whether ORC file schema matching distinguishes between uppercase and lowercase" or "enable case-sensitive matching for ORC file schema fields".

Suggested change

.description("does matching ORC file schema distinguish between uppercase and lowercase ")

.description("whether ORC file schema matching distinguishes between uppercase and lowercase.")

Copilot · 2025-12-25T08:41:18Z

native-engine/auron-jni-bridge/src/conf.rs

 define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE);
 define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
 define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
 define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);


The configuration name "ORC_SCHEMA_ISCASE_SENSITIVE" with a default value of false is semantically confusing. The name suggests that when set to false (the default), case-insensitive matching is disabled, but the actual behavior is the opposite: false enables case-insensitive matching. This creates a double negative that makes the configuration harder to understand. Consider renaming to "ORC_SCHEMA_CASE_INSENSITIVE" (with default true for Hive compatibility) or improving the documentation to clearly state that false means "case-insensitive matching enabled" and true means "case-sensitive matching enabled".

Suggested change

define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);

define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);

// NOTE: The semantics of this flag are intentionally inverted for Hive compatibility:

// - Default: false => case-insensitive schema matching is ENABLED

// - true => case-sensitive schema matching is ENABLED

// The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it without

// coordinating with the corresponding configuration on the JVM side.

richox · 2025-12-26T07:16:56Z

native-engine/auron-jni-bridge/src/conf.rs

 define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
 define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
 define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
+define_conf!(BooleanConf, ORC_SCHEMA_ISCASE_SENSITIVE);


i agree with Copilot's comment, we should use a better name for this

Be consistent with Spark's [spark.sql.caseSensitive]

Fix the ORC table written by Hive to read null uppercase fields

5903e13

github-actions bot added the native label Dec 23, 2025

Fix the ORC table written by Hive to read null uppercase fields

a1fbc9f

github-actions bot added the spark label Dec 23, 2025

duanhao-jk added 4 commits December 23, 2025 17:30

Fix the ORC table written by Hive to read null uppercase fields

b724889

Fix the ORC table written by Hive to read null uppercase fields

7e6a629

Fix the ORC table written by Hive to read null uppercase fields

271fc41

Fix the ORC table written by Hive to read null uppercase fields

440c897

cxzl25 requested a review from Copilot December 23, 2025 13:17

Copilot started reviewing on behalf of cxzl25 December 23, 2025 13:18 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

cxzl25 changed the title ~~Fix the ORC table written by Hive to read null uppercase fields~~ [AURON #1780] Fix the ORC table written by Hive to read null uppercase fields Dec 23, 2025

duanhao-jk added 4 commits December 24, 2025 09:42

Fix the ORC table written by Hive to read null uppercase fields

a32f4f1

Fix the ORC table written by Hive to read null uppercase fields

d6e7a08

Fix the ORC table written by Hive to read null uppercase fields

0675b0b

Fix the ORC table written by Hive to read null uppercase fields

715c12d

cxzl25 requested a review from Copilot December 25, 2025 08:31

Copilot started reviewing on behalf of cxzl25 December 25, 2025 08:31 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

Fix the ORC table written by Hive to read null uppercase fields

53cf322

richox reviewed Dec 26, 2025

View reviewed changes

Fix the ORC table written by Hive to read null uppercase fields

9d981c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781

[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781

dh20 commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

richox Dec 26, 2025

Uh oh!

dh20 Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = ConfigOptions.key(
		"auron.orc.schema.iscasesensitive")

	.description("does matching ORC file schema distinguish between uppercase and lowercase ")
	.description("whether ORC file schema matching distinguishes between uppercase and lowercase.")

-define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
+define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
+// NOTE: The semantics of this flag are intentionally inverted for Hive compatibility:
+// - Default: false  => case-insensitive schema matching is ENABLED
+// - true            => case-sensitive schema matching is ENABLED
+// The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it without
+// coordinating with the corresponding configuration on the JVM side.

[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781

Are you sure you want to change the base?

[AURON #1780] Fix the ORC table written by Hive to read null uppercase fields #1781

Conversation

dh20 commented Dec 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

richox Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

dh20 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants