From 62da6bfd63b543bc7a89e8f6ffb8e723612a6f6f Mon Sep 17 00:00:00 2001 From: Eunjin Song Date: Thu, 4 Jun 2026 21:20:55 -0700 Subject: [PATCH 1/2] [CORE] Drop the 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The custom 15.0.0-gluten artifact coordinate forced every contributor to run dev/build-arrow.sh before they could build gluten, even though the Java side of that build no longer carries any load-bearing modifications: * The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every consumer of those classes inside gluten was deleted by #12130 along with the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied to the Arrow Java build here; the file itself is kept because get-velox.sh still copies it into Velox's CMake Arrow EP for the C++ side. * support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load bearing for ppc64le builds, but does not require an artifact rename — it only patches the binary-resource lookup inside the arrow-c-data JNI jar and is still applied by build-arrow.sh. * The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are unchanged. After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0, arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh to produce locally-patched 15.0.0 artifacts — the local-m2 install overrides Central as before. Note: this PR removes the artifact-rename indirection but does not yet unbundle Arrow from the gluten-velox bundle. The bundle still ships unshaded Arrow (per #12226) at the same vanilla coordinates. Removing the bundled Arrow in favour of Spark's bundled copy is a separate follow-up driven by the discussion on #12226. --- dev/build-arrow.sh | 3 --- gluten-arrow/pom.xml | 10 +++++----- pom.xml | 3 --- 3 files changed, 5 insertions(+), 11 deletions(-) diff --git a/dev/build-arrow.sh b/dev/build-arrow.sh index 54c6faaf331..9ff501376f9 100755 --- a/dev/build-arrow.sh +++ b/dev/build-arrow.sh @@ -31,7 +31,6 @@ function prepare_arrow_build() { #wget_and_untar https://archive.apache.org/dist/arrow/arrow-${VELOX_ARROW_BUILD_VERSION}/apache-arrow-${VELOX_ARROW_BUILD_VERSION}.tar.gz arrow_ep cd arrow_ep patch -p1 < $CURRENT_DIR/../ep/build-velox/src/modify_arrow.patch - patch -p1 < $CURRENT_DIR/../ep/build-velox/src/modify_arrow_dataset_scan_option.patch patch -p1 < $CURRENT_DIR/../ep/build-velox/src/cmake-compatibility.patch patch -p1 < $CURRENT_DIR/../ep/build-velox/src/support_ibm_power.patch popd @@ -97,8 +96,6 @@ function build_arrow_java() { export CMAKE_BUILD_PARALLEL_LEVEL=$NPROC pushd $ARROW_PREFIX/java - # Because arrow-bom module need the -DprocessAllModules - ${MVN_CMD} versions:set -DnewVersion=15.0.0-gluten -DprocessAllModules ${MVN_CMD} clean install -pl bom,maven/module-info-compiler-maven-plugin,vector -am \ -DskipTests -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -Dassembly.skipAssembly diff --git a/gluten-arrow/pom.xml b/gluten-arrow/pom.xml index 62d9bd243a7..defcfcbf26f 100644 --- a/gluten-arrow/pom.xml +++ b/gluten-arrow/pom.xml @@ -88,13 +88,13 @@ org.apache.arrow ${arrow-memory.artifact} - ${arrow-gluten.version} + ${arrow.version} runtime org.apache.arrow arrow-memory-core - ${arrow-gluten.version} + ${arrow.version} compile @@ -110,7 +110,7 @@ org.apache.arrow arrow-vector - ${arrow-gluten.version} + ${arrow.version} io.netty @@ -129,7 +129,7 @@ org.apache.arrow arrow-c-data - ${arrow-gluten.version} + ${arrow.version} compile @@ -145,7 +145,7 @@ org.apache.arrow arrow-dataset - ${arrow-gluten.version} + ${arrow.version} compile diff --git a/pom.xml b/pom.xml index 72c5e503aa7..1484e55d4b4 100644 --- a/pom.xml +++ b/pom.xml @@ -81,7 +81,6 @@ 0.6.3 0.10.0 15.0.0 - 15.0.0-gluten arrow-memory-unsafe 2.7.4 4.9.3 @@ -1285,7 +1284,6 @@ 2.24.3 3.17.0 18.1.0 - 18.1.0 4.9.2 @@ -1365,7 +1363,6 @@ 2.24.3 3.17.0 18.1.0 - 18.1.0 4.9.5 From 62bc6c90221a025c644c7367e1d6abf5e19d1e6d Mon Sep 17 00:00:00 2001 From: Eunjin Song Date: Thu, 4 Jun 2026 21:27:19 -0700 Subject: [PATCH 2/2] [CORE] Unbundle Arrow memory + vector from gluten-velox-bundle MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mark arrow-memory-{core,unsafe,netty} and arrow-vector as scope=provided in gluten-arrow/pom.xml. They are bundled in Spark's distribution ($SPARK_HOME/jars/ for Spark 3.x; declared in Spark 4.x's pom), so the user's classpath already has them at runtime — gluten does not need to ship its own copy. Effects: * The gluten-velox bundle no longer ships ANY org.apache.arrow.memory.* or org.apache.arrow.vector.* classes. The class-shadowing problem from #12225 goes away by construction — there is no gluten-shipped copy left to shadow the user's vanilla Arrow. * The org.apache.arrow shade-relocation block in package/pom.xml becomes redundant and is removed: arrow-memory/vector are no longer in the bundle to relocate, and arrow-c-data / arrow-dataset (still bundled) were already excluded from relocation because their JNI binds to the original class names. * arrow-c-data and arrow-dataset remain at scope=compile in gluten-arrow — Spark does NOT ship those, so gluten still bundles them. With the relocation block gone, their public method signatures naturally bind to the user's vanilla org.apache.arrow.memory.BufferAllocator / arrow-vector types, exactly matching what every other Arrow C-Data caller on the classpath expects. Compile-classpath touch-ups: * backends-velox/pom.xml: re-declare arrow-memory-core and arrow-vector at scope=provided. The transitive route through gluten-arrow no longer carries them after the scope flip, so backends-velox needs its own provided declaration to compile. * gluten-ut/* and backends-clickhouse already declare arrow at provided scope locally, so they are unaffected. Caveats: * Spark 3.5 and earlier do NOT declare arrow-memory/arrow-vector in their Maven POM (they ship them inside the binary distribution only). gluten builds against the version pinned in `arrow.version`. Maintainers should keep `arrow.version` aligned with the lowest-common-denominator Arrow version across supported Spark distros (DBR 16.4 ships Arrow 12.0.1 with Spark 3.5; vanilla Spark 3.5.x ships 15.0.0 — the 15.0.0 default here is fine for vanilla Spark 3.5 but may need a compat profile for DBR/Cloudera flavors). * dev/check-arrow-c-shading.sh added in #12226 still passes — the bundle still contains org/apache/arrow/c/* classes whose method signatures now reference unshaded org.apache.arrow.memory.* / org.apache.arrow.vector.* types (which are no longer in the bundle, but resolve at runtime from Spark's Arrow). Builds on #12244 (drop the 15.0.0-gluten Arrow version rename). Addresses the follow-up direction from #12226 discussion: "remove Arrow from the bundled Gluten Jar and let users rely on Spark's bundled Arrow". --- backends-velox/pom.xml | 18 ++++++++++++++++++ gluten-arrow/pom.xml | 12 ++++++++++-- package/pom.xml | 31 +++++++++---------------------- 3 files changed, 37 insertions(+), 24 deletions(-) diff --git a/backends-velox/pom.xml b/backends-velox/pom.xml index 4432547a0fe..b8e9f840d48 100644 --- a/backends-velox/pom.xml +++ b/backends-velox/pom.xml @@ -79,6 +79,24 @@ ${project.version} compile + + + org.apache.arrow + arrow-memory-core + ${arrow.version} + provided + + + org.apache.arrow + arrow-vector + ${arrow.version} + provided + com.github.ben-manes.caffeine caffeine diff --git a/gluten-arrow/pom.xml b/gluten-arrow/pom.xml index defcfcbf26f..47404d00b7a 100644 --- a/gluten-arrow/pom.xml +++ b/gluten-arrow/pom.xml @@ -85,17 +85,24 @@ ${spark.version} provided + org.apache.arrow ${arrow-memory.artifact} ${arrow.version} - runtime + provided org.apache.arrow arrow-memory-core ${arrow.version} - compile + provided io.netty @@ -111,6 +118,7 @@ org.apache.arrow arrow-vector ${arrow.version} + provided io.netty diff --git a/package/pom.xml b/package/pom.xml index 709170a50fc..0d6b2d62573 100644 --- a/package/pom.xml +++ b/package/pom.xml @@ -118,28 +118,15 @@ com.google.gson.** - - org.apache.arrow - ${gluten.shade.packageName}.org.apache.arrow - - - org.apache.arrow.c.* - org.apache.arrow.c.jni.* - org.apache.arrow.memory.** - org.apache.arrow.vector.** - org.apache.arrow.dataset.** - - + com.google.flatbuffers ${gluten.shade.packageName}.com.google.flatbuffers