[GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded by sezruby · Pull Request #12226 · apache/gluten

sezruby · 2026-06-02T17:07:57Z

What changes were proposed in this pull request?

Extend package/pom.xml's org.apache.arrow relocation excludes to also keep org.apache.arrow.memory.** and org.apache.arrow.vector.** unshaded.

The bundled Arrow C-Data classes (org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and return org.apache.arrow.memory.* and org.apache.arrow.vector.* types — which were being relocated. The result: the bundled ArrowArrayStream / ArrowSchema / ArrowArray / Data classes get compiled against the shaded BufferAllocator / VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator hits NoSuchMethodError.

This affects any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow.

How was this patch tested?

Adds dev/check-arrow-c-shading.sh which runs javap on the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired into package/pom.xml's verify phase via exec-maven-plugin so regressions are caught in CI.

Tested against the upstream gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar:

$ dev/check-arrow-c-shading.sh /path/to/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar
  FAIL org/apache/arrow/c/ArrowArrayStream — public API references gluten-shaded Arrow types:
      public static org.apache.arrow.c.ArrowArrayStream allocateNew(
        org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator);
  FAIL org/apache/arrow/c/ArrowSchema — public API references gluten-shaded Arrow types:
      public static org.apache.arrow.c.ArrowSchema allocateNew(
        org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator);
  FAIL org/apache/arrow/c/ArrowArray — public API references gluten-shaded Arrow types:
      public static org.apache.arrow.c.ArrowArray allocateNew(
        org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator);
  FAIL org/apache/arrow/c/Data — public API references gluten-shaded Arrow types:
      [16 methods touching shaded org.apache.arrow.memory/vector types]

Bundle has 4 Arrow C-Data class(es) with shaded API types.
exit code: 1

After applying the relocation exclude change, a freshly-built bundle should pass the same check (script exits 0). The repro from #12225 (3 lines calling ArrowArrayStream.allocateNew(new RootAllocator(...)) ) goes from NoSuchMethodError to OK.

Closes

#12225

…API stays unshaded The bundled Arrow C-Data classes (org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and return org.apache.arrow.memory.* and org.apache.arrow.vector.* types, which were being relocated to org.apache.gluten.shaded.*. The result: bundled ArrowArrayStream/ArrowSchema/ArrowArray/Data classes are compiled against the shaded BufferAllocator/VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator gets NoSuchMethodError. Triggered for any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow. Fix: extend the relocation excludes to also keep org.apache.arrow.memory.** and org.apache.arrow.vector.** unshaded. The bundled C-Data API now matches the public Apache Arrow API. Adds dev/check-arrow-c-shading.sh which runs javap on the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired into package/pom.xml's verify phase via exec-maven-plugin so regressions are caught in CI. Tested against the upstream gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar — script exits 1 with a clear diagnosis on the broken bundle. Closes apache#12225

sezruby · 2026-06-02T18:28:02Z

@philo-he, @zhouyuan could you have a look at the PR?

philo-he · 2026-06-03T05:19:11Z

@sezruby, thanks for the PR. This fix makes sense. I recall there's a related issue that occurs at compile time when an external project introduces the Gluten JAR as a dependency: a Scala type mismatch caused by the Maven Shade Plugin not rewriting ScalaSignature annotations. My understanding is this PR also fixes that case (see https://chungmin.hashnode.dev/unraveling-a-scala-type-mismatch-mystery).

One small concern is potential Arrow version conflicts, since these packages are no longer shaded. That said, the memory and vector APIs should be stable across minor versions, so I assume the risk should be low in practice.

cc @zhztheplayer

zhztheplayer · 2026-06-03T15:25:40Z

+                    <exclude>org.apache.arrow.memory.**</exclude>
+                    <exclude>org.apache.arrow.vector.**</exclude>


nit: Should we directly exclude all arrow packages? E.g., org.apache.arrow.*

The full org.apache.arrow.* exclusion would lose gluten's isolation from the user's Arrow version everywhere, not just on the C-Data boundary. The C-Data classes have to be unshaded because their JNI native lib hardcodes the original class names; arrow.memory.* and arrow.vector.* follow because they appear in arrow.c.* public method signatures. Anything else under org.apache.arrow.* (flight, algorithm, adapter, etc.) is internal to gluten's columnar batch handling and safer to keep shaded so it doesn't conflict with user Arrow. The narrow exclusion is the minimum that makes the public C-Data API self-consistent without giving up isolation elsewhere.

zhztheplayer · 2026-06-03T15:36:48Z

That said, the memory and vector APIs should be stable across minor versions

This sounds a real risk. Moving forward, can we completely remove Arrow from the bundled Gluten Jar, and let user rely on Spark's bundled Arrow instead?

I assume we don't have any customized Arrow code now: #12130

sezruby · 2026-06-03T17:35:55Z

can we completely remove Arrow from the bundled Gluten Jar, and let user rely on Spark's bundled Arrow instead? I assume we don't have any customized Arrow code now: #12130

Worth doing, but a couple of things worth confirming first:

#12130 removed the dead Arrow-CSV / Arrow-Dataset JVM paths. That's progress, but there are still ~34 files under gluten-arrow/ and elsewhere that call org.apache.arrow.* (ColumnarBatches, ArrowColumnVector, ArrowWritableColumnVector, ColumnarBatchSerializer, JNI columnar batch bridges, SparkArrowUtil, etc.). Those code paths are alive on the Velox hot path and the bundled Arrow is what they currently resolve against.

If we drop the bundled Arrow:

arrow.version in gluten's poms would need to align with whatever the lowest-common-denominator target Spark ships — Arrow 12 for Spark 3.5 (DBR 16.4 ships 12.0.1, vanilla Spark 3.5.x ships 15.0.0), Arrow 18 for Spark 4.x. Compile profile per Spark version, similar to how <spark.version> is already keyed.
No 15.0.0-gluten source patches relied on — needs confirmation. The arrow-gluten.version = 15.0.0-gluten in pom.xml suggests there's at least some customization, even if the dead code in [MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths #12130 was the main consumer. Worth a sweep for any remaining call site that depends on a non-public Arrow API.
Testing matrix — minor: confirm no NoSuchMethodError on the Spark distros gluten claims support for, especially DBR / Cloudera flavors that ship older Arrow.

So I think it's the right long-term direction. As an immediate fix this PR makes the current shading approach internally consistent (which is independently a valid bug fix, since the partial-shading is a latent bug regardless of whether Arrow gets unbundled later). Happy to take either path — let me know if you'd prefer I close this and pursue the Arrow-unbundling work instead, or merge this as the short-term fix and treat unbundling as a follow-up.

a Scala type mismatch caused by the Maven Shade Plugin not rewriting ScalaSignature annotations

@philo-he Good link, the partial-shading also breaks downstream Scala consumers that pull the gluten jar as a dep, since their compile-time Arrow type doesn't match what gluten expects across the API boundary. This PR fixes that as a side effect by keeping the boundary types unshaded so they match the public Apache Arrow types every other Scala consumer compiles against. cc author @clee704

zhztheplayer · 2026-06-03T19:11:58Z

arrow.version in gluten's poms would need to align with whatever the lowest-common-denominator target Spark ships

Exactly. This is the right approach. Or we just adjust the version accroding to spark.version to match Spark's Arrow.

No 15.0.0-gluten source patches relied on — needs confirmation.

Let's track back on the history to confirm that. cc @jinchengchenghh

zhztheplayer

PR LGTM as tests passed, but please still follow the community updates in case user reports any compatibility issues. @sezruby

zhouyuan · 2026-06-03T20:05:32Z

@sezruby thanks for the fix! Could you please also share the arrow usage used in your case - does those libs also shade their arrow API?

This affects any Spark workload that combines gluten with another library using Arrow C-Data

sezruby · 2026-06-03T20:37:45Z

Could you please also share the arrow usage used in your case - does those libs also shade their arrow API?

Sure — the trigger is Lance Java (org.lance.spark.write.LanceDataWriter), which uses the standard Arrow C-Data interface to ship Spark RecordBatches into a native (Rust) Lance writer. The relevant call is:

try (ArrowArrayStream stream = ArrowArrayStream.allocateNew(allocator)) {
  Data.exportArrayStream(allocator, sparkArrowReader, stream);
  Fragment.create(datasetUri, stream, params);
}

allocator here is org.apache.arrow.memory.BufferAllocator (the public Arrow type). Lance Java does not shade Arrow — it depends on org.apache.arrow.* directly. Arrow Java 15.0.2 is pulled in transitively via lance-core.

Three observations from running this on a Spark cluster that ships the gluten-velox bundle on the AppClassLoader:

JVM resolves org.apache.arrow.c.ArrowArrayStream to gluten's bundled copy (gluten on AppClassLoader, our jar on the child MutableURLClassLoader).
Gluten's ArrowArrayStream.allocateNew signature is allocateNew(org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator).
Lance's call passes org.apache.arrow.memory.BufferAllocator → NoSuchMethodError.

Other public-API Arrow C-Data callers on the same classpath would hit the same shape — Iceberg's Arrow vector layer, Snowflake JDBC's Arrow result decoder, etc. — but I haven't tested those directly, just calling out that the issue isn't Lance-specific.

Workarounds tried at the user-jar layer before this PR: userClassPathFirst=true (cascading LinkageError on Arrow Field because Spark internals on AppClassLoader pre-load Arrow before the user-jar takes over); shade-relocating org.apache.arrow.c → org.lance.shaded.arrow.c in the user fat jar (gets past NoSuchMethodError but trips UnsatisfiedLinkError because libarrow_cdata_jni.so has the JNI natives bound to org/apache/arrow/c/jni/JniWrapper). Neither is a clean user-side fix, which is why this needs to be addressed at gluten's shading layer (or via the unbundling path you proposed above).

FelixYBW · 2026-06-04T22:30:13Z

merge this as the short-term fix and treat unbundling as a follow-up

@sezruby can you follow up on this? We can put the version specific code in shimlayer. My understanding is Gluten/velox can still use its own native arrow version, the data exchange between native and Spark is the C-Data API. So as long as the c-data format doesn't change, we can use different versions of native vs. jvm.

We should upstream all the arrow native patches, the effort paused a while.

FelixYBW · 2026-06-04T22:44:06Z

Sure — the trigger is Lance Java

It's something like our first version of parquet writer offloading. We get the data from Velox, send to Spark's parquet write, convert it to Arrow, then call the native parquet. There are a few gotchas there. @JkSelf do you still remember?

@sezruby do you use lance-spark? Do you know how it manages the memory?

Actually we have a long term plan to introduce datafusion backend as complementary to Velox. Lance may be a good try.

sezruby · 2026-06-05T05:03:12Z

Followup status on the unbundling discussion:

#12244 (drop the 15.0.0-gluten artifact rename + dead modify_arrow_dataset_scan_option.patch from the Arrow JVM build): open and CI-green. Lets non-ppc64le contributors build from Maven Central without running dev/build-arrow.sh. Doesn't change runtime/bundling.
#12245 (the actual unbundle — flip arrow-memory-* / arrow-vector to scope=provided, drop the org.apache.arrow shade-relocation block): closed.

cc @zhztheplayer @FelixYBW

CI on #12245 showed spark-test-spark33 and spark-test-spark34 failing. Root cause: bundled Arrow 15 is load-bearing for Spark < 3.5, because:

Spark 3.3.1 ships Arrow 7.0.0
Spark 3.4.4 ships Arrow 11.0.0
Spark 3.5.5 ships Arrow 15.0.0
Spark 4.0 / 4.1 ship Arrow 18.x

Today gluten compiles against Arrow 15 and wins classloader resolution because its bundled copy is on extraClassPath. Strip the bundled copy and on Spark 3.3 / 3.4 only the older Arrow remains at runtime — NoSuchMethodError.

Workarounds I considered:

Per-Spark-profile <arrow.version> (3.3→7.0, 3.4→11.0, 3.5→15.0, 4.x→18.1). Compiles, but means gluten on the Spark 3.3 profile is built against Arrow 7 — exactly the "memory and vector APIs should be stable across minor versions / this sounds a real risk" concern, now spanning an eight-version gap. Too much surface area without per-version testing.
Conditional <scope> per Spark profile. Mechanical but ugly, leaves #12225 latent on 3.3 / 3.4.
Drop Spark 3.3 / 3.4 support. Out of scope.

#12226 already neutralized the immediate NoSuchMethodError from #12225 by un-shading the boundary types, so users on Spark 3.5+ are unblocked today. The full unbundling is a small diff (~3 poms) once gluten drops Spark 3.3 / 3.4 — happy to revisit it then.

sezruby · 2026-06-05T18:31:58Z

@sezruby do you use lance-spark? Do you know how it manages the memory?

Yes — the lance-spark connector. The relevant entry point is LanceDataWriter:

try (ArrowArrayStream arrowStream =
        ArrowArrayStream.allocateNew(LanceRuntime.allocator())) {
  Data.exportArrayStream(LanceRuntime.allocator(), bufferRef, arrowStream);
  return Fragment.create(writeOptions.getDatasetUri(), arrowStream, params);
}

lance-spark memory model:

One process-wide JVM BufferAllocator — LanceRuntime.allocator() is a lazily-initialized RootAllocator with size from env LANCE_ALLOCATOR_SIZE (default Long.MAX_VALUE). Global singleton, not per-task.
Spark TaskMemoryManager is not involved. Allocations don't go through acquireExecutionMemory(...), so Spark's spill/eviction can't react. Per-batch footprint is bounded by the maxBatchBytes write option (default ~64MB), and try-with-resources releases boundary buffers as soon as Fragment.create(...) returns.
JVM ↔ native handoff is Arrow C-Data only. Vectors backing the VectorSchemaRoot are exported into the ArrowArrayStream struct as raw pointers + release callback; the Rust side borrows, JVM owns + releases via Arrow's standard release contract. No double-ownership.
Lifecycle: root allocator lives for the JVM process; child allocators per batch close on stream close.

How that compares to gluten/velox:

	lance-spark	gluten/velox
JVM allocator	Process-wide `RootAllocator` singleton	Per-task `BufferAllocator` from `ArrowBufferAllocators.contextInstance()`
Spark `MemoryManager` integration	None	Yes — `ArrowReservationListener` ↔ `acquireExecutionMemory(...)`
Native allocator	Rust crate's process heap (not coordinated with JVM)	Velox `MemoryPool` hierarchy, JNI-bridged back to a JVM `ReservationListener`
Spill	None (OOM = JVM dies)	Native spill to disk, Spark-governed
Memory accounting	Container RSS only	Spark UI metrics + Velox pool stats
Off-heap visibility	Invisible to Spark	Threaded through Spark's manager

The difference is intent: lance-spark uses Arrow as a boundary ABI ("write a batch, free it") and never integrates with Spark's memory manager because it doesn't need to — it's a connector, not an execution engine. gluten/velox is a full alternative execution engine that has to play nicely with Spark's spill/OOM machinery, so it threads a ReservationListener through Arrow's BufferAllocator parent chain and across JNI into Velox's MemoryPool.

Actually we have a long term plan to introduce datafusion backend as complementary to Velox. Lance may be a good try.

For that direction, lance-spark's allocator model is too lightweight to drop in as a Velox-equivalent role — you'd want the Spark memory-manager plumbing on the DataFusion side (similar to how gluten wires Velox today) before it could serve as a full execution backend. As-is it works well as a Lance dataset reader/writer connector but isn't an engine. Worth keeping in mind when scoping the DataFusion backend.

github-actions Bot added CORE works for Gluten Core BUILD labels Jun 2, 2026

fixup: spotless — execution element order is goals before phase

43a5b6f

philo-he approved these changes Jun 3, 2026

View reviewed changes

philo-he changed the title ~~[CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded~~ [GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded Jun 3, 2026

zhztheplayer approved these changes Jun 3, 2026

View reviewed changes

zhouyuan merged commit 5668e14 into apache:main Jun 4, 2026
61 checks passed

This was referenced Jun 5, 2026

[CORE] Drop 15.0.0-gluten Arrow version rename and depend on vanilla Apache Arrow #12244

Merged

[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle (Draft) #12245

Closed

FelixYBW mentioned this pull request Jun 8, 2026

Lance-spark support in Gluten #12263

Open

		<exclude>org.apache.arrow.memory.**</exclude>
		<exclude>org.apache.arrow.vector.**</exclude>

Conversation

sezruby commented Jun 2, 2026

What changes were proposed in this pull request?

How was this patch tested?

Closes

Uh oh!

sezruby commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philo-he commented Jun 3, 2026

Uh oh!

zhztheplayer Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

sezruby Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhztheplayer commented Jun 3, 2026

Uh oh!

sezruby commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhouyuan commented Jun 3, 2026

Uh oh!

sezruby commented Jun 3, 2026

Uh oh!

Uh oh!

FelixYBW commented Jun 4, 2026

Uh oh!

FelixYBW commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Jun 5, 2026

Uh oh!

sezruby commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sezruby commented Jun 2, 2026 •

edited

Loading

sezruby commented Jun 3, 2026 •

edited

Loading

zhztheplayer commented Jun 3, 2026 •

edited

Loading

zhztheplayer left a comment •

edited

Loading

FelixYBW commented Jun 4, 2026 •

edited

Loading