[GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded#12226
Conversation
…API stays unshaded The bundled Arrow C-Data classes (org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and return org.apache.arrow.memory.* and org.apache.arrow.vector.* types, which were being relocated to org.apache.gluten.shaded.*. The result: bundled ArrowArrayStream/ArrowSchema/ArrowArray/Data classes are compiled against the shaded BufferAllocator/VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator gets NoSuchMethodError. Triggered for any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow. Fix: extend the relocation excludes to also keep org.apache.arrow.memory.** and org.apache.arrow.vector.** unshaded. The bundled C-Data API now matches the public Apache Arrow API. Adds dev/check-arrow-c-shading.sh which runs javap on the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired into package/pom.xml's verify phase via exec-maven-plugin so regressions are caught in CI. Tested against the upstream gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar — script exits 1 with a clear diagnosis on the broken bundle. Closes apache#12225
|
@sezruby, thanks for the PR. This fix makes sense. I recall there's a related issue that occurs at compile time when an external project introduces the Gluten JAR as a dependency: a Scala type mismatch caused by the Maven Shade Plugin not rewriting ScalaSignature annotations. My understanding is this PR also fixes that case (see https://chungmin.hashnode.dev/unraveling-a-scala-type-mismatch-mystery). One small concern is potential Arrow version conflicts, since these packages are no longer shaded. That said, the memory and vector APIs should be stable across minor versions, so I assume the risk should be low in practice. |
| <exclude>org.apache.arrow.memory.**</exclude> | ||
| <exclude>org.apache.arrow.vector.**</exclude> |
There was a problem hiding this comment.
nit: Should we directly exclude all arrow packages? E.g., org.apache.arrow.*
There was a problem hiding this comment.
The full org.apache.arrow.* exclusion would lose gluten's isolation from the user's Arrow version everywhere, not just on the C-Data boundary. The C-Data classes have to be unshaded because their JNI native lib hardcodes the original class names; arrow.memory.* and arrow.vector.* follow because they appear in arrow.c.* public method signatures. Anything else under org.apache.arrow.* (flight, algorithm, adapter, etc.) is internal to gluten's columnar batch handling and safer to keep shaded so it doesn't conflict with user Arrow. The narrow exclusion is the minimum that makes the public C-Data API self-consistent without giving up isolation elsewhere.
This sounds a real risk. Moving forward, can we completely remove Arrow from the bundled Gluten Jar, and let user rely on Spark's bundled Arrow instead? I assume we don't have any customized Arrow code now: #12130 |
Worth doing, but a couple of things worth confirming first: #12130 removed the dead Arrow-CSV / Arrow-Dataset JVM paths. That's progress, but there are still ~34 files under If we drop the bundled Arrow:
So I think it's the right long-term direction. As an immediate fix this PR makes the current shading approach internally consistent (which is independently a valid bug fix, since the partial-shading is a latent bug regardless of whether Arrow gets unbundled later). Happy to take either path — let me know if you'd prefer I close this and pursue the Arrow-unbundling work instead, or merge this as the short-term fix and treat unbundling as a follow-up.
@philo-he Good link, the partial-shading also breaks downstream Scala consumers that pull the gluten jar as a dep, since their compile-time Arrow type doesn't match what gluten expects across the API boundary. This PR fixes that as a side effect by keeping the boundary types unshaded so they match the public Apache Arrow types every other Scala consumer compiles against. cc author @clee704 |
Exactly. This is the right approach. Or we just adjust the version accroding to
Let's track back on the history to confirm that. cc @jinchengchenghh |
There was a problem hiding this comment.
PR LGTM as tests passed, but please still follow the community updates in case user reports any compatibility issues. @sezruby
|
@sezruby thanks for the fix! Could you please also share the arrow usage used in your case - does those libs also shade their arrow API?
|
Sure — the trigger is Lance Java ( try (ArrowArrayStream stream = ArrowArrayStream.allocateNew(allocator)) {
Data.exportArrayStream(allocator, sparkArrowReader, stream);
Fragment.create(datasetUri, stream, params);
}
Three observations from running this on a Spark cluster that ships the gluten-velox bundle on the AppClassLoader:
Other public-API Arrow C-Data callers on the same classpath would hit the same shape — Iceberg's Arrow vector layer, Snowflake JDBC's Arrow result decoder, etc. — but I haven't tested those directly, just calling out that the issue isn't Lance-specific. Workarounds tried at the user-jar layer before this PR: |
@sezruby can you follow up on this? We can put the version specific code in shimlayer. My understanding is Gluten/velox can still use its own native arrow version, the data exchange between native and Spark is the C-Data API. So as long as the c-data format doesn't change, we can use different versions of native vs. jvm. We should upstream all the arrow native patches, the effort paused a while. |
It's something like our first version of parquet writer offloading. We get the data from Velox, send to Spark's parquet write, convert it to Arrow, then call the native parquet. There are a few gotchas there. @JkSelf do you still remember? @sezruby do you use lance-spark? Do you know how it manages the memory? Actually we have a long term plan to introduce datafusion backend as complementary to Velox. Lance may be a good try. |
|
Followup status on the unbundling discussion:
CI on #12245 showed
Today gluten compiles against Arrow 15 and wins classloader resolution because its bundled copy is on Workarounds I considered:
#12226 already neutralized the immediate |
Yes — the lance-spark connector. The relevant entry point is try (ArrowArrayStream arrowStream =
ArrowArrayStream.allocateNew(LanceRuntime.allocator())) {
Data.exportArrayStream(LanceRuntime.allocator(), bufferRef, arrowStream);
return Fragment.create(writeOptions.getDatasetUri(), arrowStream, params);
}lance-spark memory model:
How that compares to gluten/velox:
The difference is intent: lance-spark uses Arrow as a boundary ABI ("write a batch, free it") and never integrates with Spark's memory manager because it doesn't need to — it's a connector, not an execution engine. gluten/velox is a full alternative execution engine that has to play nicely with Spark's spill/OOM machinery, so it threads a
For that direction, lance-spark's allocator model is too lightweight to drop in as a Velox-equivalent role — you'd want the Spark memory-manager plumbing on the DataFusion side (similar to how gluten wires Velox today) before it could serve as a full execution backend. As-is it works well as a Lance dataset reader/writer connector but isn't an engine. Worth keeping in mind when scoping the DataFusion backend. |
What changes were proposed in this pull request?
Extend
package/pom.xml'sorg.apache.arrowrelocation excludes to also keeporg.apache.arrow.memory.**andorg.apache.arrow.vector.**unshaded.The bundled Arrow C-Data classes (
org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and returnorg.apache.arrow.memory.*andorg.apache.arrow.vector.*types — which were being relocated. The result: the bundledArrowArrayStream/ArrowSchema/ArrowArray/Dataclasses get compiled against the shadedBufferAllocator/VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator hitsNoSuchMethodError.This affects any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow.
How was this patch tested?
Adds
dev/check-arrow-c-shading.shwhich runsjavapon the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired intopackage/pom.xml'sverifyphase viaexec-maven-pluginso regressions are caught in CI.Tested against the upstream
gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar:After applying the relocation exclude change, a freshly-built bundle should pass the same check (script exits 0). The repro from #12225 (3 lines calling
ArrowArrayStream.allocateNew(new RootAllocator(...))) goes fromNoSuchMethodErrortoOK.Closes
#12225