Add in basic GPU/CPU bridge operation [databricks] #14003

revans2 · 2025-12-12T17:09:28Z

This is step 3 in splitting #13368 into smaller pieces

Description

This adds in basic GPU/CPU bridge functionality, but it is off by default because the performance would not be good without the thread pool and optimizer.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

The performance is expected to be bad so it is off by default an not tested. I did add some basic tests to verify that the code works.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

greptile-apps · 2025-12-12T17:12:58Z

Greptile Overview

Greptile Summary

This PR implements basic GPU/CPU bridge functionality that enables CPU expression evaluation within GPU execution plans. The feature is disabled by default (spark.rapids.sql.expression.cpuBridge.enabled=false) as noted in the PR description, since full performance optimization will come in future PRs.

Key Changes:

New GpuCpuBridgeExpression that transfers data GPU→Host→CPU→Host→GPU for CPU expression evaluation
Code generation support via BridgeGenerateUnsafeProjection with interpreted fallback for ~940 lines
Bridge optimizer in RapidsMeta that automatically wraps incompatible expressions (~300 lines of changes)
Comprehensive test coverage with GpuCpuBridgeSuite and BridgeUnsafeProjectionSuite (1500+ test lines)
New metrics for tracking bridge processing and wait times
Proper resource management with ThreadLocal projections and task completion cleanup

Architecture:
The bridge sits between GPU and CPU execution by: evaluating GPU input expressions → copying to host → running CPU expression → copying result back to GPU. The implementation includes deduplication of GPU inputs using semantic equality to minimize data transfers.

Safety Considerations:

Feature is off by default with explicit configuration required
Proper resource cleanup via task completion hooks
ThreadLocal usage prevents conflicts across threads
Comprehensive null handling and type support
Excludes non-deterministic and unevaluable expressions from bridge

Confidence Score: 4/5

Safe to merge - feature is disabled by default and has comprehensive test coverage
Strong implementation with proper resource management and extensive testing. Score is 4/5 rather than 5/5 due to: (1) large amount of new code (~3000 lines) requiring careful runtime validation, (2) complex ThreadLocal usage that needs production verification, (3) bridge optimizer modifying expression trees which could have edge cases, and (4) acknowledged performance concerns that will be addressed in future PRs
Pay close attention to RapidsMeta.scala (complex optimizer logic with AST interaction) and BridgeGenerateUnsafeProjection.scala (large codegen module)

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCpuBridgeExpression.scala	5/5	New GPU/CPU bridge expression implementation - enables CPU expression evaluation within GPU plans with proper resource management and metrics
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/BridgeGenerateUnsafeProjection.scala	4/5	Large code generation module (~940 lines) for optimized bridge projections with codegen and interpreted fallback support
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala	4/5	Extensive changes (~300 lines) adding bridge optimization logic, AST interaction handling, and expression tree conversion support
tests/src/test/scala/org/apache/spark/sql/rapids/BridgeUnsafeProjectionSuite.scala	5/5	Comprehensive tests (~1373 lines) for projection correctness across all data types

Sequence Diagram

sequenceDiagram
    participant Plan as SparkPlan
    participant Bridge as GpuCpuBridgeExpression
    participant GPU as GPU Memory
    participant Host as Host Memory
    participant CPU as CPU Expression
    participant Builder as RapidsHostColumnBuilder
    
    Plan->>Bridge: columnarEval(batch)
    Note over Bridge: Start wait time tracking
    
    Bridge->>Bridge: Evaluate GPU input expressions
    Bridge->>GPU: Get GPU column data
    GPU-->>Bridge: GPU columns
    
    Bridge->>Bridge: Create ColumnarBatch with GPU columns
    Note over Bridge: Start processing time tracking
    
    Bridge->>Host: ColumnarToRowIterator (GPU→Host)
    Note over Host: Data copied to host memory
    
    Host->>CPU: Iterate rows through projection
    loop For each row
        CPU->>CPU: Evaluate CPU expression
        CPU->>Builder: Append result to builder
    end
    
    Builder->>Host: Build host column
    Host->>GPU: buildAndPutOnDevice() (Host→GPU)
    Note over GPU: Result copied back to GPU
    
    GPU-->>Bridge: GPU result column
    Note over Bridge: End processing time
    Bridge-->>Plan: GpuColumnVector
    Note over Bridge: End wait time

greptile-apps

Additional Comments (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCpuBridgeExpression.scala, line 112-113 (link)

style: Consider cleanup for thread-local state

The ThreadLocal for projections is never explicitly removed. In long-running applications or thread pools, consider adding cleanup logic, though Spark's task model may naturally handle this through task completion.

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This PR adds basic GPU/CPU bridge operation functionality that enables CPU expressions to run while keeping the overall execution plan on the GPU. The bridge is disabled by default since performance optimizations (thread pool, optimizer) are not yet implemented. This represents step 3 in splitting a larger feature into smaller, reviewable pieces.

Key changes include:

New GpuCpuBridgeExpression that wraps CPU expressions and manages data transfer between GPU and CPU
Code generation support via BridgeGenerateUnsafeProjection for efficient columnar-to-row conversion
Configuration options to enable/disable the bridge and maintain a disallow list
Comprehensive test coverage for various data types and nested structures

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`BridgeUnsafeProjectionSuite.scala`	Comprehensive test suite covering all data types, nested structures, and edge cases for the bridge projection functionality
`GpuCpuBridgeSuite.scala`	Unit tests for bridge expression properties, configuration, and compatibility checking
`BridgeGenerateUnsafeProjection.scala`	Code generation implementation for efficient bridge projections with fallback to interpreted mode
`GpuCpuBridgeExpression.scala`	Main bridge expression implementation handling GPU-to-CPU-to-GPU data flow with metrics
`RapidsMeta.scala`	Metadata support for bridge expressions including compatibility checks and conversion logic
`RapidsConf.scala`	Configuration options for enabling bridge and maintaining disallow list
`TypeConverter.java`	Extracted public interface for type conversion (visibility change from Scala to Java)
`GpuRowToColumnarExec.scala`	Visibility changes to expose `TypeConverter` and related methods
`GpuTypeShims.scala` (both versions)	Import reorganization to use the new public `TypeConverter` interface
`GpuMetrics.scala`	New metrics for tracking CPU bridge processing and wait times

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/BridgeGenerateUnsafeProjection.scala

sql-plugin/src/main/java/com/nvidia/spark/rapids/TypeConverter.java

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala

revans2 · 2025-12-12T19:59:37Z

build

greptile-apps

Additional Comments (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCpuBridgeExpression.scala, line 54 (link)

syntax: Comment says "Only GPU inputs are children" but the code includes cpuExpression in children. This creates inconsistency with the comment on line 42 that says cpuExpression is "not included as children"

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

revans2 · 2025-12-15T14:19:55Z

CI failed because of #14009 but our CI currently has not way to turn off the databricks tests when you touched something even remotely related to databricks.

revans2 · 2025-12-15T15:04:06Z

build

revans2 · 2025-12-15T15:04:26Z

I upmerged to make sure that I had the latest fixes for what caused CI to fail

greptile-apps

_{11 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 · 2025-12-15T15:26:48Z

build

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

revans2 · 2025-12-15T17:05:58Z

build

abellina · 2026-01-02T15:41:44Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/TypeConverter.java

+  /**
+   * Append row value to the column builder and return the number of data bytes written
+   */
+  public abstract double append(SpecializedGetters row, int column, RapidsHostColumnBuilder builder);


it would be nice to add a comment about the return type here (double). Why can't the type be for a whole number?

abellina · 2026-01-02T15:42:42Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/TypeConverter.java

@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.


copyrights need updating.

abellina · 2026-01-02T15:43:04Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/TypeConverter.java

+   * ahead of time.  Also because structs push nulls down to the children this size should
+   * assume a validity even if the schema says it cannot be null.
+   */
+  public abstract double getNullSize();


same question, why double?

abellina · 2026-01-02T16:02:34Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCpuBridgeExpression.scala

+    val gpuInputColumns = gpuInputs.safeMap(_.columnarEval(batch))
+
+    // Time the CPU processing (columnar->row->CPU expr->columnar)
+    val processingStartTime = System.nanoTime()


nit, we could use GpuMetric.ns(metric) { }, but it doesn't take an optional metric.

abellina · 2026-01-02T16:09:28Z

tests/src/test/scala/com/nvidia/spark/rapids/GpuCpuBridgeSuite.scala

+  // Configuration Tests
+  // ============================================================================
+
+  test("Bridge config controls feature enablement") {


this test is probably not necessary. What I think we need is testing the fallback, either in a suite or integration test.

abellina · 2026-01-02T16:23:28Z

tests/src/test/scala/org/apache/spark/sql/rapids/BridgeUnsafeProjectionSuite.scala

+      (Float.MaxValue, false),
+      (0.0f, true), // null
+      (3.14159f, false),
+      (Float.NaN, false),


should we also test negative floats and NaN ranges? I know we teted this in the past (SparkQueryCompareTestSuite) has float positive/negative nan lower/upper values.

abellina · 2026-01-02T16:23:44Z

tests/src/test/scala/org/apache/spark/sql/rapids/BridgeUnsafeProjectionSuite.scala

+      (Double.MaxValue, false),
+      (0.0, true), // null
+      (3.141592653589793, false),
+      (Double.NaN, false),


same question about nan ranges and sign.

abellina · 2026-01-02T16:24:47Z

tests/src/test/scala/org/apache/spark/sql/rapids/BridgeUnsafeProjectionSuite.scala

+      (0L, false),                 // 1970-01-01 00:00:00 UTC
+      (-1000000L, false),          // Before epoch
+      (0L, true),                  // null
+      (1672531200000000L, false)   // 2023-01-01 00:00:00 UTC


how about h/m/s and ms?

abellina · 2026-01-02T16:31:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala

      timezoneCheck()
    }
+    // Update our expressions to allow them to run with the bridge if possible
+


nit extra space

abellina · 2026-01-02T16:50:58Z

tests/src/test/scala/org/apache/spark/sql/rapids/BridgeUnsafeProjectionSuite.scala

+    )
+  }
+
+  test("multiple expressions") {


should we add DayTimeIntervalType and YearMonthIntervalType and possibly an unsupported type?

abellina · 2026-01-02T17:03:31Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/BridgeGenerateUnsafeProjection.scala

+
+    val numVarLenFields = exprSchemas.count {
+      case Schema(dt, _) => !UnsafeRow.isFixedLength(dt)
+      // TODO: consider large decimal and interval type


is there a follow on for this TODO?

abellina · 2026-01-02T17:03:46Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/BridgeGenerateUnsafeProjection.scala

+    }
+
+    val writeFieldsCode = if (isTopLevel && (row == null || ctx.currentVars != null)) {
+      // TODO: support whole stage codegen


follow on for this TODO?

Add in basic GPU/CPU bridge operation

c85114d

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 requested a review from Copilot December 12, 2025 17:09

Copilot started reviewing on behalf of revans2 December 12, 2025 17:10 View session

greptile-apps bot reviewed Dec 12, 2025

View reviewed changes

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Review Comments

7d4d4ed

revans2 changed the title ~~Add in basic GPU/CPU bridge operation~~ Add in basic GPU/CPU bridge operation [databricks] Dec 12, 2025

greptile-apps bot reviewed Dec 12, 2025

View reviewed changes

Merge branch 'main' into cursor_cpu_gpu_expr_transitions_part_3

b236f3b

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Missed config change

985c9b0

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Fixed it so the bridge actually works

f24d30d

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Dec 19, 2025

abellina reviewed Jan 2, 2026

View reviewed changes

Add in basic GPU/CPU bridge operation [databricks] #14003

Are you sure you want to change the base?

Add in basic GPU/CPU bridge operation [databricks] #14003

Uh oh!

Conversation

revans2 commented Dec 12, 2025

Description

Checklists

Uh oh!

greptile-apps bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

revans2 commented Dec 12, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

revans2 commented Dec 15, 2025

Uh oh!

revans2 commented Dec 15, 2025

Uh oh!

revans2 commented Dec 15, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 commented Dec 15, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Dec 12, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading