spark connect by dhama-shashank-meesho · Pull Request #37 · Meesho/zeppelin

dhama-shashank-meesho · 2026-03-03T12:40:52Z

What is this PR for?

A few sentences describing the overall goals of the pull request's commits.
First time? Check out the contributing guide - https://zeppelin.apache.org/contribution/contributions.html

What type of PR is it?

Bug Fix
Improvement
Feature
Documentation
Hot Fix
Refactoring
Please leave your type of PR only

Todos

- Task

What is the Jira issue?

Open an issue on Jira https://issues.apache.org/jira/browse/ZEPPELIN/
Put link here, and add [ZEPPELIN-Jira number] in PR title, eg. [ZEPPELIN-533]

How should this be tested?

Strongly recommended: add automated unit tests for any new or changed behavior
Outline any manual steps to test the PR here.

Screenshots (if appropriate)

Questions:

Does the license files need to update?
Is there breaking changes for older versions?
Does this needs documentation?

Summary by CodeRabbit

New Features
- Added Spark Connect interpreter support enabling connections to remote Spark clusters via gRPC
- Introduced three interpreter variants (main, SQL, and PySpark) for flexible query execution
- Implemented notebook-level synchronization to serialize queries within a notebook
- Built-in memory safety guards for Python DataFrame operations with collection limits and warnings
Tests
- Added integration tests for Spark Connect functionality

coderabbitai · 2026-03-03T12:41:18Z

Walkthrough

This pull request introduces a new Apache Zeppelin module (spark-connect) that provides interpreter support for remote Spark execution via gRPC. The module includes Java interpreters for SQL and Python, Python wrapper classes for safe data access, per-notebook concurrency control, per-user session management, and corresponding integration tests.

Changes

Cohort / File(s)	Summary
Module Setup & Configuration `pom.xml`, `spark-connect/pom.xml`, `spark-connect/src/main/resources/interpreter-setting.json`	Parent POM adds spark-connect module; new module POM defines dependencies (spark-connect-client-jvm, zeppelin-python), build plugins (shade/relocate, checkstyle), and default Spark Connect version 3.5.3. Interpreter configuration defines three interpreters (main, sql, pyspark) with properties for remote connection, authentication, session limits, streaming, and SQL options.
Core Java Interpreters `spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectInterpreter.java`, `SparkConnectSqlInterpreter.java`, `PySparkConnectInterpreter.java`, `IPySparkConnectInterpreter.java`	Four interpreter classes: SparkConnectInterpreter manages remote SparkSession with per-user session slot limits and per-notebook concurrency locks; SparkConnectSqlInterpreter wraps SQL execution with streaming/standard result modes; PySparkConnectInterpreter bridges Python/Py4j for PySpark; IPySparkConnectInterpreter extends IPythonInterpreter to integrate IPython with Spark Connect.
Utilities & Support `spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectUtils.java`, `NotebookLockManager.java`	SparkConnectUtils provides connection string building, DataFrame rendering/streaming to Zeppelin table format, and reserved character handling. NotebookLockManager provides per-notebook fair locks for serializing queries within a notebook.
Python Wrappers `spark-connect/src/main/resources/python/zeppelin_sparkconnect.py`, `zeppelin_isparkconnect.py`	Two Python modules wrap Java SparkSession and Dataset objects; expose SparkConnectDataFrame and SparkConnectSession classes with safe collection limits, iteration guards, pandas conversion, and OOM warnings. Enable Pythonic interaction with Spark data structures from Zeppelin.
Integration & Unit Tests `spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`, `SparkConnectSqlInterpreterTest.java`, `PySparkConnectInterpreterTest.java`, `SparkConnectUtilsTest.java`	Integration tests for interpreters (gated by SPARK_CONNECT_TEST_REMOTE environment variable) covering session creation, SQL execution, result limits, DDL, error handling, and DataFrame handling. Unit tests for SparkConnectUtils validate connection strings and reserved character replacement.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Krd Checker	⚠️ Warning	PR description lacks required KRD link or valid KRD exemption with pod type and justification.	Add KRD link or KRD exemption (Pod Type + 20+ character justification) to PR description.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Dynamic Configuration Validation	✅ Passed	No changes to files matching the pattern application-dyn-.yml or application-dyn-.yaml were found. All PR changes are isolated to the spark-connect module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (11)

spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java (1)

41-44: Consider extracting lock policy literal and hardening utility shape.

Line 43 inlines the fairness literal. Consider promoting it to a named constant (and optionally making the class non-instantiable) for consistency and maintainability.

As per coding guidelines, "we should not use the literal values directly inside functions, they should be stored into separate variables and the variables should be used".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`
around lines 41 - 44, Extract the fairness literal from the ReentrantLock
instantiation into a named private static final boolean (e.g., FAIR_LOCK = true)
and replace the inline literal in getNotebookLock (and any other uses) with that
constant; also harden the utility by preventing instantiation of
NotebookLockManager (add a private constructor) and ensure notebookLocks and
getNotebookLock remain static so the class stays a proper non-instantiable
utility holder.

spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java (1)

54-89: Extract shared Spark Connect integration test harness.

This setup/context-builder pattern is repeated across multiple test classes. Moving it to a shared base/helper will reduce drift and simplify future changes.

As per coding guidelines, "if duplicate code exists it should be moved to a common method".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`
around lines 54 - 89, The test harness setup/teardown and context builder in
SparkConnectInterpreterTest (setUp, tearDown, and getInterpreterContext) are
duplicated across tests—extract them into a shared test base or helper class:
create a common abstract base (e.g., SparkConnectTestBase) or a TestUtils helper
that provides the static setup/cleanup methods (initializing InterpreterGroup,
SparkConnectInterpreter, calling interpreter.open/close) and a reusable
getInterpreterContext builder that returns the configured InterpreterContext
(using AngularObjectRegistry, LocalResourcePool, InterpreterOutput, mocked
RemoteInterpreterEventClient); then update SparkConnectInterpreterTest to extend
or call that shared base/helper and remove the duplicated code from the test
class.

spark-connect/pom.xml (1)

37-38: Remove duplicated spark.connect.version declaration to prevent drift.

The same version is declared both globally and again in the default-active profile. Keeping one source of truth reduces maintenance risk.

As per coding guidelines, "if duplicate code exists it should be moved to a common method".

Also applies to: 124-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spark-connect/pom.xml` around lines 37 - 38, There are two declarations of
the Maven property spark.connect.version (one global and one inside the
default-active profile) causing duplication and drift; remove the duplicate
inside the profile (or remove the global one if you intend the profile to be
canonical) so only one spark.connect.version property remains, and update/remove
the matching duplicate at the other spot referenced (the other occurrence around
the default-active profile) to ensure a single source of truth for Spark Connect
version.

spark-connect/src/main/resources/python/zeppelin_isparkconnect.py (3)

94-99: Chain the exception for better debugging context.

When re-raising ImportError, chain it with from None to indicate it's an intentional replacement.

♻️ Proposed fix

         except ImportError:
-            raise ImportError(
+            raise ImportError(
                 "pandas is required for toPandas(). "
-                "Install it with: pip install pandas")
+                "Install it with: pip install pandas") from None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around
lines 94 - 99, In the except ImportError block that handles "import pandas as
pd" (used by toPandas()), re-raise the ImportError using exception chaining with
"from None" so the new ImportError intentionally replaces the original (i.e.,
raise ImportError("pandas is required for toPandas(). Install it with: pip
install pandas") from None).

264-266: Add stacklevel=2 for proper warning attribution.

♻️ Proposed fix

             if isinstance(data, pd.DataFrame):
                 warnings.warn(
                     "createDataFrame from pandas goes through Py4j serialization. "
-                    "For large DataFrames, consider writing to a temp table instead.")
+                    "For large DataFrames, consider writing to a temp table instead.",
+                    stacklevel=2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around
lines 264 - 266, The warnings.warn call that alerts users about "createDataFrame
from pandas goes through Py4j serialization..." should include stacklevel=2 so
the warning points to the user's call site; update the warnings.warn invocation
in zeppelin_isparkconnect.py (the warnings.warn(...) call with that exact
message) to pass stacklevel=2 as an argument (e.g.,
warnings.warn("createDataFrame ... temp table instead.", stacklevel=2)).

68-73: Add stacklevel=2 to warnings for proper caller attribution.

Without explicit stacklevel, warnings will point to this internal line rather than the user's code that triggered the warning.

♻️ Proposed fix

             if row_count > _COLLECT_WARN_THRESHOLD:
                 warnings.warn(
                     "Collecting %d rows to driver. This may cause OOM. "
                     "Consider using .limit() or .toPandas() with a smaller subset."
-                    % row_count)
+                    % row_count,
+                    stacklevel=2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around
lines 68 - 73, The warnings.warn call inside the data-collection branch should
include stacklevel=2 so the warning points to the user's calling code; update
the warnings.warn invocation (the one immediately before return
list(self._jdf.collectAsList())) to pass stacklevel=2 as an argument while
keeping the existing message and interpolation unchanged.

spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java (1)

60-65: Consider opening SparkConnectInterpreter explicitly for robustness.

The open() method resolves the SparkConnectInterpreter but doesn't explicitly open it. While Zeppelin may guarantee ordering, explicitly calling sparkConnectInterpreter.open() would make the dependency clear and prevent issues if initialization order changes.

♻️ Proposed fix

   `@Override`
   public void open() throws InterpreterException {
     this.sparkConnectInterpreter =
         getInterpreterInTheSameSessionByClassName(SparkConnectInterpreter.class);
+    sparkConnectInterpreter.open();
     this.sqlSplitter = new SqlSplitter();
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java`
around lines 60 - 65, The open() method currently looks up the
SparkConnectInterpreter instance but doesn't explicitly initialize it; update
SparkConnectSqlInterpreter.open to invoke sparkConnectInterpreter.open() after
obtaining it (and before using sqlSplitter), propagating or handling
InterpreterException as appropriate so the dependent SparkConnectInterpreter is
explicitly opened; reference the sparkConnectInterpreter field and the open()
method on SparkConnectInterpreter when making this change.

spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java (1)

102-109: Redundant null check - the else branch is unreachable.

getProperty("zeppelin.python", "python") always returns at least "python" (the default), so StringUtils.isNotBlank(pythonExec) is always true and the final return "python" is dead code.

♻️ Proposed simplification

   `@Override`
   protected String getPythonExec() {
-    String pythonExec = getProperty("zeppelin.python", "python");
-    if (StringUtils.isNotBlank(pythonExec)) {
-      return pythonExec;
-    }
-    return "python";
+    return getProperty("zeppelin.python", "python");
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java`
around lines 102 - 109, The getPythonExec method contains a redundant non-blank
check because getProperty("zeppelin.python", "python") will always return a
non-null default; simplify by returning the property value directly. Update the
getPythonExec method in PySparkConnectInterpreter to simply return
getProperty("zeppelin.python", "python") (remove the StringUtils.isNotBlank
check and the unreachable final return) so only the direct call to getProperty
remains.

spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java (2)

41-42: Consider resetting opened flag in close() to support interpreter restart.

The opened flag prevents double-open but is never reset in close(), which could prevent the interpreter from being reopened after closing.

♻️ Proposed fix in close()

   `@Override`
   public void close() throws InterpreterException {
     LOGGER.info("Close IPySparkConnectInterpreter");
     super.close();
+    opened = false;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java`
around lines 41 - 42, The opened flag is set to prevent double-open but isn't
reset on shutdown; update the IPySparkConnectInterpreter.close() method to set
opened = false (and clear any state like curIntpContext if appropriate) so the
interpreter can be reopened after close; apply this change inside the close()
implementation of IPySparkConnectInterpreter to mirror the semantics of open()
and ensure proper restart behavior.

122-125: Unchecked cast is documented but lacks runtime safety.

While @SuppressWarnings is present, if a non-Dataset object is passed, this will throw a ClassCastException at runtime. Consider adding a type check or documenting the contract.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java`
around lines 122 - 125, The method formatDataFrame currently performs an
unchecked cast to (Dataset<Row>) and can throw ClassCastException at runtime;
update formatDataFrame to validate the input before casting (e.g., check that df
instanceof Dataset<?> and that its row type is compatible) and handle invalid
types by throwing a clear IllegalArgumentException or returning a descriptive
error string instead of allowing a ClassCastException; keep the call to
SparkConnectUtils.showDataFrame for valid Dataset<Row> inputs and include the
method name formatDataFrame, the target cast to Dataset<Row>, and the use of
SparkConnectUtils.showDataFrame when locating and modifying the code.

spark-connect/src/main/resources/python/zeppelin_sparkconnect.py (1)

1-310: Consider extracting shared code to reduce duplication.

zeppelin_sparkconnect.py and zeppelin_isparkconnect.py are nearly identical. Consider extracting the common SparkConnectDataFrame and SparkConnectSession classes into a shared module to reduce maintenance burden and ensure consistency.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around
lines 1 - 310, The two files duplicate the SparkConnectDataFrame and
SparkConnectSession implementations; extract these classes and any helper
functions/constants they use (e.g., SparkConnectDataFrame, SparkConnectSession,
_rows_to_dicts, _COLLECT_LIMIT_DEFAULT, _COLLECT_WARN_THRESHOLD) into a new
shared module (e.g., zeppelin_sparkconnect_common) and replace the in-file
definitions in both zeppelin_sparkconnect.py and zeppelin_isparkconnect.py with
imports from that module; ensure each file still initializes its
gateway/entry_point/_max_result and passes or sets any module-level state the
shared module relies on, update imports and references (SparkConnectDataFrame,
SparkConnectSession) accordingly, and run tests to confirm behavior is
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`:
- Around line 51-53: removeNotebookLock unconditionally removes entries from
notebookLocks which can create a new lock while the old one is still held;
change removeNotebookLock to only remove the map entry when the current lock for
noteId is not held and has no queued threads. Locate notebookLocks and the
methods removeNotebookLock and getNotebookLock, obtain the ReentrantLock
instance from notebookLocks (e.g., via get/computeIfPresent), and check
lock.isLocked() and lock.hasQueuedThreads() (or attempt a non-blocking tryLock
and immediately unlock) before calling notebookLocks.remove(noteId) so you only
remove locks that are truly free.

In `@spark-connect/src/main/resources/interpreter-setting.json`:
- Around line 111-117: The "spark.connect.token" interpreter setting is
currently declared with "type": "string" and must be treated as sensitive;
update the "spark.connect.token" entry in interpreter-setting.json to use
"type": "password" (matching other secret fields like jdbc/mongodb entries) so
the token is masked in UI/exports, keeping the same envName, propertyName and
defaultValue.

In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py`:
- Around line 136-137: The groupBy method currently returns the raw Java
GroupedData from self._jdf, breaking the wrapper pattern; update
SparkConnectDataFrame.groupBy to wrap the result in the corresponding Python
wrapper (e.g., a GroupedData / SparkConnectGroupedData instance) instead of
returning the raw Java object so callers get the same high-level API as other
transformations; locate groupBy and construct/return the proper wrapper (passing
the Java object and any required session/context like self._session) using the
existing GroupedData wrapper class used elsewhere in the module.
- Around line 52-56: The show method currently ignores the truncate parameter;
update the show function (def show) to either remove the unused truncate
argument or forward it to intp.formatDataFrame so truncation is honored—e.g.,
use the truncate value when calling intp.formatDataFrame(self._jdf, effective_n,
truncate) if the Java/Python interop supports a truncate argument, or if not
supported, drop the truncate parameter from the show signature and update any
callers; reference the show method, its truncate parameter, self._jdf and the
intp.formatDataFrame call when making the change.

In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py`:
- Around line 54-56: The show method currently ignores the truncate parameter;
update the show(self, n=20, truncate=True) implementation to use truncate when
formatting the DataFrame (e.g., pass the truncate flag through to
intp.formatDataFrame or apply equivalent truncation logic before printing) so
that the truncate argument affects output; locate the show method and modify the
call to intp.formatDataFrame(self._jdf, effective_n) to include the truncate
behavior using the truncate variable.
- Around line 141-142: The groupBy method currently returns the raw Java
GroupedData object from self._jdf.groupBy(*cols); change it to return the
wrapped GroupedData wrapper (e.g., construct and return
GroupedData(self._jdf.groupBy(*cols)) or whatever local wrapper class is used)
so the wrapper pattern is preserved; update the groupBy method and add any
necessary import/reference to the wrapper class (GroupedData) used elsewhere in
this module.
- Around line 89-98: The docstring for toPandas incorrectly states that it tries
to use pyarrow; update the toPandas method docstring to accurately describe the
current implementation: remove any mention of pyarrow and state that it converts
rows row-by-row via Py4j with a safety limit (limit argument, default from
zeppelin.spark.maxResult, and limit=-1 for all rows). Keep the Args section and
wording consistent with the actual behavior in toPandas.

In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java`:
- Around line 116-117: The OR-based assertions in PySparkConnectInterpreterTest
(the assertTrue checks that use output.contains("id") ||
output.contains("message") || output.contains("hello") and the ones at the other
noted locations) are too permissive — update them to assert specific,
deterministic output by checking all expected tokens together (e.g.,
assertTrue(output.contains("id") && output.contains("message") &&
output.contains("hello")) or better,
assertTrue(output.containsAll(expectedColumnsList)) or assertEquals on the exact
string/JSON you expect), replace the ambiguous SUCCESS/ERROR acceptance with an
assertEquals to the single expected status value, and for the delta-case create
or mock deterministic test data so the expected columns/values are known; locate
and change the assertions in PySparkConnectInterpreterTest (the OR-based
assertions and the SUCCESS/ERROR branch) and adjust test setup for the delta
path to produce deterministic output.

In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`:
- Around line 72-77: The teardown only closes the Interpreter instance; also
ensure the InterpreterGroup is cleaned up to avoid lingering resources by
updating the tearDown method to check if the shared InterpreterGroup
(InterpreterGroup variable) is non-null and invoke its close/cleanup method
after closing the interpreter (e.g., if (interpreterGroup != null) {
interpreterGroup.close(); }), making sure to reference the existing interpreter
and InterpreterGroup symbols and handle exceptions similar to
interpreter.close().

---

Nitpick comments:
In `@spark-connect/pom.xml`:
- Around line 37-38: There are two declarations of the Maven property
spark.connect.version (one global and one inside the default-active profile)
causing duplication and drift; remove the duplicate inside the profile (or
remove the global one if you intend the profile to be canonical) so only one
spark.connect.version property remains, and update/remove the matching duplicate
at the other spot referenced (the other occurrence around the default-active
profile) to ensure a single source of truth for Spark Connect version.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java`:
- Around line 41-42: The opened flag is set to prevent double-open but isn't
reset on shutdown; update the IPySparkConnectInterpreter.close() method to set
opened = false (and clear any state like curIntpContext if appropriate) so the
interpreter can be reopened after close; apply this change inside the close()
implementation of IPySparkConnectInterpreter to mirror the semantics of open()
and ensure proper restart behavior.
- Around line 122-125: The method formatDataFrame currently performs an
unchecked cast to (Dataset<Row>) and can throw ClassCastException at runtime;
update formatDataFrame to validate the input before casting (e.g., check that df
instanceof Dataset<?> and that its row type is compatible) and handle invalid
types by throwing a clear IllegalArgumentException or returning a descriptive
error string instead of allowing a ClassCastException; keep the call to
SparkConnectUtils.showDataFrame for valid Dataset<Row> inputs and include the
method name formatDataFrame, the target cast to Dataset<Row>, and the use of
SparkConnectUtils.showDataFrame when locating and modifying the code.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`:
- Around line 41-44: Extract the fairness literal from the ReentrantLock
instantiation into a named private static final boolean (e.g., FAIR_LOCK = true)
and replace the inline literal in getNotebookLock (and any other uses) with that
constant; also harden the utility by preventing instantiation of
NotebookLockManager (add a private constructor) and ensure notebookLocks and
getNotebookLock remain static so the class stays a proper non-instantiable
utility holder.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java`:
- Around line 102-109: The getPythonExec method contains a redundant non-blank
check because getProperty("zeppelin.python", "python") will always return a
non-null default; simplify by returning the property value directly. Update the
getPythonExec method in PySparkConnectInterpreter to simply return
getProperty("zeppelin.python", "python") (remove the StringUtils.isNotBlank
check and the unreachable final return) so only the direct call to getProperty
remains.

In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java`:
- Around line 60-65: The open() method currently looks up the
SparkConnectInterpreter instance but doesn't explicitly initialize it; update
SparkConnectSqlInterpreter.open to invoke sparkConnectInterpreter.open() after
obtaining it (and before using sqlSplitter), propagating or handling
InterpreterException as appropriate so the dependent SparkConnectInterpreter is
explicitly opened; reference the sparkConnectInterpreter field and the open()
method on SparkConnectInterpreter when making this change.

In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py`:
- Around line 94-99: In the except ImportError block that handles "import pandas
as pd" (used by toPandas()), re-raise the ImportError using exception chaining
with "from None" so the new ImportError intentionally replaces the original
(i.e., raise ImportError("pandas is required for toPandas(). Install it with:
pip install pandas") from None).
- Around line 264-266: The warnings.warn call that alerts users about
"createDataFrame from pandas goes through Py4j serialization..." should include
stacklevel=2 so the warning points to the user's call site; update the
warnings.warn invocation in zeppelin_isparkconnect.py (the warnings.warn(...)
call with that exact message) to pass stacklevel=2 as an argument (e.g.,
warnings.warn("createDataFrame ... temp table instead.", stacklevel=2)).
- Around line 68-73: The warnings.warn call inside the data-collection branch
should include stacklevel=2 so the warning points to the user's calling code;
update the warnings.warn invocation (the one immediately before return
list(self._jdf.collectAsList())) to pass stacklevel=2 as an argument while
keeping the existing message and interpolation unchanged.

In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py`:
- Around line 1-310: The two files duplicate the SparkConnectDataFrame and
SparkConnectSession implementations; extract these classes and any helper
functions/constants they use (e.g., SparkConnectDataFrame, SparkConnectSession,
_rows_to_dicts, _COLLECT_LIMIT_DEFAULT, _COLLECT_WARN_THRESHOLD) into a new
shared module (e.g., zeppelin_sparkconnect_common) and replace the in-file
definitions in both zeppelin_sparkconnect.py and zeppelin_isparkconnect.py with
imports from that module; ensure each file still initializes its
gateway/entry_point/_max_result and passes or sets any module-level state the
shared module relies on, update imports and references (SparkConnectDataFrame,
SparkConnectSession) accordingly, and run tests to confirm behavior is
unchanged.

In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`:
- Around line 54-89: The test harness setup/teardown and context builder in
SparkConnectInterpreterTest (setUp, tearDown, and getInterpreterContext) are
duplicated across tests—extract them into a shared test base or helper class:
create a common abstract base (e.g., SparkConnectTestBase) or a TestUtils helper
that provides the static setup/cleanup methods (initializing InterpreterGroup,
SparkConnectInterpreter, calling interpreter.open/close) and a reusable
getInterpreterContext builder that returns the configured InterpreterContext
(using AngularObjectRegistry, LocalResourcePool, InterpreterOutput, mocked
RemoteInterpreterEventClient); then update SparkConnectInterpreterTest to extend
or call that shared base/helper and remove the duplicated code from the test
class.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e5d0b9 and 95c600b.

📒 Files selected for processing (15)

pom.xml
spark-connect/pom.xml
spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java
spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java
spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java
spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectInterpreter.java
spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java
spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectUtils.java
spark-connect/src/main/resources/interpreter-setting.json
spark-connect/src/main/resources/python/zeppelin_isparkconnect.py
spark-connect/src/main/resources/python/zeppelin_sparkconnect.py
spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java
spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java
spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreterTest.java
spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectUtilsTest.java

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java

+  public static void removeNotebookLock(String noteId) {
+    notebookLocks.remove(noteId);
+  }


⚠️ Potential issue | 🟠 Major

Avoid removing notebook locks while they may still be active.

Line 52 removes the lock unconditionally. If a thread still holds the old lock and another thread calls getNotebookLock, a new lock can be created for the same notebook, allowing concurrent execution.

🧵 Safer cleanup approach

public static void removeNotebookLock(String noteId) { - notebookLocks.remove(noteId); + notebookLocks.computeIfPresent(noteId, (id, lock) -> { + if (lock.isLocked() || lock.hasQueuedThreads()) { + return lock; + } + return null; // remove only when truly idle + }); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

public static void removeNotebookLock(String noteId) {

notebookLocks.remove(noteId);

}

public static void removeNotebookLock(String noteId) {

notebookLocks.computeIfPresent(noteId, (id, lock) -> {

if (lock.isLocked() || lock.hasQueuedThreads()) {

return lock;

}

return null; // remove only when truly idle

});

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java` around lines 51 - 53, removeNotebookLock unconditionally removes entries from notebookLocks which can create a new lock while the old one is still held; change removeNotebookLock to only remove the map entry when the current lock for noteId is not held and has no queued threads. Locate notebookLocks and the methods removeNotebookLock and getNotebookLock, obtain the ReentrantLock instance from notebookLocks (e.g., via get/computeIfPresent), and check lock.isLocked() and lock.hasQueuedThreads() (or attempt a non-blocking tryLock and immediately unlock) before calling notebookLocks.remove(noteId) so you only remove locks that are truly free.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/interpreter-setting.json

+      "spark.connect.token": {
+        "envName": "SPARK_CONNECT_TOKEN",
+        "propertyName": "spark.connect.token",
+        "defaultValue": "",
+        "description": "Authentication token for Spark Connect (optional)",
+        "type": "string"
+      },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and verify the target file exists find . -name "interpreter-setting.json" -type f

Repository: Meesho/zeppelin

Length of output: 2075

🏁 Script executed:

# Run the provided verification script to check for password-type usage rg -n '"type"\s*:\s*"password"' --glob '**/interpreter-setting.json'

Repository: Meesho/zeppelin

Length of output: 559

🏁 Script executed:

# Examine the structure of the spark-connect interpreter-setting.json file around lines 111-117 if [ -f "spark-connect/src/main/resources/interpreter-setting.json" ]; then cat -n "spark-connect/src/main/resources/interpreter-setting.json" | sed -n '105,125p' else echo "File not found at expected path" fi

Repository: Meesho/zeppelin

Length of output: 1010

🏁 Script executed:

# Search for other authentication token fields to understand naming patterns rg -i '"token"' --glob '**/interpreter-setting.json' -A 3

Repository: Meesho/zeppelin

Length of output: 41

Mark spark.connect.token as a secret field.

Line 116 uses "type": "string" for an authentication token. Authentication tokens should be masked as sensitive values in interpreter settings. The repository already uses "type": "password" for other authentication fields (e.g., in jdbc, mongodb, elasticsearch, cassandra, and influxdb interpreters).

🔐 Suggested change

"spark.connect.token": { "envName": "SPARK_CONNECT_TOKEN", "propertyName": "spark.connect.token", "defaultValue": "", "description": "Authentication token for Spark Connect (optional)", - "type": "string" + "type": "password" },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"spark.connect.token": {

"envName": "SPARK_CONNECT_TOKEN",

"propertyName": "spark.connect.token",

"defaultValue": "",

"description": "Authentication token for Spark Connect (optional)",

"type": "string"

},

"spark.connect.token": {

"envName": "SPARK_CONNECT_TOKEN",

"propertyName": "spark.connect.token",

"defaultValue": "",

"description": "Authentication token for Spark Connect (optional)",

"type": "password"

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/interpreter-setting.json` around lines 111 - 117, The "spark.connect.token" interpreter setting is currently declared with "type": "string" and must be treated as sensitive; update the "spark.connect.token" entry in interpreter-setting.json to use "type": "password" (matching other secret fields like jdbc/mongodb entries) so the token is masked in UI/exports, keeping the same envName, propertyName and defaultValue.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/python/zeppelin_isparkconnect.py

+    def show(self, n=20, truncate=True):
+        effective_n = min(n, _max_result)
+        print(intp.formatDataFrame(self._jdf, effective_n))
+
+    def collect(self, limit=None):


⚠️ Potential issue | 🟡 Minor

Unused truncate parameter in show() method.

The truncate argument is accepted but never used. This could mislead users expecting truncation control.

🔧 Proposed fix

Either remove the unused parameter or pass it to formatDataFrame:

- def show(self, n=20, truncate=True): - effective_n = min(n, _max_result) - print(intp.formatDataFrame(self._jdf, effective_n)) + def show(self, n=20, truncate=True): + """Show DataFrame rows. Note: truncate parameter is currently not supported.""" + effective_n = min(n, _max_result) + print(intp.formatDataFrame(self._jdf, effective_n))

Alternatively, if the Java side supports truncation, wire it through.

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 52-52: Unused method argument: truncate

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around lines 52 - 56, The show method currently ignores the truncate parameter; update the show function (def show) to either remove the unused truncate argument or forward it to intp.formatDataFrame so truncation is honored—e.g., use the truncate value when calling intp.formatDataFrame(self._jdf, effective_n, truncate) if the Java/Python interop supports a truncate argument, or if not supported, drop the truncate parameter from the show signature and update any callers; reference the show method, its truncate parameter, self._jdf and the intp.formatDataFrame call when making the change.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/python/zeppelin_isparkconnect.py

+    def groupBy(self, *cols):
+        return self._jdf.groupBy(*cols)


⚠️ Potential issue | 🟡 Minor

groupBy returns raw Java object instead of wrapped DataFrame.

Unlike other transformation methods that return SparkConnectDataFrame, groupBy returns the raw Java GroupedData object, breaking the wrapper pattern and exposing internals.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around lines 136 - 137, The groupBy method currently returns the raw Java GroupedData from self._jdf, breaking the wrapper pattern; update SparkConnectDataFrame.groupBy to wrap the result in the corresponding Python wrapper (e.g., a GroupedData / SparkConnectGroupedData instance) instead of returning the raw Java object so callers get the same high-level API as other transformations; locate groupBy and construct/return the proper wrapper (passing the Java object and any required session/context like self._session) using the existing GroupedData wrapper class used elsewhere in the module.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/python/zeppelin_sparkconnect.py

+    def show(self, n=20, truncate=True):
+        effective_n = min(n, _max_result)
+        print(intp.formatDataFrame(self._jdf, effective_n))


⚠️ Potential issue | 🟡 Minor

Unused truncate parameter - same issue as in zeppelin_isparkconnect.py.

The truncate argument is accepted but not used, which could mislead users.

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 54-54: Unused method argument: truncate

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around lines 54 - 56, The show method currently ignores the truncate parameter; update the show(self, n=20, truncate=True) implementation to use truncate when formatting the DataFrame (e.g., pass the truncate flag through to intp.formatDataFrame or apply equivalent truncation logic before printing) so that the truncate argument affects output; locate the show method and modify the call to intp.formatDataFrame(self._jdf, effective_n) to include the truncate behavior using the truncate variable.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/python/zeppelin_sparkconnect.py

+    def toPandas(self, limit=None):
+        """Convert to pandas DataFrame. Applies a safety limit.
+
+        Tries to use pyarrow for efficient serialization if available,
+        otherwise falls back to row-by-row conversion through Py4j.
+
+        Args:
+            limit: Max rows. Defaults to zeppelin.spark.maxResult.
+                   Pass limit=-1 for all rows (use with caution on large data).
+        """


⚠️ Potential issue | 🟡 Minor

Misleading docstring: mentions pyarrow but doesn't use it.

The docstring claims "Tries to use pyarrow for efficient serialization if available" but the implementation doesn't check for or use pyarrow - it always uses row-by-row conversion through Py4j.

📝 Proposed fix - update docstring to match implementation

def toPandas(self, limit=None): """Convert to pandas DataFrame. Applies a safety limit. - Tries to use pyarrow for efficient serialization if available, - otherwise falls back to row-by-row conversion through Py4j. + Uses row-by-row conversion through Py4j. Args: limit: Max rows. Defaults to zeppelin.spark.maxResult. Pass limit=-1 for all rows (use with caution on large data). """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def toPandas(self, limit=None):

"""Convert to pandas DataFrame. Applies a safety limit.

Tries to use pyarrow for efficient serialization if available,

otherwise falls back to row-by-row conversion through Py4j.

Args:

limit: Max rows. Defaults to zeppelin.spark.maxResult.

Pass limit=-1 for all rows (use with caution on large data).

"""

def toPandas(self, limit=None):

"""Convert to pandas DataFrame. Applies a safety limit.

Uses row-by-row conversion through Py4j.

Args:

limit: Max rows. Defaults to zeppelin.spark.maxResult.

Pass limit=-1 for all rows (use with caution on large data).

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around lines 89 - 98, The docstring for toPandas incorrectly states that it tries to use pyarrow; update the toPandas method docstring to accurately describe the current implementation: remove any mention of pyarrow and state that it converts rows row-by-row via Py4j with a safety limit (limit argument, default from zeppelin.spark.maxResult, and limit=-1 for all rows). Keep the Args section and wording consistent with the actual behavior in toPandas.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/main/resources/python/zeppelin_sparkconnect.py

+    def groupBy(self, *cols):
+        return self._jdf.groupBy(*cols)


⚠️ Potential issue | 🟡 Minor

groupBy returns raw Java object instead of wrapped DataFrame.

Same issue as in zeppelin_isparkconnect.py - breaks the wrapper pattern by returning unwrapped Java GroupedData.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around lines 141 - 142, The groupBy method currently returns the raw Java GroupedData object from self._jdf.groupBy(*cols); change it to return the wrapped GroupedData wrapper (e.g., construct and return GroupedData(self._jdf.groupBy(*cols)) or whatever local wrapper class is used) so the wrapper pattern is preserved; update the groupBy method and add any necessary import/reference to the wrapper class (GroupedData) used elsewhere in this module.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java

+    assertTrue(output.contains("id") || output.contains("message") || output.contains("hello"),
+        "Output should contain query results: " + output);


⚠️ Potential issue | 🟡 Minor

Tighten assertions—current checks are too permissive to catch regressions reliably.

The OR-based output checks (Lines 116-117, 127-128, 151-152) can pass on partial/incorrect outputs, and Line 139-141 accepts both SUCCESS and ERROR. Consider making these assertions deterministic (e.g., assert expected columns/values together, or set up deterministic test data for the delta-case path).

Also applies to: 127-128, 139-141, 151-152

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java` around lines 116 - 117, The OR-based assertions in PySparkConnectInterpreterTest (the assertTrue checks that use output.contains("id") || output.contains("message") || output.contains("hello") and the ones at the other noted locations) are too permissive — update them to assert specific, deterministic output by checking all expected tokens together (e.g., assertTrue(output.contains("id") && output.contains("message") && output.contains("hello")) or better, assertTrue(output.containsAll(expectedColumnsList)) or assertEquals on the exact string/JSON you expect), replace the ambiguous SUCCESS/ERROR acceptance with an assertEquals to the single expected status value, and for the delta-case create or mock deterministic test data so the expected columns/values are known; locate and change the assertions in PySparkConnectInterpreterTest (the OR-based assertions and the SUCCESS/ERROR branch) and adjust test setup for the delta path to produce deterministic output.

coderabbitai · 2026-03-03T12:48:57Z

spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java

+  @AfterAll
+  public static void tearDown() throws InterpreterException {
+    if (interpreter != null) {
+      interpreter.close();
+    }
+  }


⚠️ Potential issue | 🟡 Minor

Close the InterpreterGroup in teardown as well.

InterpreterGroup owns session/lifecycle state. Line 73 currently only closes the interpreter instance; group cleanup should also run to avoid lingering resources in integration test runs.

🧹 Suggested teardown update

`@AfterAll` public static void tearDown() throws InterpreterException { - if (interpreter != null) { - interpreter.close(); - } + if (intpGroup != null) { + intpGroup.close(); + } else if (interpreter != null) { + interpreter.close(); + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java` around lines 72 - 77, The teardown only closes the Interpreter instance; also ensure the InterpreterGroup is cleaned up to avoid lingering resources by updating the tearDown method to check if the shared InterpreterGroup (InterpreterGroup variable) is non-null and invoke its close/cleanup method after closing the interpreter (e.g., if (interpreterGroup != null) { interpreterGroup.close(); }), making sure to reference the existing interpreter and InterpreterGroup symbols and handle exceptions similar to interpreter.close().

spark connect

95c600b

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark connect#37

spark connect#37
dhama-shashank-meesho wants to merge 1 commit intomasterfrom
spark-connect

dhama-shashank-meesho commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-  public static void removeNotebookLock(String noteId) {
-    notebookLocks.remove(noteId);
-  }
+  public static void removeNotebookLock(String noteId) {
+    notebookLocks.computeIfPresent(noteId, (id, lock) -> {
+      if (lock.isLocked() || lock.hasQueuedThreads()) {
+        return lock;
+      }
+      return null; // remove only when truly idle
+    });
+  }

		assertTrue(output.contains("id") \|\| output.contains("message") \|\| output.contains("hello"),
		"Output should contain query results: " + output);

Conversation

dhama-shashank-meesho commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhama-shashank-meesho commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading