Conversation
WalkthroughThis pull request introduces a new Apache Zeppelin module ( Changes
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (11)
spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java (1)
41-44: Consider extracting lock policy literal and hardening utility shape.Line 43 inlines the fairness literal. Consider promoting it to a named constant (and optionally making the class non-instantiable) for consistency and maintainability.
As per coding guidelines, "we should not use the literal values directly inside functions, they should be stored into separate variables and the variables should be used".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java` around lines 41 - 44, Extract the fairness literal from the ReentrantLock instantiation into a named private static final boolean (e.g., FAIR_LOCK = true) and replace the inline literal in getNotebookLock (and any other uses) with that constant; also harden the utility by preventing instantiation of NotebookLockManager (add a private constructor) and ensure notebookLocks and getNotebookLock remain static so the class stays a proper non-instantiable utility holder.spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java (1)
54-89: Extract shared Spark Connect integration test harness.This setup/context-builder pattern is repeated across multiple test classes. Moving it to a shared base/helper will reduce drift and simplify future changes.
As per coding guidelines, "if duplicate code exists it should be moved to a common method".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java` around lines 54 - 89, The test harness setup/teardown and context builder in SparkConnectInterpreterTest (setUp, tearDown, and getInterpreterContext) are duplicated across tests—extract them into a shared test base or helper class: create a common abstract base (e.g., SparkConnectTestBase) or a TestUtils helper that provides the static setup/cleanup methods (initializing InterpreterGroup, SparkConnectInterpreter, calling interpreter.open/close) and a reusable getInterpreterContext builder that returns the configured InterpreterContext (using AngularObjectRegistry, LocalResourcePool, InterpreterOutput, mocked RemoteInterpreterEventClient); then update SparkConnectInterpreterTest to extend or call that shared base/helper and remove the duplicated code from the test class.spark-connect/pom.xml (1)
37-38: Remove duplicatedspark.connect.versiondeclaration to prevent drift.The same version is declared both globally and again in the default-active profile. Keeping one source of truth reduces maintenance risk.
As per coding guidelines, "if duplicate code exists it should be moved to a common method".
Also applies to: 124-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/pom.xml` around lines 37 - 38, There are two declarations of the Maven property spark.connect.version (one global and one inside the default-active profile) causing duplication and drift; remove the duplicate inside the profile (or remove the global one if you intend the profile to be canonical) so only one spark.connect.version property remains, and update/remove the matching duplicate at the other spot referenced (the other occurrence around the default-active profile) to ensure a single source of truth for Spark Connect version.spark-connect/src/main/resources/python/zeppelin_isparkconnect.py (3)
94-99: Chain the exception for better debugging context.When re-raising ImportError, chain it with
from Noneto indicate it's an intentional replacement.♻️ Proposed fix
except ImportError: - raise ImportError( + raise ImportError( "pandas is required for toPandas(). " - "Install it with: pip install pandas") + "Install it with: pip install pandas") from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around lines 94 - 99, In the except ImportError block that handles "import pandas as pd" (used by toPandas()), re-raise the ImportError using exception chaining with "from None" so the new ImportError intentionally replaces the original (i.e., raise ImportError("pandas is required for toPandas(). Install it with: pip install pandas") from None).
264-266: Addstacklevel=2for proper warning attribution.♻️ Proposed fix
if isinstance(data, pd.DataFrame): warnings.warn( "createDataFrame from pandas goes through Py4j serialization. " - "For large DataFrames, consider writing to a temp table instead.") + "For large DataFrames, consider writing to a temp table instead.", + stacklevel=2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around lines 264 - 266, The warnings.warn call that alerts users about "createDataFrame from pandas goes through Py4j serialization..." should include stacklevel=2 so the warning points to the user's call site; update the warnings.warn invocation in zeppelin_isparkconnect.py (the warnings.warn(...) call with that exact message) to pass stacklevel=2 as an argument (e.g., warnings.warn("createDataFrame ... temp table instead.", stacklevel=2)).
68-73: Addstacklevel=2to warnings for proper caller attribution.Without explicit
stacklevel, warnings will point to this internal line rather than the user's code that triggered the warning.♻️ Proposed fix
if row_count > _COLLECT_WARN_THRESHOLD: warnings.warn( "Collecting %d rows to driver. This may cause OOM. " "Consider using .limit() or .toPandas() with a smaller subset." - % row_count) + % row_count, + stacklevel=2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around lines 68 - 73, The warnings.warn call inside the data-collection branch should include stacklevel=2 so the warning points to the user's calling code; update the warnings.warn invocation (the one immediately before return list(self._jdf.collectAsList())) to pass stacklevel=2 as an argument while keeping the existing message and interpolation unchanged.spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java (1)
60-65: Consider opening SparkConnectInterpreter explicitly for robustness.The
open()method resolves the SparkConnectInterpreter but doesn't explicitly open it. While Zeppelin may guarantee ordering, explicitly callingsparkConnectInterpreter.open()would make the dependency clear and prevent issues if initialization order changes.♻️ Proposed fix
`@Override` public void open() throws InterpreterException { this.sparkConnectInterpreter = getInterpreterInTheSameSessionByClassName(SparkConnectInterpreter.class); + sparkConnectInterpreter.open(); this.sqlSplitter = new SqlSplitter(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java` around lines 60 - 65, The open() method currently looks up the SparkConnectInterpreter instance but doesn't explicitly initialize it; update SparkConnectSqlInterpreter.open to invoke sparkConnectInterpreter.open() after obtaining it (and before using sqlSplitter), propagating or handling InterpreterException as appropriate so the dependent SparkConnectInterpreter is explicitly opened; reference the sparkConnectInterpreter field and the open() method on SparkConnectInterpreter when making this change.spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java (1)
102-109: Redundant null check - the else branch is unreachable.
getProperty("zeppelin.python", "python")always returns at least "python" (the default), soStringUtils.isNotBlank(pythonExec)is always true and the finalreturn "python"is dead code.♻️ Proposed simplification
`@Override` protected String getPythonExec() { - String pythonExec = getProperty("zeppelin.python", "python"); - if (StringUtils.isNotBlank(pythonExec)) { - return pythonExec; - } - return "python"; + return getProperty("zeppelin.python", "python"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java` around lines 102 - 109, The getPythonExec method contains a redundant non-blank check because getProperty("zeppelin.python", "python") will always return a non-null default; simplify by returning the property value directly. Update the getPythonExec method in PySparkConnectInterpreter to simply return getProperty("zeppelin.python", "python") (remove the StringUtils.isNotBlank check and the unreachable final return) so only the direct call to getProperty remains.spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java (2)
41-42: Consider resettingopenedflag inclose()to support interpreter restart.The
openedflag prevents double-open but is never reset inclose(), which could prevent the interpreter from being reopened after closing.♻️ Proposed fix in close()
`@Override` public void close() throws InterpreterException { LOGGER.info("Close IPySparkConnectInterpreter"); super.close(); + opened = false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java` around lines 41 - 42, The opened flag is set to prevent double-open but isn't reset on shutdown; update the IPySparkConnectInterpreter.close() method to set opened = false (and clear any state like curIntpContext if appropriate) so the interpreter can be reopened after close; apply this change inside the close() implementation of IPySparkConnectInterpreter to mirror the semantics of open() and ensure proper restart behavior.
122-125: Unchecked cast is documented but lacks runtime safety.While
@SuppressWarningsis present, if a non-Dataset object is passed, this will throw a ClassCastException at runtime. Consider adding a type check or documenting the contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java` around lines 122 - 125, The method formatDataFrame currently performs an unchecked cast to (Dataset<Row>) and can throw ClassCastException at runtime; update formatDataFrame to validate the input before casting (e.g., check that df instanceof Dataset<?> and that its row type is compatible) and handle invalid types by throwing a clear IllegalArgumentException or returning a descriptive error string instead of allowing a ClassCastException; keep the call to SparkConnectUtils.showDataFrame for valid Dataset<Row> inputs and include the method name formatDataFrame, the target cast to Dataset<Row>, and the use of SparkConnectUtils.showDataFrame when locating and modifying the code.spark-connect/src/main/resources/python/zeppelin_sparkconnect.py (1)
1-310: Consider extracting shared code to reduce duplication.
zeppelin_sparkconnect.pyandzeppelin_isparkconnect.pyare nearly identical. Consider extracting the commonSparkConnectDataFrameandSparkConnectSessionclasses into a shared module to reduce maintenance burden and ensure consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around lines 1 - 310, The two files duplicate the SparkConnectDataFrame and SparkConnectSession implementations; extract these classes and any helper functions/constants they use (e.g., SparkConnectDataFrame, SparkConnectSession, _rows_to_dicts, _COLLECT_LIMIT_DEFAULT, _COLLECT_WARN_THRESHOLD) into a new shared module (e.g., zeppelin_sparkconnect_common) and replace the in-file definitions in both zeppelin_sparkconnect.py and zeppelin_isparkconnect.py with imports from that module; ensure each file still initializes its gateway/entry_point/_max_result and passes or sets any module-level state the shared module relies on, update imports and references (SparkConnectDataFrame, SparkConnectSession) accordingly, and run tests to confirm behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`:
- Around line 51-53: removeNotebookLock unconditionally removes entries from
notebookLocks which can create a new lock while the old one is still held;
change removeNotebookLock to only remove the map entry when the current lock for
noteId is not held and has no queued threads. Locate notebookLocks and the
methods removeNotebookLock and getNotebookLock, obtain the ReentrantLock
instance from notebookLocks (e.g., via get/computeIfPresent), and check
lock.isLocked() and lock.hasQueuedThreads() (or attempt a non-blocking tryLock
and immediately unlock) before calling notebookLocks.remove(noteId) so you only
remove locks that are truly free.
In `@spark-connect/src/main/resources/interpreter-setting.json`:
- Around line 111-117: The "spark.connect.token" interpreter setting is
currently declared with "type": "string" and must be treated as sensitive;
update the "spark.connect.token" entry in interpreter-setting.json to use
"type": "password" (matching other secret fields like jdbc/mongodb entries) so
the token is masked in UI/exports, keeping the same envName, propertyName and
defaultValue.
In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py`:
- Around line 136-137: The groupBy method currently returns the raw Java
GroupedData from self._jdf, breaking the wrapper pattern; update
SparkConnectDataFrame.groupBy to wrap the result in the corresponding Python
wrapper (e.g., a GroupedData / SparkConnectGroupedData instance) instead of
returning the raw Java object so callers get the same high-level API as other
transformations; locate groupBy and construct/return the proper wrapper (passing
the Java object and any required session/context like self._session) using the
existing GroupedData wrapper class used elsewhere in the module.
- Around line 52-56: The show method currently ignores the truncate parameter;
update the show function (def show) to either remove the unused truncate
argument or forward it to intp.formatDataFrame so truncation is honored—e.g.,
use the truncate value when calling intp.formatDataFrame(self._jdf, effective_n,
truncate) if the Java/Python interop supports a truncate argument, or if not
supported, drop the truncate parameter from the show signature and update any
callers; reference the show method, its truncate parameter, self._jdf and the
intp.formatDataFrame call when making the change.
In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py`:
- Around line 54-56: The show method currently ignores the truncate parameter;
update the show(self, n=20, truncate=True) implementation to use truncate when
formatting the DataFrame (e.g., pass the truncate flag through to
intp.formatDataFrame or apply equivalent truncation logic before printing) so
that the truncate argument affects output; locate the show method and modify the
call to intp.formatDataFrame(self._jdf, effective_n) to include the truncate
behavior using the truncate variable.
- Around line 141-142: The groupBy method currently returns the raw Java
GroupedData object from self._jdf.groupBy(*cols); change it to return the
wrapped GroupedData wrapper (e.g., construct and return
GroupedData(self._jdf.groupBy(*cols)) or whatever local wrapper class is used)
so the wrapper pattern is preserved; update the groupBy method and add any
necessary import/reference to the wrapper class (GroupedData) used elsewhere in
this module.
- Around line 89-98: The docstring for toPandas incorrectly states that it tries
to use pyarrow; update the toPandas method docstring to accurately describe the
current implementation: remove any mention of pyarrow and state that it converts
rows row-by-row via Py4j with a safety limit (limit argument, default from
zeppelin.spark.maxResult, and limit=-1 for all rows). Keep the Args section and
wording consistent with the actual behavior in toPandas.
In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java`:
- Around line 116-117: The OR-based assertions in PySparkConnectInterpreterTest
(the assertTrue checks that use output.contains("id") ||
output.contains("message") || output.contains("hello") and the ones at the other
noted locations) are too permissive — update them to assert specific,
deterministic output by checking all expected tokens together (e.g.,
assertTrue(output.contains("id") && output.contains("message") &&
output.contains("hello")) or better,
assertTrue(output.containsAll(expectedColumnsList)) or assertEquals on the exact
string/JSON you expect), replace the ambiguous SUCCESS/ERROR acceptance with an
assertEquals to the single expected status value, and for the delta-case create
or mock deterministic test data so the expected columns/values are known; locate
and change the assertions in PySparkConnectInterpreterTest (the OR-based
assertions and the SUCCESS/ERROR branch) and adjust test setup for the delta
path to produce deterministic output.
In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`:
- Around line 72-77: The teardown only closes the Interpreter instance; also
ensure the InterpreterGroup is cleaned up to avoid lingering resources by
updating the tearDown method to check if the shared InterpreterGroup
(InterpreterGroup variable) is non-null and invoke its close/cleanup method
after closing the interpreter (e.g., if (interpreterGroup != null) {
interpreterGroup.close(); }), making sure to reference the existing interpreter
and InterpreterGroup symbols and handle exceptions similar to
interpreter.close().
---
Nitpick comments:
In `@spark-connect/pom.xml`:
- Around line 37-38: There are two declarations of the Maven property
spark.connect.version (one global and one inside the default-active profile)
causing duplication and drift; remove the duplicate inside the profile (or
remove the global one if you intend the profile to be canonical) so only one
spark.connect.version property remains, and update/remove the matching duplicate
at the other spot referenced (the other occurrence around the default-active
profile) to ensure a single source of truth for Spark Connect version.
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.java`:
- Around line 41-42: The opened flag is set to prevent double-open but isn't
reset on shutdown; update the IPySparkConnectInterpreter.close() method to set
opened = false (and clear any state like curIntpContext if appropriate) so the
interpreter can be reopened after close; apply this change inside the close()
implementation of IPySparkConnectInterpreter to mirror the semantics of open()
and ensure proper restart behavior.
- Around line 122-125: The method formatDataFrame currently performs an
unchecked cast to (Dataset<Row>) and can throw ClassCastException at runtime;
update formatDataFrame to validate the input before casting (e.g., check that df
instanceof Dataset<?> and that its row type is compatible) and handle invalid
types by throwing a clear IllegalArgumentException or returning a descriptive
error string instead of allowing a ClassCastException; keep the call to
SparkConnectUtils.showDataFrame for valid Dataset<Row> inputs and include the
method name formatDataFrame, the target cast to Dataset<Row>, and the use of
SparkConnectUtils.showDataFrame when locating and modifying the code.
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`:
- Around line 41-44: Extract the fairness literal from the ReentrantLock
instantiation into a named private static final boolean (e.g., FAIR_LOCK = true)
and replace the inline literal in getNotebookLock (and any other uses) with that
constant; also harden the utility by preventing instantiation of
NotebookLockManager (add a private constructor) and ensure notebookLocks and
getNotebookLock remain static so the class stays a proper non-instantiable
utility holder.
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.java`:
- Around line 102-109: The getPythonExec method contains a redundant non-blank
check because getProperty("zeppelin.python", "python") will always return a
non-null default; simplify by returning the property value directly. Update the
getPythonExec method in PySparkConnectInterpreter to simply return
getProperty("zeppelin.python", "python") (remove the StringUtils.isNotBlank
check and the unreachable final return) so only the direct call to getProperty
remains.
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.java`:
- Around line 60-65: The open() method currently looks up the
SparkConnectInterpreter instance but doesn't explicitly initialize it; update
SparkConnectSqlInterpreter.open to invoke sparkConnectInterpreter.open() after
obtaining it (and before using sqlSplitter), propagating or handling
InterpreterException as appropriate so the dependent SparkConnectInterpreter is
explicitly opened; reference the sparkConnectInterpreter field and the open()
method on SparkConnectInterpreter when making this change.
In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py`:
- Around line 94-99: In the except ImportError block that handles "import pandas
as pd" (used by toPandas()), re-raise the ImportError using exception chaining
with "from None" so the new ImportError intentionally replaces the original
(i.e., raise ImportError("pandas is required for toPandas(). Install it with:
pip install pandas") from None).
- Around line 264-266: The warnings.warn call that alerts users about
"createDataFrame from pandas goes through Py4j serialization..." should include
stacklevel=2 so the warning points to the user's call site; update the
warnings.warn invocation in zeppelin_isparkconnect.py (the warnings.warn(...)
call with that exact message) to pass stacklevel=2 as an argument (e.g.,
warnings.warn("createDataFrame ... temp table instead.", stacklevel=2)).
- Around line 68-73: The warnings.warn call inside the data-collection branch
should include stacklevel=2 so the warning points to the user's calling code;
update the warnings.warn invocation (the one immediately before return
list(self._jdf.collectAsList())) to pass stacklevel=2 as an argument while
keeping the existing message and interpolation unchanged.
In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py`:
- Around line 1-310: The two files duplicate the SparkConnectDataFrame and
SparkConnectSession implementations; extract these classes and any helper
functions/constants they use (e.g., SparkConnectDataFrame, SparkConnectSession,
_rows_to_dicts, _COLLECT_LIMIT_DEFAULT, _COLLECT_WARN_THRESHOLD) into a new
shared module (e.g., zeppelin_sparkconnect_common) and replace the in-file
definitions in both zeppelin_sparkconnect.py and zeppelin_isparkconnect.py with
imports from that module; ensure each file still initializes its
gateway/entry_point/_max_result and passes or sets any module-level state the
shared module relies on, update imports and references (SparkConnectDataFrame,
SparkConnectSession) accordingly, and run tests to confirm behavior is
unchanged.
In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`:
- Around line 54-89: The test harness setup/teardown and context builder in
SparkConnectInterpreterTest (setUp, tearDown, and getInterpreterContext) are
duplicated across tests—extract them into a shared test base or helper class:
create a common abstract base (e.g., SparkConnectTestBase) or a TestUtils helper
that provides the static setup/cleanup methods (initializing InterpreterGroup,
SparkConnectInterpreter, calling interpreter.open/close) and a reusable
getInterpreterContext builder that returns the configured InterpreterContext
(using AngularObjectRegistry, LocalResourcePool, InterpreterOutput, mocked
RemoteInterpreterEventClient); then update SparkConnectInterpreterTest to extend
or call that shared base/helper and remove the duplicated code from the test
class.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
pom.xmlspark-connect/pom.xmlspark-connect/src/main/java/org/apache/zeppelin/spark/IPySparkConnectInterpreter.javaspark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.javaspark-connect/src/main/java/org/apache/zeppelin/spark/PySparkConnectInterpreter.javaspark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectInterpreter.javaspark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreter.javaspark-connect/src/main/java/org/apache/zeppelin/spark/SparkConnectUtils.javaspark-connect/src/main/resources/interpreter-setting.jsonspark-connect/src/main/resources/python/zeppelin_isparkconnect.pyspark-connect/src/main/resources/python/zeppelin_sparkconnect.pyspark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.javaspark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.javaspark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectSqlInterpreterTest.javaspark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectUtilsTest.java
| public static void removeNotebookLock(String noteId) { | ||
| notebookLocks.remove(noteId); | ||
| } |
There was a problem hiding this comment.
Avoid removing notebook locks while they may still be active.
Line 52 removes the lock unconditionally. If a thread still holds the old lock and another thread calls getNotebookLock, a new lock can be created for the same notebook, allowing concurrent execution.
🧵 Safer cleanup approach
public static void removeNotebookLock(String noteId) {
- notebookLocks.remove(noteId);
+ notebookLocks.computeIfPresent(noteId, (id, lock) -> {
+ if (lock.isLocked() || lock.hasQueuedThreads()) {
+ return lock;
+ }
+ return null; // remove only when truly idle
+ });
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public static void removeNotebookLock(String noteId) { | |
| notebookLocks.remove(noteId); | |
| } | |
| public static void removeNotebookLock(String noteId) { | |
| notebookLocks.computeIfPresent(noteId, (id, lock) -> { | |
| if (lock.isLocked() || lock.hasQueuedThreads()) { | |
| return lock; | |
| } | |
| return null; // remove only when truly idle | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@spark-connect/src/main/java/org/apache/zeppelin/spark/NotebookLockManager.java`
around lines 51 - 53, removeNotebookLock unconditionally removes entries from
notebookLocks which can create a new lock while the old one is still held;
change removeNotebookLock to only remove the map entry when the current lock for
noteId is not held and has no queued threads. Locate notebookLocks and the
methods removeNotebookLock and getNotebookLock, obtain the ReentrantLock
instance from notebookLocks (e.g., via get/computeIfPresent), and check
lock.isLocked() and lock.hasQueuedThreads() (or attempt a non-blocking tryLock
and immediately unlock) before calling notebookLocks.remove(noteId) so you only
remove locks that are truly free.
| "spark.connect.token": { | ||
| "envName": "SPARK_CONNECT_TOKEN", | ||
| "propertyName": "spark.connect.token", | ||
| "defaultValue": "", | ||
| "description": "Authentication token for Spark Connect (optional)", | ||
| "type": "string" | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and verify the target file exists
find . -name "interpreter-setting.json" -type fRepository: Meesho/zeppelin
Length of output: 2075
🏁 Script executed:
# Run the provided verification script to check for password-type usage
rg -n '"type"\s*:\s*"password"' --glob '**/interpreter-setting.json'Repository: Meesho/zeppelin
Length of output: 559
🏁 Script executed:
# Examine the structure of the spark-connect interpreter-setting.json file around lines 111-117
if [ -f "spark-connect/src/main/resources/interpreter-setting.json" ]; then
cat -n "spark-connect/src/main/resources/interpreter-setting.json" | sed -n '105,125p'
else
echo "File not found at expected path"
fiRepository: Meesho/zeppelin
Length of output: 1010
🏁 Script executed:
# Search for other authentication token fields to understand naming patterns
rg -i '"token"' --glob '**/interpreter-setting.json' -A 3Repository: Meesho/zeppelin
Length of output: 41
Mark spark.connect.token as a secret field.
Line 116 uses "type": "string" for an authentication token. Authentication tokens should be masked as sensitive values in interpreter settings. The repository already uses "type": "password" for other authentication fields (e.g., in jdbc, mongodb, elasticsearch, cassandra, and influxdb interpreters).
🔐 Suggested change
"spark.connect.token": {
"envName": "SPARK_CONNECT_TOKEN",
"propertyName": "spark.connect.token",
"defaultValue": "",
"description": "Authentication token for Spark Connect (optional)",
- "type": "string"
+ "type": "password"
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "spark.connect.token": { | |
| "envName": "SPARK_CONNECT_TOKEN", | |
| "propertyName": "spark.connect.token", | |
| "defaultValue": "", | |
| "description": "Authentication token for Spark Connect (optional)", | |
| "type": "string" | |
| }, | |
| "spark.connect.token": { | |
| "envName": "SPARK_CONNECT_TOKEN", | |
| "propertyName": "spark.connect.token", | |
| "defaultValue": "", | |
| "description": "Authentication token for Spark Connect (optional)", | |
| "type": "password" | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/interpreter-setting.json` around lines 111 -
117, The "spark.connect.token" interpreter setting is currently declared with
"type": "string" and must be treated as sensitive; update the
"spark.connect.token" entry in interpreter-setting.json to use "type":
"password" (matching other secret fields like jdbc/mongodb entries) so the token
is masked in UI/exports, keeping the same envName, propertyName and
defaultValue.
| def show(self, n=20, truncate=True): | ||
| effective_n = min(n, _max_result) | ||
| print(intp.formatDataFrame(self._jdf, effective_n)) | ||
|
|
||
| def collect(self, limit=None): |
There was a problem hiding this comment.
Unused truncate parameter in show() method.
The truncate argument is accepted but never used. This could mislead users expecting truncation control.
🔧 Proposed fix
Either remove the unused parameter or pass it to formatDataFrame:
- def show(self, n=20, truncate=True):
- effective_n = min(n, _max_result)
- print(intp.formatDataFrame(self._jdf, effective_n))
+ def show(self, n=20, truncate=True):
+ """Show DataFrame rows. Note: truncate parameter is currently not supported."""
+ effective_n = min(n, _max_result)
+ print(intp.formatDataFrame(self._jdf, effective_n))Alternatively, if the Java side supports truncation, wire it through.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 52-52: Unused method argument: truncate
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around
lines 52 - 56, The show method currently ignores the truncate parameter; update
the show function (def show) to either remove the unused truncate argument or
forward it to intp.formatDataFrame so truncation is honored—e.g., use the
truncate value when calling intp.formatDataFrame(self._jdf, effective_n,
truncate) if the Java/Python interop supports a truncate argument, or if not
supported, drop the truncate parameter from the show signature and update any
callers; reference the show method, its truncate parameter, self._jdf and the
intp.formatDataFrame call when making the change.
| def groupBy(self, *cols): | ||
| return self._jdf.groupBy(*cols) |
There was a problem hiding this comment.
groupBy returns raw Java object instead of wrapped DataFrame.
Unlike other transformation methods that return SparkConnectDataFrame, groupBy returns the raw Java GroupedData object, breaking the wrapper pattern and exposing internals.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/python/zeppelin_isparkconnect.py` around
lines 136 - 137, The groupBy method currently returns the raw Java GroupedData
from self._jdf, breaking the wrapper pattern; update
SparkConnectDataFrame.groupBy to wrap the result in the corresponding Python
wrapper (e.g., a GroupedData / SparkConnectGroupedData instance) instead of
returning the raw Java object so callers get the same high-level API as other
transformations; locate groupBy and construct/return the proper wrapper (passing
the Java object and any required session/context like self._session) using the
existing GroupedData wrapper class used elsewhere in the module.
| def show(self, n=20, truncate=True): | ||
| effective_n = min(n, _max_result) | ||
| print(intp.formatDataFrame(self._jdf, effective_n)) |
There was a problem hiding this comment.
Unused truncate parameter - same issue as in zeppelin_isparkconnect.py.
The truncate argument is accepted but not used, which could mislead users.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 54-54: Unused method argument: truncate
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around
lines 54 - 56, The show method currently ignores the truncate parameter; update
the show(self, n=20, truncate=True) implementation to use truncate when
formatting the DataFrame (e.g., pass the truncate flag through to
intp.formatDataFrame or apply equivalent truncation logic before printing) so
that the truncate argument affects output; locate the show method and modify the
call to intp.formatDataFrame(self._jdf, effective_n) to include the truncate
behavior using the truncate variable.
| def toPandas(self, limit=None): | ||
| """Convert to pandas DataFrame. Applies a safety limit. | ||
|
|
||
| Tries to use pyarrow for efficient serialization if available, | ||
| otherwise falls back to row-by-row conversion through Py4j. | ||
|
|
||
| Args: | ||
| limit: Max rows. Defaults to zeppelin.spark.maxResult. | ||
| Pass limit=-1 for all rows (use with caution on large data). | ||
| """ |
There was a problem hiding this comment.
Misleading docstring: mentions pyarrow but doesn't use it.
The docstring claims "Tries to use pyarrow for efficient serialization if available" but the implementation doesn't check for or use pyarrow - it always uses row-by-row conversion through Py4j.
📝 Proposed fix - update docstring to match implementation
def toPandas(self, limit=None):
"""Convert to pandas DataFrame. Applies a safety limit.
- Tries to use pyarrow for efficient serialization if available,
- otherwise falls back to row-by-row conversion through Py4j.
+ Uses row-by-row conversion through Py4j.
Args:
limit: Max rows. Defaults to zeppelin.spark.maxResult.
Pass limit=-1 for all rows (use with caution on large data).
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def toPandas(self, limit=None): | |
| """Convert to pandas DataFrame. Applies a safety limit. | |
| Tries to use pyarrow for efficient serialization if available, | |
| otherwise falls back to row-by-row conversion through Py4j. | |
| Args: | |
| limit: Max rows. Defaults to zeppelin.spark.maxResult. | |
| Pass limit=-1 for all rows (use with caution on large data). | |
| """ | |
| def toPandas(self, limit=None): | |
| """Convert to pandas DataFrame. Applies a safety limit. | |
| Uses row-by-row conversion through Py4j. | |
| Args: | |
| limit: Max rows. Defaults to zeppelin.spark.maxResult. | |
| Pass limit=-1 for all rows (use with caution on large data). | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around
lines 89 - 98, The docstring for toPandas incorrectly states that it tries to
use pyarrow; update the toPandas method docstring to accurately describe the
current implementation: remove any mention of pyarrow and state that it converts
rows row-by-row via Py4j with a safety limit (limit argument, default from
zeppelin.spark.maxResult, and limit=-1 for all rows). Keep the Args section and
wording consistent with the actual behavior in toPandas.
| def groupBy(self, *cols): | ||
| return self._jdf.groupBy(*cols) |
There was a problem hiding this comment.
groupBy returns raw Java object instead of wrapped DataFrame.
Same issue as in zeppelin_isparkconnect.py - breaks the wrapper pattern by returning unwrapped Java GroupedData.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spark-connect/src/main/resources/python/zeppelin_sparkconnect.py` around
lines 141 - 142, The groupBy method currently returns the raw Java GroupedData
object from self._jdf.groupBy(*cols); change it to return the wrapped
GroupedData wrapper (e.g., construct and return
GroupedData(self._jdf.groupBy(*cols)) or whatever local wrapper class is used)
so the wrapper pattern is preserved; update the groupBy method and add any
necessary import/reference to the wrapper class (GroupedData) used elsewhere in
this module.
| assertTrue(output.contains("id") || output.contains("message") || output.contains("hello"), | ||
| "Output should contain query results: " + output); |
There was a problem hiding this comment.
Tighten assertions—current checks are too permissive to catch regressions reliably.
The OR-based output checks (Lines 116-117, 127-128, 151-152) can pass on partial/incorrect outputs, and Line 139-141 accepts both SUCCESS and ERROR. Consider making these assertions deterministic (e.g., assert expected columns/values together, or set up deterministic test data for the delta-case path).
Also applies to: 127-128, 139-141, 151-152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/PySparkConnectInterpreterTest.java`
around lines 116 - 117, The OR-based assertions in PySparkConnectInterpreterTest
(the assertTrue checks that use output.contains("id") ||
output.contains("message") || output.contains("hello") and the ones at the other
noted locations) are too permissive — update them to assert specific,
deterministic output by checking all expected tokens together (e.g.,
assertTrue(output.contains("id") && output.contains("message") &&
output.contains("hello")) or better,
assertTrue(output.containsAll(expectedColumnsList)) or assertEquals on the exact
string/JSON you expect), replace the ambiguous SUCCESS/ERROR acceptance with an
assertEquals to the single expected status value, and for the delta-case create
or mock deterministic test data so the expected columns/values are known; locate
and change the assertions in PySparkConnectInterpreterTest (the OR-based
assertions and the SUCCESS/ERROR branch) and adjust test setup for the delta
path to produce deterministic output.
| @AfterAll | ||
| public static void tearDown() throws InterpreterException { | ||
| if (interpreter != null) { | ||
| interpreter.close(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Close the InterpreterGroup in teardown as well.
InterpreterGroup owns session/lifecycle state. Line 73 currently only closes the interpreter instance; group cleanup should also run to avoid lingering resources in integration test runs.
🧹 Suggested teardown update
`@AfterAll`
public static void tearDown() throws InterpreterException {
- if (interpreter != null) {
- interpreter.close();
- }
+ if (intpGroup != null) {
+ intpGroup.close();
+ } else if (interpreter != null) {
+ interpreter.close();
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@spark-connect/src/test/java/org/apache/zeppelin/spark/SparkConnectInterpreterTest.java`
around lines 72 - 77, The teardown only closes the Interpreter instance; also
ensure the InterpreterGroup is cleaned up to avoid lingering resources by
updating the tearDown method to check if the shared InterpreterGroup
(InterpreterGroup variable) is non-null and invoke its close/cleanup method
after closing the interpreter (e.g., if (interpreterGroup != null) {
interpreterGroup.close(); }), making sure to reference the existing interpreter
and InterpreterGroup symbols and handle exceptions similar to
interpreter.close().
What is this PR for?
A few sentences describing the overall goals of the pull request's commits.
First time? Check out the contributing guide - https://zeppelin.apache.org/contribution/contributions.html
What type of PR is it?
Bug Fix
Improvement
Feature
Documentation
Hot Fix
Refactoring
Please leave your type of PR only
Todos
What is the Jira issue?
How should this be tested?
Screenshots (if appropriate)
Questions:
Summary by CodeRabbit
New Features
Tests