OpenIVM for Spark by mdrakiburrahman · Pull Request #9 · ila/lpts

mdrakiburrahman · 2026-05-18T02:39:31Z

Why this change is needed

OpenIVM for Spark reuses lpts to convert openivm's DuckDB-dialect refresh programs into Spark-executable SQL. The existing SqlDialect enum only carried DUCKDB / POSTGRES, so identifiers, qualified table refs, function names and window frames all came out in DuckDB form and Spark's parser rejected them.

How

Adds a third SqlDialect::SPARK variant and threads it end-to-end so every CTE/scan/function/window emission path can branch on dialect.

New header src/include/sql_dialect.hpp hosts SqlDialect (moved out of lpts_pipeline.hpp so cte_nodes.hpp can depend on it without a cyclic include).
New DialectQuoteIdent(name, dialect) helper: backticks for SPARK, KeywordHelper::WriteOptionallyQuoted for DUCKDB / POSTGRES.
Dialect-aware variants of the new upstream quoting helpers (DialectVecToQuotedIdentifierList, DialectQuoteTableWithOptionalSuffix, DialectQualifiedTableName) — every identifier emission site in GetNode::ToQuery, AstFlattener::AstToInlineSQL, and table-filter pushdown is routed through them.
CteBaseNode carries the dialect; AstFlattener::StampDialect propagates it onto every CTE node after construction.
ParseSqlDialect accepts "spark" / "SPARK"; the lpts_dialect setting docstring lists it.
BOUND_FUNCTION emission remaps a curated set of DuckDB function names to Spark equivalents that openivm-emitted plans actually exercise (strftime→date_format, strptime→to_timestamp, list_transform→transform, list_aggregate→aggregate, list_filter→filter, list_value→array, list_contains→array_contains, list_extract→element_at). Unsupported functions surface as a Spark-side error rather than being silently mistranslated.
Window-frame emission throws NotImplementedException on GROUPS units and EXCLUDE clauses when the dialect is SPARK (Spark SQL supports only ROWS / RANGE without exclusion).

Considerations:

The DUCKDB and POSTGRES paths fall back to the existing KeywordHelper-based quoting; the new Spark routing is gated on the enum.
Merged upstream/main (f08974212a3946…) cleanly: resolved overlapping additions in cte_nodes.cpp (kept upstream's GetNodeColumnsAreExpressions alongside dialect-aware helpers) and lpts_pipeline.cpp (kept StampDialect(...) and threaded upstream's new has_recursive_cte argument into both CteList constructions).

Test

make unittest — all upstream sqllogictest cases pass (44/44 openivm sqllogictests still green; no DuckDB/Postgres dialect regressions).
./spark-ext/dev/dev.sh verify against openivm-spark with LPTS_COMMIT bumped to this PR's HEAD: lint, compile, assembly, and the full parity suite (112 suites / 640 tests) succeed. The only test that flapped was OpenIvmRocksDBSpec.recovers committed values after an unclean process exit — a JVM-fork process.waitFor(10, SECONDS) that timed out under heavy concurrent load. Passes in 1.2s when re-run in isolation; unrelated to LPTS.
Spark dialect verified end-to-end via the openivm-spark parity suite, which executes the lpts-generated refresh SQL against Apache Spark 3.5 / Delta Lake 3.2.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a new `SqlDialect::SPARK` variant alongside DUCKDB and POSTGRES, plus the plumbing to make the output dialect-aware end-to-end: - New `src/include/sql_dialect.hpp` hosts the `SqlDialect` enum (moved out of `lpts_pipeline.hpp` so `cte_nodes.hpp` can depend on it without a cyclic include). - New `DialectQuoteIdent(name, dialect)` helper: backticks for SPARK, DuckDB `KeywordHelper::WriteOptionallyQuoted` for DUCKDB/POSTGRES. - `CteBaseNode` carries the dialect (set by `AstFlattener::StampDialect` after construction); `GetNode` and `FinalReadNode` consult it to choose identifier quoting. - `ParseSqlDialect` accepts "spark" / "SPARK". - BOUND_FUNCTION emission remaps a curated set of DuckDB function names to their Spark equivalents (strftime->date_format, list_transform->transform, list_aggregate->aggregate, list_filter->filter, list_value->array, list_contains->array_contains, list_extract->element_at, strptime-> to_timestamp). - Window-frame emission throws `NotImplementedException` on `GROUPS` units and `EXCLUDE` clauses when the dialect is SPARK (Spark SQL supports only ROWS / RANGE without exclusion). - `lpts_extension.cpp` lpts_dialect setting docstring lists "spark" as a valid value. The DuckDB and POSTGRES dialects are unaffected (regression: all 44 upstream openivm sqllogictest cases still pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # src/cte_nodes.cpp # src/lpts_pipeline.cpp

ila · 2026-05-22T07:59:00Z

Thank you so much for the PR! Looking forward to having this feature in LPTS, I have a few comments:

I am not seeing a test case? Ideally every feature should be covered by a .test file in the sql folder. I expect testing dialects will be different than the usual workflow (lpts_check) since we cannot execute queries, so I think manually writing the expected output should be enough. No need to test all operators, just a bunch of edge cases is OK.
Here I see you embed the dialect in every CTE node (https://github.com/ila/lpts/pull/9/changes#diff-3371386e33d46e988689efa5dc09d721296bea2e246985c807ce26e692e7ec72R55) is this necessary? I assume the dialect is the same for every node, can we get rid of the duplication?
Not entirely sure why we need StampDialect either. I assume if 2) is sorted then we can get rid of it?
CI fails on format/tidy check (not sure why it was not immediately triggered, maybe it's my fault).
Perhaps this logic https://github.com/ila/lpts/pull/9/changes#diff-598db8133ad92954d12191e2a8ffdb140340cf3fdc84d5bfa6034fd20c873badR916 should be moved to a separate file? I assume its complexity is going to explode as we add more dialects.

Have a nice day!

mdrakiburrahman · 2026-05-22T10:41:22Z

Thanks @ila!

I'll get your comments addressed - will tag you here for a re-review 🙂

Spark-openivm's ivm-bench builder image is pinned to ubuntu:22.04 (GCC 11) so the resulting duckdb CLI + openivm.duckdb_extension binaries depend on glibc-2.35 — matching the spark-openivm runtime container. Under GCC 11 with -std=c++11, the implicit derived->base conversion from unique_ptr<AstRecursiveCteNode> to unique_ptr<AstNode> in the RecursiveTraversal return slot trips the diagnostic from gcc.gnu.org/bugzilla/100489. GCC >= 13 (ubuntu:24.04 default) elides the conversion via guaranteed copy elision and accepts the same code, which is why this only surfaces when one specifically builds on 22.04. One explicit std::move keeps the source compatible with both toolchains. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The temporary openivm-spark-glibc-2.35 backport branch has been replaced by a one-line std::move fix on mdrakiburrahman/lpts:openivm-spark (ila/lpts#9). Bumping both ARG defaults so the spark-openivm builder pulls the same lpts SHA that openivm-spark/spark-ext pins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into ila/lpts:main, openivm PR ila#2 needs the lpts submodule to resolve to a commit that includes the Spark-facing changes (SPARK dialect, hidden-projection bindings, scan identifier quoting, etc.) plus the one-line std::move fix that lets the source compile under GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps those diffs out of the openivm PR view. Once ila/lpts#9 merges, this commit gets reverted back to ila/lpts.git with the lpts main SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates the temporary openivm-spark-glibc-2.35 backport branch. LPTS_BRANCH/LPTS_COMMIT now point at mdrakiburrahman/lpts:openivm-spark (ila/lpts#9 head plus a one-line std::move fix for GCC-11/C++11 compatibility inside the ubuntu:22.04 spark-openivm-build image). OPENIVM_COMMIT advances to the matching openivm submodule-URL bump + branch hint commit. IVM_BENCH_COMMIT advances to the Dockerfile pin sync. Verified: - Check 1: lpts build on ubuntu:22.04 succeeds (638/638). - Check 2: openivm unittests match baseline (100 pass / 2 pre-existing chained.test:214 failures on both branches). - Check 3: spark-ext verify — ivm-it 646/646 green. ivm-extension's MaterializedViewCommandsSpec (11) fails on BOTH old and new pin sets; pre-existing, unrelated to this change. Plan: .research/LPTS-BACKPORT-PLAN.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mdrakiburrahman and others added 2 commits May 16, 2026 05:28

Merge remote-tracking branch 'upstream/main' into openivm-spark

f089742

# Conflicts: # src/cte_nodes.cpp # src/lpts_pipeline.cpp

mdrakiburrahman marked this pull request as ready for review May 22, 2026 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenIVM for Spark#9

OpenIVM for Spark#9
mdrakiburrahman wants to merge 3 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark

mdrakiburrahman commented May 18, 2026 •

edited

Loading

Uh oh!

ila commented May 22, 2026 •

edited

Loading

Uh oh!

mdrakiburrahman commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mdrakiburrahman commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this change is needed

How

Test

Uh oh!

ila commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdrakiburrahman commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mdrakiburrahman commented May 18, 2026 •

edited

Loading

ila commented May 22, 2026 •

edited

Loading