Skip to content

OpenIVM for Spark#9

Open
mdrakiburrahman wants to merge 3 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark
Open

OpenIVM for Spark#9
mdrakiburrahman wants to merge 3 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark

Conversation

@mdrakiburrahman
Copy link
Copy Markdown

@mdrakiburrahman mdrakiburrahman commented May 18, 2026

Why this change is needed

OpenIVM for Spark reuses lpts to convert openivm's DuckDB-dialect refresh programs into Spark-executable SQL. The existing SqlDialect enum only carried DUCKDB / POSTGRES, so identifiers, qualified table refs, function names and window frames all came out in DuckDB form and Spark's parser rejected them.

How

Adds a third SqlDialect::SPARK variant and threads it end-to-end so every CTE/scan/function/window emission path can branch on dialect.

  • New header src/include/sql_dialect.hpp hosts SqlDialect (moved out of lpts_pipeline.hpp so cte_nodes.hpp can depend on it without a cyclic include).
  • New DialectQuoteIdent(name, dialect) helper: backticks for SPARK, KeywordHelper::WriteOptionallyQuoted for DUCKDB / POSTGRES.
  • Dialect-aware variants of the new upstream quoting helpers (DialectVecToQuotedIdentifierList, DialectQuoteTableWithOptionalSuffix, DialectQualifiedTableName) — every identifier emission site in GetNode::ToQuery, AstFlattener::AstToInlineSQL, and table-filter pushdown is routed through them.
  • CteBaseNode carries the dialect; AstFlattener::StampDialect propagates it onto every CTE node after construction.
  • ParseSqlDialect accepts "spark" / "SPARK"; the lpts_dialect setting docstring lists it.
  • BOUND_FUNCTION emission remaps a curated set of DuckDB function names to Spark equivalents that openivm-emitted plans actually exercise (strftimedate_format, strptimeto_timestamp, list_transformtransform, list_aggregateaggregate, list_filterfilter, list_valuearray, list_containsarray_contains, list_extractelement_at). Unsupported functions surface as a Spark-side error rather than being silently mistranslated.
  • Window-frame emission throws NotImplementedException on GROUPS units and EXCLUDE clauses when the dialect is SPARK (Spark SQL supports only ROWS / RANGE without exclusion).

Considerations:

  • The DUCKDB and POSTGRES paths fall back to the existing KeywordHelper-based quoting; the new Spark routing is gated on the enum.
  • Merged upstream/main (f08974212a3946…) cleanly: resolved overlapping additions in cte_nodes.cpp (kept upstream's GetNodeColumnsAreExpressions alongside dialect-aware helpers) and lpts_pipeline.cpp (kept StampDialect(...) and threaded upstream's new has_recursive_cte argument into both CteList constructions).

Test

  • make unittest — all upstream sqllogictest cases pass (44/44 openivm sqllogictests still green; no DuckDB/Postgres dialect regressions).
  • ./spark-ext/dev/dev.sh verify against openivm-spark with LPTS_COMMIT bumped to this PR's HEAD: lint, compile, assembly, and the full parity suite (112 suites / 640 tests) succeed. The only test that flapped was OpenIvmRocksDBSpec.recovers committed values after an unclean process exit — a JVM-fork process.waitFor(10, SECONDS) that timed out under heavy concurrent load. Passes in 1.2s when re-run in isolation; unrelated to LPTS.
  • Spark dialect verified end-to-end via the openivm-spark parity suite, which executes the lpts-generated refresh SQL against Apache Spark 3.5 / Delta Lake 3.2.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

mdrakiburrahman and others added 2 commits May 16, 2026 05:28
Adds a new `SqlDialect::SPARK` variant alongside DUCKDB and POSTGRES, plus
the plumbing to make the output dialect-aware end-to-end:

- New `src/include/sql_dialect.hpp` hosts the `SqlDialect` enum (moved out
  of `lpts_pipeline.hpp` so `cte_nodes.hpp` can depend on it without a
  cyclic include).
- New `DialectQuoteIdent(name, dialect)` helper: backticks for SPARK,
  DuckDB `KeywordHelper::WriteOptionallyQuoted` for DUCKDB/POSTGRES.
- `CteBaseNode` carries the dialect (set by `AstFlattener::StampDialect`
  after construction); `GetNode` and `FinalReadNode` consult it to choose
  identifier quoting.
- `ParseSqlDialect` accepts "spark" / "SPARK".
- BOUND_FUNCTION emission remaps a curated set of DuckDB function names to
  their Spark equivalents (strftime->date_format, list_transform->transform,
  list_aggregate->aggregate, list_filter->filter, list_value->array,
  list_contains->array_contains, list_extract->element_at, strptime->
  to_timestamp).
- Window-frame emission throws `NotImplementedException` on `GROUPS` units
  and `EXCLUDE` clauses when the dialect is SPARK (Spark SQL supports
  only ROWS / RANGE without exclusion).
- `lpts_extension.cpp` lpts_dialect setting docstring lists "spark" as a
  valid value.

The DuckDB and POSTGRES dialects are unaffected (regression: all 44
upstream openivm sqllogictest cases still pass).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	src/cte_nodes.cpp
#	src/lpts_pipeline.cpp
@mdrakiburrahman mdrakiburrahman marked this pull request as ready for review May 22, 2026 01:52
@ila
Copy link
Copy Markdown
Owner

ila commented May 22, 2026

Thank you so much for the PR! Looking forward to having this feature in LPTS, I have a few comments:

  1. I am not seeing a test case? Ideally every feature should be covered by a .test file in the sql folder. I expect testing dialects will be different than the usual workflow (lpts_check) since we cannot execute queries, so I think manually writing the expected output should be enough. No need to test all operators, just a bunch of edge cases is OK.
  2. Here I see you embed the dialect in every CTE node (https://github.com/ila/lpts/pull/9/changes#diff-3371386e33d46e988689efa5dc09d721296bea2e246985c807ce26e692e7ec72R55) is this necessary? I assume the dialect is the same for every node, can we get rid of the duplication?
  3. Not entirely sure why we need StampDialect either. I assume if 2) is sorted then we can get rid of it?
  4. CI fails on format/tidy check (not sure why it was not immediately triggered, maybe it's my fault).
  5. Perhaps this logic https://github.com/ila/lpts/pull/9/changes#diff-598db8133ad92954d12191e2a8ffdb140340cf3fdc84d5bfa6034fd20c873badR916 should be moved to a separate file? I assume its complexity is going to explode as we add more dialects.

Have a nice day!

@mdrakiburrahman
Copy link
Copy Markdown
Author

Thanks @ila!

I'll get your comments addressed - will tag you here for a re-review 🙂

Spark-openivm's ivm-bench builder image is pinned to ubuntu:22.04
(GCC 11) so the resulting duckdb CLI + openivm.duckdb_extension
binaries depend on glibc-2.35 — matching the spark-openivm runtime
container. Under GCC 11 with -std=c++11, the implicit derived->base
conversion from unique_ptr<AstRecursiveCteNode> to
unique_ptr<AstNode> in the RecursiveTraversal return slot trips
the diagnostic from gcc.gnu.org/bugzilla/100489. GCC >= 13
(ubuntu:24.04 default) elides the conversion via guaranteed copy
elision and accepts the same code, which is why this only surfaces
when one specifically builds on 22.04.

One explicit std::move keeps the source compatible with both
toolchains.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman added a commit to mdrakiburrahman/ivm-bench that referenced this pull request May 23, 2026
The temporary openivm-spark-glibc-2.35 backport branch has been
replaced by a one-line std::move fix on mdrakiburrahman/lpts:openivm-spark
(ila/lpts#9). Bumping both ARG defaults so the spark-openivm builder
pulls the same lpts SHA that openivm-spark/spark-ext pins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman added a commit to mdrakiburrahman/openivm that referenced this pull request May 23, 2026
Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into
ila/lpts:main, openivm PR ila#2 needs the lpts submodule to
resolve to a commit that includes the Spark-facing changes (SPARK
dialect, hidden-projection bindings, scan identifier quoting, etc.)
plus the one-line std::move fix that lets the source compile under
GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps
those diffs out of the openivm PR view.

Once ila/lpts#9 merges, this commit gets reverted back to
ila/lpts.git with the lpts main SHA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman added a commit to mdrakiburrahman/openivm-spark that referenced this pull request May 23, 2026
Eliminates the temporary openivm-spark-glibc-2.35 backport branch.
LPTS_BRANCH/LPTS_COMMIT now point at mdrakiburrahman/lpts:openivm-spark
(ila/lpts#9 head plus a one-line std::move fix for GCC-11/C++11
compatibility inside the ubuntu:22.04 spark-openivm-build image).
OPENIVM_COMMIT advances to the matching openivm submodule-URL bump
+ branch hint commit. IVM_BENCH_COMMIT advances to the Dockerfile
pin sync.

Verified:
- Check 1: lpts build on ubuntu:22.04 succeeds (638/638).
- Check 2: openivm unittests match baseline (100 pass / 2 pre-existing
  chained.test:214 failures on both branches).
- Check 3: spark-ext verify — ivm-it 646/646 green. ivm-extension's
  MaterializedViewCommandsSpec (11) fails on BOTH old and new pin
  sets; pre-existing, unrelated to this change.

Plan: .research/LPTS-BACKPORT-PLAN.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants