OpenIVM for Spark#9
Open
mdrakiburrahman wants to merge 3 commits into
Open
Conversation
Adds a new `SqlDialect::SPARK` variant alongside DUCKDB and POSTGRES, plus the plumbing to make the output dialect-aware end-to-end: - New `src/include/sql_dialect.hpp` hosts the `SqlDialect` enum (moved out of `lpts_pipeline.hpp` so `cte_nodes.hpp` can depend on it without a cyclic include). - New `DialectQuoteIdent(name, dialect)` helper: backticks for SPARK, DuckDB `KeywordHelper::WriteOptionallyQuoted` for DUCKDB/POSTGRES. - `CteBaseNode` carries the dialect (set by `AstFlattener::StampDialect` after construction); `GetNode` and `FinalReadNode` consult it to choose identifier quoting. - `ParseSqlDialect` accepts "spark" / "SPARK". - BOUND_FUNCTION emission remaps a curated set of DuckDB function names to their Spark equivalents (strftime->date_format, list_transform->transform, list_aggregate->aggregate, list_filter->filter, list_value->array, list_contains->array_contains, list_extract->element_at, strptime-> to_timestamp). - Window-frame emission throws `NotImplementedException` on `GROUPS` units and `EXCLUDE` clauses when the dialect is SPARK (Spark SQL supports only ROWS / RANGE without exclusion). - `lpts_extension.cpp` lpts_dialect setting docstring lists "spark" as a valid value. The DuckDB and POSTGRES dialects are unaffected (regression: all 44 upstream openivm sqllogictest cases still pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # src/cte_nodes.cpp # src/lpts_pipeline.cpp
Owner
|
Thank you so much for the PR! Looking forward to having this feature in LPTS, I have a few comments:
Have a nice day! |
Author
|
Thanks @ila! I'll get your comments addressed - will tag you here for a re-review 🙂 |
Spark-openivm's ivm-bench builder image is pinned to ubuntu:22.04 (GCC 11) so the resulting duckdb CLI + openivm.duckdb_extension binaries depend on glibc-2.35 — matching the spark-openivm runtime container. Under GCC 11 with -std=c++11, the implicit derived->base conversion from unique_ptr<AstRecursiveCteNode> to unique_ptr<AstNode> in the RecursiveTraversal return slot trips the diagnostic from gcc.gnu.org/bugzilla/100489. GCC >= 13 (ubuntu:24.04 default) elides the conversion via guaranteed copy elision and accepts the same code, which is why this only surfaces when one specifically builds on 22.04. One explicit std::move keeps the source compatible with both toolchains. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman
added a commit
to mdrakiburrahman/ivm-bench
that referenced
this pull request
May 23, 2026
The temporary openivm-spark-glibc-2.35 backport branch has been replaced by a one-line std::move fix on mdrakiburrahman/lpts:openivm-spark (ila/lpts#9). Bumping both ARG defaults so the spark-openivm builder pulls the same lpts SHA that openivm-spark/spark-ext pins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman
added a commit
to mdrakiburrahman/openivm
that referenced
this pull request
May 23, 2026
Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into ila/lpts:main, openivm PR ila#2 needs the lpts submodule to resolve to a commit that includes the Spark-facing changes (SPARK dialect, hidden-projection bindings, scan identifier quoting, etc.) plus the one-line std::move fix that lets the source compile under GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps those diffs out of the openivm PR view. Once ila/lpts#9 merges, this commit gets reverted back to ila/lpts.git with the lpts main SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman
added a commit
to mdrakiburrahman/openivm-spark
that referenced
this pull request
May 23, 2026
Eliminates the temporary openivm-spark-glibc-2.35 backport branch. LPTS_BRANCH/LPTS_COMMIT now point at mdrakiburrahman/lpts:openivm-spark (ila/lpts#9 head plus a one-line std::move fix for GCC-11/C++11 compatibility inside the ubuntu:22.04 spark-openivm-build image). OPENIVM_COMMIT advances to the matching openivm submodule-URL bump + branch hint commit. IVM_BENCH_COMMIT advances to the Dockerfile pin sync. Verified: - Check 1: lpts build on ubuntu:22.04 succeeds (638/638). - Check 2: openivm unittests match baseline (100 pass / 2 pre-existing chained.test:214 failures on both branches). - Check 3: spark-ext verify — ivm-it 646/646 green. ivm-extension's MaterializedViewCommandsSpec (11) fails on BOTH old and new pin sets; pre-existing, unrelated to this change. Plan: .research/LPTS-BACKPORT-PLAN.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this change is needed
OpenIVM for Spark reuses lpts to convert openivm's DuckDB-dialect refresh programs into Spark-executable SQL. The existing
SqlDialectenum only carriedDUCKDB/POSTGRES, so identifiers, qualified table refs, function names and window frames all came out in DuckDB form and Spark's parser rejected them.How
Adds a third
SqlDialect::SPARKvariant and threads it end-to-end so every CTE/scan/function/window emission path can branch on dialect.src/include/sql_dialect.hpphostsSqlDialect(moved out oflpts_pipeline.hppsocte_nodes.hppcan depend on it without a cyclic include).DialectQuoteIdent(name, dialect)helper: backticks forSPARK,KeywordHelper::WriteOptionallyQuotedforDUCKDB/POSTGRES.DialectVecToQuotedIdentifierList,DialectQuoteTableWithOptionalSuffix,DialectQualifiedTableName) — every identifier emission site inGetNode::ToQuery,AstFlattener::AstToInlineSQL, and table-filter pushdown is routed through them.CteBaseNodecarries the dialect;AstFlattener::StampDialectpropagates it onto every CTE node after construction.ParseSqlDialectaccepts"spark"/"SPARK"; thelpts_dialectsetting docstring lists it.BOUND_FUNCTIONemission remaps a curated set of DuckDB function names to Spark equivalents that openivm-emitted plans actually exercise (strftime→date_format,strptime→to_timestamp,list_transform→transform,list_aggregate→aggregate,list_filter→filter,list_value→array,list_contains→array_contains,list_extract→element_at). Unsupported functions surface as a Spark-side error rather than being silently mistranslated.NotImplementedExceptiononGROUPSunits andEXCLUDEclauses when the dialect isSPARK(Spark SQL supports onlyROWS/RANGEwithout exclusion).Considerations:
DUCKDBandPOSTGRESpaths fall back to the existingKeywordHelper-based quoting; the new Spark routing is gated on the enum.upstream/main(f08974212a3946…) cleanly: resolved overlapping additions incte_nodes.cpp(kept upstream'sGetNodeColumnsAreExpressionsalongside dialect-aware helpers) andlpts_pipeline.cpp(keptStampDialect(...)and threaded upstream's newhas_recursive_cteargument into bothCteListconstructions).Test
make unittest— all upstreamsqllogictestcases pass (44/44 openivm sqllogictests still green; no DuckDB/Postgres dialect regressions)../spark-ext/dev/dev.sh verifyagainstopenivm-sparkwithLPTS_COMMITbumped to this PR's HEAD: lint, compile, assembly, and the full parity suite (112 suites / 640 tests) succeed. The only test that flapped wasOpenIvmRocksDBSpec.recovers committed values after an unclean process exit— a JVM-forkprocess.waitFor(10, SECONDS)that timed out under heavy concurrent load. Passes in 1.2s when re-run in isolation; unrelated to LPTS.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com