OpenIVM for Spark by mdrakiburrahman · Pull Request #2 · ila/openivm

mdrakiburrahman · 2026-05-18T02:38:31Z

TODO

⚠️ Please do not review yet, Agents are trying different things to get Spark to perform with OpenIVM by any means necessary

…alect Three additions targeted at the upcoming openivm-spark Spark 3.5 / Delta Lake extension that consumes openivm as a SQL->SQL compiler: 1. `openivm_compile_only` (BOOLEAN, default false): when set, `PRAGMA refresh('v')` generates the refresh SQL artifact (under `openivm_files_path`) but skips execution. The new short-circuit lives in `RefreshViewLocked` immediately after `GenerateRefreshSQL`. 2. `openivm_target_dialect` (VARCHAR, default 'duckdb'): forwarded to every `LogicalPlanToAst` / `AstToCteList` call site via a new `ReadOpenIvmTargetDialect(context)` helper in `core/sql_utils`. The seven affected sites: - `src/rules/refresh_insert_rule.cpp` (DML interception, 5 sites) - `src/upsert/refresh_sql.cpp` (refresh SQL build path) - `src/core/parser.cpp` (CREATE MATERIALIZED VIEW round-trip) 3. `PRAGMA compile_refresh('v')`: new substitute-SQL pragma that returns one row (refresh_type INTEGER, refresh_type_name VARCHAR, sql VARCHAR) describing the refresh plan in the chosen target dialect. Calls `GenerateRefreshSQL` directly (read-only, no locks). Cross-system (DuckLake) views concatenate the pre/post metadata SQL into the returned `sql` column; the `DUCKLAKE_SNAPSHOT_PLACEHOLDER` token is intentionally left unresolved for the caller to handle. Tests in `test/sql/compile_refresh.test` cover: - compile_refresh returns the expected (refresh_type, name, non-empty sql) - the MV state is unchanged after compile_refresh - subsequent `PRAGMA refresh` still works (no setting leakage) - `openivm_compile_only` globally suppresses execution in `PRAGMA refresh` - `openivm_target_dialect='spark'` round-trips without error - compile_refresh on a non-existent view fails cleanly All 45 upstream sqllogictest files pass (5431 assertions, 0 failures). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…dd companions When openivm-spark compiles each MV in an isolated DuckDB subprocess, the openivm_delta_tables metadata is empty (no downstream MVs registered yet), so the AGGREGATE_GROUP refresh path emits only an additive view-delta that breaks cascade for downstream non-additive aggregates (COUNT(*), DISTINCT etc.). Two C++ changes: * New `openivm_force_view_delta_cascade` setting biases has_downstream to true for AGGREGATE_GROUP and AGGREGATE_HAVING refreshes only, ensuring the companion emission paths fire. Scoped narrowly to avoid triggering the SIMPLE_AGGREGATE snapshot pre/post-companion (which uses a CREATE TEMP TABLE / DROP TABLE pair that the Spark-side rewriter does not currently transform). * refresh_sql.cpp emits BOTH companions when the flag is set AND the data table carries the synthetic openivm_count_star column: - Positive companion (existing): `mult > 0 AND EXISTS data_table` → emit a -1 retract row with NULL (not 0) for non-key columns so the SUM/COUNT cascade on a NULL aggregate state is preserved. - Negative companion (new): `mult < 0 AND count_star > 0 AND post-merge count_star > 0` → emit a +1 add-back row so a partial DELETE that leaves the group alive doesn't spuriously decrement downstream COUNT(*). The gate (m.count_star + d.mult * d.count_star > 0) correctly accounts for the signed count_star delta even though openivm emits count_star as the absolute count of input rows. MVs with a user-supplied COUNT(*) alias use that alias as the count-monoid; for those, the negative companion is skipped (the count column name discovery needs classifier-level metadata that refresh_sql.cpp doesn't currently see). Documented as a known gap; DELETE cascade may be inaccurate for downstream non-additive aggregates over such MVs. Verified end-to-end by mdrakiburrahman/openivm-spark's ivm-it suite (623/623 tests passing). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ng-IN-list + derived window keys Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Brings 18 upstream commits onto the PR-2 ("OpenIVM for Spark") branch. All openivm SQL tests pass (102 cases; HTTP-dependent suites still skipped, same as upstream). Conflict resolutions - src/core/parser.cpp: kept the dialect-aware LPTS path (HEAD) layered inside upstream's profiling envelope; restored QueryNeedsOriginalSqlForLpts + PlanNeedsOriginalSqlForLpts gates (Spark-side needs string-level PIVOT/unsupported-aggregate detection); kept HEAD's window-partition metadata behavior (Refresh-time lineage analysis owns the multi-source decision). Also forces FULL_REFRESH classification for queries whose text matches QueryNeedsOriginalSqlForLpts, since DuckDB unfolds PIVOT before plan analysis so facts.has_pivot misses string-level signals. - src/core/parser_plan_helpers.cpp: kept HEAD's OccurrenceColumnRef-based CollectInnerJoinEdges / CollectWindowLookupEdges / CollectWindowPartitionRefs (self-join occurrence precision + directional LEFT/RIGHT/INNER lookup edges) and adapted their signatures to the new CreateMVPlanFacts API introduced by upstream's b7b3fd0. Restored QueryNeedsOriginalSqlForLpts, PlanNeedsOriginalSqlForLpts, IsAggregateFunctionUnsupportedByLpts. - src/include/upsert/refresh_internal.hpp: kept HEAD's BuildSignedMultisetDeltaInsertSQL + emit_cascade_delta WindowPartition argument; layered upstream's IsSummableLogicalType / BuildStandardDeltaRowsSQL / TryBuildGroupMeasureUpdateRefresh declarations. - src/rules/join.cpp: combined upstream's SqlUtils::GetBoolSetting refactor with HEAD's openivm_compile_only empty-mask narrowing. - src/rules/refresh_insert_rule.cpp: restored HEAD's IsRowIdColumn / IsSemiJoinOnRowId helpers (the pinned LPTS still needs the rowid SEMI JOIN workaround), then routed the general-DELETE path through upstream's refactored BuildDeleteDeltaInsertFromPlan helper. - src/upsert/refresh_sql.cpp: combined upstream's SqlUtils::GetBoolSetting and lpts_start profiling with HEAD's PropagateRefreshPlanningSettings + dialect-aware LPTS call. Adopted upstream's build_affected_snapshot_companion(keys) helper for the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion (resolves chained.test:214 EXCEPT-ALL miss against agg_proj_top). Added an openivm_compile_only fallback in the GROUP_RECOMPUTE branch so PRAGMA compile_refresh still emits the full template even with zero active deltas (preserves cascade_group_recompute_delta.test behavior). - src/upsert/refresh.cpp: returned 'true' from the openivm_compile_only short-circuit (upstream changed RefreshViewLocked from void to bool). - third_party/lpts: kept ours at 70306c4 to preserve glibc-2.35 build compatibility (the ivm-bench spark-openivm-build builder image is on ubuntu:22.04). The companion lpts backport (mdrakiburrahman/lpts:openivm-spark-glibc-2.35 @ 2b2ff63) is used at pin-time via spark-ext/dev/pins.env. Test-side adjustments - test/sql/cascade_window_partition_coalesce.test: upstream's b7b3fd0 collapsed window_partition_lineage_json / projection_key_lineage_json into a single lineage_json column; updated the regex assertion. - test/sql/full_refresh.test: mv_pivot now classifies as FULL_REFRESH (type=3) on our pinned LPTS rather than upstream's AGGREGATE_GROUP (type=0); data correctness EXCEPT ALL assertions still hold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ution Resolves three additional conflicts introduced by upstream's ae5596a ("Support schema evolution for aux state"): - src/upsert/refresh_sql.cpp: kept HEAD's CopyOpenIvmSetting + PropagateRefreshPlanningSettings (Spark dialect/compile-only forwarding) and layered upstream's HasPendingDeltaRows + RequireNoPendingAuxRepairDeltas + EnsureAuxState template + per-aux Ensure* wrappers on top so both feature sets coexist. - src/rules/refresh_insert_rule.cpp: kept HEAD's IsRowIdColumn / IsSemiJoinOnRowId / PlanReferencesColumn / FindMVReferencingColumn helpers — the rowid SEMI JOIN workaround is still needed for our pinned lpts and the FindMVReferencingColumn diagnostic helper has no replacement upstream. - src/core/parser_plan_helpers.cpp: kept HEAD's local WindowLineageOp (with source_occurrence / lookup_occurrence fields) and JSON serializer rather than RefreshMetadata::WindowPartitionLineageOp + the new RefreshMetadata::WindowPartitionLineageToJson — upstream's struct drops the occurrence indices that our self-join-aware lineage analysis relies on. The companion CollectInnerJoinEdges / CollectWindowLookupEdges helpers stay OccurrenceColumnRef-typed for the same reason. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reverts the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion to PR-2's inline INSERT-with-NULLs approach. The build_affected_snapshot_companion(keys) helper from upstream's a522636 emits openivm_old_affected_/openivm_old_snapshot_ TEMP TABLEs that ivm-common/SparkRefreshRewriter's classifier does not recognize (it expects the old OPENIVM_OLD_<view> / OPENIVM_NEW_<view> naming), and Spark cannot carry TEMP TABLEs across the statement-by- statement refresh program the rewriter emits anyway. NULL companions are safe: SQL's SUM/COUNT/MIN/MAX ignore NULL non-key columns, so the zero-valued retract/add-back rows contribute nothing to the downstream aggregate state while still firing the count-monoid transitions cascade requires. This breaks the upstream-added test test/sql/chained.test:214 (agg_proj_top EXCEPT ALL agg_proj_sales) which depends on the snapshot helper. Documenting this as a known divergence: spark-ext consumers take priority over upstream's openivm-only tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…h (2b2ff63) openivm @ 2b712a0 brings ila/openivm#2 up to date with upstream/main (through ae5596a 'Support schema evolution for aux state') and re-resolves the merge with PR-2's Spark integration intact. All 102 openivm SQL tests pass except the upstream-added chained.test:214, which depends on build_affected_snapshot_companion temp-table semantics that ivm-common/SparkRefreshRewriter cannot carry across statements — see the openivm-side commit message for the documented divergence. lpts moves to the new openivm-spark-glibc-2.35 branch @ 2b2ff63, which backports ila/lpts@01fa3eb 'Quote scan identifiers in SQL output' onto the 6de6f67 base so the ivm-bench spark-openivm-build image stays on Ubuntu 22.04 / glibc-2.35. Full ./spark-ext/dev/dev.sh verify is green at 646/646 with these pins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into ila/lpts:main, openivm PR ila#2 needs the lpts submodule to resolve to a commit that includes the Spark-facing changes (SPARK dialect, hidden-projection bindings, scan identifier quoting, etc.) plus the one-line std::move fix that lets the source compile under GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps those diffs out of the openivm PR view. Once ila/lpts#9 merges, this commit gets reverted back to ila/lpts.git with the lpts main SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pins.env, ivm-bench's spark-openivm-build Dockerfile, and now .gitmodules all agree on the source branch (openivm-spark). The field is informational — submodule resolution still goes by the recorded gitlink SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mdrakiburrahman and others added 8 commits May 16, 2026 05:29

feat: emit openivm_delta_<view> from recompute paths

4471f4e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: emit cascade view-delta for joined WINDOW_PARTITION MVs

a1001e9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: emit cascade view-delta for multi-source SIMPLE_PROJECTION + lo…

9543f56

…ng-IN-list + derived window keys Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mdrakiburrahman force-pushed the openivm-spark branch from 1da9be7 to 2b712a0 Compare May 23, 2026 07:22

mdrakiburrahman and others added 4 commits May 23, 2026 16:07

chore(pins-fix): snapshot working tree

35eaed0

chore(pins-fix): snapshot working tree

465b299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenIVM for Spark#2

OpenIVM for Spark#2
mdrakiburrahman wants to merge 12 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark

mdrakiburrahman commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdrakiburrahman commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdrakiburrahman commented May 18, 2026 •

edited

Loading