Skip to content

OpenIVM for Spark#2

Draft
mdrakiburrahman wants to merge 12 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark
Draft

OpenIVM for Spark#2
mdrakiburrahman wants to merge 12 commits into
ila:mainfrom
mdrakiburrahman:openivm-spark

Conversation

@mdrakiburrahman
Copy link
Copy Markdown

@mdrakiburrahman mdrakiburrahman commented May 18, 2026

TODO

⚠️ Please do not review yet, Agents are trying different things to get Spark to perform with OpenIVM by any means necessary

mdrakiburrahman and others added 8 commits May 16, 2026 05:29
…alect

Three additions targeted at the upcoming openivm-spark Spark 3.5 / Delta
Lake extension that consumes openivm as a SQL->SQL compiler:

1. `openivm_compile_only` (BOOLEAN, default false): when set, `PRAGMA
   refresh('v')` generates the refresh SQL artifact (under
   `openivm_files_path`) but skips execution. The new short-circuit lives
   in `RefreshViewLocked` immediately after `GenerateRefreshSQL`.

2. `openivm_target_dialect` (VARCHAR, default 'duckdb'): forwarded to
   every `LogicalPlanToAst` / `AstToCteList` call site via a new
   `ReadOpenIvmTargetDialect(context)` helper in `core/sql_utils`. The
   seven affected sites:
   - `src/rules/refresh_insert_rule.cpp` (DML interception, 5 sites)
   - `src/upsert/refresh_sql.cpp` (refresh SQL build path)
   - `src/core/parser.cpp` (CREATE MATERIALIZED VIEW round-trip)

3. `PRAGMA compile_refresh('v')`: new substitute-SQL pragma that returns
   one row (refresh_type INTEGER, refresh_type_name VARCHAR, sql VARCHAR)
   describing the refresh plan in the chosen target dialect. Calls
   `GenerateRefreshSQL` directly (read-only, no locks). Cross-system
   (DuckLake) views concatenate the pre/post metadata SQL into the
   returned `sql` column; the `DUCKLAKE_SNAPSHOT_PLACEHOLDER` token is
   intentionally left unresolved for the caller to handle.

Tests in `test/sql/compile_refresh.test` cover:
- compile_refresh returns the expected (refresh_type, name, non-empty sql)
- the MV state is unchanged after compile_refresh
- subsequent `PRAGMA refresh` still works (no setting leakage)
- `openivm_compile_only` globally suppresses execution in `PRAGMA refresh`
- `openivm_target_dialect='spark'` round-trips without error
- compile_refresh on a non-existent view fails cleanly

All 45 upstream sqllogictest files pass (5431 assertions, 0 failures).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dd companions

When openivm-spark compiles each MV in an isolated DuckDB subprocess,
the openivm_delta_tables metadata is empty (no downstream MVs registered
yet), so the AGGREGATE_GROUP refresh path emits only an additive view-delta
that breaks cascade for downstream non-additive aggregates (COUNT(*),
DISTINCT etc.).

Two C++ changes:
* New `openivm_force_view_delta_cascade` setting biases has_downstream
  to true for AGGREGATE_GROUP and AGGREGATE_HAVING refreshes only,
  ensuring the companion emission paths fire. Scoped narrowly to avoid
  triggering the SIMPLE_AGGREGATE snapshot pre/post-companion (which
  uses a CREATE TEMP TABLE / DROP TABLE pair that the Spark-side
  rewriter does not currently transform).

* refresh_sql.cpp emits BOTH companions when the flag is set AND the
  data table carries the synthetic openivm_count_star column:
   - Positive companion (existing): `mult > 0 AND EXISTS data_table` →
     emit a -1 retract row with NULL (not 0) for non-key columns so the
     SUM/COUNT cascade on a NULL aggregate state is preserved.
   - Negative companion (new): `mult < 0 AND count_star > 0 AND post-merge
     count_star > 0` → emit a +1 add-back row so a partial DELETE that
     leaves the group alive doesn't spuriously decrement downstream
     COUNT(*). The gate (m.count_star + d.mult * d.count_star > 0)
     correctly accounts for the signed count_star delta even though
     openivm emits count_star as the absolute count of input rows.

MVs with a user-supplied COUNT(*) alias use that alias as the
count-monoid; for those, the negative companion is skipped (the count
column name discovery needs classifier-level metadata that
refresh_sql.cpp doesn't currently see). Documented as a known gap;
DELETE cascade may be inaccurate for downstream non-additive aggregates
over such MVs.

Verified end-to-end by mdrakiburrahman/openivm-spark's ivm-it suite
(623/623 tests passing).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ng-IN-list + derived window keys

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings 18 upstream commits onto the PR-2 ("OpenIVM for Spark") branch.
All openivm SQL tests pass (102 cases; HTTP-dependent suites still
skipped, same as upstream).

Conflict resolutions

- src/core/parser.cpp: kept the dialect-aware LPTS path (HEAD) layered
  inside upstream's profiling envelope; restored QueryNeedsOriginalSqlForLpts
  + PlanNeedsOriginalSqlForLpts gates (Spark-side needs string-level
  PIVOT/unsupported-aggregate detection); kept HEAD's window-partition
  metadata behavior (Refresh-time lineage analysis owns the multi-source
  decision). Also forces FULL_REFRESH classification for queries whose
  text matches QueryNeedsOriginalSqlForLpts, since DuckDB unfolds PIVOT
  before plan analysis so facts.has_pivot misses string-level signals.

- src/core/parser_plan_helpers.cpp: kept HEAD's OccurrenceColumnRef-based
  CollectInnerJoinEdges / CollectWindowLookupEdges / CollectWindowPartitionRefs
  (self-join occurrence precision + directional LEFT/RIGHT/INNER lookup
  edges) and adapted their signatures to the new CreateMVPlanFacts API
  introduced by upstream's b7b3fd0. Restored QueryNeedsOriginalSqlForLpts,
  PlanNeedsOriginalSqlForLpts, IsAggregateFunctionUnsupportedByLpts.

- src/include/upsert/refresh_internal.hpp: kept HEAD's BuildSignedMultisetDeltaInsertSQL
  + emit_cascade_delta WindowPartition argument; layered upstream's
  IsSummableLogicalType / BuildStandardDeltaRowsSQL / TryBuildGroupMeasureUpdateRefresh
  declarations.

- src/rules/join.cpp: combined upstream's SqlUtils::GetBoolSetting refactor
  with HEAD's openivm_compile_only empty-mask narrowing.

- src/rules/refresh_insert_rule.cpp: restored HEAD's IsRowIdColumn /
  IsSemiJoinOnRowId helpers (the pinned LPTS still needs the rowid SEMI
  JOIN workaround), then routed the general-DELETE path through upstream's
  refactored BuildDeleteDeltaInsertFromPlan helper.

- src/upsert/refresh_sql.cpp: combined upstream's SqlUtils::GetBoolSetting
  and lpts_start profiling with HEAD's PropagateRefreshPlanningSettings +
  dialect-aware LPTS call. Adopted upstream's build_affected_snapshot_companion(keys)
  helper for the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion
  (resolves chained.test:214 EXCEPT-ALL miss against agg_proj_top). Added
  an openivm_compile_only fallback in the GROUP_RECOMPUTE branch so
  PRAGMA compile_refresh still emits the full template even with zero
  active deltas (preserves cascade_group_recompute_delta.test behavior).

- src/upsert/refresh.cpp: returned 'true' from the openivm_compile_only
  short-circuit (upstream changed RefreshViewLocked from void to bool).

- third_party/lpts: kept ours at 70306c4 to preserve glibc-2.35 build
  compatibility (the ivm-bench spark-openivm-build builder image is on
  ubuntu:22.04). The companion lpts backport
  (mdrakiburrahman/lpts:openivm-spark-glibc-2.35 @ 2b2ff63) is used at
  pin-time via spark-ext/dev/pins.env.

Test-side adjustments

- test/sql/cascade_window_partition_coalesce.test: upstream's b7b3fd0
  collapsed window_partition_lineage_json / projection_key_lineage_json
  into a single lineage_json column; updated the regex assertion.
- test/sql/full_refresh.test: mv_pivot now classifies as FULL_REFRESH
  (type=3) on our pinned LPTS rather than upstream's AGGREGATE_GROUP
  (type=0); data correctness EXCEPT ALL assertions still hold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ution

Resolves three additional conflicts introduced by upstream's
ae5596a ("Support schema evolution for aux state"):

- src/upsert/refresh_sql.cpp: kept HEAD's CopyOpenIvmSetting +
  PropagateRefreshPlanningSettings (Spark dialect/compile-only forwarding)
  and layered upstream's HasPendingDeltaRows + RequireNoPendingAuxRepairDeltas
  + EnsureAuxState template + per-aux Ensure* wrappers on top so both
  feature sets coexist.

- src/rules/refresh_insert_rule.cpp: kept HEAD's IsRowIdColumn /
  IsSemiJoinOnRowId / PlanReferencesColumn / FindMVReferencingColumn
  helpers — the rowid SEMI JOIN workaround is still needed for our pinned
  lpts and the FindMVReferencingColumn diagnostic helper has no
  replacement upstream.

- src/core/parser_plan_helpers.cpp: kept HEAD's local WindowLineageOp
  (with source_occurrence / lookup_occurrence fields) and JSON serializer
  rather than RefreshMetadata::WindowPartitionLineageOp + the new
  RefreshMetadata::WindowPartitionLineageToJson — upstream's struct drops
  the occurrence indices that our self-join-aware lineage analysis relies
  on. The companion CollectInnerJoinEdges / CollectWindowLookupEdges
  helpers stay OccurrenceColumnRef-typed for the same reason.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reverts the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion to
PR-2's inline INSERT-with-NULLs approach. The build_affected_snapshot_companion(keys)
helper from upstream's a522636 emits openivm_old_affected_/openivm_old_snapshot_
TEMP TABLEs that ivm-common/SparkRefreshRewriter's classifier does not
recognize (it expects the old OPENIVM_OLD_<view> / OPENIVM_NEW_<view>
naming), and Spark cannot carry TEMP TABLEs across the statement-by-
statement refresh program the rewriter emits anyway.

NULL companions are safe: SQL's SUM/COUNT/MIN/MAX ignore NULL non-key
columns, so the zero-valued retract/add-back rows contribute nothing to
the downstream aggregate state while still firing the count-monoid
transitions cascade requires.

This breaks the upstream-added test test/sql/chained.test:214
(agg_proj_top EXCEPT ALL agg_proj_sales) which depends on the snapshot
helper. Documenting this as a known divergence: spark-ext consumers
take priority over upstream's openivm-only tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman pushed a commit to mdrakiburrahman/openivm-spark that referenced this pull request May 22, 2026
…h (2b2ff63)

openivm @ 2b712a0 brings ila/openivm#2 up to date with upstream/main
(through ae5596a 'Support schema evolution for aux state') and re-resolves
the merge with PR-2's Spark integration intact. All 102 openivm SQL tests
pass except the upstream-added chained.test:214, which depends on
build_affected_snapshot_companion temp-table semantics that
ivm-common/SparkRefreshRewriter cannot carry across statements — see
the openivm-side commit message for the documented divergence.

lpts moves to the new openivm-spark-glibc-2.35 branch @ 2b2ff63, which
backports ila/lpts@01fa3eb 'Quote scan identifiers in SQL output' onto
the 6de6f67 base so the ivm-bench spark-openivm-build image stays on
Ubuntu 22.04 / glibc-2.35.

Full ./spark-ext/dev/dev.sh verify is green at 646/646 with these pins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman and others added 4 commits May 23, 2026 16:07
Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into
ila/lpts:main, openivm PR ila#2 needs the lpts submodule to
resolve to a commit that includes the Spark-facing changes (SPARK
dialect, hidden-projection bindings, scan identifier quoting, etc.)
plus the one-line std::move fix that lets the source compile under
GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps
those diffs out of the openivm PR view.

Once ila/lpts#9 merges, this commit gets reverted back to
ila/lpts.git with the lpts main SHA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pins.env, ivm-bench's spark-openivm-build Dockerfile, and now
.gitmodules all agree on the source branch (openivm-spark). The
field is informational — submodule resolution still goes by the
recorded gitlink SHA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant