OpenIVM for Spark#2
Draft
mdrakiburrahman wants to merge 12 commits into
Draft
Conversation
…alect
Three additions targeted at the upcoming openivm-spark Spark 3.5 / Delta
Lake extension that consumes openivm as a SQL->SQL compiler:
1. `openivm_compile_only` (BOOLEAN, default false): when set, `PRAGMA
refresh('v')` generates the refresh SQL artifact (under
`openivm_files_path`) but skips execution. The new short-circuit lives
in `RefreshViewLocked` immediately after `GenerateRefreshSQL`.
2. `openivm_target_dialect` (VARCHAR, default 'duckdb'): forwarded to
every `LogicalPlanToAst` / `AstToCteList` call site via a new
`ReadOpenIvmTargetDialect(context)` helper in `core/sql_utils`. The
seven affected sites:
- `src/rules/refresh_insert_rule.cpp` (DML interception, 5 sites)
- `src/upsert/refresh_sql.cpp` (refresh SQL build path)
- `src/core/parser.cpp` (CREATE MATERIALIZED VIEW round-trip)
3. `PRAGMA compile_refresh('v')`: new substitute-SQL pragma that returns
one row (refresh_type INTEGER, refresh_type_name VARCHAR, sql VARCHAR)
describing the refresh plan in the chosen target dialect. Calls
`GenerateRefreshSQL` directly (read-only, no locks). Cross-system
(DuckLake) views concatenate the pre/post metadata SQL into the
returned `sql` column; the `DUCKLAKE_SNAPSHOT_PLACEHOLDER` token is
intentionally left unresolved for the caller to handle.
Tests in `test/sql/compile_refresh.test` cover:
- compile_refresh returns the expected (refresh_type, name, non-empty sql)
- the MV state is unchanged after compile_refresh
- subsequent `PRAGMA refresh` still works (no setting leakage)
- `openivm_compile_only` globally suppresses execution in `PRAGMA refresh`
- `openivm_target_dialect='spark'` round-trips without error
- compile_refresh on a non-existent view fails cleanly
All 45 upstream sqllogictest files pass (5431 assertions, 0 failures).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dd companions
When openivm-spark compiles each MV in an isolated DuckDB subprocess,
the openivm_delta_tables metadata is empty (no downstream MVs registered
yet), so the AGGREGATE_GROUP refresh path emits only an additive view-delta
that breaks cascade for downstream non-additive aggregates (COUNT(*),
DISTINCT etc.).
Two C++ changes:
* New `openivm_force_view_delta_cascade` setting biases has_downstream
to true for AGGREGATE_GROUP and AGGREGATE_HAVING refreshes only,
ensuring the companion emission paths fire. Scoped narrowly to avoid
triggering the SIMPLE_AGGREGATE snapshot pre/post-companion (which
uses a CREATE TEMP TABLE / DROP TABLE pair that the Spark-side
rewriter does not currently transform).
* refresh_sql.cpp emits BOTH companions when the flag is set AND the
data table carries the synthetic openivm_count_star column:
- Positive companion (existing): `mult > 0 AND EXISTS data_table` →
emit a -1 retract row with NULL (not 0) for non-key columns so the
SUM/COUNT cascade on a NULL aggregate state is preserved.
- Negative companion (new): `mult < 0 AND count_star > 0 AND post-merge
count_star > 0` → emit a +1 add-back row so a partial DELETE that
leaves the group alive doesn't spuriously decrement downstream
COUNT(*). The gate (m.count_star + d.mult * d.count_star > 0)
correctly accounts for the signed count_star delta even though
openivm emits count_star as the absolute count of input rows.
MVs with a user-supplied COUNT(*) alias use that alias as the
count-monoid; for those, the negative companion is skipped (the count
column name discovery needs classifier-level metadata that
refresh_sql.cpp doesn't currently see). Documented as a known gap;
DELETE cascade may be inaccurate for downstream non-additive aggregates
over such MVs.
Verified end-to-end by mdrakiburrahman/openivm-spark's ivm-it suite
(623/623 tests passing).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ng-IN-list + derived window keys Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings 18 upstream commits onto the PR-2 ("OpenIVM for Spark") branch.
All openivm SQL tests pass (102 cases; HTTP-dependent suites still
skipped, same as upstream).
Conflict resolutions
- src/core/parser.cpp: kept the dialect-aware LPTS path (HEAD) layered
inside upstream's profiling envelope; restored QueryNeedsOriginalSqlForLpts
+ PlanNeedsOriginalSqlForLpts gates (Spark-side needs string-level
PIVOT/unsupported-aggregate detection); kept HEAD's window-partition
metadata behavior (Refresh-time lineage analysis owns the multi-source
decision). Also forces FULL_REFRESH classification for queries whose
text matches QueryNeedsOriginalSqlForLpts, since DuckDB unfolds PIVOT
before plan analysis so facts.has_pivot misses string-level signals.
- src/core/parser_plan_helpers.cpp: kept HEAD's OccurrenceColumnRef-based
CollectInnerJoinEdges / CollectWindowLookupEdges / CollectWindowPartitionRefs
(self-join occurrence precision + directional LEFT/RIGHT/INNER lookup
edges) and adapted their signatures to the new CreateMVPlanFacts API
introduced by upstream's b7b3fd0. Restored QueryNeedsOriginalSqlForLpts,
PlanNeedsOriginalSqlForLpts, IsAggregateFunctionUnsupportedByLpts.
- src/include/upsert/refresh_internal.hpp: kept HEAD's BuildSignedMultisetDeltaInsertSQL
+ emit_cascade_delta WindowPartition argument; layered upstream's
IsSummableLogicalType / BuildStandardDeltaRowsSQL / TryBuildGroupMeasureUpdateRefresh
declarations.
- src/rules/join.cpp: combined upstream's SqlUtils::GetBoolSetting refactor
with HEAD's openivm_compile_only empty-mask narrowing.
- src/rules/refresh_insert_rule.cpp: restored HEAD's IsRowIdColumn /
IsSemiJoinOnRowId helpers (the pinned LPTS still needs the rowid SEMI
JOIN workaround), then routed the general-DELETE path through upstream's
refactored BuildDeleteDeltaInsertFromPlan helper.
- src/upsert/refresh_sql.cpp: combined upstream's SqlUtils::GetBoolSetting
and lpts_start profiling with HEAD's PropagateRefreshPlanningSettings +
dialect-aware LPTS call. Adopted upstream's build_affected_snapshot_companion(keys)
helper for the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion
(resolves chained.test:214 EXCEPT-ALL miss against agg_proj_top). Added
an openivm_compile_only fallback in the GROUP_RECOMPUTE branch so
PRAGMA compile_refresh still emits the full template even with zero
active deltas (preserves cascade_group_recompute_delta.test behavior).
- src/upsert/refresh.cpp: returned 'true' from the openivm_compile_only
short-circuit (upstream changed RefreshViewLocked from void to bool).
- third_party/lpts: kept ours at 70306c4 to preserve glibc-2.35 build
compatibility (the ivm-bench spark-openivm-build builder image is on
ubuntu:22.04). The companion lpts backport
(mdrakiburrahman/lpts:openivm-spark-glibc-2.35 @ 2b2ff63) is used at
pin-time via spark-ext/dev/pins.env.
Test-side adjustments
- test/sql/cascade_window_partition_coalesce.test: upstream's b7b3fd0
collapsed window_partition_lineage_json / projection_key_lineage_json
into a single lineage_json column; updated the regex assertion.
- test/sql/full_refresh.test: mv_pivot now classifies as FULL_REFRESH
(type=3) on our pinned LPTS rather than upstream's AGGREGATE_GROUP
(type=0); data correctness EXCEPT ALL assertions still hold.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ution Resolves three additional conflicts introduced by upstream's ae5596a ("Support schema evolution for aux state"): - src/upsert/refresh_sql.cpp: kept HEAD's CopyOpenIvmSetting + PropagateRefreshPlanningSettings (Spark dialect/compile-only forwarding) and layered upstream's HasPendingDeltaRows + RequireNoPendingAuxRepairDeltas + EnsureAuxState template + per-aux Ensure* wrappers on top so both feature sets coexist. - src/rules/refresh_insert_rule.cpp: kept HEAD's IsRowIdColumn / IsSemiJoinOnRowId / PlanReferencesColumn / FindMVReferencingColumn helpers — the rowid SEMI JOIN workaround is still needed for our pinned lpts and the FindMVReferencingColumn diagnostic helper has no replacement upstream. - src/core/parser_plan_helpers.cpp: kept HEAD's local WindowLineageOp (with source_occurrence / lookup_occurrence fields) and JSON serializer rather than RefreshMetadata::WindowPartitionLineageOp + the new RefreshMetadata::WindowPartitionLineageToJson — upstream's struct drops the occurrence indices that our self-join-aware lineage analysis relies on. The companion CollectInnerJoinEdges / CollectWindowLookupEdges helpers stay OccurrenceColumnRef-typed for the same reason. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reverts the AGGREGATE_GROUP / AGGREGATE_HAVING cascade companion to PR-2's inline INSERT-with-NULLs approach. The build_affected_snapshot_companion(keys) helper from upstream's a522636 emits openivm_old_affected_/openivm_old_snapshot_ TEMP TABLEs that ivm-common/SparkRefreshRewriter's classifier does not recognize (it expects the old OPENIVM_OLD_<view> / OPENIVM_NEW_<view> naming), and Spark cannot carry TEMP TABLEs across the statement-by- statement refresh program the rewriter emits anyway. NULL companions are safe: SQL's SUM/COUNT/MIN/MAX ignore NULL non-key columns, so the zero-valued retract/add-back rows contribute nothing to the downstream aggregate state while still firing the count-monoid transitions cascade requires. This breaks the upstream-added test test/sql/chained.test:214 (agg_proj_top EXCEPT ALL agg_proj_sales) which depends on the snapshot helper. Documenting this as a known divergence: spark-ext consumers take priority over upstream's openivm-only tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mdrakiburrahman
pushed a commit
to mdrakiburrahman/openivm-spark
that referenced
this pull request
May 22, 2026
…h (2b2ff63) openivm @ 2b712a0 brings ila/openivm#2 up to date with upstream/main (through ae5596a 'Support schema evolution for aux state') and re-resolves the merge with PR-2's Spark integration intact. All 102 openivm SQL tests pass except the upstream-added chained.test:214, which depends on build_affected_snapshot_companion temp-table semantics that ivm-common/SparkRefreshRewriter cannot carry across statements — see the openivm-side commit message for the documented divergence. lpts moves to the new openivm-spark-glibc-2.35 branch @ 2b2ff63, which backports ila/lpts@01fa3eb 'Quote scan identifiers in SQL output' onto the 6de6f67 base so the ivm-bench spark-openivm-build image stays on Ubuntu 22.04 / glibc-2.35. Full ./spark-ext/dev/dev.sh verify is green at 646/646 with these pins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1da9be7 to
2b712a0
Compare
Until ila/lpts#9 (mdrakiburrahman/lpts:openivm-spark) is merged into ila/lpts:main, openivm PR ila#2 needs the lpts submodule to resolve to a commit that includes the Spark-facing changes (SPARK dialect, hidden-projection bindings, scan identifier quoting, etc.) plus the one-line std::move fix that lets the source compile under GCC 11 / -std=c++11. Pointing the submodule URL at the fork keeps those diffs out of the openivm PR view. Once ila/lpts#9 merges, this commit gets reverted back to ila/lpts.git with the lpts main SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pins.env, ivm-bench's spark-openivm-build Dockerfile, and now .gitmodules all agree on the source branch (openivm-spark). The field is informational — submodule resolution still goes by the recorded gitlink SHA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TODO