Skip to content

Register DISTINCT_COUNT_APPROX logical marker in PPLFuncImpTable#5466

Open
songkant-aws wants to merge 1 commit into
opensearch-project:mainfrom
songkant-aws:dc-marker-registration
Open

Register DISTINCT_COUNT_APPROX logical marker in PPLFuncImpTable#5466
songkant-aws wants to merge 1 commit into
opensearch-project:mainfrom
songkant-aws:dc-marker-registration

Conversation

@songkant-aws
Copy link
Copy Markdown
Collaborator

Description

PPL parser fails with Cannot resolve function: DISTINCT_COUNT_APPROX on any execution path that does not construct OpenSearchExecutionEngine (unified-query / analytics-engine / DataFusion), because the UDAF was only registered to PPLFuncImpTable.aggExternalFunctionRegistry as a side effect of that constructor (registerOpenSearchFunctions).

This change adds a logical marker UDAF in core that lets PPL parser succeed on every path. Backends are expected to push down or rewrite the operator before execution.

Layered registration

  • core (this PR): register a marker DistinctCountApproxLogicalAggFunction in PPLFuncImpTable.AggBuilder.populate() via the new PPLBuiltinOperators.DISTINCT_COUNT_APPROX. Lookup precedence in PPLFuncImpTable.getImplementation() is aggExternalFunctionRegistry first, then aggFunctionRegistry.
  • opensearch (unchanged): OpenSearchExecutionEngine.registerOpenSearchFunctions() still installs the real DistinctCountApproxAggFunction (HyperLogLog++) into aggExternalFunctionRegistry. Because external is consulted first, the V3 path always sees the real implementation once the constructor has run, and the marker is effectively overridden.

This is the same pattern already used by RelevanceQueryFunction.RelevanceQueryImplementor for match/match_phrase/etc — relevance functions register a no-op operator whose RelevanceQueryImplementor.implement() throws UnsupportedOperationException, because their semantics live entirely on the OpenSearch index side. DISTINCT_COUNT_APPROX is in the same situation: real semantics on the OpenSearch side (cardinality aggregation) and on backends like DataFusion (approx_count_distinct).

Marker class behavior

init / add / result and the accumulator's value all throw UnsupportedOperationException with the message:

DISTINCT_COUNT_APPROX logical marker reached Enumerable execution; an engine-specific implementation must be registered or rewritten before execution.

Reaching the body means a backend either failed to push down or did not register an adapter — surfacing this as a clear error rather than silently producing wrong results is the intended behavior.

V3 path behavior is preserved

  • AggregateAnalyzer translates DISTINCT_COUNT_APPROX to OpenSearch cardinality aggregation through a BuiltinFunctionName switch, independent of the wrapped SqlAggFunction instance — unchanged by this PR.
  • When V3 falls back to Calcite enumerable execution (push-down failed for some sub-plan), RexImpTable.getAggImplementor reads SqlUserDefinedAggFunction.function, which is the real DistinctCountApproxAggFunction registered by OpenSearchExecutionEngine (external takes precedence) — unchanged.
  • Operand metadata for the marker is null to match the existing external registration's permissive type policy. No new type rejection is introduced.

How this unblocks unified-query / DataFusion path

RestUnifiedQueryAction does not construct OpenSearchExecutionEngine; before this change, dc() / distinct_count() / distinct_count_approx() failed at PPL parse stage on that path. After this change, the parser succeeds via the marker, and the downstream backend (e.g. analytics-engine's DataFusion adapter) rewrites RexOver(DISTINCT_COUNT_APPROX) to APPROX_COUNT_DISTINCT before substrait emission. End-to-end verified locally with the analytics-engine REST IT in the OpenSearch sandbox.

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

PPL parser fails with "Cannot resolve function: DISTINCT_COUNT_APPROX" on
any execution path that does not construct OpenSearchExecutionEngine
(unified-query / analytics-engine / DataFusion), because the UDAF was
only registered to PPLFuncImpTable.aggExternalFunctionRegistry as a
side effect of that constructor.

Add a logical marker class DistinctCountApproxLogicalAggFunction in
core, expose it as PPLBuiltinOperators.DISTINCT_COUNT_APPROX, and register
it inside PPLFuncImpTable.AggBuilder.populate() alongside other built-in
aggregates. The marker has no JVM execution: init / add / result throw
UnsupportedOperationException, mirroring the pattern already used by
RelevanceQueryFunction.RelevanceQueryImplementor for match-family
functions which similarly have no JVM semantics.

Behavior on V3 path is preserved: OpenSearchExecutionEngine still
registers the real HyperLogLog++ DistinctCountApproxAggFunction in
aggExternalFunctionRegistry, and getImplementation() consults that
external registry first, so the marker is overridden whenever the V3
constructor has run. AggregateAnalyzer continues to translate the
operator to OpenSearch cardinality DSL via the BuiltinFunctionName
switch which is independent of the wrapped SqlAggFunction instance.

Operand metadata for the marker is null to match the existing external
registration's permissive type policy and avoid introducing new type
rejections that would surface as regressions in existing dc IT.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Replace null operand checker

Passing null for operand type checker can cause NullPointerException during query
validation if Calcite attempts to validate operands before backend rewriting occurs.
Consider using PPLOperandTypes.ANY or a permissive operand checker instead of null
to ensure safe validation behavior.

core/src/main/java/org/opensearch/sql/expression/function/PPLBuiltinOperators.java [521-526]

 public static final SqlAggFunction DISTINCT_COUNT_APPROX =
     createUserDefinedAggFunction(
         DistinctCountApproxLogicalAggFunction.class,
         "DISTINCT_COUNT_APPROX",
         ReturnTypes.BIGINT_FORCE_NULLABLE,
-        null);
+        PPLOperandTypes.ANY);
Suggestion importance[1-10]: 7

__

Why: Valid concern about potential NullPointerException during query validation. The comment explicitly states null matches existing external registration's permissive policy, but using PPLOperandTypes.ANY would be safer while maintaining permissiveness. However, the suggestion assumes validation occurs before backend rewriting without confirming the actual execution flow.

Medium

@songkant-aws songkant-aws added PPL Piped processing language enhancement New feature or request labels May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant