Skip to content

Added UDT for IP and Binary support#21807

Open
vinaykpud wants to merge 3 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary
Open

Added UDT for IP and Binary support#21807
vinaykpud wants to merge 3 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary

Conversation

@vinaykpud
Copy link
Copy Markdown
Contributor

@vinaykpud vinaykpud commented May 22, 2026

Description

Adds Calcite UDTs for OpenSearch ip and binary columns to the analytics-engine path, plus two backend adapters and two Rust UDFs that make CIDRMATCH and cast(<ip|binary> as STRING) work end-to-end against DataFusion. Closes the rendering, schema-label, CIDRMATCH-literal and [B-writer-crash bugs.

Background

Today, OpenSearchSchemaBuilder.addLeafFields collapses both OpenSearch ip and binary field types to SqlTypeName.VARBINARY. Three downstream consequences:

  1. The Calcite plan's row type carries plain VARBINARY for the column. The DataFusion backend ships back a 16-byte ipv4-mapped-ipv6 buffer and byte[] reaches the SQL plugin's row converter, which has no byte[] case — unsupported object class [B.
  2. The response schema reports "type": "binary" for an IP column.
  3. CIDRMATCH('1.2.3.4', '1.2.3.0/24') (literal/literal) reaches DataFusion as a UDF call no backend implements.
  4. PPL cast(host as STRING) lands as SAFE_CAST(<ip-col> AS VARCHAR). DataFusion's cast(binary, utf8) kernel decodes the raw 16-byte buffer as UTF-8 — garbage strings or NULL.

Approach

Three coordinated pieces:

1. UDT split (analytics-api). New IpType and BinaryType extending AbstractSqlType with VARBINARY underneath. They keep getSqlTypeName() == VARBINARY so Substrait, Arrow, DataFusion, and operator dispatch (cidrmatch byte-range rewrite, =/IN/BETWEEN coercion) all keep working unchanged. The UDT distinction lives only in the Calcite RelNode and is read at exactly two sites in the SQL plugin (response-schema build + row converter). Wired through OpenSearchSchemaBuilder.buildLeafType.

2. CIDRMATCH backend adapter (analytics-backend-datafusion). New CidrMatchFunctionAdapter implementing ScalarFunctionAdapter.

3. Cast rewrite + Rust UDFs (analytics-backend-datafusion). New IpBinaryCastFunctionAdapter (also ScalarFunctionAdapter) registered for both ScalarFunction.CAST and ScalarFunction.SAFE_CAST (PPL's cast(... AS STRING) reliably emits SAFE_CAST via rexBuilder.makeCast(..., true, true), but Calcite's planner-internal coercion can also emit plain CAST). For VARCHAR-target casts whose source is IpType the adapter emits ip_to_string(<col>); for BinaryType, binary_to_base64(<col>). Otherwise returns the original unchanged.

The two emitted UDFs bind through the Substrait extension catalog (opensearch_scalar_functions.yaml) and are implemented in two new Rust UDFs:

  • ip_to_string — detects ipv4-mapped form (10 zero bytes + 0xff 0xff + 4 IPv4 bytes), emits dotted-quad; otherwise renders the 16-byte buffer via Ipv6Addr::to_string() (RFC 5952). Output matches InetAddresses.toAddrString semantics.
  • binary_to_base64 — base64-encodes per the OpenSearch binary field wire contract.

The cast rewrite has to happen before Substrait conversion because by the time the row reaches Java the cell is already a (mangled) String, not byte[].

Companion change in the SQL plugin

The SQL plugin recognizes the new UDT at the response-schema boundary (instanceof dispatch in OpenSearchTypeFactory, byte[] formatting in AnalyticsExecutionEngine.convertRows) and removes the legacy in-plugin CidrMatchAdapter now that dispatch lives on the backend. Companion PR: opensearch-project/sql#5463 — must merge together.

Testing

Unit tests added:

  • OpenSearchSchemaBuilderTests — projected ip/binary columns are IpType/BinaryType but getSqlTypeName() == VARBINARY; digest-based equality.
  • CidrMatchFunctionAdapterTests — both-literal in-range/out-of-range, IPv6 literal fold, varbinary column + cidr literal byte-range AND, dynamic cidr falls through, unparseable IP falls through.
  • IpBinaryCastFunctionAdapterTests — IpType cast, BinaryType cast, SAFE_CAST IpType, non-VARCHAR target unchanged, plain VARBINARY (no UDT) unchanged, INTEGER unchanged.
  • Rust per-module tests — ip_to_string: ipv4-mapped, pure IPv6 loopback, arbitrary IPv6, null pass-through, wrong-length null. binary_to_base64: round-trip a known payload, null pass-through.

End-to-end: ran the IP/binary PPL test set against a single-node gradle run cluster on x86_64 + Corretto JDK 25:

  • cidrmatch(host, '1.2.3.0/24') (column form) — passes, fires tier 2.
  • CIDRMATCH('1.2.3.4', '1.2.3.0/24') (literal/literal) — passes, fires tier 1.
  • cast(host as STRING) — returns "1.2.3.4" / "::1" (was a 16-byte garbage buffer).
  • cast(blob as STRING) — base64-encoded buffer matches | fields blob.
  • where host = '1.2.3.4' — coerces cleanly with no Unable to convert call IP(string) regression.
  • FieldTypeCoverageIT#testIpFilters — unchanged, still passing.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 0a18110.

PathLineSeverityDescription
sandbox/plugins/analytics-backend-datafusion/build.gradle84highNew external Java dependency added: 'com.github.seancfoley:ipaddress:5.4.2'. Per mandatory supply-chain rule, all new package additions must be flagged regardless of apparent legitimacy. Maintainers must verify the artifact's authenticity and SHA1 hash against the known-good release at the canonical source.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml85highNew external Rust dependency added: 'base64 = "0.22"'. Per mandatory supply-chain rule, all new package additions must be flagged regardless of apparent legitimacy. Maintainers must verify this crate resolves to the expected crates.io artifact and that Cargo.lock pins a known-good checksum.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 2 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch 2 times, most recently from 8f66e01 to 18e6978 Compare May 22, 2026 19:36
@vinaykpud vinaykpud marked this pull request as ready for review May 22, 2026 19:38
@vinaykpud vinaykpud requested a review from a team as a code owner May 22, 2026 19:38
@vinaykpud vinaykpud requested a review from mch2 May 22, 2026 19:45
@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from 18e6978 to 0a18110 Compare May 22, 2026 19:45
@vinaykpud vinaykpud added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 22, 2026
@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from 0a18110 to fd0296c Compare May 22, 2026 20:34
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fd0296c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from fd0296c to e7a90c8 Compare May 22, 2026 22:50
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for e7a90c8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

vinaykpud added 3 commits May 24, 2026 06:16
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 29b1061: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.34%. Comparing base (19b99d8) to head (29b1061).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21807      +/-   ##
============================================
- Coverage     73.38%   73.34%   -0.04%     
+ Complexity    75271    75241      -30     
============================================
  Files          6028     6028              
  Lines        342014   342014              
  Branches      49186    49186              
============================================
- Hits         250972   250852     -120     
- Misses        71110    71164      +54     
- Partials      19932    19998      +66     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants