Added UDT for IP and Binary support#21807
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 0a18110.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
8f66e01 to
18e6978
Compare
18e6978 to
0a18110
Compare
0a18110 to
fd0296c
Compare
|
❌ Gradle check result for fd0296c: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
fd0296c to
e7a90c8
Compare
|
❌ Gradle check result for e7a90c8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
e7a90c8 to
29b1061
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21807 +/- ##
============================================
- Coverage 73.38% 73.34% -0.04%
+ Complexity 75271 75241 -30
============================================
Files 6028 6028
Lines 342014 342014
Branches 49186 49186
============================================
- Hits 250972 250852 -120
- Misses 71110 71164 +54
- Partials 19932 19998 +66 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Adds Calcite UDTs for OpenSearch
ipandbinarycolumns to the analytics-engine path, plus two backend adapters and two Rust UDFs that makeCIDRMATCHandcast(<ip|binary> as STRING)work end-to-end against DataFusion. Closes the rendering, schema-label,CIDRMATCH-literal and[B-writer-crash bugs.Background
Today,
OpenSearchSchemaBuilder.addLeafFieldscollapses both OpenSearchipandbinaryfield types toSqlTypeName.VARBINARY. Three downstream consequences:VARBINARYfor the column. The DataFusion backend ships back a 16-byte ipv4-mapped-ipv6 buffer andbyte[]reaches the SQL plugin's row converter, which has nobyte[]case —unsupported object class [B."type": "binary"for an IP column.CIDRMATCH('1.2.3.4', '1.2.3.0/24')(literal/literal) reaches DataFusion as a UDF call no backend implements.cast(host as STRING)lands asSAFE_CAST(<ip-col> AS VARCHAR). DataFusion'scast(binary, utf8)kernel decodes the raw 16-byte buffer as UTF-8 — garbage strings or NULL.Approach
Three coordinated pieces:
1. UDT split (
analytics-api). NewIpTypeandBinaryTypeextendingAbstractSqlTypewithVARBINARYunderneath. They keepgetSqlTypeName() == VARBINARYso Substrait, Arrow, DataFusion, and operator dispatch (cidrmatch byte-range rewrite,=/IN/BETWEENcoercion) all keep working unchanged. The UDT distinction lives only in the Calcite RelNode and is read at exactly two sites in the SQL plugin (response-schema build + row converter). Wired throughOpenSearchSchemaBuilder.buildLeafType.2. CIDRMATCH backend adapter (
analytics-backend-datafusion). NewCidrMatchFunctionAdapterimplementingScalarFunctionAdapter.3. Cast rewrite + Rust UDFs (
analytics-backend-datafusion). NewIpBinaryCastFunctionAdapter(alsoScalarFunctionAdapter) registered for bothScalarFunction.CASTandScalarFunction.SAFE_CAST(PPL'scast(... AS STRING)reliably emitsSAFE_CASTviarexBuilder.makeCast(..., true, true), but Calcite's planner-internal coercion can also emit plainCAST). For VARCHAR-target casts whose source isIpTypethe adapter emitsip_to_string(<col>); forBinaryType,binary_to_base64(<col>). Otherwise returns the original unchanged.The two emitted UDFs bind through the Substrait extension catalog (
opensearch_scalar_functions.yaml) and are implemented in two new Rust UDFs:ip_to_string— detects ipv4-mapped form (10 zero bytes +0xff 0xff+ 4 IPv4 bytes), emits dotted-quad; otherwise renders the 16-byte buffer viaIpv6Addr::to_string()(RFC 5952). Output matchesInetAddresses.toAddrStringsemantics.binary_to_base64— base64-encodes per the OpenSearchbinaryfield wire contract.The cast rewrite has to happen before Substrait conversion because by the time the row reaches Java the cell is already a (mangled)
String, notbyte[].Companion change in the SQL plugin
The SQL plugin recognizes the new UDT at the response-schema boundary (instanceof dispatch in
OpenSearchTypeFactory, byte[] formatting inAnalyticsExecutionEngine.convertRows) and removes the legacy in-pluginCidrMatchAdapternow that dispatch lives on the backend. Companion PR: opensearch-project/sql#5463 — must merge together.Testing
Unit tests added:
OpenSearchSchemaBuilderTests— projected ip/binary columns areIpType/BinaryTypebutgetSqlTypeName() == VARBINARY; digest-based equality.CidrMatchFunctionAdapterTests— both-literal in-range/out-of-range, IPv6 literal fold, varbinary column + cidr literal byte-range AND, dynamic cidr falls through, unparseable IP falls through.IpBinaryCastFunctionAdapterTests— IpType cast, BinaryType cast, SAFE_CAST IpType, non-VARCHAR target unchanged, plain VARBINARY (no UDT) unchanged, INTEGER unchanged.ip_to_string: ipv4-mapped, pure IPv6 loopback, arbitrary IPv6, null pass-through, wrong-length null.binary_to_base64: round-trip a known payload, null pass-through.End-to-end: ran the IP/binary PPL test set against a single-node
gradle runcluster on x86_64 + Corretto JDK 25:cidrmatch(host, '1.2.3.0/24')(column form) — passes, fires tier 2.CIDRMATCH('1.2.3.4', '1.2.3.0/24')(literal/literal) — passes, fires tier 1.cast(host as STRING)— returns"1.2.3.4"/"::1"(was a 16-byte garbage buffer).cast(blob as STRING)— base64-encoded buffer matches| fields blob.where host = '1.2.3.4'— coerces cleanly with noUnable to convert call IP(string)regression.FieldTypeCoverageIT#testIpFilters— unchanged, still passing.