Skip to content

Added UDT for IP and Binary support#5463

Open
vinaykpud wants to merge 2 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary
Open

Added UDT for IP and Binary support#5463
vinaykpud wants to merge 2 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary

Conversation

@vinaykpud
Copy link
Copy Markdown
Contributor

@vinaykpud vinaykpud commented May 22, 2026

Description

Companion change for opensearch-project/OpenSearch#21807 which adds Calcite UDTs (IpType, BinaryType) for OpenSearch ip and binary columns to the analytics-engine path. This PR teaches the SQL plugin to recognize the new UDTs at the response-schema boundary so:

  • byte[] cells from the analytics backend get rendered as canonical IP strings / base64 strings (was: unsupported object class [B).
  • The response schema reports "type": "ip" / "type": "binary" correctly (was: both labelled "binary").
  • Existing comparison and CIDRMATCH query shapes keep working unchanged — the UDT is invisible to Calcite's planner-internal coercion machinery.

Background

OpenSearchSchemaBuilder (in the OpenSearch sandbox) used to map both ip and binary field types to plain SqlTypeName.VARBINARY. That collapse caused three downstream bugs in this plugin's PPL path through the analytics engine:

  1. The DataFusion backend ships back a 16-byte ipv4-mapped-ipv6 buffer; AnalyticsExecutionEngine.convertRows calls ExprValueUtils.fromObjectValue(byte[]) which has no byte[] case → unsupported object class [B.
  2. The response schema reports "type": "binary" for an IP column (because convertRelDataTypeToExprType(VARBINARY) → BINARY ExprType).
  3. CIDRMATCH against a column was registered with PPLTypeChecker.family(SqlTypeFamily.BINARY, SqlTypeFamily.STRING), with the byte-range expansion living inline at the registration site.

The companion OpenSearch PR introduces IpType / BinaryType (Calcite UDTs that extend AbstractSqlType with VARBINARY underneath) and moves CIDRMATCH dispatch into a backend ScalarFunctionAdapter.

Approach

Two response-boundary changes plus one cleanup, all minimal:

1. Result-column UDT recognition (OpenSearchTypeFactory).
New sibling function convertAnalyticsEngineRelDataTypeToExprType does an instanceof dispatch:

if (type instanceof IpType) return IP;
if (type instanceof BinaryType) return BINARY;
return convertRelDataTypeToExprType(type);

The original convertRelDataTypeToExprType is deliberately unchanged — Calcite's coercion machinery round-trips through it, so returning IP ExprType for a VARBINARY column synthesizes IP(string) casts that DataFusion can't resolve. Keeping UDT recognition in a sibling function isolates it to the response-schema path.

2. Per-column UDT dispatch in row conversion (AnalyticsExecutionEngine.convertRows).
New static helper toExprValue(Object value, RelDataType type):

  • byte[] + IpTypeInetAddresses.toAddrString(InetAddress.getByAddress(bytes)) (matches IpFieldMapper.valueFetcher semantics: dotted-quad for IPv4 / IPv4-mapped, RFC 5952 for pure IPv6).
  • byte[] + BinaryTypeBase64.getEncoder().encodeToString(bytes) (matches the OpenSearch binary field wire contract).
  • Otherwise → ExprValueUtils.fromObjectValue(value) unchanged.

UnknownHostException from a malformed IP buffer falls through to the default handler so the user sees a clear error rather than a malformed address string.

3. CIDRMATCH cleanup (PPLFuncImpTable).

  • Removes udf/ip/CidrMatchAdapter and its inline registration. CIDRMATCH dispatch now lives in CidrMatchFunctionAdapter on the backend (in the companion OpenSearch PR), which means it serves both the production SQL-plugin variant and the sandbox test front-end with a single implementation.
  • Collapses the two-line registration (typecheck + withRexBuilderShim) to a single typecheck-only registerOperator(CIDRMATCH, PPLBuiltinOperators.CIDRMATCH).
  • The runtime CidrMatchFunction UDF stays as the dynamic last-resort fallback.

Testing

Unit tests added:

  • OpenSearchTypeFactoryTest:

    • testConvertResultColumnIpTypeReturnsIpExprTypeIpTypeExprCoreType.IP.
    • testConvertResultColumnBinaryTypeReturnsBinaryExprTypeBinaryTypeExprCoreType.BINARY.
    • testConvertResultColumnPlainVarbinaryFallsBackToBinary — plain VARBINARY (no UDT) keeps returning BINARY.
    • testConvertResultColumnDelegatesParityForNonUdtTypes — for every non-UDT RelDataType, the result-column variant must agree with the planner-internal variant. Drift here would mean response-schema labels diverge from what Calcite's coercion sees.
  • AnalyticsExecutionEngineTest:

    • executeRelNode_ipColumnRendersAsAddressString — both ipv4-mapped IPv6 (::ffff:1.2.3.4"1.2.3.4") and pure IPv6 (::1) buffers render to the right canonical form, schema reports "ip".
    • executeRelNode_binaryColumnRendersAsBase64byte[] payload base64-encodes to match OpenSearch wire format, schema reports "binary".

Existing tests still pass. [B-class regression caught at: physicalPlanExecute_callsOnFailure (already in the file, exercises the same converter dispatch).

End-to-end (against a single-node gradle run cluster with the companion OpenSearch PR applied):

  • cast(host as STRING) returns "1.2.3.4" / "::1" (was a 16-byte garbage buffer).
  • cast(blob as STRING) matches | fields blob (base64-encoded).
  • where host = '1.2.3.4' coerces cleanly with no Unable to convert call IP(string) regression.
  • cidrmatch(host, '1.2.3.0/24') (column form) and CIDRMATCH('1.2.3.4', '1.2.3.0/24') (literal form) both return correct row counts via the backend adapter.
  • Response schema labels: "type": "ip" for ip columns, "type": "binary" for binary columns.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Reviewer Guide 🔍

(Review updated until commit 0918d6b)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The loop at line 131 iterates over fields.size() but only accesses row[i] when i < row.length. If fields.size() > row.length, the loop continues but value is set to null for indices beyond row.length. However, valueMap.put(field.getName(), toExprValue(value, field.getType())) is never called for these iterations because the put statement is outside the loop body shown. This appears to be incomplete code that may skip populating fields when row length is shorter than field count.

for (int i = 0; i < fields.size(); i++) {
  RelDataTypeField field = fields.get(i);
  Object value = (i < row.length) ? row[i] : null;
  valueMap.put(field.getName(), toExprValue(value, field.getType()));
}
Exception Handling

When InetAddress.getByAddress(bytes) throws UnknownHostException at line 162, the catch block wraps it in IllegalStateException with a message about invalid buffer length. However, UnknownHostException is only thrown by getByAddress when the byte array length is not 4 or 16. If the analytics engine returns a buffer of unexpected length (e.g., due to a backend bug or data corruption), this will surface as an IllegalStateException during result conversion, potentially causing query failures. Consider whether this should be handled more gracefully or logged with additional context about the source column.

try {
  return ExprValueUtils.stringValue(
      InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
} catch (UnknownHostException e) {
  throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Code Suggestions ✨

Latest suggestions up to 0918d6b

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Validate IP buffer length explicitly

The UnknownHostException is thrown only for invalid buffer lengths (not 4 or 16
bytes), but the error message doesn't clarify the expected lengths. Add validation
to check buffer length explicitly before calling InetAddress.getByAddress() and
provide a more descriptive error message indicating valid lengths (4 or 16 bytes).

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [157-170]

 private static ExprValue toExprValue(Object value, RelDataType type) {
   if (value instanceof byte[] bytes) {
     if (type instanceof IpType) {
+      if (bytes.length != 4 && bytes.length != 16) {
+        throw new IllegalStateException(
+            "invalid IP buffer length: " + bytes.length + " (expected 4 or 16 bytes)");
+      }
       try {
         return ExprValueUtils.stringValue(
             InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
       } catch (UnknownHostException e) {
-        throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
+        throw new IllegalStateException("failed to parse IP address", e);
       }
     } else if (type instanceof BinaryType) {
       return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
     }
   }
   return ExprValueUtils.fromObjectValue(value);
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion improves error handling by explicitly validating IP buffer lengths before processing, providing clearer error messages. However, InetAddress.getByAddress() already validates buffer length internally, so this is primarily a defensive programming improvement rather than fixing a critical bug.

Medium
Log warning for row-field mismatch

When row.length < fields.size(), the code defaults missing values to null. However,
this silently masks potential data corruption or schema mismatches. Consider logging
a warning when row length doesn't match field count to aid debugging of upstream
data issues.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [131-135]

+if (row.length != fields.size()) {
+  logger.warn("Row length mismatch: expected {} fields but got {} values", 
+              fields.size(), row.length);
+}
 for (int i = 0; i < fields.size(); i++) {
   RelDataTypeField field = fields.get(i);
   Object value = (i < row.length) ? row[i] : null;
   valueMap.put(field.getName(), toExprValue(value, field.getType()));
 }
Suggestion importance[1-10]: 6

__

Why: Adding logging for row-field mismatches would help detect data corruption or schema issues during debugging. However, the existing code already handles this case gracefully by defaulting to null, and the suggestion only adds observability without changing behavior or fixing a bug.

Low

Previous suggestions

Suggestions up to commit b7fd95b
CategorySuggestion                                                                                                                                    Impact
General
Log invalid IP buffer errors

The UnknownHostException catch block silently falls through to default handling,
which may throw an unclear error. Consider logging the exception with the invalid
buffer length to aid debugging when the backend sends malformed IP data.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-167]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(
           InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      // Defensive: backend gave us a buffer that isn't 4 or 16 bytes. Fall through to
-      // the default handling so the user sees a clear error rather than a malformed
-      // address string.
+      // Log the error with buffer length for debugging
+      logger.warn("Invalid IP buffer length: {} bytes, expected 4 or 16", bytes.length, e);
+      // Fall through to default handling
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for the UnknownHostException would help with debugging malformed IP buffers from the backend. However, the current fallthrough behavior is intentional and documented, and the default handling will already surface an error to the user. The logging is a minor enhancement rather than a critical fix.

Low
Suggestions up to commit 540074b
CategorySuggestion                                                                                                                                    Impact
General
Log invalid IP byte arrays

The UnknownHostException catch block silently falls through to default handling,
which may throw an unclear error. Consider logging the exception with context (e.g.,
buffer length) to aid debugging when invalid IP byte arrays are encountered.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-166]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      // Defensive: backend gave us a buffer that isn't 4 or 16 bytes. Fall through to
-      // the default handling so the user sees a clear error rather than a malformed
-      // address string.
+      // Log the invalid buffer for debugging
+      logger.warn("Invalid IP address byte array of length {}: {}", bytes.length, e.getMessage());
+      // Fall through to default handling
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for invalid IP byte arrays would help with debugging, but the current defensive approach of falling through to default handling is already reasonable. The suggestion improves observability without fixing a critical issue. The score reflects a moderate improvement in maintainability and debugging capability.

Low

Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from 540074b to b7fd95b Compare May 22, 2026 18:21
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b7fd95b

Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
@vinaykpud vinaykpud marked this pull request as ready for review May 22, 2026 20:22
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0918d6b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants