API, Core: Add UDF leaf types — Representation and Parameter by huaxingao · Pull Request #15994 · apache/iceberg

huaxingao · 2026-04-16T02:59:33Z

Add the foundational model classes for UDF metadata

UdfRepresentation / SQLUdfRepresentation — interfaces for UDF implementation representations
UdfParameter — interface for function parameters with support for primitive and nested types
UnknownUdfRepresentation — forward-compatible fallback for unrecognized representation types
JSON parsers and type utilities for all of the above

huaxingao · 2026-04-16T17:19:19Z

@flyrain @amogh-jahagirdar @singhpk234 @szehon-ho

Could you please take a look at this PR when you have a moment?

This PR adds the leaf-level UDF model classes (Representation and Parameter) as defined in the UDF spec. I followed the same code pattern as the view implementation (api/.../view/, core/.../view/).
This is the first in a series of PRs to build out UDF metadata support. The next PR will add UdfDefinition and UdfDefinitionVersion on top of these types.

RussellSpitzer · 2026-04-20T18:31:51Z

+   * The parameter data type, encoded as a type string for primitives/semi-structured types or as a
+   * JSON object for nested types (struct, list, map).
+   */
+  Object type();


Why is this a a raw object here? Within our API usage here can we narrow the allowable types? For example shouldn't this be a Iceberg Type?

UDF types in the spec are based on Iceberg types but don't include field IDs, for example, list only has type and element, and struct fields only have name and type (no element-id or field-id). So Iceberg's Type (which requires field IDs for nested types) doesn't map directly.

I added a UdfType class that wraps either a primitive type string or a nested type structure, so no raw object is used any more. Does this approach look good to you?

RussellSpitzer · 2026-04-20T19:13:45Z

+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface UnknownUdfRepresentation extends UdfRepresentation {}


Is this for a UDF where we have several Representations but only some are understood by the library?

Also should this be package private? I don't think we want other folks to do anything with the "unknown"?

Yes. A definition version can have multiple representations but only some are understood by the library. When deserializing, if the library encounters a type it doesn't recognize, it stores it as UnknownUdfRepresentation instead of throwing.

I have change this to private.

RussellSpitzer · 2026-04-20T19:25:27Z

+    if (node.isTextual()) {
+      return node.asText();
+    } else if (node.isObject()) {
+      return JsonUtil.mapper().convertValue(node, java.util.Map.class);


This is a bit scarry to me with the raw map cast. But If we switch to a strong Type (Iceberg Type) here I think we fix it

I think it's safer now with the new UdfType

huaxingao · 2026-04-27T20:50:03Z

@RussellSpitzer Thanks for the review! I have addressed all the comments, could you please take one more look? Thanks!

RussellSpitzer · 2026-05-01T21:35:40Z

+  private final String primitiveType;
+  private final Map<String, Object> nestedType;
+
+  private UdfType(String primitiveType, Map<String, Object> nestedType) {


Still not a fan of using the Map here, I think we should probably just make all the correct recursive types

UDFType (interface) | UDFStructType | UDFFieldType | UDFListType | UDFMapType

Mimicing the Iceberg Type Structure

Basically I think we should be building this exactly the same way we deal with Iceberg Types and Parsing

RussellSpitzer · 2026-05-01T21:42:02Z

+      throw new IllegalArgumentException("Nested type map must not be null");
+    }
+
+    return new UdfType(null, type);


This is a dangerous spot since a mutable map here could be silently modified. Which has some weird downstream implications (= still matching even though the map is modified). If you take a look at the Iceberg StructType (or friends) each one defensively copies which prevents that sort of thing.

Like

private StructType(List<NestedField> fields) { Preconditions.checkNotNull(fields, "Field list cannot be null"); this.fields = new NestedField[fields.size()]; for (int i = 0; i < this.fields.length; i += 1) { this.fields[i] = fields.get(i); } }

I think if we shift to the Iceberg Types we can avoid this sort of potential issue

RussellSpitzer · 2026-05-01T21:46:42Z

+
+import org.immutables.value.Value;
+
+@Value.Immutable


@value.Style(visibilityString = "PACKAGE")

RussellSpitzer · 2026-05-01T21:50:59Z

+   * Writes a UDF type value (without a field name) to a JSON generator. Used when writing array
+   * elements.
+   */
+  static void writeTypeValue(UdfType type, JsonGenerator generator) throws IOException {


What's this one for? Is there a time when we need a list of parameter types without the other info?

RussellSpitzer · 2026-05-01T21:52:29Z

+    if (node.isTextual()) {
+      return UdfType.primitive(node.asText());
+    } else if (node.isObject()) {
+      Map<String, Object> nested = JsonUtil.mapper().convertValue(node, Map.class);


Would not require a bare cast if we had a strong type here :)

RussellSpitzer · 2026-05-01T21:53:52Z

+
+  /** Creates a UdfType for a primitive or semi-structured type (e.g., "int", "decimal(9,2)"). */
+  public static UdfType primitive(String type) {
+    if (type == null) {


This (and other arg checks) are usually Precondition instead of If/ throws

RussellSpitzer · 2026-05-01T21:55:17Z

+
+  static void toJson(UdfRepresentation representation, JsonGenerator generator) throws IOException {
+    Preconditions.checkArgument(representation != null, "Invalid UDF representation: null");
+    switch (representation.type().toLowerCase(Locale.ENGLISH)) {


Use ROOT instead of ENGLISH. I know the view parser uses ENGLISH but it's wrong and we should fix that too in a followup

RussellSpitzer · 2026-05-01T21:57:37Z

+import com.fasterxml.jackson.databind.JsonNode;
+import org.junit.jupiter.api.Test;
+
+public class TestSQLUdfRepresentationParser {


For this (and the other new test files)

Classes and tests should all be package private and we no longer have to start tests with "test" in Junit5

RussellSpitzer · 2026-05-01T22:00:16Z

+            .build();
+
+    UdfParameter parsed = UdfParameterParser.fromJson(json);
+    assertThat(parsed.name()).isEqualTo("x");


Should be just a straight forward "isEquals" instead of a element by element comparison. I'm guessing this isn't working because of the Map issue I was highlighting before so this also gets cleaned up if we fix the type issues, so another payoff for making that change.

RussellSpitzer · 2026-05-01T22:01:52Z

+    }
+  }
+
+  static String toJson(UdfRepresentation entry) {


nit: representation instead of "entry" ?

RussellSpitzer · 2026-05-01T22:03:19Z

+
+  @Override
+  public String toString() {
+    return isPrimitive() ? primitiveType : nestedType.toString();


nother note on the Map here, this would use Map to string format for all nested types

RussellSpitzer · 2026-05-01T22:04:47Z

+public interface UdfRepresentation {
+
+  class Type {
+    private Type() {}


nit: missing javadoc

RussellSpitzer · 2026-05-01T22:06:20Z

+ */
+package org.apache.iceberg.udf;
+
+/** A representation of a UDF implementation. */


This doc isn't really more useful that the interface name. Try something like

Describes how a UDF's logic is expressed, for example as a SQL body with a specific dialect. A UDF definition may have multiple representations for different engines.

RussellSpitzer · 2026-05-01T22:07:54Z

+  /** The parameter data type. */
+  UdfType type();
+
+  /** Optional documentation string. */


nit: Nullable already signals Optionality

pvary · 2026-05-04T10:29:20Z

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(typeString);


Could you please help me understand why the Map/List contains their respective type in the hash, like:

@Override public int hashCode() { return Objects.hash(UdfListType.class, elementType); }

But not the UdfPrimitiveType?

I just want to understand at this point

Good catch. it should contain the type in the hash. Fixed.

pvary · 2026-05-04T10:36:36Z

+ * A UDF struct type with an ordered list of named fields. Unlike Iceberg struct types, UDF struct
+ * fields do not have field IDs.


I find it strange that we compare to the Iceberg struct types. Is there a specific reason for this?

Maybe UdfStructType types are based on Iceberg struct types but intentionally omit field IDs and element nullability.?

Thanks for the suggestion! Changed to Based on Iceberg struct types but intentionally omits field IDs and element nullability.

pvary · 2026-05-04T10:50:44Z

+    Preconditions.checkArgument(
+        node.isObject(), "Cannot parse UDF representation from non-object: %s", node);
+    String type = JsonUtil.getString(TYPE, node).toLowerCase(Locale.ROOT);
+    switch (type) {


pvary · 2026-05-04T10:51:14Z

+      return UdfPrimitiveType.of(node.asText());
+    } else if (node.isObject()) {
+      String typeName = JsonUtil.getString(TYPE, node);
+      switch (typeName) {


pvary · 2026-05-04T10:53:33Z

+          return readStruct(node);
+        default:
+          throw new IllegalArgumentException(
+              String.format("Cannot parse UDF type from object with type: %s", typeName));


Should we print the full node? Maybe the user forgot to add the type to the node, and this way it will be hard to find the offending type

Updated the error to include the full node

pvary · 2026-05-04T11:04:17Z

+
+  @Test
+  void parseSqlUdfRepresentation() {
+    String json = "{\"type\":\"sql\", \"sql\": \"x + 1\", \"dialect\": \"spark\"}";


Is it more readable if we use something like this?

String json = """ {"type":"sql","sql":"x + 1","dialect":"spark"}""";

For me it is very hard to parse \"

If you agree, then please change the other string constants too.

pvary · 2026-05-04T11:08:22Z

+
+  @Test
+  void parseListTypeParameter() {
+    String json = "{\"name\":\"items\",\"type\":{\"type\":\"list\",\"element\":\"string\"}}";


Maybe like:

String json = """ { "name":"items", "type": { "type":"list", "element":"string" } }""";

pvary · 2026-05-04T11:09:27Z

+    String json =
+        "{\"name\":\"row\",\"type\":{\"type\":\"struct\",\"fields\":["
+            + "{\"name\":\"id\",\"type\":\"int\"},"
+            + "{\"name\":\"label\",\"type\":\"string\"}]}}";


This is very hard to read. Please apply the pattern I suggested above to make this parseable to an average human like me 😄

huaxingao · 2026-05-04T18:11:17Z

@RussellSpitzer @pvary I have addressed all the comments. Could you please take one more look? Thanks a lot!

RussellSpitzer · 2026-05-07T14:46:32Z

+ * UdfPrimitiveType} for primitive and semi-structured types, and the nested types {@link
+ * UdfListType}, {@link UdfMapType}, and {@link UdfStructType}.
+ */
+public interface UdfType {


Organization suggestion -

I feel like we have a lot of small class files here and there are a few ways to make it cleaner

Option 1, make a type package put all the files in there
Option 2 (or option 1 extension) - Follow the Iceberg Types model with UdfType Interfaces and UdfTypes Concerete Implemetnations

I would probably suggest doing both, but just option 2 is probably fine.

How Types is set up

// Type.java — the interface public interface Type extends Serializable { enum TypeID { BOOLEAN, INTEGER, ... STRUCT, LIST, MAP, VARIANT } TypeID typeId(); default boolean isPrimitiveType() { ... } default boolean isStructType() { ... } // ... abstract class PrimitiveType implements Type { ... } abstract class NestedType implements Type { ... } }

// Types.java — ALL concrete implementations as static nested classes in ONE file public class Types { private Types() {} // Primitives public static class BooleanType extends PrimitiveType { ... } public static class IntegerType extends PrimitiveType { ... } public static class StringType extends PrimitiveType { ... } public static class DecimalType extends PrimitiveType { ... } // ... ~15 more primitives ... // Field (equivalent to UdfFieldType) public static class NestedField implements Serializable { public static NestedField optional(int id, String name, Type type) { ... } public static NestedField required(int id, String name, Type type) { ... } // ... } // Nested types public static class StructType extends NestedType { public static StructType of(NestedField... fields) { ... } // ... } public static class ListType extends NestedType { public static ListType ofOptional(int elementId, Type elementType) { ... } // ... } public static class MapType extends NestedType { public static MapType ofOptional(int keyId, int valueId, Type keyType, Type valueType) { ... } // ... } }

Done. Went with Option 2: collapsed into UdfTypes.java with PrimitiveType / ListType / MapType / StructType / NestedField as static nested classes, mirroring org.apache.iceberg.types.Types.

RussellSpitzer · 2026-05-07T14:52:04Z

+    if (node.isTextual()) {
+      return UdfPrimitiveType.of(node.asText());
+    } else if (node.isObject()) {
+      String typeName = JsonUtil.getString(TYPE, node);


Do we need a toLowerCase here?

Done: added toLowerCase(Locale.ROOT) and a test for case-insensitive parsing.

RussellSpitzer · 2026-05-07T14:56:37Z

+    Preconditions.checkArgument(typeString != null, "Invalid primitive type: null");
+    return new UdfPrimitiveType(typeString);


Should we add some validations here for the other types? We only allow the same ones that Iceberg supports correct?

Done. PrimitiveType.of now validates via Types.fromTypeName.

RussellSpitzer · 2026-05-07T16:13:49Z

+    } else if (node.isObject()) {
+      String typeName = JsonUtil.getString(TYPE, node);
+      return switch (typeName) {
+        case LIST -> UdfListType.of(readType(node.get(ELEMENT)));


Should be using JsonUtil (like below)

Switched to JsonUtil.get(...)

RussellSpitzer · 2026-05-07T16:16:42Z

+
+  TypeId typeId();
+
+  default boolean isPrimitive() {


nit: isPrimitiveType

Changed isPrimitive -> isPrimitiveType
Also changed asPrimitive -> asPrimitiveType

RussellSpitzer · 2026-05-07T16:17:13Z

+    STRUCT
+  }
+
+  TypeId typeId();


nit: typeID

Done — renamed TypeId → TypeID to match Type.TypeID in Iceberg. I think that's what you meant?

RussellSpitzer · 2026-05-07T16:25:43Z

+
+    assertThat(deserialized).isInstanceOf(SQLUdfRepresentation.class);
+    SQLUdfRepresentation sqlRepr = (SQLUdfRepresentation) deserialized;
+    assertThat(sqlRepr.sql()).isEqualTo("x + 1.0");


Same issue @pvary mentioned above. We can just do a straight object equals

RussellSpitzer · 2026-05-07T16:31:31Z

+  }
+
+  @Test
+  void parseDecimalTypeParameter() {


Right now this test is a bit misleading, we currently don't have any special logic for decimals (or any types for that matter). That said, I think we should do validation and as I mentioned above match up with how Iceberg does this. Ideally I don't want us to be able to make a type that isn't an iceberg type

Good point. B
Both decimal and variant tests were exercising the same primitive-parsing pathway as parseParameterWithDoc with no new coverage, so I removed them.

RussellSpitzer · 2026-05-07T16:32:30Z

+  }
+
+  @Test
+  void parseVariantTypeParameter() {


Same comment as above, this test exercises the same pathway as the previous test just with a different string. I think we should get some validation in there or maybe even specific type classes?

RussellSpitzer · 2026-05-07T16:36:32Z

+import org.junit.jupiter.api.Test;
+
+class TestUdfParameterParser {
+


I think we are still missing a few of the negative tests

Test for list without element, Map without element

Test for structs with invalid fields

Test for a struct which isn't an object

Added these tests. Thanks!

API, Core: Add UDF leaf types — Representation and Parameter

8a21601

github-actions Bot added API core labels Apr 16, 2026

RussellSpitzer reviewed Apr 20, 2026

View reviewed changes

Add UdfType

5afb02f

RussellSpitzer reviewed May 1, 2026

View reviewed changes

address comments

b96a2fe

pvary reviewed May 4, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/udf/UdfRepresentationParser.java

pvary reviewed May 4, 2026

View reviewed changes

Comment thread core/src/test/java/org/apache/iceberg/udf/TestSQLUdfRepresentationParser.java

address comments

eed9f23

RussellSpitzer reviewed May 7, 2026

View reviewed changes

huaxingao added 2 commits May 10, 2026 19:24

Address comments

0db77f5

fix checkstyle

690408d

		* A UDF struct type with an ordered list of named fields. Unlike Iceberg struct types, UDF struct
		* fields do not have field IDs.

		Preconditions.checkArgument(typeString != null, "Invalid primitive type: null");
		return new UdfPrimitiveType(typeString);

		import org.junit.jupiter.api.Test;

		class TestUdfParameterParser {

Conversation

huaxingao commented Apr 16, 2026

Uh oh!

huaxingao commented Apr 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Apr 27, 2026

Uh oh!

RussellSpitzer May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer May 1, 2026 •

edited

Loading