Skip to content

API, Core: Add UDF leaf types — Representation and Parameter#15994

Open
huaxingao wants to merge 6 commits intoapache:mainfrom
huaxingao:udf_a1
Open

API, Core: Add UDF leaf types — Representation and Parameter#15994
huaxingao wants to merge 6 commits intoapache:mainfrom
huaxingao:udf_a1

Conversation

@huaxingao
Copy link
Copy Markdown
Contributor

Add the foundational model classes for UDF metadata

  • UdfRepresentation / SQLUdfRepresentation — interfaces for UDF implementation representations
  • UdfParameter — interface for function parameters with support for primitive and nested types
  • UnknownUdfRepresentation — forward-compatible fallback for unrecognized representation types
  • JSON parsers and type utilities for all of the above

@huaxingao
Copy link
Copy Markdown
Contributor Author

@flyrain @amogh-jahagirdar @singhpk234 @szehon-ho

Could you please take a look at this PR when you have a moment?

This PR adds the leaf-level UDF model classes (Representation and Parameter) as defined in the UDF spec. I followed the same code pattern as the view implementation (api/.../view/, core/.../view/).
This is the first in a series of PRs to build out UDF metadata support. The next PR will add UdfDefinition and UdfDefinitionVersion on top of these types.

* The parameter data type, encoded as a type string for primitives/semi-structured types or as a
* JSON object for nested types (struct, list, map).
*/
Object type();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a a raw object here? Within our API usage here can we narrow the allowable types? For example shouldn't this be a Iceberg Type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UDF types in the spec are based on Iceberg types but don't include field IDs, for example, list only has type and element, and struct fields only have name and type (no element-id or field-id). So Iceberg's Type (which requires field IDs for nested types) doesn't map directly.

I added a UdfType class that wraps either a primitive type string or a nested type structure, so no raw object is used any more. Does this approach look good to you?

import org.immutables.value.Value;

@Value.Immutable
public interface UnknownUdfRepresentation extends UdfRepresentation {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for a UDF where we have several Representations but only some are understood by the library?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should this be package private? I don't think we want other folks to do anything with the "unknown"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A definition version can have multiple representations but only some are understood by the library. When deserializing, if the library encounters a type it doesn't recognize, it stores it as UnknownUdfRepresentation instead of throwing.

I have change this to private.

if (node.isTextual()) {
return node.asText();
} else if (node.isObject()) {
return JsonUtil.mapper().convertValue(node, java.util.Map.class);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit scarry to me with the raw map cast. But If we switch to a strong Type (Iceberg Type) here I think we fix it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer now with the new UdfType

@huaxingao
Copy link
Copy Markdown
Contributor Author

@RussellSpitzer Thanks for the review! I have addressed all the comments, could you please take one more look? Thanks!

private final String primitiveType;
private final Map<String, Object> nestedType;

private UdfType(String primitiveType, Map<String, Object> nestedType) {
Copy link
Copy Markdown
Member

@RussellSpitzer RussellSpitzer May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not a fan of using the Map here, I think we should probably just make all the correct recursive types

UDFType (interface)
  | UDFStructType
  | UDFFieldType
  | UDFListType
  | UDFMapType

Mimicing the Iceberg Type Structure

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I think we should be building this exactly the same way we deal with Iceberg Types and Parsing

throw new IllegalArgumentException("Nested type map must not be null");
}

return new UdfType(null, type);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dangerous spot since a mutable map here could be silently modified. Which has some weird downstream implications (= still matching even though the map is modified). If you take a look at the Iceberg StructType (or friends) each one defensively copies which prevents that sort of thing.

Like

    private StructType(List<NestedField> fields) {
      Preconditions.checkNotNull(fields, "Field list cannot be null");
      this.fields = new NestedField[fields.size()];
      for (int i = 0; i < this.fields.length; i += 1) {
        this.fields[i] = fields.get(i);
      }
    }

I think if we shift to the Iceberg Types we can avoid this sort of potential issue


import org.immutables.value.Value;

@Value.Immutable
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@value.Style(visibilityString = "PACKAGE")

* Writes a UDF type value (without a field name) to a JSON generator. Used when writing array
* elements.
*/
static void writeTypeValue(UdfType type, JsonGenerator generator) throws IOException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this one for? Is there a time when we need a list of parameter types without the other info?

if (node.isTextual()) {
return UdfType.primitive(node.asText());
} else if (node.isObject()) {
Map<String, Object> nested = JsonUtil.mapper().convertValue(node, Map.class);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not require a bare cast if we had a strong type here :)


/** Creates a UdfType for a primitive or semi-structured type (e.g., "int", "decimal(9,2)"). */
public static UdfType primitive(String type) {
if (type == null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and other arg checks) are usually Precondition instead of If/ throws


static void toJson(UdfRepresentation representation, JsonGenerator generator) throws IOException {
Preconditions.checkArgument(representation != null, "Invalid UDF representation: null");
switch (representation.type().toLowerCase(Locale.ENGLISH)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ROOT instead of ENGLISH. I know the view parser uses ENGLISH but it's wrong and we should fix that too in a followup

import com.fasterxml.jackson.databind.JsonNode;
import org.junit.jupiter.api.Test;

public class TestSQLUdfRepresentationParser {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this (and the other new test files)

Classes and tests should all be package private and we no longer have to start tests with "test" in Junit5

.build();

UdfParameter parsed = UdfParameterParser.fromJson(json);
assertThat(parsed.name()).isEqualTo("x");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be just a straight forward "isEquals" instead of a element by element comparison. I'm guessing this isn't working because of the Map issue I was highlighting before so this also gets cleaned up if we fix the type issues, so another payoff for making that change.

}
}

static String toJson(UdfRepresentation entry) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: representation instead of "entry" ?


@Override
public String toString() {
return isPrimitive() ? primitiveType : nestedType.toString();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nother note on the Map here, this would use Map to string format for all nested types

public interface UdfRepresentation {

class Type {
private Type() {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing javadoc

*/
package org.apache.iceberg.udf;

/** A representation of a UDF implementation. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc isn't really more useful that the interface name. Try something like

Describes how a UDF's logic is expressed, for example as a SQL body with a specific dialect. A UDF definition may have multiple representations for different engines.

/** The parameter data type. */
UdfType type();

/** Optional documentation string. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Nullable already signals Optionality


@Override
public int hashCode() {
return Objects.hashCode(typeString);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me understand why the Map/List contains their respective type in the hash, like:

  @Override
  public int hashCode() {
    return Objects.hash(UdfListType.class, elementType);
  }

But not the UdfPrimitiveType?

I just want to understand at this point

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. it should contain the type in the hash. Fixed.

Comment on lines +29 to +30
* A UDF struct type with an ordered list of named fields. Unlike Iceberg struct types, UDF struct
* fields do not have field IDs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange that we compare to the Iceberg struct types. Is there a specific reason for this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe UdfStructType types are based on Iceberg struct types but intentionally omit field IDs and element nullability.?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Changed to Based on Iceberg struct types but intentionally omits field IDs and element nullability.

Preconditions.checkArgument(
node.isObject(), "Cannot parse UDF representation from non-object: %s", node);
String type = JsonUtil.getString(TYPE, node).toLowerCase(Locale.ROOT);
switch (type) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new switch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return UdfPrimitiveType.of(node.asText());
} else if (node.isObject()) {
String typeName = JsonUtil.getString(TYPE, node);
switch (typeName) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new switch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return readStruct(node);
default:
throw new IllegalArgumentException(
String.format("Cannot parse UDF type from object with type: %s", typeName));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we print the full node? Maybe the user forgot to add the type to the node, and this way it will be hard to find the offending type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the error to include the full node


@Test
void parseSqlUdfRepresentation() {
String json = "{\"type\":\"sql\", \"sql\": \"x + 1\", \"dialect\": \"spark\"}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more readable if we use something like this?

String json =
    """
    {"type":"sql","sql":"x + 1","dialect":"spark"}""";

For me it is very hard to parse \"

If you agree, then please change the other string constants too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Test
void parseListTypeParameter() {
String json = "{\"name\":\"items\",\"type\":{\"type\":\"list\",\"element\":\"string\"}}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe like:

    String json =
        """
        {
            "name":"items",
            "type": {
                "type":"list",
                "element":"string"
            }
        }""";

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +99 to +102
String json =
"{\"name\":\"row\",\"type\":{\"type\":\"struct\",\"fields\":["
+ "{\"name\":\"id\",\"type\":\"int\"},"
+ "{\"name\":\"label\",\"type\":\"string\"}]}}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to read. Please apply the pattern I suggested above to make this parseable to an average human like me 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@huaxingao
Copy link
Copy Markdown
Contributor Author

@RussellSpitzer @pvary I have addressed all the comments. Could you please take one more look? Thanks a lot!

* UdfPrimitiveType} for primitive and semi-structured types, and the nested types {@link
* UdfListType}, {@link UdfMapType}, and {@link UdfStructType}.
*/
public interface UdfType {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organization suggestion -

I feel like we have a lot of small class files here and there are a few ways to make it cleaner

Option 1, make a type package put all the files in there
Option 2 (or option 1 extension) - Follow the Iceberg Types model with UdfType Interfaces and UdfTypes Concerete Implemetnations

I would probably suggest doing both, but just option 2 is probably fine.


How Types is set up

// Type.java — the interface
public interface Type extends Serializable {
  enum TypeID { BOOLEAN, INTEGER, ... STRUCT, LIST, MAP, VARIANT }

  TypeID typeId();
  default boolean isPrimitiveType() { ... }
  default boolean isStructType() { ... }
  // ...

  abstract class PrimitiveType implements Type { ... }
  abstract class NestedType implements Type { ... }
}
// Types.java — ALL concrete implementations as static nested classes in ONE file
public class Types {
  private Types() {}

  // Primitives
  public static class BooleanType extends PrimitiveType { ... }
  public static class IntegerType extends PrimitiveType { ... }
  public static class StringType extends PrimitiveType { ... }
  public static class DecimalType extends PrimitiveType { ... }
  // ... ~15 more primitives ...

  // Field (equivalent to UdfFieldType)
  public static class NestedField implements Serializable {
    public static NestedField optional(int id, String name, Type type) { ... }
    public static NestedField required(int id, String name, Type type) { ... }
    // ...
  }

  // Nested types
  public static class StructType extends NestedType {
    public static StructType of(NestedField... fields) { ... }
    // ...
  }

  public static class ListType extends NestedType {
    public static ListType ofOptional(int elementId, Type elementType) { ... }
    // ...
  }

  public static class MapType extends NestedType {
    public static MapType ofOptional(int keyId, int valueId, Type keyType, Type valueType) { ... }
    // ...
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Went with Option 2: collapsed into UdfTypes.java with PrimitiveType / ListType / MapType / StructType / NestedField as static nested classes, mirroring org.apache.iceberg.types.Types.

if (node.isTextual()) {
return UdfPrimitiveType.of(node.asText());
} else if (node.isObject()) {
String typeName = JsonUtil.getString(TYPE, node);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a toLowerCase here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: added toLowerCase(Locale.ROOT) and a test for case-insensitive parsing.

Comment on lines +33 to +34
Preconditions.checkArgument(typeString != null, "Invalid primitive type: null");
return new UdfPrimitiveType(typeString);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some validations here for the other types? We only allow the same ones that Iceberg supports correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PrimitiveType.of now validates via Types.fromTypeName.

} else if (node.isObject()) {
String typeName = JsonUtil.getString(TYPE, node);
return switch (typeName) {
case LIST -> UdfListType.of(readType(node.get(ELEMENT)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be using JsonUtil (like below)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to JsonUtil.get(...)


TypeId typeId();

default boolean isPrimitive() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isPrimitiveType

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed isPrimitive -> isPrimitiveType
Also changed asPrimitive -> asPrimitiveType

STRUCT
}

TypeId typeId();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typeID

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed TypeIdTypeID to match Type.TypeID in Iceberg. I think that's what you meant?


assertThat(deserialized).isInstanceOf(SQLUdfRepresentation.class);
SQLUdfRepresentation sqlRepr = (SQLUdfRepresentation) deserialized;
assertThat(sqlRepr.sql()).isEqualTo("x + 1.0");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue @pvary mentioned above. We can just do a straight object equals

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

@Test
void parseDecimalTypeParameter() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this test is a bit misleading, we currently don't have any special logic for decimals (or any types for that matter). That said, I think we should do validation and as I mentioned above match up with how Iceberg does this. Ideally I don't want us to be able to make a type that isn't an iceberg type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. B
Both decimal and variant tests were exercising the same primitive-parsing pathway as parseParameterWithDoc with no new coverage, so I removed them.

}

@Test
void parseVariantTypeParameter() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, this test exercises the same pathway as the previous test just with a different string. I think we should get some validation in there or maybe even specific type classes?

import org.junit.jupiter.api.Test;

class TestUdfParameterParser {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are still missing a few of the negative tests

  1. Test for list without element, Map without element
  2. Test for structs with invalid fields
  3. Test for a struct which isn't an object

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these tests. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants