API, Core: Add UDF leaf types — Representation and Parameter#15994
API, Core: Add UDF leaf types — Representation and Parameter#15994huaxingao wants to merge 6 commits intoapache:mainfrom
Conversation
|
@flyrain @amogh-jahagirdar @singhpk234 @szehon-ho Could you please take a look at this PR when you have a moment? This PR adds the leaf-level UDF model classes ( |
| * The parameter data type, encoded as a type string for primitives/semi-structured types or as a | ||
| * JSON object for nested types (struct, list, map). | ||
| */ | ||
| Object type(); |
There was a problem hiding this comment.
Why is this a a raw object here? Within our API usage here can we narrow the allowable types? For example shouldn't this be a Iceberg Type?
There was a problem hiding this comment.
UDF types in the spec are based on Iceberg types but don't include field IDs, for example, list only has type and element, and struct fields only have name and type (no element-id or field-id). So Iceberg's Type (which requires field IDs for nested types) doesn't map directly.
I added a UdfType class that wraps either a primitive type string or a nested type structure, so no raw object is used any more. Does this approach look good to you?
| import org.immutables.value.Value; | ||
|
|
||
| @Value.Immutable | ||
| public interface UnknownUdfRepresentation extends UdfRepresentation {} |
There was a problem hiding this comment.
Is this for a UDF where we have several Representations but only some are understood by the library?
There was a problem hiding this comment.
Also should this be package private? I don't think we want other folks to do anything with the "unknown"?
There was a problem hiding this comment.
Yes. A definition version can have multiple representations but only some are understood by the library. When deserializing, if the library encounters a type it doesn't recognize, it stores it as UnknownUdfRepresentation instead of throwing.
I have change this to private.
| if (node.isTextual()) { | ||
| return node.asText(); | ||
| } else if (node.isObject()) { | ||
| return JsonUtil.mapper().convertValue(node, java.util.Map.class); |
There was a problem hiding this comment.
This is a bit scarry to me with the raw map cast. But If we switch to a strong Type (Iceberg Type) here I think we fix it
There was a problem hiding this comment.
I think it's safer now with the new UdfType
|
@RussellSpitzer Thanks for the review! I have addressed all the comments, could you please take one more look? Thanks! |
| private final String primitiveType; | ||
| private final Map<String, Object> nestedType; | ||
|
|
||
| private UdfType(String primitiveType, Map<String, Object> nestedType) { |
There was a problem hiding this comment.
Still not a fan of using the Map here, I think we should probably just make all the correct recursive types
UDFType (interface)
| UDFStructType
| UDFFieldType
| UDFListType
| UDFMapType
Mimicing the Iceberg Type Structure
There was a problem hiding this comment.
Basically I think we should be building this exactly the same way we deal with Iceberg Types and Parsing
| throw new IllegalArgumentException("Nested type map must not be null"); | ||
| } | ||
|
|
||
| return new UdfType(null, type); |
There was a problem hiding this comment.
This is a dangerous spot since a mutable map here could be silently modified. Which has some weird downstream implications (= still matching even though the map is modified). If you take a look at the Iceberg StructType (or friends) each one defensively copies which prevents that sort of thing.
Like
private StructType(List<NestedField> fields) {
Preconditions.checkNotNull(fields, "Field list cannot be null");
this.fields = new NestedField[fields.size()];
for (int i = 0; i < this.fields.length; i += 1) {
this.fields[i] = fields.get(i);
}
}I think if we shift to the Iceberg Types we can avoid this sort of potential issue
|
|
||
| import org.immutables.value.Value; | ||
|
|
||
| @Value.Immutable |
| * Writes a UDF type value (without a field name) to a JSON generator. Used when writing array | ||
| * elements. | ||
| */ | ||
| static void writeTypeValue(UdfType type, JsonGenerator generator) throws IOException { |
There was a problem hiding this comment.
What's this one for? Is there a time when we need a list of parameter types without the other info?
| if (node.isTextual()) { | ||
| return UdfType.primitive(node.asText()); | ||
| } else if (node.isObject()) { | ||
| Map<String, Object> nested = JsonUtil.mapper().convertValue(node, Map.class); |
There was a problem hiding this comment.
Would not require a bare cast if we had a strong type here :)
|
|
||
| /** Creates a UdfType for a primitive or semi-structured type (e.g., "int", "decimal(9,2)"). */ | ||
| public static UdfType primitive(String type) { | ||
| if (type == null) { |
There was a problem hiding this comment.
This (and other arg checks) are usually Precondition instead of If/ throws
|
|
||
| static void toJson(UdfRepresentation representation, JsonGenerator generator) throws IOException { | ||
| Preconditions.checkArgument(representation != null, "Invalid UDF representation: null"); | ||
| switch (representation.type().toLowerCase(Locale.ENGLISH)) { |
There was a problem hiding this comment.
Use ROOT instead of ENGLISH. I know the view parser uses ENGLISH but it's wrong and we should fix that too in a followup
| import com.fasterxml.jackson.databind.JsonNode; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| public class TestSQLUdfRepresentationParser { |
There was a problem hiding this comment.
For this (and the other new test files)
Classes and tests should all be package private and we no longer have to start tests with "test" in Junit5
| .build(); | ||
|
|
||
| UdfParameter parsed = UdfParameterParser.fromJson(json); | ||
| assertThat(parsed.name()).isEqualTo("x"); |
There was a problem hiding this comment.
Should be just a straight forward "isEquals" instead of a element by element comparison. I'm guessing this isn't working because of the Map issue I was highlighting before so this also gets cleaned up if we fix the type issues, so another payoff for making that change.
| } | ||
| } | ||
|
|
||
| static String toJson(UdfRepresentation entry) { |
There was a problem hiding this comment.
nit: representation instead of "entry" ?
|
|
||
| @Override | ||
| public String toString() { | ||
| return isPrimitive() ? primitiveType : nestedType.toString(); |
There was a problem hiding this comment.
nother note on the Map here, this would use Map to string format for all nested types
| public interface UdfRepresentation { | ||
|
|
||
| class Type { | ||
| private Type() {} |
| */ | ||
| package org.apache.iceberg.udf; | ||
|
|
||
| /** A representation of a UDF implementation. */ |
There was a problem hiding this comment.
This doc isn't really more useful that the interface name. Try something like
Describes how a UDF's logic is expressed, for example as a SQL body with a specific dialect. A UDF definition may have multiple representations for different engines.
| /** The parameter data type. */ | ||
| UdfType type(); | ||
|
|
||
| /** Optional documentation string. */ |
There was a problem hiding this comment.
nit: Nullable already signals Optionality
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hashCode(typeString); |
There was a problem hiding this comment.
Could you please help me understand why the Map/List contains their respective type in the hash, like:
@Override
public int hashCode() {
return Objects.hash(UdfListType.class, elementType);
}
But not the UdfPrimitiveType?
I just want to understand at this point
There was a problem hiding this comment.
Good catch. it should contain the type in the hash. Fixed.
| * A UDF struct type with an ordered list of named fields. Unlike Iceberg struct types, UDF struct | ||
| * fields do not have field IDs. |
There was a problem hiding this comment.
I find it strange that we compare to the Iceberg struct types. Is there a specific reason for this?
There was a problem hiding this comment.
Maybe UdfStructType types are based on Iceberg struct types but intentionally omit field IDs and element nullability.?
There was a problem hiding this comment.
Thanks for the suggestion! Changed to Based on Iceberg struct types but intentionally omits field IDs and element nullability.
| Preconditions.checkArgument( | ||
| node.isObject(), "Cannot parse UDF representation from non-object: %s", node); | ||
| String type = JsonUtil.getString(TYPE, node).toLowerCase(Locale.ROOT); | ||
| switch (type) { |
| return UdfPrimitiveType.of(node.asText()); | ||
| } else if (node.isObject()) { | ||
| String typeName = JsonUtil.getString(TYPE, node); | ||
| switch (typeName) { |
| return readStruct(node); | ||
| default: | ||
| throw new IllegalArgumentException( | ||
| String.format("Cannot parse UDF type from object with type: %s", typeName)); |
There was a problem hiding this comment.
Should we print the full node? Maybe the user forgot to add the type to the node, and this way it will be hard to find the offending type
There was a problem hiding this comment.
Updated the error to include the full node
|
|
||
| @Test | ||
| void parseSqlUdfRepresentation() { | ||
| String json = "{\"type\":\"sql\", \"sql\": \"x + 1\", \"dialect\": \"spark\"}"; |
There was a problem hiding this comment.
Is it more readable if we use something like this?
String json =
"""
{"type":"sql","sql":"x + 1","dialect":"spark"}""";
For me it is very hard to parse \"
If you agree, then please change the other string constants too.
|
|
||
| @Test | ||
| void parseListTypeParameter() { | ||
| String json = "{\"name\":\"items\",\"type\":{\"type\":\"list\",\"element\":\"string\"}}"; |
There was a problem hiding this comment.
Maybe like:
String json =
"""
{
"name":"items",
"type": {
"type":"list",
"element":"string"
}
}""";
| String json = | ||
| "{\"name\":\"row\",\"type\":{\"type\":\"struct\",\"fields\":[" | ||
| + "{\"name\":\"id\",\"type\":\"int\"}," | ||
| + "{\"name\":\"label\",\"type\":\"string\"}]}}"; |
There was a problem hiding this comment.
This is very hard to read. Please apply the pattern I suggested above to make this parseable to an average human like me 😄
|
@RussellSpitzer @pvary I have addressed all the comments. Could you please take one more look? Thanks a lot! |
| * UdfPrimitiveType} for primitive and semi-structured types, and the nested types {@link | ||
| * UdfListType}, {@link UdfMapType}, and {@link UdfStructType}. | ||
| */ | ||
| public interface UdfType { |
There was a problem hiding this comment.
Organization suggestion -
I feel like we have a lot of small class files here and there are a few ways to make it cleaner
Option 1, make a type package put all the files in there
Option 2 (or option 1 extension) - Follow the Iceberg Types model with UdfType Interfaces and UdfTypes Concerete Implemetnations
I would probably suggest doing both, but just option 2 is probably fine.
How Types is set up
// Type.java — the interface
public interface Type extends Serializable {
enum TypeID { BOOLEAN, INTEGER, ... STRUCT, LIST, MAP, VARIANT }
TypeID typeId();
default boolean isPrimitiveType() { ... }
default boolean isStructType() { ... }
// ...
abstract class PrimitiveType implements Type { ... }
abstract class NestedType implements Type { ... }
}// Types.java — ALL concrete implementations as static nested classes in ONE file
public class Types {
private Types() {}
// Primitives
public static class BooleanType extends PrimitiveType { ... }
public static class IntegerType extends PrimitiveType { ... }
public static class StringType extends PrimitiveType { ... }
public static class DecimalType extends PrimitiveType { ... }
// ... ~15 more primitives ...
// Field (equivalent to UdfFieldType)
public static class NestedField implements Serializable {
public static NestedField optional(int id, String name, Type type) { ... }
public static NestedField required(int id, String name, Type type) { ... }
// ...
}
// Nested types
public static class StructType extends NestedType {
public static StructType of(NestedField... fields) { ... }
// ...
}
public static class ListType extends NestedType {
public static ListType ofOptional(int elementId, Type elementType) { ... }
// ...
}
public static class MapType extends NestedType {
public static MapType ofOptional(int keyId, int valueId, Type keyType, Type valueType) { ... }
// ...
}
}There was a problem hiding this comment.
Done. Went with Option 2: collapsed into UdfTypes.java with PrimitiveType / ListType / MapType / StructType / NestedField as static nested classes, mirroring org.apache.iceberg.types.Types.
| if (node.isTextual()) { | ||
| return UdfPrimitiveType.of(node.asText()); | ||
| } else if (node.isObject()) { | ||
| String typeName = JsonUtil.getString(TYPE, node); |
There was a problem hiding this comment.
Do we need a toLowerCase here?
There was a problem hiding this comment.
Done: added toLowerCase(Locale.ROOT) and a test for case-insensitive parsing.
| Preconditions.checkArgument(typeString != null, "Invalid primitive type: null"); | ||
| return new UdfPrimitiveType(typeString); |
There was a problem hiding this comment.
Should we add some validations here for the other types? We only allow the same ones that Iceberg supports correct?
There was a problem hiding this comment.
Done. PrimitiveType.of now validates via Types.fromTypeName.
| } else if (node.isObject()) { | ||
| String typeName = JsonUtil.getString(TYPE, node); | ||
| return switch (typeName) { | ||
| case LIST -> UdfListType.of(readType(node.get(ELEMENT))); |
There was a problem hiding this comment.
Should be using JsonUtil (like below)
There was a problem hiding this comment.
Switched to JsonUtil.get(...)
|
|
||
| TypeId typeId(); | ||
|
|
||
| default boolean isPrimitive() { |
There was a problem hiding this comment.
Changed isPrimitive -> isPrimitiveType
Also changed asPrimitive -> asPrimitiveType
| STRUCT | ||
| } | ||
|
|
||
| TypeId typeId(); |
There was a problem hiding this comment.
Done — renamed TypeId → TypeID to match Type.TypeID in Iceberg. I think that's what you meant?
|
|
||
| assertThat(deserialized).isInstanceOf(SQLUdfRepresentation.class); | ||
| SQLUdfRepresentation sqlRepr = (SQLUdfRepresentation) deserialized; | ||
| assertThat(sqlRepr.sql()).isEqualTo("x + 1.0"); |
There was a problem hiding this comment.
Same issue @pvary mentioned above. We can just do a straight object equals
| } | ||
|
|
||
| @Test | ||
| void parseDecimalTypeParameter() { |
There was a problem hiding this comment.
Right now this test is a bit misleading, we currently don't have any special logic for decimals (or any types for that matter). That said, I think we should do validation and as I mentioned above match up with how Iceberg does this. Ideally I don't want us to be able to make a type that isn't an iceberg type
There was a problem hiding this comment.
Good point. B
Both decimal and variant tests were exercising the same primitive-parsing pathway as parseParameterWithDoc with no new coverage, so I removed them.
| } | ||
|
|
||
| @Test | ||
| void parseVariantTypeParameter() { |
There was a problem hiding this comment.
Same comment as above, this test exercises the same pathway as the previous test just with a different string. I think we should get some validation in there or maybe even specific type classes?
| import org.junit.jupiter.api.Test; | ||
|
|
||
| class TestUdfParameterParser { | ||
|
|
There was a problem hiding this comment.
I think we are still missing a few of the negative tests
- Test for list without element, Map without element
- Test for structs with invalid fields
- Test for a struct which isn't an object
There was a problem hiding this comment.
Added these tests. Thanks!
Add the foundational model classes for UDF metadata