-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Interface based DataFile reader and writer API - PoC #12298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
907089c to
313c2d5
Compare
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
| public Key(FileFormat fileFormat, String dataType, String builderType) { | ||
| this.fileFormat = fileFormat; | ||
| this.dataType = dataType; | ||
| this.builderType = builderType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of defining the default one using a priority (int) based approach and let the one with the highest priority be the default one. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a concrete example for this: Comet vectorized parquet reader spark.sql.iceberg.parquet.reader-type
I think it is good if the reader/writer choice is a conscious decision, and not happening based on some behind the scenes algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for simplicity. This code should not determine things like whether Comet is used. This should have a single purpose, which is to standardize how object models plug in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the config to properties, and the builder method will create the different readers based on this config
| import org.apache.iceberg.io.FileAppender; | ||
|
|
||
| /** Builder API for creating {@link FileAppender}s. */ | ||
| public interface AppenderBuilder extends InternalData.WriteBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder that AppenderBuilder has a base interface but the other builders don't.
Guess it might help to have a common DataFileIoBuilder interface defining the common builder attributes (table, schema, properties, meta). It's a bit of an "adventure in Java generics", but doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a look at the other PRs (#12164, #12069), you can see that first, I took that adventurous route, but the result was too many classes/interfaces and casts.
This PR is aiming for the minimal set of changes, and the InternalData.WriteBuilder is already introduced to Iceberg by #12060. We either need to widen that interface or inherit from it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also confused by this inheritance. We're extending but overriding everything and it's not clear to me what we really gain by going with this approach. It looks like it ends up as a completely different builder that produces the same build result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal with the PR was to show the minimal changes required to make the idea work.
We either create a different builder class for the InternalData.WriteBuilder and the DataFile.WriteBuilder, or we need to have inheritance of the interfaces.
Based on our discussion below we might end up using a different strategy, so revisit this comment later.
| return DataFileServiceRegistry.read( | ||
| task.file().format(), Record.class.getName(), input, fileProjection, partition) | ||
| .split(task.start(), task.length()) | ||
| .caseSensitive(caseSensitive) | ||
| .reuseContainers(reuseContainers) | ||
| .filter(task.residual()) | ||
| .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these simplifications!
c528a52 to
9975b4f
Compare
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pvary for this proposal, I left some comments.
|
|
||
| /** Enables reusing the containers returned by the reader. Decreases pressure on GC. */ | ||
| @Override | ||
| default ReaderBuilder reuseContainers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it should not be here? These are parquet reader specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also used by Avro.
See:
| this.reuseContainers = reuseContainers; |
| * @param rowType of the native input data | ||
| * @return {@link DataWriterBuilder} for building the actual writer | ||
| */ | ||
| public static <S> DataWriterBuilder dataWriterBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand in what case need this? I think append would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this. We might be able to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current approach, the file format api implementation creates the appender, and the PR creates the writers for the different data/delete files
| * @return {@link AppenderBuilder} for building the actual writer | ||
| */ | ||
| public static <S, B extends EqualityDeleteWriterBuilder<B>> | ||
| EqualityDeleteWriterBuilder<B> equalityDeleteWriterBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think file format should consider eqaulity deletion/pos deletion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current Avro positional delete writer behaves differently than Parquet/ORC positional delete writers.
In case of the positional delete files the schema provided to the Avro writer should omit the PATH and the POS fields, and only needs the actual table schema. The writer handles the PATH/POS fields by static code:
iceberg/core/src/main/java/org/apache/iceberg/avro/Avro.java
Lines 614 to 618 in b8fdd84
| public void write(PositionDelete<D> delete, Encoder out) throws IOException { | |
| PATH_WRITER.write(delete.path(), out); | |
| POS_WRITER.write(delete.pos(), out); | |
| rowWriter.write(delete.row(), out); | |
| } |
The Parquet/ORC positional delete writers behave in the same way. They expect the same input.
If we are ready for a more invasive change we can harmonize the writers.
I have aimed for a minimal changeset to allow easier acceptance for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The appender doesn't need to know about these, but the file formats and the writer implementations need this
| * issues. | ||
| */ | ||
| private static final class Registry { | ||
| private static final Map<Key, ReaderService> READ_BUILDERS = Maps.newConcurrentMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like a convention problem, I think maybe we just need to store in FileFormatService in registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we don't have writers (arrow), or we have multiple readers vectorized/non-vectorized readers. Also Parquet has Comet reader. So I kept the writers and the readers separate
| /** Key used to identify readers and writers in the {@link DataFileServiceRegistry}. */ | ||
| public static class Key { | ||
| private final FileFormat fileFormat; | ||
| private final String dataType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this things like arrow, internal row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
Currenly we have:
- Record - generic readers/writers
- ColumnarBatch (arrow) - arrow
- RowData - Flink
- InternalRow - Spark
- ColumnarBatch (spark) - Spark batch
|
I will start to collect the differences here between the different writer types (appender/dataWriter/equalityDeleteWriter/positionalDeleteWriter) for reference:
|
| import org.apache.iceberg.io.DataWriter; | ||
|
|
||
| /** Builder API for creating {@link DataWriter}s. */ | ||
| public interface DataWriterBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put the builder interface into the DataWriter class and put it in the same package? It seems odd to me that we're introducing this new datafile package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classes which has to be implemented by the file formats are kept in the io package, but moved the others to the data package
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
...urces/META-INF/services/org.apache.iceberg.io.datafile.DataFileServiceRegistry$WriterService
Outdated
Show resolved
Hide resolved
|
While I think the goal here is a good one, the implementation looks too complex to be workable in its current form. The primary issue that we currently have is adapting object models (like Iceber's internal - switch (format) {
- case AVRO:
- AvroIterable<ManifestEntry<F>> reader =
- Avro.read(file)
- .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
- .createResolvingReader(this::newReader)
- .reuseContainers()
- .build();
+ CloseableIterable<ManifestEntry<F>> reader =
+ InternalData.read(format, file)
+ .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+ .reuseContainers()
+ .build();
- addCloseable(reader);
+ addCloseable(reader);
- return CloseableIterable.transform(reader, inheritableMetadata::apply);
+ return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
- default:
- throw new UnsupportedOperationException("Invalid format for manifest file: " + format);
- }This shows:
In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the row reader. The builder is initialized from return DataFileServiceRegistry.readerBuilder(
format, InternalRow.class.getName(), file, projection, idToConstant)There are also new static classes in the file. Each creates a new service and each service creates the builder and object model: public static class AvroReaderService implements DataFileServiceRegistry.ReaderService {
@Override
public DataFileServiceRegistry.Key key() {
return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName());
}
@Override
public ReaderBuilder builder(
InputFile inputFile,
Schema readSchema,
Map<Integer, ?> idToConstant,
DeleteFilter<?> deleteFilter) {
return Avro.read(inputFile)
.project(readSchema)
.createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant));
}The In addition, there are now a lot more abstractions:
I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:
|
I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it!
I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format. We have a matrix of transformation which we need to encode somewhere:
The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle.
If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like
We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)
If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible.
Will do
Will see what could be arcived |
c488d32 to
71ec538
Compare
…quetFormatModel.ReadBuilderWrapper
f50aad4 to
7206c7b
Compare
7206c7b to
b41cb69
Compare
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/GenericFormatModels.java
Outdated
Show resolved
Hide resolved
| private final Class<D> type; | ||
| private final Class<S> schemaType; | ||
| private final BiFunction<Schema, Map<Integer, ?>, DatumReader<D>> readerFunction; | ||
| private final BiFunction<org.apache.avro.Schema, S, DatumWriter<D>> writerFunction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going through the different reader/writer functions and how they are supplied by Spark and generics. All of the variations are pretty wacky. This is a good example, where the read side uses an Iceberg Schema but the write side uses an Avro schema. In addition, only the write paths for Spark pass the inputSchema from Spark (even though the readers are parameterized by its type).
I know that we're trying to avoid doing extra work right now, but it seems to me that this is an area where we should do work now since we're ensuring that we can pass the engine schema around and we aren't going to want to make substantial changes to the FormatModel interface. What happens if we want to use the Iceberg schema in addition to the Avro schema for writes? I'd like to know now rather than later and would lean toward plumbing it through these interfaces; at a minimum, I'd make sure that we have Iceberg schema, engine schema, and file schema for writes (which is what is needed for Parquet) and I'd pass Iceberg schema, file schema, and constants map for reads. I also suspect that we are going to want engine schema for reads so we may want to add it now as well. Having a common set of metadata supported by the builders and the formats is going to make this easier to think about and build for, without needing to make substantial changes to plumbing when we actually want to use these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a BaseFormatModel abstract class, and updated the PR to use that in the Parquet, Avro and ORC models.
21ada63 to
4328286
Compare
ec93457 to
9627db8
Compare
b03fd34 to
00626e2
Compare
| .createBatchedReaderFunc( | ||
| (icebergSchema, messageType) -> | ||
| (VectorizedReader<?>) | ||
| readerFunction.read(icebergSchema, messageType, null, idToConstant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We send in "null", as currently the engine schema setter is removed from the API, as it is not used anywhere at this point. We might want to consider adding it back, so it is wired in correctly
00626e2 to
67e706b
Compare
67e706b to
a11216f
Compare
rashworld-max
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what the PR does:
ReadBuilder- Builder for reading data from data filesAppenderBuilder- Builder for writing data to data filesObjectModel- Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pairAppenderBuilder- Builder for writing a fileDataWriterBuilder- Builder for generating a data filePositionDeleteWriterBuilder- Builder for generating a position delete fileEqualityDeleteWriterBuilder- Builder for generating an equality delete fileReadBuilderhere - the file format reader builder is reusedWriterBuilderclass which implements the interfaces above (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) based on a provided file format specificAppenderBuilderObjectModelRegistrywhich stores the availableObjectModels, and engines and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.GenericObjectModels- for reading and writing Iceberg RecordsSparkObjectModels- for reading (vectorized and non-vectorized) and writing Spark InternalRow/ColumnarBatch objectsFlinkObjectModels- for reading and writing Flink RowData objects