Skip to content

Usage with pyarrow parquet #10

@tanguycdls

Description

@tanguycdls

Hello, I'm very interested by the library usage however I struggle to apply it to a parquet file other than the dremel example.

from struct2tensor import expression_impl
import struct2tensor as s2t
import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.table([pa.array([0, 1])], names='a')
pq.ParquetWriter('/tmp/test', tbl.schema).write_table(tbl)
filenames = ["/tmp/test"]
batch_size = 2

exp = s2t.expression_impl.parquet.create_expression_from_parquet_file(filenames)
ps = exp.project(['a'])

val = s2t.expression_impl.parquet.calculate_parquet_values([ps], exp, 
                                        filenames, batch_size)
for h in val:
    break

segfaults with the error:
2021-04-15 15:30:40.254237: E struct2tensor/kernels/parquet/parquet_reader.cc:198]
The repetition type of the root node was 0, but should be 2. There may be something wrong with your supplied parquet schema. We will treat it as a repeated field.

2021-04-15 15:31:46.428109: W tensorflow/core/framework/dataset.cc:477]
Input of ParquetDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.

I also tried saving again the dremel file loaded with Pyarrow and dumping it right away and I can reproduce the error.

How do you advise to save your parquet ?

Thanks for your help !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions