Skip to content

Streaming new records into an existing parquet file in S3 #125

@designreact

Description

@designreact

I'm attempting to aggregate records by id as they are processed from SQS via lambda into S3.

I do get a merged file uploaded into S3 as I can see the filesize increasing each time the lambda runs, but when using parquet-tools to inspect the data I only see one result. I have a feeling this is due to multiple headers being set in the file and parquet-tools is only reading the latest entry.

Can anyone help me figure out a way to adapt my approach using part of the parquetjs library? The aim is to correctly stitch together the streams, I think I need to parse the oldStream chunks into parquet rows and then write them into a new write stream but being new to parquet and parquetjs I don't know where to start.

My approach could well be a poor one, but if possible I'd rather not create / maintain another process, e.g. a cloudwatch scheduled event to aggregate and repartition my data.

I think this may relate to: #120

Thank to all the contributers for all your hard work, this is a great library from what I've seen so far 👍🏻

My current approach (though a little broken):

Read existing parquet file as stream

const oldStream = new Stream.PassThrough();
getS3FileStream(Bucket, Key) // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
  .pipe(oldStream);

Create new parquet stream from SQS record

const recordStream = new Stream.PassThrough();
createRecordStream(record) // formats SQS Record data inline with schema
  .pipe(StreamArray.withParser())
  .pipe(new parquet.ParquetTransformer(schema, { useDataPageV2: false }))
  .pipe(recordStream);

Merge streams together

const combinedStream = new Stream.PassThrough();
mergeStream(oldStream, recordStream)
  .pipe(combinedStream);

Upload to S3

const upload = s3Stream.upload({
  Bucket,
  Key // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
});
combinedStream.pipe(upload);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions