Streaming new records into an existing parquet file in S3

I'm attempting to aggregate records by id as they are processed from SQS via lambda into S3.

I do get a merged file uploaded into S3 as I can see the filesize increasing each time the lambda runs, but when using parquet-tools to inspect the data I only see one result. I have a feeling this is due to multiple headers being set in the file and parquet-tools is only reading the latest entry.

Can anyone help me figure out a way to adapt my approach  using part of the parquetjs library? The aim is to correctly stitch together the streams, I think I need to parse the oldStream chunks into parquet rows and then write them into a new write stream but being new to parquet and parquetjs I don't know where to start.

My approach could well be a poor one, but if possible I'd rather not create / maintain another process, e.g. a cloudwatch scheduled event to aggregate and repartition my data.

I think this may relate to: https://github.com/ironSource/parquetjs/issues/120

Thank to all the contributers for all your hard work, this is a great library from what I've seen so far 👍🏻

My current approach (though a little broken):

Read existing parquet file as stream
```TypeScript
const oldStream = new Stream.PassThrough();
getS3FileStream(Bucket, Key) // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
  .pipe(oldStream);
```

Create new parquet stream from SQS record
```TypeScript
const recordStream = new Stream.PassThrough();
createRecordStream(record) // formats SQS Record data inline with schema
  .pipe(StreamArray.withParser())
  .pipe(new parquet.ParquetTransformer(schema, { useDataPageV2: false }))
  .pipe(recordStream);
```

Merge streams together
```TypeScript
const combinedStream = new Stream.PassThrough();
mergeStream(oldStream, recordStream)
  .pipe(combinedStream);
```

Upload to S3
```TypeScript
const upload = s3Stream.upload({
  Bucket,
  Key // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
});
combinedStream.pipe(upload);
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming new records into an existing parquet file in S3 #125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming new records into an existing parquet file in S3 #125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions