I'm attempting to aggregate records by id as they are processed from SQS via lambda into S3.
I do get a merged file uploaded into S3 as I can see the filesize increasing each time the lambda runs, but when using parquet-tools to inspect the data I only see one result. I have a feeling this is due to multiple headers being set in the file and parquet-tools is only reading the latest entry.
Can anyone help me figure out a way to adapt my approach using part of the parquetjs library? The aim is to correctly stitch together the streams, I think I need to parse the oldStream chunks into parquet rows and then write them into a new write stream but being new to parquet and parquetjs I don't know where to start.
My approach could well be a poor one, but if possible I'd rather not create / maintain another process, e.g. a cloudwatch scheduled event to aggregate and repartition my data.
I think this may relate to: #120
Thank to all the contributers for all your hard work, this is a great library from what I've seen so far 👍🏻
My current approach (though a little broken):
Read existing parquet file as stream
const oldStream = new Stream.PassThrough();
getS3FileStream(Bucket, Key) // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
.pipe(oldStream);
Create new parquet stream from SQS record
const recordStream = new Stream.PassThrough();
createRecordStream(record) // formats SQS Record data inline with schema
.pipe(StreamArray.withParser())
.pipe(new parquet.ParquetTransformer(schema, { useDataPageV2: false }))
.pipe(recordStream);
Merge streams together
const combinedStream = new Stream.PassThrough();
mergeStream(oldStream, recordStream)
.pipe(combinedStream);
Upload to S3
const upload = s3Stream.upload({
Bucket,
Key // Key: guid=xyz/year=2021/month=04/2021.04.xyz.parquet
});
combinedStream.pipe(upload);
I'm attempting to aggregate records by id as they are processed from SQS via lambda into S3.
I do get a merged file uploaded into S3 as I can see the filesize increasing each time the lambda runs, but when using parquet-tools to inspect the data I only see one result. I have a feeling this is due to multiple headers being set in the file and parquet-tools is only reading the latest entry.
Can anyone help me figure out a way to adapt my approach using part of the parquetjs library? The aim is to correctly stitch together the streams, I think I need to parse the oldStream chunks into parquet rows and then write them into a new write stream but being new to parquet and parquetjs I don't know where to start.
My approach could well be a poor one, but if possible I'd rather not create / maintain another process, e.g. a cloudwatch scheduled event to aggregate and repartition my data.
I think this may relate to: #120
Thank to all the contributers for all your hard work, this is a great library from what I've seen so far 👍🏻
My current approach (though a little broken):
Read existing parquet file as stream
Create new parquet stream from SQS record
Merge streams together
Upload to S3