Skip to content

Conversation

@KnorpelSenf
Copy link
Contributor

@KnorpelSenf KnorpelSenf commented Feb 6, 2024

This is a collection of various tasks that are overdue and somewhat related to each other. Here is a breakdown of the changes. After reading through it, this PR should be reviewed on a per-commit basis. The diff is very large, but the actual changes are mostly self-contained and limited in complexity.

Getting rid of axios and form-data

Node has had global fetch for a long time now. It is no longer helpful to use dependencies like axios that introduce security vulnerability like #111. Consequently, this PR removes both axios and form-data, and replaces them with global fetch.

Unfortunately, it is not possible to support ReadableStreams with built-in fetch as it does not work with form-data. It is not possible to use the built-in FormData because it does not support ReadableStreams.

Hence, this PR introduces a tiny implementation of the multipart/form-data protocol for our use case. As a side-effect, we now also support a few more ways to input files.

This leaves us with socket.io as the only remaining dependency.

Updating supported Node versions

This PR introduces an engine field in package.json. The package currently supports all active Node versions. This PR now specifies this in the package.json file.

We also align the Node versions used in CI to this. They are currently set to 20 and 22 only because 24 is not available on GitHub yet.

Updating Prettier and ESLint

The tooling was updated, config files were migrated, source code was reformatted slightly.

Migrating all tests to TypeScript

Previously, the test suite was written in JS. This PR migrates to TS and employs tsx to avoid a compile step.

@KnorpelSenf
Copy link
Contributor Author

not tested at all, does not even compile yet

@kingmesal
Copy link

I'm very interested in dumping axios in favor of the built-in fetch. I run a lot of things in cloudflare workers and this would be quite a helpful change.

Any update on when this might be ready?

@KnorpelSenf
Copy link
Contributor Author

I'm ready to continue with this, I was just hoping for pointers regarding #111 (comment) before I go ahead. Shipping a custom implementation of multipart/form-data clearly gives us the best library, but I didn't feel like deciding alone whether or not we want to accept that in this library.

@josiasmontag
Copy link
Contributor

Sorry for the delay. We should go with a custom multipart/form-data implementation then as @KnorpelSenf suggested.

I think we need to keep the optional size parameter in the upload() method though. If one passes a custom stream with unknown length, there is no way to calculate a correct Content-Length header.

@KnorpelSenf
Copy link
Contributor Author

KnorpelSenf commented May 2, 2024

Sorry for the delay. We should go with a custom multipart/form-data implementation then as @KnorpelSenf suggested.

Alright, I will commence my work on this in the coming time, but my time for OSS is a bit limited right now, so please don't expect an immediate implementation :)

I think we need to keep the optional size parameter in the upload() method though. If one passes a custom stream with unknown length, there is no way to calculate a correct Content-Length header.

What if somebody passes a read stream, or an AsyncIterator<Uint8Array>? For example, if people download a file from one server via fetch and then obtain the file stream from the response body, there is no way of knowing the file size in advance. Is the content-length header strictly required?

@josiasmontag
Copy link
Contributor

What if somebody passes a read stream, or an AsyncIterator<Uint8Array>? For example, if people download a file from one server via fetch and then obtain the file stream from the response body, there is no way of knowing the file size in advance. Is the content-length header strictly required?

Unfortunately, S3 strictly requires the Content-Length header for uploads. Therefore, I would like to automatically detect the file size if possible (file streams) and otherwise use a size parameter for custom streams.

In the mentioned example the server should return a Content-Length header which could be passed as size parameter to the upload() method.

@KnorpelSenf
Copy link
Contributor Author

What if somebody passes a read stream, or an AsyncIterator<Uint8Array>? For example, if people download a file from one server via fetch and then obtain the file stream from the response body, there is no way of knowing the file size in advance. Is the content-length header strictly required?

Unfortunately, S3 strictly requires the Content-Length header for uploads. Therefore, I would like to automatically detect the file size if possible (file streams) and otherwise use a size parameter for custom streams.

Alright!

In the mentioned example the server should return a Content-Length header which could be passed as size parameter to the upload() method.

Yeah I'll add this to the doc string

@dncnbuck
Copy link

dncnbuck commented May 9, 2024

Hi @KnorpelSenf - I see you've mentioned you won't be able to dedicate much time to this PR - but regardless I'm going to ask! Do you have a rough timeframe here?

@KnorpelSenf
Copy link
Contributor Author

Yep, I'm gonna give a talk at https://events.geekle.us/typescript24/ on Tuesday so there are a number of things to prepare until then. The rest of that week will be spent catching up with life in general, and the week after that I'm mostly free and looking forward to getting back to this (but no promises, life can be surprising and I don't want to set myself deadlines)

@KnorpelSenf
Copy link
Contributor Author

If you feel like contributing, you can check out the base implementation of what I'll do here in this file: https://github.com/grammyjs/grammY/blob/main/src/core/payload.ts

@KnorpelSenf
Copy link
Contributor Author

I will get back to this after my summer vacation, likely in September. Sorry for the delays, there were a few other important tasks on my agenda :)

@KnorpelSenf KnorpelSenf marked this pull request as ready for review April 20, 2025 07:32
@KnorpelSenf
Copy link
Contributor Author

KnorpelSenf commented Apr 20, 2025

Authorization header maybe? The request is presigned an S3 rejects any requests with additional Authorization header.

That was it! Thanks, it's fixed now. We need to wait for the rate limit again before we can see if all tests pass now.

Content-length is required as well.

I did some research and this is actually not true. If you use the multipart/form-data API of S3 (which this implementation does now) then you only have to specify the content length for each chunk (done automatically by the runtime). You do not need to know the total number of bytes to be uploaded.

My implementation now makes use of that so we can stream the entire data without querying its size upfront. This enables support for uploading arbitrary data streams, removes the need for a size argument, and saves the file stat operation.

@KnorpelSenf
Copy link
Contributor Author

@josiasmontag this is ready for review now. If you have a project that uses this SDK, it may be reasonable to drop in these changes and try them out. The diff ended up being fairly large (perhaps I should do maintenance more frequently in the future) so we better test them well.

@KnorpelSenf
Copy link
Contributor Author

@kingmesal @dncnbuck @elawad you might be interested in doing this, too

@KnorpelSenf KnorpelSenf changed the title chore: do various maintenance tasks chore: perform various maintenance tasks Apr 20, 2025
@josiasmontag
Copy link
Contributor

I did some research and this is actually not true. If you use the multipart/form-data API of S3 (which this implementation does now) then you only have to specify the content length for each chunk (done automatically by the runtime). You do not need to know the total number of bytes to be uploaded.

My implementation now makes use of that so we can stream the entire data without querying its size upfront. This enables support for uploading arbitrary data streams, removes the need for a size argument, and saves the file stat operation.

Unfortunately, this does not seem to work for native S3. Some background about the infra: Currently and by default CloudConvert uses Ceph as object storage, which has a S3-compatible API. Ceph seems to be less strict than the "real" S3 and accepts uploads without Content-Length header. However, as fallback (and for future regions) we are using AWS S3 and the SDK should be compatible. For testing, I just switched the sandbox to use AWS S3 and now uploads are rejected with a Length Required error.

Also, I adjusted the rate limits for the sandbox. You should not get the rate limit errors any more.

@KnorpelSenf
Copy link
Contributor Author

Hmmmm sad. But alright. I'll reintroduce the size argument and precompute the length whenever possible.

Thanks for adjusting the limits.

@KnorpelSenf
Copy link
Contributor Author

@josiasmontag actually … does that mean that we cannot use the multipart/form-data API at all? The Content-Length header would specify the byte count of the entire request body, but the request body is different from the file size that is passed to S3. How should we proceed?

@KnorpelSenf
Copy link
Contributor Author

Or maybe specify a Content-Length header only for the respective part that contains the file? Here?

yield enc.encode(
`content-disposition:form-data;name="file";filename=${this.filename}\r\ncontent-type:application/octet-stream\r\n\r\n`
);

@josiasmontag
Copy link
Contributor

Content-Length headers for the respective parts is not enough as far as I know, it needs a Content-Length which is the total size of the entire body, including the file. axios did calculate this automatically.

I think it would be the best to keep the current signature of the upload() method to avoid breaking changes:

upload(
        task: Task | JobTask,
        stream: Stream,
        filename: string | null = null,
        size: number | null = null
    )

If no filename passed -> try to guess the filename.
If no size passed -> try to guess the size. If it is a file stream, use stat to get the file size.

@KnorpelSenf
Copy link
Contributor Author

How is the backend implemented then? If I tell you that the total body size is 500 bytes and then I first send 350 bytes with file data followed by 100 bytes of metadata and the remaining 50 bytes are the interspersed boundaries of the protocol … then the server will not be able to determine the file size until the entire request has been read.

On the other hand, if the server reads the entire request anyway, then what's the purpose of the content length?

@josiasmontag
Copy link
Contributor

Unfortunately I have no idea how AWS S3 has implemented this on their end 🤷‍♂️. The docs only say it requires Content-Length for the total request: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html#RESTObjectPOST-requests.

@KnorpelSenf
Copy link
Contributor Author

Very strange :D

It's an easy fix either way. We can restore the old signature of upload.

@KnorpelSenf
Copy link
Contributor Author

Oh. The link you've shared actually contains the implementation details I was asking for.

Important

When constructing your request, make sure that the file field is the last field in the form.

They simply force you to pass all metadata upfront. That way, they can compute the file size based on the content length.

The example I described above is simply disallowed.

That clears it up, thanks for sharing!

@KnorpelSenf
Copy link
Contributor Author

@josiasmontag done!

I'm running into rate limits again, will check again later.

@KnorpelSenf
Copy link
Contributor Author

Awesome. Would you like to do anything else before merging?

@josiasmontag
Copy link
Contributor

Looks good 👍 Will do some final review and then release this as 3.0.0.
Thanks!

@KnorpelSenf KnorpelSenf mentioned this pull request Apr 24, 2025
might get size of different file and result in unexpected results
@josiasmontag josiasmontag merged commit 2b7f4be into cloudconvert:master Apr 24, 2025
3 checks passed
@KnorpelSenf KnorpelSenf deleted the maintenance branch April 24, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants