Skip to content

Conversation

@aseigo
Copy link
Contributor

@aseigo aseigo commented Dec 16, 2025

In the documentation, there are several examples such as this one from the error handling guide:

GRPC.Stream.from([1, 2])
|> GRPC.Stream.map(fn
  2 -> raise "boom"
  x -> x
end)
|> GRPC.Stream.map_error(fn
  {:error, {:exception, _reason}} ->
    {:error, GRPC.RPCError.exception(message: "Booomm")}
end)

However, we have found that this does not actually work in practice and instead errors are thrown on the server side with no errors arriving on the client side. Which is unfortunate, because this would be a very nice API pattern :)

Thankfully, it's a pretty simple fix: GRPC.Server.Adapter.send_reply/2 needs to send errors to the client. This does mean adding one more function to the Adapter behaviour to facilitate access to the error sending capabilities of the adapter, however.

While pursuing this problem, I stumbled on a semi-related "foot-gun" in that the error exception did not require a status, and that would result in yet other errors being thrown. To address that, the status field is set to an unknown error by default, so that developers who only set a message don't end up with things blowing up on them.

With this PR, we now see errors appear in trailers. I think there are still discussions to be had in terms of multi-message streams and how errors could/should handled there. As it currently stands, the stream finishes when an error occurs. Of course, the developer can use an in-band error signalling message instead, leaving "actual" GRPC errors a communication-terminating event.

Aaron Seigo added 2 commits December 16, 2025 11:36
This avoids a foot-gun where the app dev does not know which status to use
and only sets the message (and/or potentially the details). Without a
default value, the status field becomes `nil` which blows up when
serializing which is not an easy bug to trace back to the right place in
the application.
This exposes a `send_error/2` function in the adapter behaviour.
end

def send_reply(%{adapter: adapter} = stream, {:error, %GRPC.RPCError{} = error}, _opts) do
adapter.send_error(stream.payload, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a new send_error callback? What does send_reply not have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function already exists, it's just not accessible via the Adaptor behaviour.

The difference to send_reply is that the error path doesn't try to GRPC-encode the message, puts the error information into the trailers, and ends the connection immediately.

It would be possible to change the contract of Adapter.send_reply/3 so that implementations must pattern match on errors being passed in as the data, but that means overloading the purpose of send_reply: sometimes it sends a reply, sometimes it sends an error.

I felt it cleaner to keep these two paths clearly separated, and as the adapter will need to track its own errors anyways and have a way to deal with those, this is isn't actually introducing new complexity.

But if you'd prefer to overload send_reply and make it a requirement for adapters to handle RPCError structs there, that can also be done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second paragraph sold me on the new callback!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second paragraph sold me on the new callback!

do_send_reply(stream, [], opts)
end

def send_reply(%{adapter: adapter} = stream, {:error, %GRPC.RPCError{} = error}, _opts) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we need to check is if the only errors that can reach here are in the form of this struct, of if we need to add another clause

Copy link
Contributor Author

@aseigo aseigo Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question.

I've oscillated between having one and not having one, as it would be "easy" enough to try and match on any error .. or even just errors with strings (an error message) and wrap those in an RPCError.

The flip side of that is should that occur, it is almost certainly a developer error. Either an RPCError , a thrown exception, or a protobuf-encodable struct should be the result. Returning Something Else(tm) is probably something like a raw Ecto query or whatever that has been returned, and simply passing that back silently to the client, even as an error, feels somewhere between dangerous and too much magic.

In those cases, I'd rather see a request crash on the server side that can be tracked and addressed.

@sleipnir
Copy link
Collaborator

In the documentation, there are several examples such as this one from the error handling guide:

GRPC.Stream.from([1, 2])
|> GRPC.Stream.map(fn
  2 -> raise "boom"
  x -> x
end)
|> GRPC.Stream.map_error(fn
  {:error, {:exception, _reason}} ->
    {:error, GRPC.RPCError.exception(message: "Booomm")}
end)

However, we have found that this does not actually work in practice and instead errors are thrown on the server side with no errors arriving on the client side. Which is unfortunate, because this would be a very nice API pattern :)

@aseigo Thank you for bringing this up! I'd like to clarify that the current behavior is actually intentional by design, not a bug. Let me explain the flow and the reasoning behind it.

Current Error Flow (Working as Designed)

When using GRPC.Stream.map_error/2 in streaming handlers, here's what happens:

  • Handler calls run_with (inside GRPC.Server.send_reply for streaming)
  • Flow executes and map_error transforms errors into {:error, GRPC.RPCError{}}
  • run_with detects the error and calls send_response
  • send_response raises the GRPC.RPCError (intentional control flow mechanism)
  • Adapter catches the exception:
    • Cowboy: catches via catch and does exit({:shutdown, {error, []}})
    • Stream process/handler sends error trailers to client
    • Stream terminates with proper gRPC error status
    • This flow successfully sends errors to clients through HTTP/2 trailers with the correct grpc-status and grpc-message.

Why We Terminate the Stream on Error
The decision to terminate the stream when an error occurs is intentional for several important reasons:

  1. Trailer Accumulation Problem
    In HTTP/2 and gRPC protocol, trailers are sent once at the end of a stream with the END_STREAM flag. If we were to continue streaming after an error:
  • Problematic scenario:[msg1, ERROR, msg2, ERROR2, msg3] # Which trailers do we send? # grpc-status from ERROR or ERROR2?# We can't send multiple trailers - protocol limitation!
    HTTP/2 only allows one HEADERS frame with END_STREAM at the end. We cannot send:
    Multiple trailer frames (protocol violation)
    Accumulate errors indefinitely (memory/state management nightmare)
  1. gRPC Protocol Semantics
    Per gRPC protocol specification:
    "Status is sent as a single trailer at the end of the stream"
    The protocol is designed around the concept of one status per RPC. Errors are meant to be terminating events.

  2. Clear Error Semantics
    When an error occurs in a stream transformation:

GRPC.Stream.from([1, 2, 3])
|> GRPC.Stream.map(fn x ->   if x == 2, doraise("error"), elsexend)
|> GRPC.Stream.map_error(fn _ ->   GRPC.RPCError.exception(status:internal, message"Processing failed")end)

The client receives:

✅ msg1 (successful)
✅ Trailers: grpc-status=13, grpc-message="Processing failed"
✅ Stream closes

The alternative (continuing after error) would be confusing:

❌ What does receiving more messages after an error mean?
❌ Was the error "recovered"?
❌ How does the client know the final status?

Alternative: In-Band Error Signaling
If you need to handle recoverable errors within a stream, the recommended pattern is in-band error signaling:

Define a response type that includes success/error variants, in other words, map the errors to common data types within your application's business logic.

This way:

✅ Errors are part of the message protocol
✅ Stream continues after recoverable errors
✅ Final grpc-status=0 indicates successful completion
✅ Client can handle per-message errors

The current behavior is:

raise → adapter catches → sends trailers works perfectly for both Cowboy and ThousandIsland
Stream termination on error prevents trailer accumulation issues and follows gRPC protocol semantics
map_error does work - it transforms errors before sending them to clients as terminating trailers
For non-terminating errors, use in-band error messages in your protobuf definitions
The key insight is: gRPC errors are meant to be stream-terminating events. If you need to continue streaming despite errors, those "errors" should be modeled as regular response messages, not gRPC status codes.

Does this clarify the design? I'm happy to discuss further if you have a specific use case that this pattern doesn't address well!
But do you have any evidence to show us that the client didn't receive the error via the trailer? Couldn't it be a bug in the clients? What you're implementing seems to be exactly the original design, just in a different way. I'd like to discuss this a bit more before proceeding with the review.

@sleipnir
Copy link
Collaborator

sleipnir commented Dec 16, 2025

@aseigo Remember that map_error was created precisely for scenarios where you intend to capture errors and transform them into valid business messages, and if an error is critical, you can also map it to a valid GRPC.Error status of your interest.

@aseigo
Copy link
Contributor Author

aseigo commented Dec 16, 2025

Let me explain the flow and the reasoning behind it.

I agree that this is a fine way to handle errors and quite like the ergonomics of it.

The issue is that it currently does not work.

When returning GRPC.RPCError.exception from GRPC.Stream.map (or map_error) one gets:

[error] ** (UndefinedFunctionError) function GRPC.RPCError.transform_module/0 is undefined or private
    (grpc_core 1.0.0-rc.1) GRPC.RPCError.transform_module()
    (protobuf 0.15.0) lib/protobuf/encoder.ex:157: Protobuf.Encoder.transform_module/2
    (protobuf 0.15.0) lib/protobuf/encoder.ex:12: Protobuf.Encoder.encode_to_iodata/1
    (grpc_server 1.0.0-rc.1) lib/grpc/server/stream.ex:93: GRPC.Server.Stream.send_reply/3
    (grpc_server 1.0.0-rc.1) lib/grpc/stream.ex:179: GRPC.Stream.run/1

If returning {:error, %GRPC.RPCError{....} one gets:

[error] ** (FunctionClauseError) no function clause matching in Protobuf.Encoder.encode_to_iodata/1
    (protobuf 0.15.0) lib/protobuf/encoder.ex:10: Protobuf.Encoder.encode_to_iodata({:error, %GRPC.RPCError{status: 2, message: "Oh no!", details: nil}})
    (grpc_server 1.0.0-rc.1) lib/grpc/server/stream.ex:93: GRPC.Server.Stream.send_reply/3
    (grpc_server 1.0.0-rc.1) lib/grpc/stream.ex:179: GRPC.Stream.run/1

On the client side, a trailer is sent but it does not contain the contents of the RPCError the developer created. It contains the generic trailers with a status of 0 (!) and no message.

This can be seen as well in the tests changed in this PR where messages that test RPCError exceptions being thrown see the exception in the log rather than the message.

So this doesn't change the workflow currently in the library, it just makes it work such that the RPCError ends up on the client. We noticed this when map_error calls that generated RPCError responses, copy-and-paste from the docs even, did not make it to the client while these transform_module/encode_to_iodata errors were showing up in logs.

@sleipnir
Copy link
Collaborator

sleipnir commented Dec 16, 2025

On the client side, a trailer is sent but it does not contain the contents of the RPCError the developer created. It contains the generic trailers with a status of 0 (!) and no message.

This can be seen as well in the tests changed in this PR where messages that test RPCError exceptions being thrown see the > exception in the log rather than the message.

So this doesn't change the workflow currently in the library, it just makes it work such that the RPCError ends up on the client. > We noticed this when map_error calls that generated RPCError responses, copy-and-paste from the docs even, did not make it > to the client while these transform_module/encode_to_iodata errors were showing up in logs.

Okay, let me explore a bit more. I created a suite of integration tests about this here: 1fb869b Could you please run it like this: mix test test/grpc/integration/stream_test.exs --only map_error on the feat/new-server-adapter branch?

All I had to do was adjust the patterns of the send_response/3 function. Sorry for insisting, but I just want to understand the bug.

@aseigo
Copy link
Contributor Author

aseigo commented Dec 16, 2025

All I had to do was adjust the patterns of the send_response/3 function

With that adjustment, on the server-side I see the exception in the logs rather than random errors (yes! nice!) but I am still not seeing them on the client side. I applied the patch to the master branch as well as my grpcweb-with-trailers branch, and tested with a native client and a grpcweb client and I'm still seeing grpc-message: '', grpc-status: 0

So maybe handling this in Stream.send_response is a good way forward, but there's still getting the response to the client it seems ...

@sleipnir
Copy link
Collaborator

sleipnir commented Dec 16, 2025

All I had to do was adjust the patterns of the send_response/3 function

With that adjustment, on the server-side I see the exception in the logs rather than random errors (yes! nice!) but I am still not seeing them on the client side. I applied the patch to the master branch as well as my grpcweb-with-trailers branch, and tested with a native client and a grpcweb client and I'm still seeing grpc-message: '', grpc-status: 0

So maybe handling this in Stream.send_response is a good way forward, but there's still getting the response to the client it seems ...

Did you run the tests? They're important to know if it's something the clients are handling or a server error. In my case, I believe it's a client issue because the integration tests worked with the standard elixir client, correctly retrieving the errors.

@aseigo How can I replicate your problem scenario here on my end?

@aseigo
Copy link
Contributor Author

aseigo commented Dec 16, 2025

Did you run the tests?

The tests in the feat/new-server-adapter work. I suspect (though haven't confirmed due to time restrictions here) that it is due to the ThousandIslands-based adapter doing this correctly while the existing cowboy adapter in the master branch is not?

I say this because the same change to Stream.send_response does not work in the master branch with the existing cowboy-based adapter. :/

If the cowboy adapter is going to be removed soonishly, or if it is also changed in the feat/new-server-adapter branch, then we just have to wait for that branch to be merged, I suppose. If the cowboy adapter will remain, then this will remain an issue. And, of course, if the new adapter isn't merged and released, then it also remains an issue :)

I could not test with the app I have here, unfortunately, as it does not start correctly with the feat/new-server-adapter branch. (Will comment about that on the other PR...)

@sleipnir
Copy link
Collaborator

sleipnir commented Dec 16, 2025

Did you run the tests?

The tests in the feat/new-server-adapter work. I suspect (though haven't confirmed due to time restrictions here) that it is due to the ThousandIslands-based adapter doing this correctly while the existing cowboy adapter in the master branch is not?

Those tests ran with Cowboy; run_server without passing options will use the default adapter, which in this case is Cowboy. To run the tests with ThousandIsland, you would need to pass the option:

run_server([MyService], fn port ->

# ...
end, 0, adapter: GRPC.Server.Adapters.ThousandIsland)

I only used that branch because it was convenient to be able to commit to it without impacting anything in production.
I also will continue investigating the issue before reviewing your PR. I don't like approving things without understanding the real reason why they are trying to solve. 😸

@sleipnir
Copy link
Collaborator

@aseigo Can you try again here? #487

@aseigo aseigo mentioned this pull request Dec 16, 2025
@aseigo
Copy link
Contributor Author

aseigo commented Dec 16, 2025

Fixed in #487, closing! Nice work @sleipnir !

@aseigo aseigo closed this Dec 16, 2025
@sleipnir
Copy link
Collaborator

Fixed in #487, closing! Nice work @sleipnir !

I'm the one who should be thanking everyone for their hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants