Skip to content

reverseproxy: add lb_retry_match condition on response status#7569

Open
seroperson wants to merge 2 commits intocaddyserver:masterfrom
seroperson:i6267-reverse-proxy-retry
Open

reverseproxy: add lb_retry_match condition on response status#7569
seroperson wants to merge 2 commits intocaddyserver:masterfrom
seroperson:i6267-reverse-proxy-retry

Conversation

@seroperson
Copy link
Copy Markdown

@seroperson seroperson commented Mar 14, 2026

Closes #6267

  • lb_retry_match now handles expression option which evaluates against the upstream response via {rp.status_code}, {rp.status_text}, and {rp.header.X-Header} placeholders in CEL expressions
  • Added new helper function to CEL expressions isTransportError()
  • Using both expression + old syntax (I mean just a plain old lb_retry_match options) isn't allowed
  • Fully backward-compatible

It wasn't possible to implement it like lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]` because of how lb_retry_match is handled, but you can use a oneliner like lb_retry_match expression `path('/foo*') && {rp.status_code} in [502, 503]`

Everything is covered with unit-tests, e2e tests and caddyfile tests.

⚠️ Initially I wanted to preserve both old syntax + new syntax to make everything AND'ed, but it wasn't possible because of how tough the logic is: some things rely on existence of lb_retry_match options, some things not, and sometimes it's unclear whether something would be retried or not.

For example, the previous logic:

  reverse_proxy upstream1 upstream2 {
      lb_retries 3
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          method PUT
      }
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘

GET loses its' default transport error retry because RetryMatch is non-nil and only PUT is defined. Adding expression handling would result into even more unclear logic. That's why using expression with plain options isn't allowed.

Just to remind what transport error is (roughly):

  • Upstream accepted the connection then reset it
  • TLS handshake succeeded but upstream closed during request/response transfer
  • Upstream timed out while sending the response body
  • Upstream sent malformed HTTP response headers

Now examples of how the new expression can be used:

  • Retry 502/503 responses only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry on response header only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌──────────────────────────────────────┬───────────────┐
  │               Scenario               │   Behavior    │
  ├──────────────────────────────────────┼───────────────┤
  │ Dial error, any method               │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Transport error, any method          │ not retried   │
  ├──────────────────────────────────────┼───────────────┤
  │ Response with X-Upstream-Retry: true │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Response without header              │ proxied as-is │
  └──────────────────────────────────────┴───────────────┘
  • Retry 503 only for POST
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 503 POST                    │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 503 GET                     │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST                    │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 POST                    │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry 5xx for /api* paths only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /api/foo            │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST /api/foo           │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /other              │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 GET /api/foo            │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry all transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry GET transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('GET')) || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, GET         │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, POST        │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200 response                 │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry PUT transport errors to /api* + any 502
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('PUT') && path('/api*')) || {rp.status_code} == 502`
      }
  }

  ┌────────────────────────────────────┬───────────────┐
  │              Scenario              │   Behavior    │
  ├────────────────────────────────────┼───────────────┤
  │ Dial error, any method             │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /api/foo      │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /other        │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, GET /api/foo      │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ 502 response, any method, any path │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ 200 response                       │ proxied as-is │
  └────────────────────────────────────┴───────────────┘
  • Retry POST on 503, GET on transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
      lb_retry_match {
          expression `isTransportError() && method('GET')`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 503 POST               │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ 503 GET                │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 502 POST               │ not retried       │
  └────────────────────────┴───────────────────┘
  • Retry idempotent transport errors + 502 for specific host
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT') || method('DELETE'))`
      }
      lb_retry_match {
          expression `host('api.example.com') && {rp.status_code} == 502`
      }
  }

  ┌──────────────────────────┬───────────────────┐
  │         Scenario         │     Behavior      │
  ├──────────────────────────┼───────────────────┤
  │ Dial error, any method   │ retried           │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, GET     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, PUT     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, DELETE  │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, POST    │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 502 to api.example.com   │ retried (block 2) │
  ├──────────────────────────┼───────────────────┤
  │ 502 to other.example.com │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 200 response             │ proxied as-is     │
  └──────────────────────────┴───────────────────┘
  • Retry transport errors for multiple methods + 5xx for /api*
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT'))`
      }
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, PUT   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /api/foo       │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 502 POST /api/foo      │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /other         │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 200 response           │ proxied as-is     │
  └────────────────────────┴───────────────────┘
  • Retry on response header + all transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌─────────────────────────────────┬───────────────┐
  │            Scenario             │   Behavior    │
  ├─────────────────────────────────┼───────────────┤
  │ Dial error, any method          │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ Transport error, any method     │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 502 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 without header              │ proxied as-is │
  └─────────────────────────────────┴───────────────┘

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 16, 2026

I've revisited this PR to make it simpler. Shortly, the only thing left is retrying based on status, which I moved to the lb_retry_match block.

I've removed retrying based on conditions as it messes with default behavior too much and it makes everything too complicated and unclear. I still can implement them in this PR, but feels like you should point me on how they must be defined and handled.

Updated the PR's description, previous one is now hidden under "Initial version" spoiler.

@seroperson seroperson changed the title feat(reverse_proxy): lb_retry_on conditions feat(reverse_proxy): lb_retry_match now contains status field Mar 16, 2026
@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from b8aaa26 to 4e31ada Compare March 16, 2026 14:50
@seroperson
Copy link
Copy Markdown
Author

@mohammed90 Hello! Sorry for ping, but would you mind reviewing this?

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from eeb231c to d106d8c Compare March 25, 2026 00:16
@francislavoie
Copy link
Copy Markdown
Member

I find this quite awkward. As documented, the config layer says lb_retry_match takes a request matcher https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#lb_retry_match. But with this, it's a weird mix of request and response matchers. I'm not sure I like that. I understand the motivation though.

Could we use response matchers https://caddyserver.com/docs/caddyfile/response-matchers directly instead of statusCodes? It would allow the status code matching to come "for free", and add header matching support. I don't know if header matching is that useful, but it might be if the backend explicitly responds with some X-Retry-Me: true header or something lol

This also needs Caddyfile adapt tests. See .caddyfiletest files, there should be coverage for various patterns to cover backwards compat and new config.

@francislavoie francislavoie added the feature ⚙️ New feature or request label Mar 25, 2026
@francislavoie francislavoie changed the title feat(reverse_proxy): lb_retry_match now contains status field reverseproxy: add lb_retry_match condition on response status Mar 25, 2026
@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 25, 2026

@francislavoie yea, the parsing looks weird in such scenario.

The reason why I did it exactly like this is that I wanted to preserve the ability to write the spec like this:

lb_retry_match {
  method POST
  status 503
}

This way it looks understandable for an end user and reads exactly like "retry on match method POST and status 503". We can preserve this approach and just re-use some parsing from response-matcher, so it will look like:

lb_retry_match {
  <request-matcher>
  <response-matcher>
}
// or like this
lb_retry_match {
  <request-matcher>
  on_response {
    <response-matcher>
  }
}

It will be still the same mix, but with re-used parsing logic.


OR we can split the matchers, so it will look like:

lb_retry_match {
  <request-matcher>
}
lb_retry_match_response {
  <response-matcher>
}

This way the code is cleaner and there is a strict separation of matching requests and responses, but this approach a bit less flexible because of how we treat conditions:

  • Currently every lb_retry_match is ORed and nested fields are ANDed, like lb_retry_match { ... AND ... } OR lb_retry_match { ... AND ... }.
  • With both lb_retry_match and lb_retry_match_response it intuitively feels like this: (lb_retry_match { ... AND ... } OR lb_retry_match { ... AND ... }) AND (lb_retry_match_response { ... AND ... } OR lb_retry_match_response { ... AND ... })

But then we will be unable to implement something like "retry on (DELETE and 401) or (POST and 503)". It will be only "retry on (DELETE or POST) and (401 or 503)".

Maybe I'm missing something.


@francislavoie @mohammed90 What do you think? Which approach to implement then?

@francislavoie
Copy link
Copy Markdown
Member

francislavoie commented Mar 27, 2026

Hmm.

I think we can just reframe this.

What we want is to match on response status or headers.

We can actually do this using CEL expressions already, as long as the Replacer is filled with properties of the response, placeholders can be used in the CEL expression to do complex conditions involving those.

We document in https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#intercepting-responses that {rp.status_code}, {rp.status_text}, and {rp.header.*} can be used in handle_response, we can ensure those are populated as well in the lb_retry codepath if it's not already. Then writing config like this should be possible:

lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]`

The backtick syntax might not work as-is in lb_retry_match, I don't remember how I implemented that, whether it only works on @named-matchers or if it would also work on subdirectives that take matchers too, but if not then the expression matcher name would need to be put in front of it.

We'll also need some .caddyfiletest Caddyfile -> JSON adapt fixtures for this functionality to prove is encodes properly.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch 2 times, most recently from 03fcc72 to 8b1f9c6 Compare March 28, 2026 15:26
@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 28, 2026

@francislavoie thank you for the review and pointing in the right direction! I've revisited changes and re-implemented them.

Now:

  • lb_retry_match now handles expression option which evaluates against the upstream response via {rp.status_code}, {rp.status_text}, and {rp.header.X-Header} placeholders in CEL expressions
  • Added new helper function to CEL expressions isTransportError()
  • Using both expression + old syntax (I mean just a plain old lb_retry_match options) isn't allowed
  • Fully backward-compatible

It wasn't possible to implement it like lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]` because of how lb_retry_match is handled, but you can use a oneliner like lb_retry_match expression `path('/foo*') && {rp.status_code} in [502, 503]`

Everything is covered with unit-tests, e2e tests and caddyfile tests.

⚠️ Initially I wanted to preserve both old syntax + new syntax to make everything AND'ed, but it wasn't possible because of how tough the logic is: some things rely on existence of lb_retry_match options, some things not, and sometimes it's unclear whether something would be retried or not.

For example, the previous logic:

  reverse_proxy upstream1 upstream2 {
      lb_retries 3
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          method PUT
      }
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘

GET loses its' default transport error retry because RetryMatch is non-nil and only PUT is defined. Adding expression handling would result into even more unclear logic. That's why using expression with plain options isn't allowed.

Just to remind what transport error is (roughly):

  • Upstream accepted the connection then reset it
  • TLS handshake succeeded but upstream closed during request/response transfer
  • Upstream timed out while sending the response body
  • Upstream sent malformed HTTP response headers

Now examples of how the new expression can be used:

  • Retry 502/503 responses only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry on response header only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌──────────────────────────────────────┬───────────────┐
  │               Scenario               │   Behavior    │
  ├──────────────────────────────────────┼───────────────┤
  │ Dial error, any method               │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Transport error, any method          │ not retried   │
  ├──────────────────────────────────────┼───────────────┤
  │ Response with X-Upstream-Retry: true │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Response without header              │ proxied as-is │
  └──────────────────────────────────────┴───────────────┘
  • Retry 503 only for POST
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 503 POST                    │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 503 GET                     │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST                    │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 POST                    │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry 5xx for /api* paths only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /api/foo            │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST /api/foo           │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /other              │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 GET /api/foo            │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry all transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry GET transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('GET')) || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, GET         │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, POST        │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200 response                 │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry PUT transport errors to /api* + any 502
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('PUT') && path('/api*')) || {rp.status_code} == 502`
      }
  }

  ┌────────────────────────────────────┬───────────────┐
  │              Scenario              │   Behavior    │
  ├────────────────────────────────────┼───────────────┤
  │ Dial error, any method             │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /api/foo      │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /other        │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, GET /api/foo      │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ 502 response, any method, any path │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ 200 response                       │ proxied as-is │
  └────────────────────────────────────┴───────────────┘
  • Retry POST on 503, GET on transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
      lb_retry_match {
          expression `isTransportError() && method('GET')`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 503 POST               │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ 503 GET                │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 502 POST               │ not retried       │
  └────────────────────────┴───────────────────┘
  • Retry idempotent transport errors + 502 for specific host
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT') || method('DELETE'))`
      }
      lb_retry_match {
          expression `host('api.example.com') && {rp.status_code} == 502`
      }
  }

  ┌──────────────────────────┬───────────────────┐
  │         Scenario         │     Behavior      │
  ├──────────────────────────┼───────────────────┤
  │ Dial error, any method   │ retried           │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, GET     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, PUT     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, DELETE  │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, POST    │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 502 to api.example.com   │ retried (block 2) │
  ├──────────────────────────┼───────────────────┤
  │ 502 to other.example.com │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 200 response             │ proxied as-is     │
  └──────────────────────────┴───────────────────┘
  • Retry transport errors for multiple methods + 5xx for /api*
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT'))`
      }
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, PUT   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /api/foo       │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 502 POST /api/foo      │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /other         │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 200 response           │ proxied as-is     │
  └────────────────────────┴───────────────────┘
  • Retry on response header + all transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌─────────────────────────────────┬───────────────┐
  │            Scenario             │   Behavior    │
  ├─────────────────────────────────┼───────────────┤
  │ Dial error, any method          │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ Transport error, any method     │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 502 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 without header              │ proxied as-is │
  └─────────────────────────────────┴───────────────┘

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from 8b1f9c6 to f4ec226 Compare March 28, 2026 15:38
@francislavoie
Copy link
Copy Markdown
Member

I'm confused. That looks like way too many changes for this. Why do we need "isTransportError" and such? That seems very over engineered. It's also changing how the placeholders are being set outside of the retries, I don't understand why. And why is it blocking other matchers to be alongside expression? That's very arbitrary, nothing else does that.

In my mind, the change should've just been "make sure placeholders are set during the retry flow, and add tests". That's it really. Possibly also "allow omitting expression token if the first input to lb_retry_match is a backtick-wrapped token".

@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 28, 2026

@francislavoie thank you for a quick response

It's also changing how the placeholders are being set outside of the retries, I don't understand why.

It's now set earlier because the old place executes after the place where we decide to retry. Without the moving these variables would be unset in retry logic, which must go before HandleResponse, as HandleResponse is actually writes bytes to a downstream. It doesn't change the HandleResponse behaviour, just variables are set earlier.

Also I've added header_keys tracking to clear stale headers from previous attempts and prevent situations when, for example, upstream A returned header A, but upstream B didn't and read that previously set header A instead.

Why do we need "isTransportError" and such?

Everything else (path, query, protocol etc) already available in expression out-of-box, there is nothing new added besides isTransportError. But let's imagine we don't have isTransportError. When user writes something like this:

lb_retry_match { expression `{rp.status_code} in [502, 503]` }

Then in case of transport error, that's what happening in retry logic:

if lb.RetryMatch == nil && req.Method != "GET" {
// by default, don't retry requests if they aren't GET
return false
}
match, err := lb.RetryMatch.AnyMatchWithError(req)
if err != nil {
logger.Error("error matching request for retry", zap.Error(err))
return false
}
if !match {
return false
}

We don't get into the very first if because lb.RetryMatch != nil (lb_retry_match is set), and this match, err := lb.RetryMatch.AnyMatchWithError(req) evaluates into false because: rp.status_code is nil at the moment (got transport error, so no response from an upstream at all), so nil in [502,503] returns false.

It makes all transport error non-retriable if expression is set and contains rp.* checks (they always likely will be false). Which makes the whole transport error retrying logic even more unclear for user:

  • Default behavior: on transport error only GET is retriable, other methods are not
  • lb_retry_match { method PUT }: on transport error only PUT is retriable
  • lb_retry_match { method PUT; expression {rp.status_code} in [502, 503] } (or just only expression): on transport error nothing is retriable

I didn't found easy ways of fixing this + not break the existing logic + not code workarounds + let API remain simple and not confusing for user.

And why is it blocking other matchers to be alongside expression?

That's why I separated expression from existing retry logic:

  • This way everything remains backward compatible
  • Transport errors now have a transparent way of handling with isTransportError()
  • Less footguns for users with mixing both plain options and expression (which also allows to use method, path etc)

If we somehow deal with transport error handling, then I guess there is no point to restrict mixing. Actually if we stick to isTransportError, then there is already no strict necessity, just I thought that it will be more clear for users to define everything either using old or new approach, but not mixing them (because options-only approach has its' own way of handling transport errors).

"make sure placeholders are set during the retry flow, and add tests". That's it really.

I also thought that initially, but it's not that simple. Besides things described above, I've also added additional handling of failure counting for health checks (so upstream marked unhealthy only in case of dial/transport errors), and the actual response matcher handling in reverseProxy: the pre-existed retrying logic is done only in tryAgain where response isn't available.

I can revert the isTransportError() and the mixing restriction, but then transport error handling will be unpredictable for users. The rest is necessary for things to work, though maybe I'm missing something and some things may be simplified.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from f4ec226 to 0283a77 Compare March 28, 2026 23:59
@seroperson
Copy link
Copy Markdown
Author

  • Removed mixing restriction
  • isTransportError() replaced with {rp.is_transport_error}
  • Added backtick shorthand in ParseCaddyfileNestedMatcherSet
  • Added DeleteByPrefix to Replacer
  • Removed CanMatchResponse() marker if favor of *MatchExpression type assertion

And updated all the tests, comments and docs everywhere.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from 8f170bb to 61c00e9 Compare March 29, 2026 11:35
Copy link
Copy Markdown
Member

@francislavoie francislavoie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! I think I'm happy where this is at now. We'll just need to spend a bit of effort on documentation for the website as well (if you'd like to help with that after merging) to make it clear how the different states work, especially because of the hidden magic regarding branching logic if there's an expression vs not.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from 61c00e9 to 6087447 Compare March 29, 2026 15:01
@francislavoie francislavoie added this to the v2.11.3 milestone Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature ⚙️ New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom conditions for retrying proxy requests

4 participants