feat: bound cluster pipeline parallelism by hltduong · Pull Request #1149 · envoyproxy/ratelimit

hltduong · 2026-05-29T03:47:12Z

Problem

In Redis Cluster mode, clientImpl.executeGroupedPipeline groups pipeline actions by key and then executes each key group serially. For a single ShouldRateLimit request that carries multiple descriptors whose keys map to different cluster slots, this adds one Redis round-trip per group. Request latency therefore scales with descriptor count even though the groups are independent.

Concretely: a request with N descriptors hits N parallel Redis cluster slots but waits N × RTT instead of 1 × RTT.

Solution

Adds a new env var (and matching field on Settings):

Setting	Behavior
`REDIS_CLUSTER_PIPELINE_PARALLELISM=1`	Legacy serial behavior (default — no change for existing users).
`REDIS_CLUSTER_PIPELINE_PARALLELISM=0`	Unbounded parallel group execution (one goroutine per group, `errgroup.WithContext`).
`REDIS_CLUSTER_PIPELINE_PARALLELISM>1`	Bounded parallel group execution; caps concurrent in-flight groups.

A matching REDIS_PERSECOND_CLUSTER_PIPELINE_PARALLELISM covers the per-second pool.

A len(pipeline) == 1 fast-path skips grouping entirely so N=1 callers see zero overhead.

Backward compatibility

default = 1 preserves upstream legacy serial behavior. No config-file changes, no metric changes, no public API changes.

Test results

Bench: 4 RLS pods per variant, both running the same binary, differing only by REDIS_CLUSTER_PIPELINE_PARALLELISM env var. Backend: a managed Redis Cluster reached via TLS. Server-side latency measured via histogram_quantile(0.99, sum by (le) (rate(ratelimit_service_response_time_seconds_bucket[1m]))).

N=1 — confirms no regression (3 reps, 3000 RPS, 90s each)

Variant	Server-side P99 (ms) per rep	Mean
`=1` (legacy)	9.75 / 9.70 / 9.49	9.65
`=8` (bounded)	9.95 / 9.71 / 9.49	9.72

Δ = +0.7% — within ambient noise. Both variants take the single-action fast-path at N=1, as designed.

N=2 @ 3000 RPS — typical multi-descriptor case (3 reps, 90s each)

Variant	Server-side P99 (ms) per rep	Mean
`=1` (legacy)	390 / 334 / 258	327
`=8` (bounded)	179 / 144 / 81	135

Δ = −59% server-side P99 (and −62% on the bench client side; both metrics agree).

N=5 @ 1000 RPS — isolates the RTT-multiplier effect

Variant	Server-side P99 (ms)
`=1` (legacy)	99
`=8` (bounded)	23

Δ = −77%. The 4.3× ratio closely matches the theoretical 5× upper bound (legacy = N × RTT, parallel = 1 × RTT-batch).

Resource cost

Metric	Legacy	Bounded (=8)
RLS pod CPU (peak during bench)	~0.40 cores	~0.45 cores
RLS pod RSS	~480 MiB	~480 MiB
`ratelimit_redis_pool_cx_active` per pod	40	40
Goroutines / pod (steady state)	n/a (single goroutine per pipeline)	≤ N additional per in-flight pipeline

Goroutine increase is bounded by concurrent in-flight pipelines × N (or × parallelism when bounded). At realistic load this is on the order of hundreds, not thousands.

Caveat — when this does not help

Parallelism helps when the RLS-side pool / RTT is the bottleneck. When the Redis cluster itself is CPU-saturated (e.g., per-shard CPU > 75%), every Redis op queues at the shard regardless of how the client submits them, and the latency improvement disappears. In testing at N=5 + 3000 RPS, the managed Redis cluster hit 77–79% per-shard CPU and both variants converged to ~2300 ms P99. Operators should size Redis appropriately before relying on this knob to absorb high-N requests.

This is called out in the env-var docs in the README.

Tests

go test ./...            # 253 passed (19 new for this feature path)
go test -race ./...      # 253 passed with -race
make check_format

CI on the PR will run the full suite.

Files changed

README.md                     |   1 +
go.mod                        |   1 +
go.sum                        |   2 +
src/redis/cache_impl.go       |   8 +-
src/redis/driver_impl.go      | 107 +++++++++++++++++++-------
src/redis/driver_impl_test.go | 169 ++++++++++++++++++++++++++++++++++++++++++
src/settings/settings.go      |  26 +++++--
src/settings/settings_test.go |  31 ++++++++
8 files changed, 306 insertions(+), 39 deletions(-)

Screenshots of latency / RPS / CPU charts attached in a follow-up comment.

hltduong · 2026-05-29T03:57:58Z

Chart screenshots

All time-range UTC, server-side latency from ratelimit_service_response_time_seconds_bucket histograms.
Two services in each chart: ratelimit-cscx193-current-service (legacy, =1) and ratelimit-cscx193-patched-service (bounded, =8). Both run the same binary, differ only by env var.

Server-side P99 — full bench window (all 14 cells)

14 spikes correspond to the 14 bench cells: N=1, N=2 alternating × 3 reps + 2 N=5-at-1k cells at the tail. Patched series consistently below current at N=2 and N=5.

P99 — Rep 1 zoom (N=1 r1, N=2 r1)

P99 — Rep 2 zoom

P99 — Rep 3 zoom

P99 — N=5 @ 1k RPS (sub-saturation — isolates parallelism)

Clearest single chart in this PR: the =8 series stays flat near baseline while =1 rises to ~5× as N goes from 1 to 5. Matches the theoretical legacy = N × RTT vs parallel = 1 × RTT-batch model.

Server-side P95 (full window)

Server-side P50 (full window)

Bench delivered RPS (proves load actually reached the 3k RPS target)

RLS pod CPU during bench (proves RLS not CPU-bound)

Peak ~0.5 cores per pod, against a 2-core limit ≈ 25% utilization. Latency originates downstream of RLS (RTT / pool / Redis CPU, not RLS CPU).

Rate-limit decision counters — control, verifies bench load exercised the rate-limit path

Request volume	Within-limit	Near-limit	Over-limit

Signed-off-by: dthuynh <dthuynh@axon.com>

hltduong · 2026-05-30T06:05:55Z

@collin-lee could you help me to take a look on this?

This change adds configurable bounded parallelism for Redis Cluster pipeline groups while preserving the current serial behavior by default. The intent is to reduce P99 latency for multi-descriptor requests where descriptors map to different cluster slots, without changing behavior for existing users.

I’ve included benchmark results, resource impact, caveats, and test coverage in the PR description. The change is backward-compatible with REDIS_CLUSTER_PIPELINE_PARALLELISM=1 as the default.

Happy to adjust the implementation or docs based on your feedback. Thanks!

hltduong · 2026-06-03T09:32:15Z

@agrawroh could you please take a look?

collin-lee · 2026-06-04T18:49:20Z

@hltduong

What's the expected upper bound on distinct keys per request? If it can exceed REDIS_POOL_SIZE, unbounded mode risks pool starvation.

Should context.Background() in PipeDo be replaced with the gRPC request context to enable deadline propagation to parallel Redis calls?

…the parallelism to RedisPoolSize Signed-off-by: dthuynh <dthuynh@axon.com>

hltduong · 2026-06-05T03:32:02Z

@collin-lee Thanks, both points are valid.

For distinct keys: there is no hard protocol/code upper bound on distinct Redis keys per request. In practice it is bounded by the request descriptors/non-empty generated cache keys in a pipeline phase, but that can exceed REDIS_POOL_SIZE.

I updated the implementation so this does not rely only on operator configuration. The configured cluster pipeline parallelism is normalized at Redis client creation:

1 preserves the legacy serial behavior (default)
0 now means auto, bounded to the corresponding Redis pool size
values greater than the corresponding Redis pool size are capped to the pool size
negative values are rejected

For context propagation: agree. I updated PipeDo to accept the request context and pass it through the GET and INCR/EXPIRE Redis pipeline calls, so deadlines/cancellation now propagate to parallel Redis calls.

The fork content was already synced via squash-merge in 95e672c (PR #33). This merge commit links the histories so GitHub no longer reports the fork as 100+ commits behind upstream. * upstream/main: (154 commits) Update to golang-1.26.4 and update golang.org/x/net to 0.55.0 (envoyproxy#1154) feat: bound cluster pipeline parallelism (envoyproxy#1149) fix: correct typos in memcache error messages and variable name (envoyproxy#1150) Update to golang 1.26.3 (envoyproxy#1152) Add quota mode to rate limit descriptor proto (envoyproxy#1148) feat: add retry in init phase instead of panic directly (envoyproxy#1144) Add integration test for quota based service selection. (envoyproxy#1114) build: pin golang:1.26.2 to multi-arch index digest (envoyproxy#1131) Update third party libraries flagged for vulnerability scans (envoyproxy#1124) feat: add zipkin b3 header propagation (envoyproxy#1110) Fix Prometheus response time units (envoyproxy#1104) Dockerfile: add ENTRYPOINT (envoyproxy#1095) Send user defined metadata to the client (envoyproxy#1112) build(deps): bump google.golang.org/grpc from v1.74.2 to v1.80.0 (envoyproxy#1111) Fix quota result when all limits were exceeded (envoyproxy#1059) Update golang references to 1.26.1 (envoyproxy#1091) Add integration test for token based quota (envoyproxy#1092) Add quota integration test (envoyproxy#1090) Add debug logging for quota values (envoyproxy#1089) Wait for sevices to be up before running tests (envoyproxy#1088) ...

redis: bound cluster pipeline parallelism

47c0cae

Signed-off-by: dthuynh <dthuynh@axon.com>

hltduong force-pushed the cscx193-redis-cluster-pipeline-parallelism branch from 139df7d to 47c0cae Compare May 29, 2026 04:22

hltduong marked this pull request as ready for review May 29, 2026 04:28

hltduong changed the title ~~redis: bound cluster pipeline parallelism~~ feat: bound cluster pipeline parallelism May 29, 2026

Refactor to address comment: use gRPC request context in PipeDo, cap …

99caa22

…the parallelism to RedisPoolSize Signed-off-by: dthuynh <dthuynh@axon.com>

hltduong force-pushed the cscx193-redis-cluster-pipeline-parallelism branch from 60cee1f to 99caa22 Compare June 5, 2026 03:29

collin-lee approved these changes Jun 5, 2026

View reviewed changes

collin-lee merged commit 34d2b74 into envoyproxy:main Jun 5, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bound cluster pipeline parallelism#1149

feat: bound cluster pipeline parallelism#1149
collin-lee merged 2 commits into
envoyproxy:mainfrom
hltduong:cscx193-redis-cluster-pipeline-parallelism

hltduong commented May 29, 2026 •

edited

Loading

Uh oh!

hltduong commented May 29, 2026

Uh oh!

hltduong commented May 30, 2026

Uh oh!

hltduong commented Jun 3, 2026

Uh oh!

collin-lee commented Jun 4, 2026

Uh oh!

hltduong commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hltduong commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Backward compatibility

Test results

N=1 — confirms no regression (3 reps, 3000 RPS, 90s each)

N=2 @ 3000 RPS — typical multi-descriptor case (3 reps, 90s each)

N=5 @ 1000 RPS — isolates the RTT-multiplier effect

Resource cost

Caveat — when this does not help

Tests

Files changed

Uh oh!

hltduong commented May 29, 2026

Chart screenshots

Server-side P99 — full bench window (all 14 cells)

P99 — Rep 1 zoom (N=1 r1, N=2 r1)

P99 — Rep 2 zoom

P99 — Rep 3 zoom

P99 — N=5 @ 1k RPS (sub-saturation — isolates parallelism)

Server-side P95 (full window)

Server-side P50 (full window)

Bench delivered RPS (proves load actually reached the 3k RPS target)

RLS pod CPU during bench (proves RLS not CPU-bound)

Rate-limit decision counters — control, verifies bench load exercised the rate-limit path

Uh oh!

hltduong commented May 30, 2026

Uh oh!

hltduong commented Jun 3, 2026

Uh oh!

collin-lee commented Jun 4, 2026

Uh oh!

hltduong commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hltduong commented May 29, 2026 •

edited

Loading

hltduong commented Jun 5, 2026 •

edited

Loading