Skip to content

feat: bound cluster pipeline parallelism#1149

Merged
collin-lee merged 2 commits into
envoyproxy:mainfrom
hltduong:cscx193-redis-cluster-pipeline-parallelism
Jun 5, 2026
Merged

feat: bound cluster pipeline parallelism#1149
collin-lee merged 2 commits into
envoyproxy:mainfrom
hltduong:cscx193-redis-cluster-pipeline-parallelism

Conversation

@hltduong

@hltduong hltduong commented May 29, 2026

Copy link
Copy Markdown
Contributor

Problem

In Redis Cluster mode, clientImpl.executeGroupedPipeline groups pipeline actions by key and then executes each key group serially. For a single ShouldRateLimit request that carries multiple descriptors whose keys map to different cluster slots, this adds one Redis round-trip per group. Request latency therefore scales with descriptor count even though the groups are independent.

Concretely: a request with N descriptors hits N parallel Redis cluster slots but waits N × RTT instead of 1 × RTT.

Solution

Adds a new env var (and matching field on Settings):

Setting Behavior
REDIS_CLUSTER_PIPELINE_PARALLELISM=1 Legacy serial behavior (default — no change for existing users).
REDIS_CLUSTER_PIPELINE_PARALLELISM=0 Unbounded parallel group execution (one goroutine per group, errgroup.WithContext).
REDIS_CLUSTER_PIPELINE_PARALLELISM>1 Bounded parallel group execution; caps concurrent in-flight groups.

A matching REDIS_PERSECOND_CLUSTER_PIPELINE_PARALLELISM covers the per-second pool.

A len(pipeline) == 1 fast-path skips grouping entirely so N=1 callers see zero overhead.

Backward compatibility

default = 1 preserves upstream legacy serial behavior. No config-file changes, no metric changes, no public API changes.

Test results

Bench: 4 RLS pods per variant, both running the same binary, differing only by REDIS_CLUSTER_PIPELINE_PARALLELISM env var. Backend: a managed Redis Cluster reached via TLS. Server-side latency measured via histogram_quantile(0.99, sum by (le) (rate(ratelimit_service_response_time_seconds_bucket[1m]))).

N=1 — confirms no regression (3 reps, 3000 RPS, 90s each)

Variant Server-side P99 (ms) per rep Mean
=1 (legacy) 9.75 / 9.70 / 9.49 9.65
=8 (bounded) 9.95 / 9.71 / 9.49 9.72

Δ = +0.7% — within ambient noise. Both variants take the single-action fast-path at N=1, as designed.

N=2 @ 3000 RPS — typical multi-descriptor case (3 reps, 90s each)

Variant Server-side P99 (ms) per rep Mean
=1 (legacy) 390 / 334 / 258 327
=8 (bounded) 179 / 144 / 81 135

Δ = −59% server-side P99 (and −62% on the bench client side; both metrics agree).

N=5 @ 1000 RPS — isolates the RTT-multiplier effect

Variant Server-side P99 (ms)
=1 (legacy) 99
=8 (bounded) 23

Δ = −77%. The 4.3× ratio closely matches the theoretical 5× upper bound (legacy = N × RTT, parallel = 1 × RTT-batch).

Resource cost

Metric Legacy Bounded (=8)
RLS pod CPU (peak during bench) ~0.40 cores ~0.45 cores
RLS pod RSS ~480 MiB ~480 MiB
ratelimit_redis_pool_cx_active per pod 40 40
Goroutines / pod (steady state) n/a (single goroutine per pipeline) ≤ N additional per in-flight pipeline

Goroutine increase is bounded by concurrent in-flight pipelines × N (or × parallelism when bounded). At realistic load this is on the order of hundreds, not thousands.

Caveat — when this does not help

Parallelism helps when the RLS-side pool / RTT is the bottleneck. When the Redis cluster itself is CPU-saturated (e.g., per-shard CPU > 75%), every Redis op queues at the shard regardless of how the client submits them, and the latency improvement disappears. In testing at N=5 + 3000 RPS, the managed Redis cluster hit 77–79% per-shard CPU and both variants converged to ~2300 ms P99. Operators should size Redis appropriately before relying on this knob to absorb high-N requests.

This is called out in the env-var docs in the README.

Tests

go test ./...            # 253 passed (19 new for this feature path)
go test -race ./...      # 253 passed with -race
make check_format

CI on the PR will run the full suite.

Files changed

README.md                     |   1 +
go.mod                        |   1 +
go.sum                        |   2 +
src/redis/cache_impl.go       |   8 +-
src/redis/driver_impl.go      | 107 +++++++++++++++++++-------
src/redis/driver_impl_test.go | 169 ++++++++++++++++++++++++++++++++++++++++++
src/settings/settings.go      |  26 +++++--
src/settings/settings_test.go |  31 ++++++++
8 files changed, 306 insertions(+), 39 deletions(-)

Screenshots of latency / RPS / CPU charts attached in a follow-up comment.

@hltduong

Copy link
Copy Markdown
Contributor Author

Chart screenshots

All time-range UTC, server-side latency from ratelimit_service_response_time_seconds_bucket histograms.
Two services in each chart: ratelimit-cscx193-current-service (legacy, =1) and ratelimit-cscx193-patched-service (bounded, =8). Both run the same binary, differ only by env var.

Server-side P99 — full bench window (all 14 cells)

RLS_server_side_P99

14 spikes correspond to the 14 bench cells: N=1, N=2 alternating × 3 reps + 2 N=5-at-1k cells at the tail. Patched series consistently below current at N=2 and N=5.

P99 — Rep 1 zoom (N=1 r1, N=2 r1)

RLS_server_side_P99_rep1

P99 — Rep 2 zoom

RLS_server_side_P99_rep2

P99 — Rep 3 zoom

RLS_server_side_P99_rep3

P99 — N=5 @ 1k RPS (sub-saturation — isolates parallelism)

RLS_server_side_P99_rep4_N5_1kRPS

Clearest single chart in this PR: the =8 series stays flat near baseline while =1 rises to ~5× as N goes from 1 to 5. Matches the theoretical legacy = N × RTT vs parallel = 1 × RTT-batch model.

Server-side P95 (full window)

RLS_server_side_P95

Server-side P50 (full window)

RLS_server_side_P50

Bench delivered RPS (proves load actually reached the 3k RPS target)

Bench_RLS_delivered

RLS pod CPU during bench (proves RLS not CPU-bound)

RLS_pod_CPU

Peak ~0.5 cores per pod, against a 2-core limit ≈ 25% utilization. Latency originates downstream of RLS (RTT / pool / Redis CPU, not RLS CPU).

Rate-limit decision counters — control, verifies bench load exercised the rate-limit path

Request volume Within-limit Near-limit Over-limit
Request_volume_by_domain_key Request_within_limit_by_domain_key Request_near_limit_by_domain_key Request_over_limit_by_domain_key

Signed-off-by: dthuynh <dthuynh@axon.com>
@hltduong hltduong force-pushed the cscx193-redis-cluster-pipeline-parallelism branch from 139df7d to 47c0cae Compare May 29, 2026 04:22
@hltduong hltduong marked this pull request as ready for review May 29, 2026 04:28
@hltduong hltduong changed the title redis: bound cluster pipeline parallelism feat: bound cluster pipeline parallelism May 29, 2026
@hltduong

Copy link
Copy Markdown
Contributor Author

@collin-lee could you help me to take a look on this?

This change adds configurable bounded parallelism for Redis Cluster pipeline groups while preserving the current serial behavior by default. The intent is to reduce P99 latency for multi-descriptor requests where descriptors map to different cluster slots, without changing behavior for existing users.

I’ve included benchmark results, resource impact, caveats, and test coverage in the PR description. The change is backward-compatible with REDIS_CLUSTER_PIPELINE_PARALLELISM=1 as the default.

Happy to adjust the implementation or docs based on your feedback. Thanks!

@hltduong

hltduong commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@agrawroh could you please take a look?

@collin-lee

Copy link
Copy Markdown
Contributor

@hltduong

What's the expected upper bound on distinct keys per request? If it can exceed REDIS_POOL_SIZE, unbounded mode risks pool starvation.

Should context.Background() in PipeDo be replaced with the gRPC request context to enable deadline propagation to parallel Redis calls?

…the parallelism to RedisPoolSize

Signed-off-by: dthuynh <dthuynh@axon.com>
@hltduong hltduong force-pushed the cscx193-redis-cluster-pipeline-parallelism branch from 60cee1f to 99caa22 Compare June 5, 2026 03:29
@hltduong

hltduong commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@collin-lee Thanks, both points are valid.

For distinct keys: there is no hard protocol/code upper bound on distinct Redis keys per request. In practice it is bounded by the request descriptors/non-empty generated cache keys in a pipeline phase, but that can exceed REDIS_POOL_SIZE.

I updated the implementation so this does not rely only on operator configuration. The configured cluster pipeline parallelism is normalized at Redis client creation:

  • 1 preserves the legacy serial behavior (default)
  • 0 now means auto, bounded to the corresponding Redis pool size
  • values greater than the corresponding Redis pool size are capped to the pool size
  • negative values are rejected

For context propagation: agree. I updated PipeDo to accept the request context and pass it through the GET and INCR/EXPIRE Redis pipeline calls, so deadlines/cancellation now propagate to parallel Redis calls.

@collin-lee collin-lee merged commit 34d2b74 into envoyproxy:main Jun 5, 2026
6 checks passed
timcovar added a commit to goatapp/ratelimit that referenced this pull request Jun 10, 2026
The fork content was already synced via squash-merge in 95e672c (PR #33).
This merge commit links the histories so GitHub no longer reports
the fork as 100+ commits behind upstream.

* upstream/main: (154 commits)
  Update to golang-1.26.4 and update golang.org/x/net to 0.55.0 (envoyproxy#1154)
  feat: bound cluster pipeline parallelism (envoyproxy#1149)
  fix: correct typos in memcache error messages and variable name (envoyproxy#1150)
  Update to golang 1.26.3 (envoyproxy#1152)
  Add quota mode to rate limit descriptor proto (envoyproxy#1148)
  feat: add retry in init phase instead of panic directly (envoyproxy#1144)
  Add integration test for quota based service selection. (envoyproxy#1114)
  build: pin golang:1.26.2 to multi-arch index digest (envoyproxy#1131)
  Update third party libraries flagged for vulnerability scans (envoyproxy#1124)
  feat: add zipkin b3 header propagation (envoyproxy#1110)
  Fix Prometheus response time units (envoyproxy#1104)
  Dockerfile: add ENTRYPOINT (envoyproxy#1095)
  Send user defined metadata to the client (envoyproxy#1112)
  build(deps): bump google.golang.org/grpc from v1.74.2 to v1.80.0 (envoyproxy#1111)
  Fix quota result when all limits were exceeded (envoyproxy#1059)
  Update golang references to 1.26.1 (envoyproxy#1091)
  Add integration test for token based quota (envoyproxy#1092)
  Add quota integration test (envoyproxy#1090)
  Add debug logging for quota values (envoyproxy#1089)
  Wait for sevices to be up before running tests (envoyproxy#1088)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants