This document details the performance engineering process for VelocityGate, documenting the methodology, bottleneck analysis, and optimizations that led to our high-throughput capabilities.
To ensure low-overhead profiling in production-like environments, we utilized the following toolset:
- Async-profiler: For low-overhead CPU sampling and Flame Graph generation.
- Command:
./profiler.sh -d 60 -f flamegraph_cpu.html -e itimer <pid>
- Command:
- Java Flight Recorder (JFR): Continuous monitoring of GC, Latency, and Allocations.
- JVM Flags:
-XX:StartFlightRecording:disk=true,dumponexit=true,filename=recording.jfr,settings=profile
- JVM Flags:
- VisualVM: For real-time heap dump analysis during memory leak investigations.
Test Scenario: 5,000 RPS, Token Bucket Algorithm, JWT Auth.
The initial flame graph revealed two massive towers:
io.jsonwebtoken.impl.crypto.MacProvider.sign(): 40% of CPU. JWT verification was re-calculating HMAC for every request.reactor.core.publisher.Flux.map(): 15% of CPU. Excessive reactive stream object creation.
- Allocation Rate: 2.5 GB/sec.
- Top Allocator:
java.lang.String(45%).- Cause: Concatenating
"rate_limit:" + userId + ":" + timestampon every request created millions of transient Strings.
- Cause: Concatenating
java.util.concurrent.ConcurrentHashMap.computeIfAbsent: High contention in the internal metric registry when creating new counters for dynamic tags.
Observation: Validating the same JWT signature 1000 times/sec for the same active user is wasteful.
Decision: Implemented a short-lived (10s) Caffeine cache for valid JWT signatures.
Impact:
- JWT Crypto CPU usage dropped from 40% -> 5%.
- Throughput increased by 300%.
Observation: Each rate limit check involved 3 round-trips (GET, INCR, EXPIRE). Decision: Switched to Redis Lua Scripts. Impact:
- Reduced Network I/O syscalls by 66%.
- P99 Latency dropped from 45ms -> 12ms.
Observation: String concatenation for Redis keys was generating massive garbage.
Decision: Pre-compiled byte arrays for static prefixes (rate_limit:) and used reused StringBuilder buffers.
Impact:
- Allocation rate dropped to 800 MB/sec.
- GC Pause time (G1) improved from 15ms -> 4ms.
[----------------- JWT Signature Validation (40%) -----------------] [--- Netty I/O ---]
[------ HmacSHA256 ------] [--- String Alloc ---]
- Interpretation: The wide "plateau" on the left shows the application spending nearly half its time just doing math (Crypto), blocking the Event Loop.
[JWT Cache (5%)] [------- Netty I/O Processing (80%) -------] [Redis (10%)]
- Interpretation: The CPU is now mostly spent doing actual work: reading from the network, parsing HTTP, and talking to Redis. This is a healthy profile for an I/O-bound Gateway.
Based on the profiling data, we tuned the application.yml:
| Parameter | Initial | Tuned | Rationale |
|---|---|---|---|
reactor.netty.ioWorkerCount |
Default (CPU) | CPU * 2 |
Profiling showed threads blocked on I/O wait, so slightly over-provisioning helped. |
spring.data.redis.lettuce.pool.max-active |
8 | 50 |
Under high load (10k RPS), threads were waiting 5ms just to borrow a Redis connection. |
server.jetty.threads.max |
200 | N/A |
Switched to Netty (Event Loop model) removing the need for 200+ distinct threads. |
When analyzing flamegraph.html generated by Async-profiler:
- X-Axis (Width): Represents the frequency of the function in samples. Wider = More CPU time.
- Y-Axis (Height): Represents the stack depth. Taller = Deeper call stack.
- Colors: Usually random, but commonly:
- Red/Orange: CPU-bound (Java code).
- Blue/Green: I/O-bound (Native code/Syscalls).
Optimization Goal: Look for "Wide Plateaus". Narrow, spiky towers are fine. A wide block means one function is dominating your CPU. Flatten the widest blocks first.