From a267345e0330e3053293f6dbead092d5372b1834 Mon Sep 17 00:00:00 2001 From: szibis Date: Thu, 14 May 2026 10:33:01 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20reorder=20perf=20section=20=E2=80=94=20?= =?UTF-8?q?cold=20baseline=20first,=20label=20warm=20table=20as=20100%=20c?= =?UTF-8?q?ache-hit=20ceiling?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 4cc4711..a223f95 100644 --- a/README.md +++ b/README.md @@ -35,11 +35,29 @@ Project site: `https://reliablyobserve.github.io/Loki-VL-proxy/` Measured head-to-head against tuned Loki: Apple M5 Pro (18 cores, 64 GB RAM), ~8 M log entries across 15 services, 7-day window. -### Dashboards and Explore — production steady state +### Cold proxy — honest baseline -Grafana dashboards auto-refresh every 30 s. After the first fetch, every repeated query is served from in-memory cache without touching VictoriaLogs. +No cache, no coalescer. Pure translation overhead + HTTP proxying + VictoriaLogs response time. This is the floor — what you get before any caching kicks in. -| Workload | Concurrency | Loki req/s | Proxy req/s | Throughput | P50 Loki | P50 Proxy | Latency | +| Workload | Concurrency | Cold proxy | Loki | Ratio | +|---|:---:|---:|---:|:---:| +| Small metadata queries | c=10 | 1,212 req/s | ~880 req/s | **1.4× faster** | +| Small metadata queries | c=50 | 1,583 req/s | ~780 req/s | **2× faster** | +| Heavy pipeline queries | c=10 | 126–188 req/s | ~161–472 req/s | **~parity** | +| Heavy pipeline queries | c=100 | 139 req/s | ~33 req/s | **4.2× faster** | +| Long-range (6 h–72 h) | c=10 | 2× faster than Loki | — | parallel sub-window fetching | +| Compute (rate, topk) | c=10 | 210 req/s | ~2,403 req/s | 0.09× — N VL calls per metric query | + +- **Small and metadata queries:** 1.4–2× faster than Loki cold — VL scans are faster than Loki's chunk store for label/series queries. +- **Heavy pipeline queries:** parity to 4.2× faster depending on concurrency — `stats_query_range` fast path eliminates 39% cold CPU for `count_over_time`/`rate` queries. +- **Long-range queries:** 2× faster cold — parallel sub-window fetching completes before Loki's sequential chunk scan. +- **Compute aggregations (`quantile_over_time`, `topk`, multi-stage pipelines):** each metric query fans out to N VL calls; pprof-guided alloc fixes lifted cold throughput from 40 to 210 req/s. Historical sub-windows are cached on first fetch (24 h TTL), so repeated compute queries approach warm performance. + +### Warm cache — what production steady-state looks like + +Grafana dashboards auto-refresh every 30 s. After the first fetch, repeated queries are served from in-memory cache without touching VictoriaLogs. The numbers below are **100% cache-hit results** — they represent the ceiling, not the typical case. Real production performance sits between the cold floor above and these warm numbers depending on your dashboard diversity and refresh interval. + +| Workload | Concurrency | Loki req/s | Proxy (warm) req/s | Throughput | P50 Loki | P50 Proxy | Latency | |---|:---:|---:|---:|:---:|---:|---:|:---:| | Small panels | c=10 | 2,011 | 15,626 | **7.8× faster** | 4 ms | 587 µs | **6.8× faster** | | Small panels | c=100 | 2,290 | 27,513 | **12× faster** | 42 ms | 3 ms | **14× faster** | @@ -49,38 +67,20 @@ Grafana dashboards auto-refresh every 30 s. After the first fetch, every repeate | Compute | c=10 | 2,803 | 11,162 | **4× faster** | 1 ms | 675 µs | **1.5× faster** | | Compute | c=100 | 1,611 | 16,456 | **10.2× faster** | 4 ms | 4 ms | parity | -CPU: **6–408× less** than Loki. RAM: **1.7–3.9× less** for most workloads. +CPU: **6–408× less** than Loki under cache load. RAM: **1.7–3.9× less** for most workloads. † Loki heavy c=100 was saturated — P90=1,818 ms, P99=6,950 ms, delivering only 162 req/s vs 7,134 for the proxy. ### Dashboard load spikes — request coalescer -When many panels hit the same query at once, the proxy collapses them into a single backend call. Everyone gets the result; the backend sees one request instead of N. +When many panels hit the same query simultaneously, the proxy collapses them into a single backend call. The figures below assume the coalesced result is already cached; first-hit coalescing still avoids the N-fan-out but pays one backend round-trip. -| Workload | Loki P50 | Proxy P50 | +| Workload | Loki P50 | Proxy P50 (warm) | |---|---:|---:| | Metadata queries | 196 ms | **1 ms** | | Heavy aggregations | 2,399 ms | **1 ms** | | Content search | 13,415 ms | **1 ms** | -### Cold cache, unique queries — honest floor - -No cache, no coalescer benefit. Pure translation overhead + HTTP proxying + VL response time. - -| Workload | Concurrency | Cold proxy | Loki | Ratio | -|---|:---:|---:|---:|:---:| -| Small metadata queries | c=10 | 1,212 req/s | ~880 req/s | **1.4× faster** | -| Small metadata queries | c=50 | 1,583 req/s | ~780 req/s | **2× faster** | -| Heavy pipeline queries | c=10 | 126–188 req/s | ~161–472 req/s | **~parity** | -| Heavy pipeline queries | c=100 | 139 req/s | ~33 req/s | **4.2× faster** | -| Long-range (6 h–72 h) | c=10 | 2× faster than Loki | — | parallel sub-window vs sequential scan | -| Compute (rate, topk) | c=10 | 210 req/s | ~2,403 req/s | 0.09× — N VL calls per query; was 0.03× before alloc fixes | - -- **Small and metadata queries:** 1.4–2× faster than Loki cold — VL scans are faster than Loki's chunk store for label/series queries -- **Heavy pipeline queries:** parity to **4.2× faster** depending on concurrency — `stats_query_range` fast path eliminates 39% cold CPU for `count_over_time`/`rate` queries -- **Long-range queries:** **2× faster cold** — parallel sub-window fetching completes before Loki's sequential chunk scan -- **Compute aggregations (`quantile_over_time`, `topk`, multi-stage pipelines):** N VL calls per metric query; pprof-guided alloc fixes lifted cold throughput 40→210 req/s (+5.25×); historical windows cache after first run (24 h TTL), bringing warm compute on par with or faster than Loki - Full throughput tables, P90/P99 latency, CPU and RSS breakdowns: [Benchmarks](docs/benchmarks.md) · [Performance](docs/performance.md) ---