perf: cold proxy +3-4x — stats_query_range fast path + 4 follow-up alloc fixes#350
Merged
Conversation
The default 500 series limit truncates results for broad queries, creating false parity comparisons in e2e tests.
4 test suites covering: - Error propagation: invalid LogQL must return same errors as Loki - Parser stages: json, logfmt, line_format, label_format, drop, keep - Line filters: contains, not_contains, regex, chained combinations - Multi-stage: parser → filter → format combinations - Metric queries: count_over_time, rate, sum by, topk, bottomk Findings: 2 error propagation gaps, 3 metric query gaps confirmed.
Two test suites: - ErrorParity (31 cases): walks ALL invalid LogQL patterns, verifies proxy returns same error codes as Loki. Score: 14/31 (45%) — 17 silent failures found (all binary ops on log queries, malformed selectors, empty queries) - QueryParity (50 cases): walks ALL valid LogQL operations, verifies both Loki and proxy return success. Score: 50/50 (100%) Categories tested: selectors, line filters, parsers (json/logfmt/ unpack/decolorize), field filters, format stages, drop/keep, multi- stage pipelines, metric queries, unwrap metrics, comparisons, vector ops, absent_over_time.
Validate queries before translation to catch patterns Loki would reject:
- Binary operations on log/pipeline expressions (==, !=, <, <=, >, >=, +, -, *, /, %, ^)
- Malformed selectors ({app=}, unclosed braces)
- Empty queries, missing selectors (| json)
Returns plain text 'parse error' matching Loki's exact format.
Fixes 17 silent failure cases where proxy returned 200 instead of 400.
Expanded from 80 to 116 test cases covering: - 31 error cases: binary ops, malformed selectors, empty queries - 85 valid operations: 6 selectors, 4 line filters, 6 parsers, 8 field filters, 4 formats, 5 drop/keep, 8 multi-stage, 7 range metrics, 10 aggregations, 9 unwrap, 8 comparisons, 4 vector ops, 2 grouping, 1 absent, 3 empty/basic Fixed validator false positive: field filter comparisons like | json | status>=400 were incorrectly caught as binary ops.
…ntax - Suppress errcheck lint warning for fmt.Fprint response write (write error not actionable after headers are sent) - Fix isBinaryOpOnLogQuery falsely rejecting valid label filter stages like '| json | status >= 400': check last pipe-separated segment and allow identifier op number comparisons when identifier is not a parser keyword
Logfmt fields arrive as Go strings so inferDetectedType always returned "string" even for numeric values like status=400 or latency_ms=300. Grafana filters unwrap field candidates to type=int/float, so all logfmt numeric fields were invisible in the unwrap dropdown. Now tries ParseInt and ParseFloat before falling back to string, matching Loki's behaviour of returning type=int for integer-valued fields.
Grafana passes the full query including the incomplete '| unwrap' stage (with no field name yet) when calling detected_fields to populate the unwrap field dropdown. The stripFieldDetectionStages regex required a field name after unwrap so the bare stage was left in, causing the query translation to fail and returning zero options. Makes the field-name part optional so both '| unwrap field' and bare '| unwrap' are stripped before field scanning.
- Add _msg field to all JSON-format generators (api-gateway, auth-service, frontend-ssr, batch-etl, ml-serving) so VL stores log messages correctly; VictoriaLogs v1.50.0 requires _msg key (not msg) in JSON log lines - Switch docker-compose root stack to use published v1.16.0 image - Fix VL healthcheck from localhost to 127.0.0.1 (IPv4 vs IPv6) - Add native Loki service for side-by-side parity testing - Fix proxy datasource URL and add Loki (direct) datasource in Grafana config
Documents the addition of _msg fields to JSON log generators so the changelog gate passes for PR #255.
Keep Unreleased e2e _msg fix entry from the feature branch.
The underscore proxy variant is the default datasource in Grafana. Adding -patterns-autodetect-from-queries=true caused it to report pattern_ingester_enabled=true, breaking drilldown_limits_shape test which expects false for the default (non-autodetect) datasource. Pattern autodetect flags belong only on loki-vl-proxy-patterns-autodetect.
…enchmark docs - bench/internal/pprof: new package for CPU/heap/alloc/goroutine capture from proxy /debug/pprof/* endpoints during bench runs - loki-bench: --pprof-proxy, --pprof-no-cache, --pprof-auth-token, --pprof-duration flags; CPU profile runs concurrently with each bench run; heap/allocs/goroutine captured after each run; profiles saved to bench/results/pprof/<workload>-c<N>-<target>-<type>.pprof - run-comparison.sh: always build proxy binary from source before spawning the no-cache instance — prevents stale binary missing current flags (e.g. -server.enable-pprof); add -server.enable-pprof and admin token to no-cache proxy spawn; pass --pprof-* flags to loki-bench - docker-compose: enable -server.enable-pprof + admin token on main proxy - docs/benchmarks.md: expand three-way to four-way comparison section with compute workload, jitter=2h data, cold-proxy numbers, pprof hot path table, and updated running instructions
Resolved conflicts in: - CHANGELOG.md: combined unreleased entries from both branches - bench/cmd/loki-bench/main.go: take warmup-with-jitter improvement from main - bench/run-comparison.sh: take port-conflict detection + CB flags from main - docs/benchmarks.md: take updated system specs and tables from main
Switch collectRangeMetricHits from /select/logsql/hits to /select/logsql/stats_query_range. The hits endpoint cannot group by stream labels stored in VL's binary _stream field, so explicit label groupBy (e.g. app, namespace) returned empty series. stats_query_range understands stream labels and returns pre-aggregated Prometheus-format buckets, avoiding the raw NDJSON scan that consumed 39% CPU in pprof profiles of cold proxy traffic. Use fastjson for response parsing to avoid encoding/json overhead on the hot path. Add TestSumByCountOverTime_NoParser_UsesStatsQueryRange to guard the fast path for pure stream-selector queries.
Add __bytes__ → sum_len(_msg) to the fast path alongside __count__ → count(). bytes_rate was the next largest cold-proxy bottleneck (4664ms P50 at c=50 heavy), still reading every raw log entry to sum message sizes. VL's sum_len(_msg) in stats_query_range returns pre-aggregated byte-count buckets in the same Prometheus format; the existing sliding window builder handles rate (sum/window_s) and over_time (sum) correctly. Refactor: collectRangeMetricHits accepts statsAggFunc string so callers control the aggregation clause without duplicating the query builder.
63509b3 to
450ace6
Compare
patternLineTokenizer.Join allocated a fresh strings.Builder on every call — the dominant allocator in the log-streaming path at cold concurrency (6237 MB flat, 15.56% of all allocs per pprof). Introduces patternJoinBuilderPool (sync.Pool) to reuse builders across calls, eliminating the per-call heap allocation and the bytes.growSlice cascade it triggered (5830 MB, 14.55% flat).
Contributor
PR Quality ReportCompared against base branch Coverage and tests
Compatibility
Performance smokeLower CPU cost (
State
|
…s scan Replaces the two-phase tokenize+SplitN approach with a single byte-walk that processes each key=value token inline: - Removes intermediate []string tokens slice (make + append per call) - Removes strings.Builder for token accumulation - Replaces strings.SplitN(tok, "=", 2) with strings.IndexByte strings.genSplit was 302 MB flat (4.79%) and strings.Builder.WriteString a further 496 MB flat (7.86%) in the cold-proxy alloc profile.
trimStatsQueryRangeResponseToEnd/FromStart previously used json.Unmarshal
into [][]interface{} (248 MB allocs) + json.Marshal (669 MB allocs) = 938 MB
cum per pprof. New fastjson path parses in-place via pooled Parser, writes
filtered series to a pooled bytes.Buffer, copies only on actual trim; no-trim
path returns original body with zero allocs.
translateStatsResponseLabelsWithContext previously used stdjson.Unmarshal
into Go structs (477 MB cum) + make(map[string]string) per changed metric
(188 MB flat) + stdjson.Marshal of full response = 1.31 GB cum total. New
fastjson path: Parser pool eliminates struct allocs; metric fields visited
in-place via GetObject.Visit; changed metrics written via marshalStringMapJSON
(appendJSONString) directly to a pooled bytes.Buffer; unchanged items copied
verbatim via MarshalTo(nil).
Combined: ~2.25 GB cum eliminated from compute/heavy workload pprof.
Two pprof-identified bottlenecks in the 21.3% header alloc cluster:
1. Coalescer resp.Header.Clone() (488 MB flat): callDirect and the singleflight
Do path both cloned response headers on every VL call. All three callers
(vlGetCoalescedWithStatus, vlPostCoalesced, vlGetMetadataCoalesced) discard
headers with `_`. Now returns nil — zero Header.Clone allocs per request.
2. net/textproto.MIMEHeader.Set in applyBackendHeaders (635 MB flat): each
Header.Set call canonicalizes the key via CanonicalMIMEHeaderKey (allocates
for custom headers like X-Loki-VL-Client-ID → X-Loki-Vl-Client-Id) and
allocates []string{value}. Pre-canonicalized keys (package-level vars) avoid
the key alloc; pre-allocated static value slices (hdrValIdentity etc.) avoid
the Accept-Encoding []string alloc. Dynamic values (clientID, authUser) still
alloc []string{v} but skip the CanonicalMIMEHeaderKey path.
…onse fast path, pre-compute stream canonical keys
- Add fjMarshalPool + marshalFJ: reuse []byte scratch for fastjson MarshalTo
calls in trimStatsQRByTimeFJ and writeTranslatedStatsItemsFJ, eliminating
per-call allocation (5.1 GB flat in compute pprof)
- Fix wrapAsLokiResponse: add fast-path A for {"status":"success","data":{"resultType":
prefix (most common VL stats_query_range format); add hasPrefix helper;
eliminates 3 GB flat of stdjson.Unmarshal+Marshal round-trip
- Pre-compute translatedKey in cachedLogQueryStreamDescriptor; store Key in
queryRangeWindowEntry; groupQueryRangeWindowEntries uses entry.Key instead
of calling canonicalLabelsKey per-entry, eliminating ~2.5 GB cum in heavy
… direct byte building marshalWindowedStreamsResult writes the Loki query_range streams JSON envelope directly using appendJSONStringToBuffer (zero-alloc string escaping) and marshalStringMapJSONTo (direct-write label map serialization), eliminating the reflection-heavy json.Marshal path for the common non-categorized case. Categorized values with metadata still fall back to json.Marshal. Addresses 3535 MB cum (4.08%) of encoding/json.Marshal in heavy workload pprof.
Store cloneStringMap(syntheticLabels) in changedMetrics instead of pre-serialised []byte. The write pass uses marshalStringMapJSONTo() to write directly to the output buffer via appendJSONStringToBuffer (zero-alloc for ASCII labels). Eliminates marshalStringMapJSON flat (~4 GB) + appendJSONString flat (~3.3 GB) = ~7.3 GB of per-changed-metric allocations in the compute/heavy pprof. Also moves marshalStringMapJSONTo definition to query_range_windowing.go (shared via package scope) and removes now-dead marshalStringMapJSON.
…ix shellcheck
- translateStatsResponseLabelsWithContext now writes {"status":"success",...} prefix,
so wrapAsLokiResponse fast path A matches (zero-alloc return) instead of fast path B
which allocated a new []byte per translated response
- Remove unused isOTelDataFJ function (lint: unused)
- Fix SC2038 shellcheck warning in loc-badges.yaml (use -print0 | xargs -0)
- README cold cache table: compute 40→210 req/s, ratio 0.03×→0.09× - docs/benchmarks.md: add alloc fix impact to compute cold overhead section - CHANGELOG: document alloc fix series (fjMarshalPool, status prefix, pre-computed stream keys, direct byte building, deferred label serialization)
Grafana's Loki datasource streaming parser reads JSON object keys
sequentially. When it encounters "result" before "encodingFlags", it
doesn't know to parse value tuples as triplets [ts,msg,meta], so the
third element {"parsed":{...}} triggers a ReadArray decode failure and
no logs are rendered.
marshalWindowedStreamsResult now emits encodingFlags before result,
matching the order used by streamLogQuery and writeLokiStreamQueryResponse.
Also remaps all e2e-compat host ports with a 1 prefix (e.g. 9428→19428)
to avoid conflicts with lakehouse compose on the same Docker host.
…e functions Resolves golangci-lint unused function warnings.
…apping wait_e2e_stack.sh, ci.yaml, security-pr.yaml, security-heavy.yaml all had hardcoded ports (3100, 3101, 9428, etc.) from before the e2e-compat host port remapping. Update to the new 1-prefixed ports (13100, 13101, 19428, etc.).
…step
Queries like sum(rate({cluster="us-east-1"} | json [$__auto])) were routed
to the slow manual log-fetch path because the translator adds | math to the
VL pipeline (buildRateLikeQuery), which confused parseStatsCompatSpec into
seeing _stream as the groupBy sentinel, blocking the fast path.
For non-sliding windows (range == step, which is what Grafana $__auto uses),
the translator's fully-formed VL rate pipeline is semantically correct and
VL can execute it natively via stats_query_range. Bypass handleStatsCompatRange
for these queries; sliding windows (range > step) still use the manual path
for correct per-step accumulation semantics.
After the docker-compose host port remapping (3100→13100, 9428→19428, 8880→18880, etc.), several test files and the loki track stack wait script still referenced old port numbers: - alerting_compat_test.go: vmalertURL 8880 → 18880 - otel_compat_test.go: underscore proxy default 3102 → 13102 - structured_metadata_compat_test.go: native/translated/no-metadata proxy defaults 3106/3107/3108 → 13106/13107/13108 - wait_loki_track_stack.sh: all service ports remapped - explore.spec.ts / logs-drilldown.spec.ts: VL jsonline insert 9428 → 19428
Accept the remote section structure (Scaling Profile, Resource Usage, Known Hot Paths, Running Benchmarks, eviction pressure PromQL) while keeping our updated compute/heavy throughput numbers.
szibis
added a commit
that referenced
this pull request
May 12, 2026
…ast path PR #350 added collectRangeMetricHits inside proxyManualRangeMetricRange which calls stats_query_range for grouped sliding-window queries. Update the test to serve both VL endpoints and verify a 200 response rather than asserting which internal path is taken. The shouldUseManualRangeMetricCompat gate behavior is already verified by the unit test.
szibis
added a commit
that referenced
this pull request
May 13, 2026
…#352) * docs: add metric fast-path expansion design spec * docs: add metric fast-path implementation plan * test: add failing unit tests for parser-stage fast-path gate removal * perf: remove parser-stage guard from shouldUseManualRangeMetricCompat Rate, count_over_time, bytes_rate, and bytes_over_time queries with | json or | logfmt parser stages now use VL native stats_query_range when the range window equals the step (tumbling window). Previously these were forced to the manual raw-log-fetch path regardless of the window configuration. Sliding-window queries (range > step) remain on the slow path: the !rangeEqualsStep gate is preserved for correct sliding-window semantics. The originalLogql parameter is dropped from shouldUseManualRangeMetricCompat as it was only used by the removed __error__ exception check. * fix: correct VL syntax in shouldUseManualRangeMetricCompat comment * test: add integration tests for parser-stage rate tumbling/sliding routing * test: align stats_query_range stub with actual VL response format * test: update sliding-window test for PR #350 collectRangeMetricHits fast path PR #350 added collectRangeMetricHits inside proxyManualRangeMetricRange which calls stats_query_range for grouped sliding-window queries. Update the test to serve both VL endpoints and verify a 200 response rather than asserting which internal path is taken. The shouldUseManualRangeMetricCompat gate behavior is already verified by the unit test. * docs(changelog): add entry for parser-stage guard removal --------- Co-authored-by: szibis <szibis@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance branch: eliminate allocations and O(n) scans from Loki-compatible hot paths. Seven independent perf improvements + three bug fixes stacked on each other.
1.
stats_query_rangefast path forsum()/sum by()count/rate (dbaf8dd)What: For
sum(rate({...}[w]))andsum by (label)(count_over_time({...}[w]))— the most common Grafana dashboard query shapes — the proxy now uses VL's native/select/logsql/stats_query_rangeinstead of decomposing the range into N manual sub-windows and scanning raw log bytes.Why it matters: Manual sub-window fan-out is O(total log entries in the time range). The stats endpoint is O(matched series × time buckets) — typically 100–1000× less work for high-cardinality streams.
Constraint respected: The fast path fires only when
range == step(tumbling windows). Grafana sets$__autoso that range == step for standard dashboards. Sliding-window queries (range > step) still use the correct manual-accumulation path.2. Extend fast path to
bytes_rate/bytes_over_time(7ddc008)What: Same
stats_query_rangebypass extended to byte-based aggregations:sum(bytes_rate({...}[w]))andsum(bytes_over_time({...}[w])).How: VL's stats endpoint supports
sum_len()natively. Proxy maps Loki's bytes semantics →sum_len()→__lvp_bytes→ translates back.3.
fastjsonin stats response hot paths (70a7d33)What: Replaced
encoding/json(reflection-based marshal/unmarshal) withfastjsonin the stats response translation path.Savings:
json.Unmarshalallocations for every VL stats bucketfastjson.Value.GetStringBytes4. Scratch pool for
MarshalTo+wrapAsLokiResponsefix (24e26cb)What: Reused
[]bytescratch buffers across the per-step stats assembly loop viasync.Pool. Also fixedwrapAsLokiResponseto correctly wrap already-translated stats — was double-wrapping in some paths.5. Eliminate per-request HTTP header allocs in VL sub-window hot path (d92a125)
What: VL sub-window requests were allocating a new
http.Headermap per sub-window per step. Switched to a pre-built header template cloned once per request, not per sub-window.Savings: −1 alloc per sub-window call. For a 24h range at 1m step = 1440 sub-windows: −1440 allocs/request.
6. Replace
json.Marshalreflection in windowed streams result (e864a1c)What:
marshalWindowedStreamsResultwas usingjson.Marshalto encode each stream's label map and each log entry. Replaced with a direct byte-builder usingappendJSONStringand manual JSON construction.Savings: Eliminated per-stream and per-entry reflection allocs. For a typical log query returning 500 entries × 10 streams: −5500 allocs/response.
7. Eliminate
appendJSONStringallocations in stats label translation (ddcb3de)What:
appendJSONStringwas allocating for every label value escape check even when no escaping was needed (the common case for well-formed label values). Added a fast-path check: if the string contains no special chars, write directly without allocation.8. Fix
encodingFlagsordering in windowed streams response (e11561d)What (bug fix): Grafana's Loki datasource streaming parser reads JSON object keys sequentially and uses
encodingFlagsto decide how to parseresult. IfencodingFlagsappears afterresultin the JSON object, the parser has already committed to the wrong parse mode — values are decoded as string pairs[ts, msg]instead of triplets[ts, msg, {meta}], causing aReadArrayerror and blank log panels in Grafana.Fix:
marshalWindowedStreamsResultnow emitsencodingFlagsbeforeresult:{"status":"success","data":{"resultType":"streams","encodingFlags":["categorize-labels"],"result":[...]}}Affected: Any log query to a proxy with
categorize-labelsenabled (the default for metadata-aware variants).9.
sum(rate({...} | json [$__auto]))slow path bypass (b4d0903)What (perf fix):
sum(rate({cluster="us-east-1"} | json [$__auto]))was ~10× slower thansum by (app)(rate({...}[$__auto]))despite both being count/rate queries.Root cause: The translator's
buildRateLikeQueryalways adds_streamas the inner group-by when the query has a parser pipeline (| json).parseStatsCompatSpecsees| stats by (_stream)and setshasStreamSentinel = true, which blocks thecollectRangeMetricHitsfast path and falls through tocollectRangeMetricSamples— a full raw log scan.Fix: Queries containing
| mathare translator-built multi-stage VL rate pipelines. For non-sliding windows (range ≤ step), VL can execute them natively viastats_query_rangewithout decomposition. Added bypass inhandleStatsCompatRange:Sliding window queries (
range > step) still fall through to the correct manual path.10. Remove unused functions (996b167)
What (lint fix):
golangci-lintflagged two unused functions:marshalStringMapJSONinquery_translation.go— was used during development, superseded by direct builderisVLDataResultTypeResponseinstream_processing.go— dead code with a misleading commentBoth removed.
11.
parseLogfmtFieldssentinel loop rewrite (9a017e0)What: Replaced the O(n²) sentinel scan in
parseLogfmtFieldswith a single-pass linear scan. The old code re-scanned the entire parsed field list on every sentinel match.12. e2e port remapping + CI fixes (80adeed)
What: All host-side ports in
docker-compose.ymlprefixed with1(3100→13100, 9428→19428, 3101→13101, etc.) to avoid collision with a co-located compose stack. Internal container-to-container ports unchanged. Updatedwait_e2e_stack.sh,ci.yaml,security-pr.yaml,security-heavy.yamlto match the new host ports.pprof progression
Compute cold → hot progression (req/s on MacBook M3 Pro, 24h range 1440 steps):
Test plan
go test ./internal/proxy/...— 1625 tests passTestContract_QueryRange_MatrixFormat_SlidingRateUsesManualLogFetch— sliding window still uses manual pathTestQueryRange_SlidingWindowWithout_UsesManualPath— bypassed only for non-slidingTestSumByBytesRate_NoParser_UsesSumLen— bytes_rate fast path