Skip to content

fix: address code review issues — parser-stage correctness, memory cap, O(N²) chunked fetch, and docs#355

Merged
szibis merged 2 commits into
mainfrom
docs/accuracy-audit-pr
May 13, 2026
Merged

fix: address code review issues — parser-stage correctness, memory cap, O(N²) chunked fetch, and docs#355
szibis merged 2 commits into
mainfrom
docs/accuracy-audit-pr

Conversation

@szibis
Copy link
Copy Markdown
Collaborator

@szibis szibis commented May 13, 2026

Summary

  • Fix 1 (HIGH correctness): Restore __error__ guard for parser-stage metric queries. VL stats_query_range counts parse-failed lines while Loki excludes them. All rate/bytes_rate/count_over_time/bytes_over_time queries with parser stages now route to the slow log-fetch path unless the caller explicitly adds | drop __error__ to opt in to VL's count-all semantics. The opt-in check uses the inner LogQL base query (without outer aggregation or range window brackets) so hasDropErrorOnlyPostParserStage correctly identifies the clause.
  • Fix 2 (MEDIUM performance): Eliminate O(N²) copy amplification in coldBackwardChunkedFetch. Replaced the append(chunkBody, accumulated...) prepend loop with a slice of chunks reversed and joined once at the end.
  • Fix 3 (MEDIUM memory): Cap unbounded io.ReadAll in collectRangeMetricHits at 64 MB via the existing readBodyLimited helper to bound RSS on oversized VL error response bodies.
  • Fix 4 & 5 (documentation): Document optional tenant header behaviour and multi-tenant serial fanout in KNOWN_ISSUES.md. Add a TODO comment in multitenant.go referencing the tracking entry.

Test Plan

  • go test ./internal/proxy/... passes (1638 tests, 0 failures)
  • TestQueryRange_RateParserStageTumblingUsesSlowPath verifies parser-stage rate without | drop __error__ uses slow path
  • TestQueryRange_RateParserStageDropErrTumblingUsesStatsQueryRange verifies | drop __error__ opt-in restores fast path
  • TestQueryRange_CountOverTimeParserUsesDirectStatsRange verifies grouped count_over_time with | drop __error__, __error_details__ uses stats_query_range
  • TestCompatHelpers_ParseQuantileAndUnwrapErrorName verifies updated unit assertions for parser-stage slow path

szibis added 2 commits May 13, 2026 12:24
Fix 1 (correctness): Restore __error__ guard to parser-stage metric queries.
VL native stats_query_range counts parse-failed lines while Loki excludes them.
Queries with parser stages (| json, | logfmt, etc.) now route to the slow
log-fetch path unless the caller explicitly opts in via " | drop __error__".

The opt-in check uses origSpec.BaseQuery (inner LogQL pipeline without outer
aggregation or range window brackets) so hasDropErrorOnlyPostParserStage
correctly identifies the drop-error clause. The | math guard for grouped rate
queries and a new pre-check for non-| math queries (count_over_time etc.) both
apply this fast-path bypass before shouldUseManualRangeMetricCompat is called.

Fix 2 (performance): Eliminate O(N²) copy amplification in coldBackwardChunkedFetch.
Replaced prepend-into-growing-buffer (accumulated = append(chunkBody, accumulated...))
with a slice of chunks that is reversed once at the end and joined.

Fix 3 (memory): Cap unbounded io.ReadAll in collectRangeMetricHits at 64 MB via
the existing readBodyLimited helper to bound RSS on oversized VL error bodies.

Fix 4 (documentation): Document optional tenant header behaviour in KNOWN_ISSUES.md.

Fix 5 (documentation + code): Document multi-tenant serial fanout in KNOWN_ISSUES.md
and add a TODO comment in multitenant.go referencing the tracking entry.
@github-actions github-actions Bot added size/M Medium change scope/proxy Proxy core scope/docs Documentation scope/tests Tests bugfix Bug fix labels May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Quality Report

Compared against base branch main.

Coverage and tests

Signal Base PR Delta
Test count 2506 2507 1
Coverage 87.2% 87.2% 0.0% (stable)

Compatibility

Track Base PR Delta
Loki API 100.0% 11/11 (100.0%) 0.0% (stable)
Logs Drilldown 100.0% 17/17 (100.0%) 0.0% (stable)
VictoriaLogs 100.0% 11/11 (100.0%) 0.0% (stable)

Performance smoke

Lower CPU cost (ns/op) is better. Lower benchmark memory cost (B/op, allocs/op) is better. Higher throughput is better. Lower load-test memory growth is better. Benchmark rows are medians from repeated samples.

Signal Base PR Delta
QueryRange cache-hit CPU cost 1603.0 ns/op 1653.0 ns/op +3.1% (stable)
QueryRange cache-hit memory 200.0 B/op 200.0 B/op 0.0% (stable)
QueryRange cache-hit allocations 7.0 allocs/op 7.0 allocs/op 0.0% (stable)
QueryRange cache-bypass CPU cost 1916.0 ns/op 1942.0 ns/op +1.4% (stable)
QueryRange cache-bypass memory 281.0 B/op 282.0 B/op +0.4% (stable)
QueryRange cache-bypass allocations 7.0 allocs/op 7.0 allocs/op 0.0% (stable)
Labels cache-hit CPU cost 631.8 ns/op 666.5 ns/op +5.5% (stable)
Labels cache-hit memory 48.0 B/op 48.0 B/op 0.0% (stable)
Labels cache-hit allocations 3.0 allocs/op 3.0 allocs/op 0.0% (stable)
Labels cache-bypass CPU cost 773.2 ns/op 779.0 ns/op +0.8% (stable)
Labels cache-bypass memory 53.0 B/op 52.0 B/op -1.9% (stable)
Labels cache-bypass allocations 3.0 allocs/op 3.0 allocs/op 0.0% (stable)

State

  • Coverage, compatibility, and sampled performance are reported here from the same PR workflow.
  • This is a delta report, not a release gate by itself. Required checks still decide merge safety.
  • Performance is a smoke comparison, not a full benchmark lab run.
  • Delta states use the same noise guards as the quality gate (percent + absolute + low-baseline checks), so report labels match merge-gate behavior.

@szibis szibis merged commit 7eef6a9 into main May 13, 2026
48 checks passed
@szibis szibis deleted the docs/accuracy-audit-pr branch May 13, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Bug fix scope/docs Documentation scope/proxy Proxy core scope/tests Tests size/M Medium change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant