Optimize index_writer by gshigin · Pull Request #379 · deckhouse/prompp

gshigin · 2026-06-15T10:35:04Z

Summary

Adds an entry-point benchmark for the TSDB index writer and optimizes two hot paths it exposes.

Changes

Benchmark

New index_writer_symbols_benchmark (pp/series_index/benchmarks/) that times each IndexWriter entry-point call (write_header, write_symbols in the no-shrink / fixed / shrunk states, write_series, write_label_indices, one write_postings batch, table offsets, TOC) against a real serialized LSS loaded from lss_file.
Review follow-ups: added the missing <algorithm> include, fail loudly when lss_file is absent, documented the single-threaded timing assumption and the safe mutate-while-iterating in mark_all_series_as_added.

Symbol collection (`index_write_context`)

collect_current no longer special-cases the shrunk state; symbol count is small and filtering per current series did not pay off.
estimate_count simplified (dropped the redundant is_shrunk branch).
Snapshot symbols are now enumerated once: for_each_snapshot_key_id / for_each_snapshot_value_id replace the single callback that re-emitted each name symbol per value.
Symbol table now built via a flat_hash_map keyed by the resolved string, then the unique keys are sorted once at the end. This replaces the original "collect all ids into a vector, sort with a string-resolving comparator, deduplicate" path, whose comparator re-resolved strings on every comparison (O(N log N) lookups) and dominated write_symbols in the shrunk state, where the collected vector is heavily duplicated. Each symbol is resolved exactly once at insertion; ordering costs a single sort over the ~46k unique strings instead of being maintained per insert. (An intermediate btree_map variant was also tried but the hash-map + single sort proved faster, see results.)
Ids sharing a string are grouped via intrusive singly-linked lists over a single pre-allocated node pool (one node per collected id, the map stores each list head). This replaces the per-string std::vector, removing ~46k small allocations (one per unique symbol) in favor of a single pool allocation.
The sort caches the first 8 bytes of each unique string as a big-endian integer inline next to the string_view. Most comparisons are resolved on this integer prefix without dereferencing the view into scattered LSS memory; only equal prefixes fall back to a full string compare. The list head is carried in the sort entry too, removing the per-symbol hash lookup when building the reverse index. (The sort was the single dominant component of write_symbols, ~5.5ms; a cedar radix-tree variant that avoids sorting entirely was also tried but its insertion cost was far higher, so it was rejected. A btree_map + linked-list pool variant — which keeps the strings sorted on insert and needs no prefix cache — was also re-measured on top of the pool, but it stayed 34–56% slower than flat_hash_map + single prefix-cached sort (dev box, µs: no shrink 10760 vs 8007, fixed 10595 vs 7905, after shrink 17031 vs 10940): the btree pays an O(log N) cache-missing string compare on every insert, which the prefix cache exists to avoid.)

Postings (`series references`)

Replaced the flat_hash_map<LabelSetID, SeriesReference> with a dense, pre-allocated vector indexed by LabelSetID. Series ids are dense and the total count is known up front, so a single allocation suffices. This removes the per-id hash lookup while writing postings. A written series reference is never zero (the symbols table precedes the series section), so zero is used as the "not written" sentinel.

`write_symbols` cgo call

write_symbols is the longest single C call on the index-writing path (a few ms, tens of ms in the shrunk state). Every other entry point goes through fastcgo, which runs on the system stack without releasing the P — fine for short calls, but for this one it stalls the Go scheduler and GC for the whole duration. write_symbols now uses a regular cgo call, which parks the goroutine in _Gsyscall and frees the P; the ~tens-of-ns cgo overhead is negligible against a multi-ms call. The result header is mirrored with a uintptr data field so the struct carries no Go pointer type (the buffer is always nil or prompp-allocated), keeping the call clear of the vet/cgocheck "Go pointer to C" guard.

Postings: batched cgo call (bounds the transient buffer)

WriteRestTo originally emitted postings in 1 MiB batches (a WriteNextPostingsBatch(max_batch_size) -> (data, has_more_data) loop), so each cgo call did ~8 ms of work on production hardware. We first tried to bound a call to ~1 ms by shrinking the batch, but the byte bound is only checked after a full posting, and individual large postings (the all-series posting, hot label values) are emitted atomically — so the tail cannot be split. The sweep below makes this concrete (prod-like x86_64, all series written, real LSS; per-call µs):

batch size	batches	calls > 1 ms	mean / call	max / call
1 MiB	38	38	8715	30921
64 KiB	376	40	857	30610
16 KiB	1065	35	299	29641

Smaller batches only multiply the number of tiny cgo calls (1065 calls at 16 KiB) while the calls > 1 ms count and the ~30 ms max (the all-series posting) barely move — batching can't meet the 1 ms latency goal.

We briefly dropped batching for a single un-batched write_postings call, but that is not viable for memory: a single call buffers the entire postings section into one prompp-allocated buffer before flushing. On the real LSS (~1.2M series) that buffer measured 50 904 044 bytes (~48.5 MiB) — larger than the whole serialized LSS (~30 MB), an unacceptable transient.

So batching is kept, but the per-batch call now goes through a regular cgo call (not fastcgo): prompp_index_writer_write_postings(writer, max_batch_size) writes one batch into the writer's internal buffer and sets a has_more_postings flag (a stable pointer returned by the constructor, like the output buffer) that Go reads to decide whether to loop. Go flushes each 64 KiB batch to the writer and reuses the buffer, so the transient is bounded by one batch plus the single largest atomic posting (a few MiB) instead of the whole ~50 MB section. Each batch parks the goroutine in _Gsyscall and frees the P for its duration, and only the writer pointer (a stable prompp-arena address) and the scalar batch size cross the boundary by value — no goroutine stack pointer is handed to C. The benchmark times one steady-state 64 KiB batch (the first batch, which emits the unsplittable all-series posting, is skipped untimed).

Benchmark results

Real serialized LSS, --benchmark_repetitions=10, min per entry-point call (microseconds).

Note: the write_next_postings_batch / write_postings_table_offsets rows below are the original ARM dev-box numbers measured with a single series written (a full index traversal). Postings is written in 64 KiB batches; for the series-references map → vector comparison see the postings table in "Production-like hardware" below.

entry-point call	before	after	delta
`write_header`	0.083	1.58	jitter (sub-us)
`write_symbols` (no shrink)	13638	8384	-39%
`write_symbols` (fixed state)	13591	8122	-40%
`write_symbols` (after shrink)	37707	10995	-71%
`write_next_series_batch`	2.21	1.79	jitter (sub-us)
`write_label_indices`	3097	3245	noise
`write_next_postings_batch`	24678	16371	non-representative (see note + prod-like postings table)
`write_label_indices_table`	3.08	3.25	jitter (sub-us)
`write_postings_table_offsets`	0.500	0.750	jitter (sub-us)
`write_table_of_contents`	0.083	0.125	jitter (sub-us)

Symbol-collection structure comparison (write_symbols, min us):

state	original (vector sort+dedup)	btree_map	flat_hash_map + single sort	+ linked-list pool	+ 8-byte prefix sort
no shrink	13638	13414	12671	11502	8384
fixed state	13591	13471	13038	11390	8122
after shrink	37707	22002	16935	15815	10995

(The flat_hash_map + single sort and + linked-list pool columns are from an earlier benchmarking session; the + linked-list pool and + 8-byte prefix sort columns were re-measured back-to-back on the same machine for a fair delta.)

Production-like hardware (x86_64, 4 cores)

The numbers above were taken on an Apple-silicon dev box. The branch history was reordered so the benchmark and snapshot fixes land first and establish an honest baseline (d57f3b723: fair benchmark + snapshot names collected once, no optimizations yet); the optimizations are then stacked on top. The full progression was re-measured back-to-back on a production-like x86_64 machine (--copt=-march=native, --benchmark_repetitions=10, min, µs):

stage	no shrink	fixed state	after shrink
`d57f3b723` baseline (bench + snapshot fixes)	21 686	22 151	51 513
+ dense series-references vector	24 430	23 870	55 018
+ btree symbol table	22 609	22 026	37 117
+ `flat_hash_map` + single sort	30 480	30 108	49 139
+ linked-list pool	18 115	18 753	33 909
+ 8-byte prefix sort	13 271	12 817	24 747

(Vector::resize(size, value) and "store resolved strings" are plumbing-only steps between btree and flat_hash_map; they don't move write_symbols and are omitted.)

End-to-end on prod-like hardware (write_symbols, honest baseline → tip): no shrink -38% (21.7k → 13.4k), fixed -42% (22.2k → 12.8k), after shrink -53% (51.5k → 24.3k). Notable honesty point: the flat_hash_map + single sort step on its own regresses vs the btree (after shrink 37.1k → 49.1k µs) — it only pays off once the linked-list pool removes the per-symbol allocations and the 8-byte prefix cache speeds the final sort.

Postings — series-references map → dense vector (prod-like x86_64, real LSS, min of 10, µs). Measured per call at a fixed configuration so the only variable is the lookup structure:

stage	write postings
series-references `flat_hash_map`	15 145
dense series-references vector	6 820

Replacing the per-id hash lookup with a dense vector roughly halves the postings work (-55%). This win is a property of the lookup and is independent of batching (see "Postings: batched cgo call (bounds the transient buffer)"). Symbol-collection commits don't touch postings.

Symbol statistics (same real LSS)

The LSS contains 45 788 unique symbols. Label values are stored independently per key, so a value string is collected once per key it appears under (and again from the snapshot after shrink); the number of collected symbol-id entries is therefore much larger than the unique count:

state	collected symbol ids	unique symbols
no shrink	68 972	45 788
after shrink	178 514	45 788

Distribution of how many collected ids back each unique symbol (i.e. duplicate multiplicity):

occurrences	no shrink	after shrink
1x	30 724	5 973
2x	12 543	26 422
3x	456	1 899
4x	918	8 798
5x..10x	977	1 926
11x..100x	170	706
>100x	0	64
max multiplicity	29x	4673x

Takeaways:

Without shrink ~67% of symbols are unique (appear once); the duplication is mild and bounded (max 29x).
After shrink only ~13% are unique: the snapshot roughly doubles every symbol (the "2x" bucket dominates), the collected vector grows 69k -> 179k, and the tail is unbounded (a single hot label value reaches 4673 occurrences). This is exactly what made the old sort+dedup write_symbols path expensive in the shrunk state.

Memory

Measured with jemalloc (--enable-prof) on the same real LSS. Two complementary views.

Cumulative allocations over the whole benchmark run

MALLOC_CONF=prof:true,prof_accum:true,prof_final:true,lg_prof_sample:0 (every allocation sampled); the t* total line of the final heap profile. This is whole-program cumulative (not peak) and inflated by the many IndexWriter constructions in the benchmark, but the structure is identical across commits so the trend is meaningful.

stage	allocations	bytes
original PR	390 074	626.4 MB
+ dense series-references vector	390 078	701.4 MB
+ `flat_hash_map` + per-symbol `std::vector`	1 697 378	842.5 MB
+ linked-list pool	390 114	791.4 MB
+ 8-byte prefix sort	390 133	819.9 MB

The per-symbol std::vector approach exploded the allocation count (~46k tiny vectors per build); the linked-list pool brought it back to the baseline (~4.35x fewer allocations) and also cut bytes by ~51 MB. The remaining byte growth vs the original is the deliberate space-for-speed trade (dense series-references vector, prefix-sort cache).

Peak transient memory per `write_symbols` call

jemalloc thread.peak reset/read around the call (whole-program peak is useless here — it is dominated by the static multi-hundred-MB LSS).

stage	no shrink / fixed	after shrink
original (in-place sort + dedup)	3.75 MB	3.75 MB
`flat_hash_map` + per-symbol `std::vector`	7.18 MB	14.28 MB
+ linked-list pool	5.00 MB	9.75 MB
+ 8-byte prefix sort	6.50 MB	11.25 MB

Peak transient memory grew from ~3.75 MB to 6.5/11.25 MB — single-digit MB, negligible against the LSS itself — in exchange for the speedups above. The per-symbol-vector approach was the worst on both allocation count and peak; the pool fixed it, the prefix cache adds ~1.5 MB for the sort.

Note on the original row being flat across states: in the original version the transient was dominated by the serialized-symbols output buffer written to the stream, which is sized by the unique symbols (~1.98 MB of payload, identical in both states) and does not depend on the collected count. The collected symbol_ids vector does grow after shrink (≈0.55 MB → ≈1.4 MB), but it is freed during rebuild() before the stream is written, and it never exceeds the output buffer, so it does not move the peak. The newer versions introduce build-time structures (node pool, hash map, sort cache) that scale with the collected count and exceed the output buffer, which is why their peak does reflect the shrink state.

To think about (not in this PR)

`write_label_indices` cost breakdown

write_label_indices takes ~3.15 ms on the dev box (real LSS). We expected the per-value symbol-ref map lookup and the per-uint32 CRC32 to dominate, but a subtractive measurement (compile out one operation at a time, attribute the delta) showed otherwise:

component	cost	share
trie enumeration + loop/bookkeeping	≈ 2.30 ms	~73%
`symbol_refs_` map lookup	≈ 0.36 ms	~11%
CRC32 (per `uint32`)	≈ 0.27 ms	~9%
stream write + byteswap	≈ 0.21 ms	~7%

The cost is dominated by walking the names_trie and the per-name values_trie via make_enumerative_iterator() (Cedar/double-array), not by the lookup/CRC/write we suspected. Optimizing the lookup or CRC would buy at most ~20%.

Ideas to revisit:

Replace the nested values_trie enumeration with a cheaper per-name value source (e.g. whatever the reverse index can already provide contiguously).
Precompute, during the symbol-table build, flat per-name lists of value-sorted symbol refs so write_label_indices just streams ready refs — removing both the trie walk and the per-value map lookup at once.

Test plan

bazel test //:series_index_test (incl. symbols / shrunk / postings / series-writer / label-indices)
Benchmark runs against a real LSS fixture

- benchmark: add missing <algorithm> include, fail loudly when lss_file is absent, document single-threaded timing assumption and the safe mutate-while-iterating in mark_all_series_as_added - index_write_context: simplify estimate_count (drop redundant is_shrunk branch) and fix non-ASCII characters in comments Co-authored-by: Cursor <cursoragent@cursor.com>

The snapshot resolver only enumerated values() and collect_snapshot emitted the key-only name symbol for every value, so a high-cardinality label name produced one name entry per value (a single hot name reached ~4.7k copies on a real LSS). Split the resolver into for_each_key_id (names, once each) and for_each_value_id (values), mirroring the current-side collection. This shrinks the collected symbol vector in the shrunk state and cuts write_symbols (after shrink) by a further ~16% on top of the btree change. Co-authored-by: Cursor <cursoragent@cursor.com>

The index-writer benchmark wrote a single series, so the output never grew, the byte-based batch never filled and one write_postings call walked the whole index (the postings / table-offset / TOC numbers were a full traversal, not a batch). Write every series before measuring postings / table offsets / TOC, and sample a steady-state batch (skipping the unsplittable all-series first batch). This makes write_next_postings_batch comparable across the optimization progression (e.g. the dense series-references vector) independently of the later batch-size change. Co-authored-by: Cursor <cursoragent@cursor.com>

Series ids are dense and known up front, so index the series references by LabelSetID in a single pre-allocated vector instead of a flat_hash_map. This removes the per-id hash lookup during postings writing. A written series reference is never zero (the symbols table precedes the series section), so zero is used as the "not written" sentinel. Co-authored-by: Cursor <cursoragent@cursor.com>

Collect symbols into a btree_map keyed by the resolved string so each symbol is resolved exactly once (at insertion) and the table comes out already sorted and deduplicated. The previous approach sorted a vector of symbol ids with a comparator that re-resolved strings on every comparison (O(N log N) lookups), which dominated write_symbols in the shrunk state where the collected vector is heavily duplicated. write_symbols for a shrunk LSS drops by ~37% on a real LSS. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds a resize overload that fills newly added elements with a given value, so callers no longer have to follow resize() with a manual std::fill (which was also a footgun for trivial element types that resize() leaves uninitialized). Use it for the dense series-references index in IndexWriter and its tests. Co-authored-by: Cursor <cursoragent@cursor.com>

The btree key is already the resolved symbol string, so keep it in symbols_ as a string_view (valid for the LSS lifetime) instead of an ExportSymbolId. SymbolsWriter walks the table twice (size estimation + write), so this drops two resolves per unique symbol; correctness is unchanged and output is byte-identical. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the btree_map used for symbol collection with a flat_hash_map and sort the unique keys once at the end instead of maintaining sorted order on every insert. write_symbols (after_shrink) drops ~23% (22.0ms -> 16.9ms). Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the per-string std::vector in the symbol-collection map with intrusive singly-linked lists over a single pre-allocated node pool (one node per collected id). This drops ~46k small per-symbol vector allocations into one pool allocation. write_symbols improves a further ~16-18% across states (after_shrink 16.9ms -> 14.2ms). Co-authored-by: Cursor <cursoragent@cursor.com>

Sort unique symbols by an inline big-endian 8-byte prefix carried next to the string_view, so most comparisons avoid chasing the view into scattered LSS memory; equal prefixes fall back to a full string compare. The list head is carried in the sort entry too, removing the per-symbol hash lookup when building the reverse index. write_symbols improves ~27-30% across states. Co-authored-by: Cursor <cursoragent@cursor.com>

WriteRestTo emitted postings in 1 MiB batches, so a single WriteNextPostingsBatch call did ~8ms of work on production hardware. Drop the bound to 64 KiB so a typical batch is ~1ms; the all-series posting and hot label values remain single atomic postings that can exceed the bound in one call. The benchmark's kPostingsBatchSize is lowered to match. Co-authored-by: Cursor <cursoragent@cursor.com>

write_symbols is the longest single C call on the index-writing path (multiple ms, tens of ms in the shrunk state). fastcgo runs on the system stack without releasing the P, which stalls the Go scheduler and GC for the whole call; a regular cgo call parks the goroutine in _Gsyscall and frees the P, at a ~tens-of-ns overhead that is negligible against a multi-ms call. res mirrors the []byte header with a uintptr data field so the struct carries no Go pointer type and the call stays clear of the vet/cgocheck "Go pointer to C" guard; the buffer is always nil or prompp-allocated. Co-authored-by: Cursor <cursoragent@cursor.com>

Even at 64 KiB the batch bound could not keep a call under 1 ms: the all-series posting and hot label values are single atomic postings (up to ~30 ms each) that the byte bound cannot split, so batching only multiplied the number of cgo calls without taming that tail. Drop it entirely: write_postings now writes the whole postings section in one pass, exposed as prompp_index_writer_write_postings and invoked via a regular cgo call (like write_symbols) so the long call parks the goroutine in _Gsyscall and frees the P instead of blocking the scheduler. Removes the max_batch_size / has_more_data ABI, the Go batch loop in WriteRestTo and the benchmark's batch knob — simpler, faster and more predictable. Co-authored-by: Cursor <cursoragent@cursor.com>

The long write_symbols/write_postings cgo calls no longer thread a buffer through a goroutine-stack struct. The writer now owns its output buffer: each write_* method resets and fills it, the constructor returns a stable pointer to it, and Go reads the result from there. Only the writer pointer (a stable prompp-arena address) crosses the boundary, by value, so no goroutine stack address is handed to C and a concurrent GC stack move during a long call is harmless. The buffer is freed with the writer in the destructor, so no extra C type or return value is needed and the generated void-signature entrypoints stay compatible with the symbol dispatcher. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

write_label_indices walks the whole name/value trie index (a few ms), long enough that fastcgo blocking the P stalls the Go scheduler/GC for the duration. Like write_symbols/write_postings it now uses a regular cgo call so the goroutine parks in _Gsyscall and frees its P; only the writer pointer (a stable prompp-arena address) crosses the boundary by value. Co-authored-by: Cursor <cursoragent@cursor.com>

A single un-batched write_postings call buffers the entire postings section into one prompp-allocated buffer before flushing: on the real LSS (~1.2M series) that buffer reached ~50 MB, larger than the whole serialized LSS (~30 MB) — an unacceptable transient. Bring batching back, but keep the per-batch call as a regular cgo call (not fastcgo): write_postings(writer, max_batch_size) emits one batch into the writer's internal buffer and sets a has_more_postings flag exposed as a stable pointer from the constructor (like the output buffer), which Go reads to drain the section in a WriteNextPostingsBatch loop. Each 64 KiB batch is flushed and the buffer reused, so the transient is bounded by one batch plus the largest atomic posting instead of the whole section. Only the writer pointer (a stable prompp-arena address) and the scalar batch size cross the boundary by value. Restores the PostingsWriter/IndexWriter batch state, the C++ PartialWrite test, the Go writePostings loop and the benchmark's steady-state batch sample. Co-authored-by: Cursor <cursoragent@cursor.com>

Seed the test stream with a one-alignment stub standing in for the preceding sections so the first series lands at a non-zero, aligned offset, matching production. Offset 0 maps to the kUnwrittenSeriesReference sentinel and trips the new SeriesWriter assert. Co-authored-by: Cursor <cursoragent@cursor.com>

gshigin · 2026-06-23T12:48:24Z

  };
  struct Result {
-    IndexWriterPtr writer;
+    IndexWriterHandle* writer;


The general style of entrypoint is to use smart pointers instead of manual memory management

gshigin · 2026-06-23T13:31:01Z

  };
  struct Result {
-    IndexWriterPtr writer;
+    IndexWriterHandle* writer;


nit: exposing pointers to the data from IndexWriterHandle has strong code smell for me. It sure helps to avoid unnecessary cgo calls, but API becomes dirty. Need some clearer mechanism for pattern such as "C++ writes -> Go reads"

gshigin · 2026-06-23T13:42:57Z

+    // Group the ids that resolve to the same string using intrusive singly-linked lists
+    // over a single pre-allocated pool (exactly one node per collected id). The map keeps
+    // the head index of each list; ids are resolved once and prepended to their list.
+    std::vector<SymbolIdNode> nodes;


nit: In some places we have BareBones::Vector, in others -- std::vector. In particular case there's no mush difference, but having consistent container usage is better

gshigin · 2026-06-23T13:46:27Z

+    if (symbol.size() >= sizeof(uint64_t)) {
+      uint64_t prefix = 0;
+      std::memcpy(&prefix, symbol.data(), sizeof(uint64_t));
+      return __builtin_bswap64(prefix);


I think std::byteswap do the same thing, but it's standard instead of compiler builtin

gshigin added 2 commits June 11, 2026 10:50

bench and fix

980604e

update benchmark for all index writer

ee77012

gshigin requested a review from vporoshok as a code owner June 15, 2026 10:35

gshigin self-assigned this Jun 15, 2026

vporoshok changed the title ~~Index writer benchmark~~ Optimize index_writer Jun 17, 2026

vporoshok force-pushed the index_writer_benchmark branch from 85f0cc7 to 0c5e97f Compare June 18, 2026 14:19

vporoshok and others added 9 commits June 18, 2026 18:39

vporoshok force-pushed the index_writer_benchmark branch from 0c5e97f to 04c088d Compare June 18, 2026 14:40

vporoshok and others added 7 commits June 19, 2026 00:21

performance_tests: ignore local test_data fixtures

51e0f54

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/pp' into index_writer_benchmark

9108f7c

vporoshok self-assigned this Jun 22, 2026

gshigin commented Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize index_writer#379

Optimize index_writer#379
gshigin wants to merge 21 commits into
ppfrom
index_writer_benchmark

gshigin commented Jun 15, 2026 •

edited by vporoshok

Loading

Uh oh!

gshigin Jun 23, 2026

Uh oh!

gshigin Jun 23, 2026

Uh oh!

gshigin Jun 23, 2026

Uh oh!

gshigin Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gshigin commented Jun 15, 2026 • edited by vporoshok Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark

Symbol collection (index_write_context)

Postings (series references)

write_symbols cgo call

Postings: batched cgo call (bounds the transient buffer)

Benchmark results

Production-like hardware (x86_64, 4 cores)

Symbol statistics (same real LSS)

Memory

Cumulative allocations over the whole benchmark run

Peak transient memory per write_symbols call

To think about (not in this PR)

write_label_indices cost breakdown

Test plan

Uh oh!

gshigin Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gshigin Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gshigin Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gshigin Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gshigin commented Jun 15, 2026 •

edited by vporoshok

Loading

Symbol collection (`index_write_context`)

Postings (`series references`)

`write_symbols` cgo call

Peak transient memory per `write_symbols` call

`write_label_indices` cost breakdown