Skip to content

Optimize index_writer#379

Open
gshigin wants to merge 21 commits into
ppfrom
index_writer_benchmark
Open

Optimize index_writer#379
gshigin wants to merge 21 commits into
ppfrom
index_writer_benchmark

Conversation

@gshigin

@gshigin gshigin commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an entry-point benchmark for the TSDB index writer and optimizes two hot paths it exposes.

Changes

Benchmark

  • New index_writer_symbols_benchmark (pp/series_index/benchmarks/) that times each IndexWriter entry-point call (write_header, write_symbols in the no-shrink / fixed / shrunk states, write_series, write_label_indices, one write_postings batch, table offsets, TOC) against a real serialized LSS loaded from lss_file.
  • Review follow-ups: added the missing <algorithm> include, fail loudly when lss_file is absent, documented the single-threaded timing assumption and the safe mutate-while-iterating in mark_all_series_as_added.

Symbol collection (index_write_context)

  • collect_current no longer special-cases the shrunk state; symbol count is small and filtering per current series did not pay off.
  • estimate_count simplified (dropped the redundant is_shrunk branch).
  • Snapshot symbols are now enumerated once: for_each_snapshot_key_id / for_each_snapshot_value_id replace the single callback that re-emitted each name symbol per value.
  • Symbol table now built via a flat_hash_map keyed by the resolved string, then the unique keys are sorted once at the end. This replaces the original "collect all ids into a vector, sort with a string-resolving comparator, deduplicate" path, whose comparator re-resolved strings on every comparison (O(N log N) lookups) and dominated write_symbols in the shrunk state, where the collected vector is heavily duplicated. Each symbol is resolved exactly once at insertion; ordering costs a single sort over the ~46k unique strings instead of being maintained per insert. (An intermediate btree_map variant was also tried but the hash-map + single sort proved faster, see results.)
  • Ids sharing a string are grouped via intrusive singly-linked lists over a single pre-allocated node pool (one node per collected id, the map stores each list head). This replaces the per-string std::vector, removing ~46k small allocations (one per unique symbol) in favor of a single pool allocation.
  • The sort caches the first 8 bytes of each unique string as a big-endian integer inline next to the string_view. Most comparisons are resolved on this integer prefix without dereferencing the view into scattered LSS memory; only equal prefixes fall back to a full string compare. The list head is carried in the sort entry too, removing the per-symbol hash lookup when building the reverse index. (The sort was the single dominant component of write_symbols, ~5.5ms; a cedar radix-tree variant that avoids sorting entirely was also tried but its insertion cost was far higher, so it was rejected. A btree_map + linked-list pool variant — which keeps the strings sorted on insert and needs no prefix cache — was also re-measured on top of the pool, but it stayed 34–56% slower than flat_hash_map + single prefix-cached sort (dev box, µs: no shrink 10760 vs 8007, fixed 10595 vs 7905, after shrink 17031 vs 10940): the btree pays an O(log N) cache-missing string compare on every insert, which the prefix cache exists to avoid.)

Postings (series references)

  • Replaced the flat_hash_map<LabelSetID, SeriesReference> with a dense, pre-allocated vector indexed by LabelSetID. Series ids are dense and the total count is known up front, so a single allocation suffices. This removes the per-id hash lookup while writing postings. A written series reference is never zero (the symbols table precedes the series section), so zero is used as the "not written" sentinel.

write_symbols cgo call

  • write_symbols is the longest single C call on the index-writing path (a few ms, tens of ms in the shrunk state). Every other entry point goes through fastcgo, which runs on the system stack without releasing the P — fine for short calls, but for this one it stalls the Go scheduler and GC for the whole duration. write_symbols now uses a regular cgo call, which parks the goroutine in _Gsyscall and frees the P; the ~tens-of-ns cgo overhead is negligible against a multi-ms call. The result header is mirrored with a uintptr data field so the struct carries no Go pointer type (the buffer is always nil or prompp-allocated), keeping the call clear of the vet/cgocheck "Go pointer to C" guard.

Postings: batched cgo call (bounds the transient buffer)

WriteRestTo originally emitted postings in 1 MiB batches (a WriteNextPostingsBatch(max_batch_size) -> (data, has_more_data) loop), so each cgo call did ~8 ms of work on production hardware. We first tried to bound a call to ~1 ms by shrinking the batch, but the byte bound is only checked after a full posting, and individual large postings (the all-series posting, hot label values) are emitted atomically — so the tail cannot be split. The sweep below makes this concrete (prod-like x86_64, all series written, real LSS; per-call µs):

batch size batches calls > 1 ms mean / call max / call
1 MiB 38 38 8715 30921
64 KiB 376 40 857 30610
16 KiB 1065 35 299 29641

Smaller batches only multiply the number of tiny cgo calls (1065 calls at 16 KiB) while the calls > 1 ms count and the ~30 ms max (the all-series posting) barely move — batching can't meet the 1 ms latency goal.

We briefly dropped batching for a single un-batched write_postings call, but that is not viable for memory: a single call buffers the entire postings section into one prompp-allocated buffer before flushing. On the real LSS (~1.2M series) that buffer measured 50 904 044 bytes (~48.5 MiB) — larger than the whole serialized LSS (~30 MB), an unacceptable transient.

So batching is kept, but the per-batch call now goes through a regular cgo call (not fastcgo): prompp_index_writer_write_postings(writer, max_batch_size) writes one batch into the writer's internal buffer and sets a has_more_postings flag (a stable pointer returned by the constructor, like the output buffer) that Go reads to decide whether to loop. Go flushes each 64 KiB batch to the writer and reuses the buffer, so the transient is bounded by one batch plus the single largest atomic posting (a few MiB) instead of the whole ~50 MB section. Each batch parks the goroutine in _Gsyscall and frees the P for its duration, and only the writer pointer (a stable prompp-arena address) and the scalar batch size cross the boundary by value — no goroutine stack pointer is handed to C. The benchmark times one steady-state 64 KiB batch (the first batch, which emits the unsplittable all-series posting, is skipped untimed).

Benchmark results

Real serialized LSS, --benchmark_repetitions=10, min per entry-point call (microseconds).

Note: the write_next_postings_batch / write_postings_table_offsets rows below are the original ARM dev-box numbers measured with a single series written (a full index traversal). Postings is written in 64 KiB batches; for the series-references map → vector comparison see the postings table in "Production-like hardware" below.

entry-point call before after delta
write_header 0.083 1.58 jitter (sub-us)
write_symbols (no shrink) 13638 8384 -39%
write_symbols (fixed state) 13591 8122 -40%
write_symbols (after shrink) 37707 10995 -71%
write_next_series_batch 2.21 1.79 jitter (sub-us)
write_label_indices 3097 3245 noise
write_next_postings_batch 24678 16371 non-representative (see note + prod-like postings table)
write_label_indices_table 3.08 3.25 jitter (sub-us)
write_postings_table_offsets 0.500 0.750 jitter (sub-us)
write_table_of_contents 0.083 0.125 jitter (sub-us)

Symbol-collection structure comparison (write_symbols, min us):

state original (vector sort+dedup) btree_map flat_hash_map + single sort + linked-list pool + 8-byte prefix sort
no shrink 13638 13414 12671 11502 8384
fixed state 13591 13471 13038 11390 8122
after shrink 37707 22002 16935 15815 10995

(The flat_hash_map + single sort and + linked-list pool columns are from an earlier benchmarking session; the + linked-list pool and + 8-byte prefix sort columns were re-measured back-to-back on the same machine for a fair delta.)

Production-like hardware (x86_64, 4 cores)

The numbers above were taken on an Apple-silicon dev box. The branch history was reordered so the benchmark and snapshot fixes land first and establish an honest baseline (d57f3b723: fair benchmark + snapshot names collected once, no optimizations yet); the optimizations are then stacked on top. The full progression was re-measured back-to-back on a production-like x86_64 machine (--copt=-march=native, --benchmark_repetitions=10, min, µs):

stage no shrink fixed state after shrink
d57f3b723 baseline (bench + snapshot fixes) 21 686 22 151 51 513
+ dense series-references vector 24 430 23 870 55 018
+ btree symbol table 22 609 22 026 37 117
+ flat_hash_map + single sort 30 480 30 108 49 139
+ linked-list pool 18 115 18 753 33 909
+ 8-byte prefix sort 13 271 12 817 24 747

(Vector::resize(size, value) and "store resolved strings" are plumbing-only steps between btree and flat_hash_map; they don't move write_symbols and are omitted.)

End-to-end on prod-like hardware (write_symbols, honest baseline → tip): no shrink -38% (21.7k → 13.4k), fixed -42% (22.2k → 12.8k), after shrink -53% (51.5k → 24.3k). Notable honesty point: the flat_hash_map + single sort step on its own regresses vs the btree (after shrink 37.1k → 49.1k µs) — it only pays off once the linked-list pool removes the per-symbol allocations and the 8-byte prefix cache speeds the final sort.

Postings — series-references map → dense vector (prod-like x86_64, real LSS, min of 10, µs). Measured per call at a fixed configuration so the only variable is the lookup structure:

stage write postings
series-references flat_hash_map 15 145
dense series-references vector 6 820

Replacing the per-id hash lookup with a dense vector roughly halves the postings work (-55%). This win is a property of the lookup and is independent of batching (see "Postings: batched cgo call (bounds the transient buffer)"). Symbol-collection commits don't touch postings.

Symbol statistics (same real LSS)

The LSS contains 45 788 unique symbols. Label values are stored independently per key, so a value string is collected once per key it appears under (and again from the snapshot after shrink); the number of collected symbol-id entries is therefore much larger than the unique count:

state collected symbol ids unique symbols
no shrink 68 972 45 788
after shrink 178 514 45 788

Distribution of how many collected ids back each unique symbol (i.e. duplicate multiplicity):

occurrences no shrink after shrink
1x 30 724 5 973
2x 12 543 26 422
3x 456 1 899
4x 918 8 798
5x..10x 977 1 926
11x..100x 170 706
>100x 0 64
max multiplicity 29x 4673x

Takeaways:

  • Without shrink ~67% of symbols are unique (appear once); the duplication is mild and bounded (max 29x).
  • After shrink only ~13% are unique: the snapshot roughly doubles every symbol (the "2x" bucket dominates), the collected vector grows 69k -> 179k, and the tail is unbounded (a single hot label value reaches 4673 occurrences). This is exactly what made the old sort+dedup write_symbols path expensive in the shrunk state.

Memory

Measured with jemalloc (--enable-prof) on the same real LSS. Two complementary views.

Cumulative allocations over the whole benchmark run

MALLOC_CONF=prof:true,prof_accum:true,prof_final:true,lg_prof_sample:0 (every allocation sampled); the t* total line of the final heap profile. This is whole-program cumulative (not peak) and inflated by the many IndexWriter constructions in the benchmark, but the structure is identical across commits so the trend is meaningful.

stage allocations bytes
original PR 390 074 626.4 MB
+ dense series-references vector 390 078 701.4 MB
+ flat_hash_map + per-symbol std::vector 1 697 378 842.5 MB
+ linked-list pool 390 114 791.4 MB
+ 8-byte prefix sort 390 133 819.9 MB

The per-symbol std::vector approach exploded the allocation count (~46k tiny vectors per build); the linked-list pool brought it back to the baseline (~4.35x fewer allocations) and also cut bytes by ~51 MB. The remaining byte growth vs the original is the deliberate space-for-speed trade (dense series-references vector, prefix-sort cache).

Peak transient memory per write_symbols call

jemalloc thread.peak reset/read around the call (whole-program peak is useless here — it is dominated by the static multi-hundred-MB LSS).

stage no shrink / fixed after shrink
original (in-place sort + dedup) 3.75 MB 3.75 MB
flat_hash_map + per-symbol std::vector 7.18 MB 14.28 MB
+ linked-list pool 5.00 MB 9.75 MB
+ 8-byte prefix sort 6.50 MB 11.25 MB

Peak transient memory grew from ~3.75 MB to 6.5/11.25 MB — single-digit MB, negligible against the LSS itself — in exchange for the speedups above. The per-symbol-vector approach was the worst on both allocation count and peak; the pool fixed it, the prefix cache adds ~1.5 MB for the sort.

Note on the original row being flat across states: in the original version the transient was dominated by the serialized-symbols output buffer written to the stream, which is sized by the unique symbols (~1.98 MB of payload, identical in both states) and does not depend on the collected count. The collected symbol_ids vector does grow after shrink (≈0.55 MB → ≈1.4 MB), but it is freed during rebuild() before the stream is written, and it never exceeds the output buffer, so it does not move the peak. The newer versions introduce build-time structures (node pool, hash map, sort cache) that scale with the collected count and exceed the output buffer, which is why their peak does reflect the shrink state.

To think about (not in this PR)

write_label_indices cost breakdown

write_label_indices takes ~3.15 ms on the dev box (real LSS). We expected the per-value symbol-ref map lookup and the per-uint32 CRC32 to dominate, but a subtractive measurement (compile out one operation at a time, attribute the delta) showed otherwise:

component cost share
trie enumeration + loop/bookkeeping ≈ 2.30 ms ~73%
symbol_refs_ map lookup ≈ 0.36 ms ~11%
CRC32 (per uint32) ≈ 0.27 ms ~9%
stream write + byteswap ≈ 0.21 ms ~7%

The cost is dominated by walking the names_trie and the per-name values_trie via make_enumerative_iterator() (Cedar/double-array), not by the lookup/CRC/write we suspected. Optimizing the lookup or CRC would buy at most ~20%.

Ideas to revisit:

  • Replace the nested values_trie enumeration with a cheaper per-name value source (e.g. whatever the reverse index can already provide contiguously).
  • Precompute, during the symbol-table build, flat per-name lists of value-sorted symbol refs so write_label_indices just streams ready refs — removing both the trie walk and the per-value map lookup at once.

Test plan

  • bazel test //:series_index_test (incl. symbols / shrunk / postings / series-writer / label-indices)
  • Benchmark runs against a real LSS fixture

@gshigin gshigin requested a review from vporoshok as a code owner June 15, 2026 10:35
@gshigin gshigin self-assigned this Jun 15, 2026
- benchmark: add missing <algorithm> include, fail loudly when lss_file is
  absent, document single-threaded timing assumption and the safe
  mutate-while-iterating in mark_all_series_as_added
- index_write_context: simplify estimate_count (drop redundant is_shrunk
  branch) and fix non-ASCII characters in comments

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok changed the title Index writer benchmark Optimize index_writer Jun 17, 2026
The snapshot resolver only enumerated values() and collect_snapshot emitted the
key-only name symbol for every value, so a high-cardinality label name produced
one name entry per value (a single hot name reached ~4.7k copies on a real LSS).
Split the resolver into for_each_key_id (names, once each) and for_each_value_id
(values), mirroring the current-side collection. This shrinks the collected
symbol vector in the shrunk state and cuts write_symbols (after shrink) by a
further ~16% on top of the btree change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok force-pushed the index_writer_benchmark branch from 85f0cc7 to 0c5e97f Compare June 18, 2026 14:19
vporoshok and others added 9 commits June 18, 2026 18:39
The index-writer benchmark wrote a single series, so the output never grew, the
byte-based batch never filled and one write_postings call walked the whole index
(the postings / table-offset / TOC numbers were a full traversal, not a batch).
Write every series before measuring postings / table offsets / TOC, and sample a
steady-state batch (skipping the unsplittable all-series first batch). This makes
write_next_postings_batch comparable across the optimization progression (e.g. the
dense series-references vector) independently of the later batch-size change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Series ids are dense and known up front, so index the series references by
LabelSetID in a single pre-allocated vector instead of a flat_hash_map. This
removes the per-id hash lookup during postings writing. A written series
reference is never zero (the symbols table precedes the series section), so
zero is used as the "not written" sentinel.

Co-authored-by: Cursor <cursoragent@cursor.com>
Collect symbols into a btree_map keyed by the resolved string so each symbol
is resolved exactly once (at insertion) and the table comes out already sorted
and deduplicated. The previous approach sorted a vector of symbol ids with a
comparator that re-resolved strings on every comparison (O(N log N) lookups),
which dominated write_symbols in the shrunk state where the collected vector is
heavily duplicated. write_symbols for a shrunk LSS drops by ~37% on a real LSS.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a resize overload that fills newly added elements with a given value, so
callers no longer have to follow resize() with a manual std::fill (which was
also a footgun for trivial element types that resize() leaves uninitialized).
Use it for the dense series-references index in IndexWriter and its tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
The btree key is already the resolved symbol string, so keep it in symbols_ as a
string_view (valid for the LSS lifetime) instead of an ExportSymbolId. SymbolsWriter
walks the table twice (size estimation + write), so this drops two resolves per
unique symbol; correctness is unchanged and output is byte-identical.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the btree_map used for symbol collection with a flat_hash_map and
sort the unique keys once at the end instead of maintaining sorted order on
every insert. write_symbols (after_shrink) drops ~23% (22.0ms -> 16.9ms).

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the per-string std::vector in the symbol-collection map with intrusive
singly-linked lists over a single pre-allocated node pool (one node per collected
id). This drops ~46k small per-symbol vector allocations into one pool allocation.
write_symbols improves a further ~16-18% across states (after_shrink 16.9ms -> 14.2ms).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sort unique symbols by an inline big-endian 8-byte prefix carried next to the
string_view, so most comparisons avoid chasing the view into scattered LSS
memory; equal prefixes fall back to a full string compare. The list head is
carried in the sort entry too, removing the per-symbol hash lookup when building
the reverse index. write_symbols improves ~27-30% across states.

Co-authored-by: Cursor <cursoragent@cursor.com>
WriteRestTo emitted postings in 1 MiB batches, so a single WriteNextPostingsBatch
call did ~8ms of work on production hardware. Drop the bound to 64 KiB so a typical
batch is ~1ms; the all-series posting and hot label values remain single atomic
postings that can exceed the bound in one call. The benchmark's kPostingsBatchSize
is lowered to match.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok force-pushed the index_writer_benchmark branch from 0c5e97f to 04c088d Compare June 18, 2026 14:40
vporoshok and others added 7 commits June 19, 2026 00:21
write_symbols is the longest single C call on the index-writing path (multiple ms,
tens of ms in the shrunk state). fastcgo runs on the system stack without releasing
the P, which stalls the Go scheduler and GC for the whole call; a regular cgo call
parks the goroutine in _Gsyscall and frees the P, at a ~tens-of-ns overhead that is
negligible against a multi-ms call. res mirrors the []byte header with a uintptr data
field so the struct carries no Go pointer type and the call stays clear of the
vet/cgocheck "Go pointer to C" guard; the buffer is always nil or prompp-allocated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Even at 64 KiB the batch bound could not keep a call under 1 ms: the all-series
posting and hot label values are single atomic postings (up to ~30 ms each) that the
byte bound cannot split, so batching only multiplied the number of cgo calls without
taming that tail. Drop it entirely: write_postings now writes the whole postings
section in one pass, exposed as prompp_index_writer_write_postings and invoked via a
regular cgo call (like write_symbols) so the long call parks the goroutine in
_Gsyscall and frees the P instead of blocking the scheduler. Removes the
max_batch_size / has_more_data ABI, the Go batch loop in WriteRestTo and the
benchmark's batch knob — simpler, faster and more predictable.

Co-authored-by: Cursor <cursoragent@cursor.com>
The long write_symbols/write_postings cgo calls no longer thread a buffer
through a goroutine-stack struct. The writer now owns its output buffer:
each write_* method resets and fills it, the constructor returns a stable
pointer to it, and Go reads the result from there. Only the writer pointer
(a stable prompp-arena address) crosses the boundary, by value, so no
goroutine stack address is handed to C and a concurrent GC stack move
during a long call is harmless. The buffer is freed with the writer in the
destructor, so no extra C type or return value is needed and the generated
void-signature entrypoints stay compatible with the symbol dispatcher.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
write_label_indices walks the whole name/value trie index (a few ms),
long enough that fastcgo blocking the P stalls the Go scheduler/GC for
the duration. Like write_symbols/write_postings it now uses a regular
cgo call so the goroutine parks in _Gsyscall and frees its P; only the
writer pointer (a stable prompp-arena address) crosses the boundary by
value.

Co-authored-by: Cursor <cursoragent@cursor.com>
A single un-batched write_postings call buffers the entire postings section into
one prompp-allocated buffer before flushing: on the real LSS (~1.2M series) that
buffer reached ~50 MB, larger than the whole serialized LSS (~30 MB) — an
unacceptable transient. Bring batching back, but keep the per-batch call as a
regular cgo call (not fastcgo): write_postings(writer, max_batch_size) emits one
batch into the writer's internal buffer and sets a has_more_postings flag exposed
as a stable pointer from the constructor (like the output buffer), which Go reads
to drain the section in a WriteNextPostingsBatch loop. Each 64 KiB batch is
flushed and the buffer reused, so the transient is bounded by one batch plus the
largest atomic posting instead of the whole section. Only the writer pointer (a
stable prompp-arena address) and the scalar batch size cross the boundary by
value. Restores the PostingsWriter/IndexWriter batch state, the C++ PartialWrite
test, the Go writePostings loop and the benchmark's steady-state batch sample.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok self-assigned this Jun 22, 2026
Seed the test stream with a one-alignment stub standing in for the
preceding sections so the first series lands at a non-zero, aligned
offset, matching production. Offset 0 maps to the kUnwrittenSeriesReference
sentinel and trips the new SeriesWriter assert.

Co-authored-by: Cursor <cursoragent@cursor.com>
};
struct Result {
IndexWriterPtr writer;
IndexWriterHandle* writer;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general style of entrypoint is to use smart pointers instead of manual memory management

};
struct Result {
IndexWriterPtr writer;
IndexWriterHandle* writer;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: exposing pointers to the data from IndexWriterHandle has strong code smell for me. It sure helps to avoid unnecessary cgo calls, but API becomes dirty. Need some clearer mechanism for pattern such as "C++ writes -> Go reads"

// Group the ids that resolve to the same string using intrusive singly-linked lists
// over a single pre-allocated pool (exactly one node per collected id). The map keeps
// the head index of each list; ids are resolved once and prepended to their list.
std::vector<SymbolIdNode> nodes;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In some places we have BareBones::Vector, in others -- std::vector. In particular case there's no mush difference, but having consistent container usage is better

if (symbol.size() >= sizeof(uint64_t)) {
uint64_t prefix = 0;
std::memcpy(&prefix, symbol.data(), sizeof(uint64_t));
return __builtin_bswap64(prefix);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think std::byteswap do the same thing, but it's standard instead of compiler builtin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants