Optimize index_writer#379
Open
gshigin wants to merge 21 commits into
Open
Conversation
- benchmark: add missing <algorithm> include, fail loudly when lss_file is absent, document single-threaded timing assumption and the safe mutate-while-iterating in mark_all_series_as_added - index_write_context: simplify estimate_count (drop redundant is_shrunk branch) and fix non-ASCII characters in comments Co-authored-by: Cursor <cursoragent@cursor.com>
The snapshot resolver only enumerated values() and collect_snapshot emitted the key-only name symbol for every value, so a high-cardinality label name produced one name entry per value (a single hot name reached ~4.7k copies on a real LSS). Split the resolver into for_each_key_id (names, once each) and for_each_value_id (values), mirroring the current-side collection. This shrinks the collected symbol vector in the shrunk state and cuts write_symbols (after shrink) by a further ~16% on top of the btree change. Co-authored-by: Cursor <cursoragent@cursor.com>
85f0cc7 to
0c5e97f
Compare
The index-writer benchmark wrote a single series, so the output never grew, the byte-based batch never filled and one write_postings call walked the whole index (the postings / table-offset / TOC numbers were a full traversal, not a batch). Write every series before measuring postings / table offsets / TOC, and sample a steady-state batch (skipping the unsplittable all-series first batch). This makes write_next_postings_batch comparable across the optimization progression (e.g. the dense series-references vector) independently of the later batch-size change. Co-authored-by: Cursor <cursoragent@cursor.com>
Series ids are dense and known up front, so index the series references by LabelSetID in a single pre-allocated vector instead of a flat_hash_map. This removes the per-id hash lookup during postings writing. A written series reference is never zero (the symbols table precedes the series section), so zero is used as the "not written" sentinel. Co-authored-by: Cursor <cursoragent@cursor.com>
Collect symbols into a btree_map keyed by the resolved string so each symbol is resolved exactly once (at insertion) and the table comes out already sorted and deduplicated. The previous approach sorted a vector of symbol ids with a comparator that re-resolved strings on every comparison (O(N log N) lookups), which dominated write_symbols in the shrunk state where the collected vector is heavily duplicated. write_symbols for a shrunk LSS drops by ~37% on a real LSS. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a resize overload that fills newly added elements with a given value, so callers no longer have to follow resize() with a manual std::fill (which was also a footgun for trivial element types that resize() leaves uninitialized). Use it for the dense series-references index in IndexWriter and its tests. Co-authored-by: Cursor <cursoragent@cursor.com>
The btree key is already the resolved symbol string, so keep it in symbols_ as a string_view (valid for the LSS lifetime) instead of an ExportSymbolId. SymbolsWriter walks the table twice (size estimation + write), so this drops two resolves per unique symbol; correctness is unchanged and output is byte-identical. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the btree_map used for symbol collection with a flat_hash_map and sort the unique keys once at the end instead of maintaining sorted order on every insert. write_symbols (after_shrink) drops ~23% (22.0ms -> 16.9ms). Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the per-string std::vector in the symbol-collection map with intrusive singly-linked lists over a single pre-allocated node pool (one node per collected id). This drops ~46k small per-symbol vector allocations into one pool allocation. write_symbols improves a further ~16-18% across states (after_shrink 16.9ms -> 14.2ms). Co-authored-by: Cursor <cursoragent@cursor.com>
Sort unique symbols by an inline big-endian 8-byte prefix carried next to the string_view, so most comparisons avoid chasing the view into scattered LSS memory; equal prefixes fall back to a full string compare. The list head is carried in the sort entry too, removing the per-symbol hash lookup when building the reverse index. write_symbols improves ~27-30% across states. Co-authored-by: Cursor <cursoragent@cursor.com>
WriteRestTo emitted postings in 1 MiB batches, so a single WriteNextPostingsBatch call did ~8ms of work on production hardware. Drop the bound to 64 KiB so a typical batch is ~1ms; the all-series posting and hot label values remain single atomic postings that can exceed the bound in one call. The benchmark's kPostingsBatchSize is lowered to match. Co-authored-by: Cursor <cursoragent@cursor.com>
0c5e97f to
04c088d
Compare
write_symbols is the longest single C call on the index-writing path (multiple ms, tens of ms in the shrunk state). fastcgo runs on the system stack without releasing the P, which stalls the Go scheduler and GC for the whole call; a regular cgo call parks the goroutine in _Gsyscall and frees the P, at a ~tens-of-ns overhead that is negligible against a multi-ms call. res mirrors the []byte header with a uintptr data field so the struct carries no Go pointer type and the call stays clear of the vet/cgocheck "Go pointer to C" guard; the buffer is always nil or prompp-allocated. Co-authored-by: Cursor <cursoragent@cursor.com>
Even at 64 KiB the batch bound could not keep a call under 1 ms: the all-series posting and hot label values are single atomic postings (up to ~30 ms each) that the byte bound cannot split, so batching only multiplied the number of cgo calls without taming that tail. Drop it entirely: write_postings now writes the whole postings section in one pass, exposed as prompp_index_writer_write_postings and invoked via a regular cgo call (like write_symbols) so the long call parks the goroutine in _Gsyscall and frees the P instead of blocking the scheduler. Removes the max_batch_size / has_more_data ABI, the Go batch loop in WriteRestTo and the benchmark's batch knob — simpler, faster and more predictable. Co-authored-by: Cursor <cursoragent@cursor.com>
The long write_symbols/write_postings cgo calls no longer thread a buffer through a goroutine-stack struct. The writer now owns its output buffer: each write_* method resets and fills it, the constructor returns a stable pointer to it, and Go reads the result from there. Only the writer pointer (a stable prompp-arena address) crosses the boundary, by value, so no goroutine stack address is handed to C and a concurrent GC stack move during a long call is harmless. The buffer is freed with the writer in the destructor, so no extra C type or return value is needed and the generated void-signature entrypoints stay compatible with the symbol dispatcher. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
write_label_indices walks the whole name/value trie index (a few ms), long enough that fastcgo blocking the P stalls the Go scheduler/GC for the duration. Like write_symbols/write_postings it now uses a regular cgo call so the goroutine parks in _Gsyscall and frees its P; only the writer pointer (a stable prompp-arena address) crosses the boundary by value. Co-authored-by: Cursor <cursoragent@cursor.com>
A single un-batched write_postings call buffers the entire postings section into one prompp-allocated buffer before flushing: on the real LSS (~1.2M series) that buffer reached ~50 MB, larger than the whole serialized LSS (~30 MB) — an unacceptable transient. Bring batching back, but keep the per-batch call as a regular cgo call (not fastcgo): write_postings(writer, max_batch_size) emits one batch into the writer's internal buffer and sets a has_more_postings flag exposed as a stable pointer from the constructor (like the output buffer), which Go reads to drain the section in a WriteNextPostingsBatch loop. Each 64 KiB batch is flushed and the buffer reused, so the transient is bounded by one batch plus the largest atomic posting instead of the whole section. Only the writer pointer (a stable prompp-arena address) and the scalar batch size cross the boundary by value. Restores the PostingsWriter/IndexWriter batch state, the C++ PartialWrite test, the Go writePostings loop and the benchmark's steady-state batch sample. Co-authored-by: Cursor <cursoragent@cursor.com>
Seed the test stream with a one-alignment stub standing in for the preceding sections so the first series lands at a non-zero, aligned offset, matching production. Offset 0 maps to the kUnwrittenSeriesReference sentinel and trips the new SeriesWriter assert. Co-authored-by: Cursor <cursoragent@cursor.com>
gshigin
commented
Jun 23, 2026
| }; | ||
| struct Result { | ||
| IndexWriterPtr writer; | ||
| IndexWriterHandle* writer; |
Collaborator
Author
There was a problem hiding this comment.
The general style of entrypoint is to use smart pointers instead of manual memory management
| }; | ||
| struct Result { | ||
| IndexWriterPtr writer; | ||
| IndexWriterHandle* writer; |
Collaborator
Author
There was a problem hiding this comment.
nit: exposing pointers to the data from IndexWriterHandle has strong code smell for me. It sure helps to avoid unnecessary cgo calls, but API becomes dirty. Need some clearer mechanism for pattern such as "C++ writes -> Go reads"
| // Group the ids that resolve to the same string using intrusive singly-linked lists | ||
| // over a single pre-allocated pool (exactly one node per collected id). The map keeps | ||
| // the head index of each list; ids are resolved once and prepended to their list. | ||
| std::vector<SymbolIdNode> nodes; |
Collaborator
Author
There was a problem hiding this comment.
nit: In some places we have BareBones::Vector, in others -- std::vector. In particular case there's no mush difference, but having consistent container usage is better
| if (symbol.size() >= sizeof(uint64_t)) { | ||
| uint64_t prefix = 0; | ||
| std::memcpy(&prefix, symbol.data(), sizeof(uint64_t)); | ||
| return __builtin_bswap64(prefix); |
Collaborator
Author
There was a problem hiding this comment.
I think std::byteswap do the same thing, but it's standard instead of compiler builtin
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an entry-point benchmark for the TSDB index writer and optimizes two hot paths it exposes.
Changes
Benchmark
index_writer_symbols_benchmark(pp/series_index/benchmarks/) that times eachIndexWriterentry-point call (write_header,write_symbolsin the no-shrink / fixed / shrunk states,write_series,write_label_indices, onewrite_postingsbatch, table offsets, TOC) against a real serialized LSS loaded fromlss_file.<algorithm>include, fail loudly whenlss_fileis absent, documented the single-threaded timing assumption and the safe mutate-while-iterating inmark_all_series_as_added.Symbol collection (
index_write_context)collect_currentno longer special-cases the shrunk state; symbol count is small and filtering per current series did not pay off.estimate_countsimplified (dropped the redundantis_shrunkbranch).for_each_snapshot_key_id/for_each_snapshot_value_idreplace the single callback that re-emitted each name symbol per value.flat_hash_mapkeyed by the resolved string, then the unique keys are sorted once at the end. This replaces the original "collect all ids into a vector, sort with a string-resolving comparator, deduplicate" path, whose comparator re-resolved strings on every comparison (O(N log N)lookups) and dominatedwrite_symbolsin the shrunk state, where the collected vector is heavily duplicated. Each symbol is resolved exactly once at insertion; ordering costs a singlesortover the ~46k unique strings instead of being maintained per insert. (An intermediatebtree_mapvariant was also tried but the hash-map + single sort proved faster, see results.)std::vector, removing ~46k small allocations (one per unique symbol) in favor of a single pool allocation.string_view. Most comparisons are resolved on this integer prefix without dereferencing the view into scattered LSS memory; only equal prefixes fall back to a full string compare. The list head is carried in the sort entry too, removing the per-symbol hash lookup when building the reverse index. (The sort was the single dominant component ofwrite_symbols, ~5.5ms; a cedar radix-tree variant that avoids sorting entirely was also tried but its insertion cost was far higher, so it was rejected. Abtree_map+ linked-list pool variant — which keeps the strings sorted on insert and needs no prefix cache — was also re-measured on top of the pool, but it stayed 34–56% slower thanflat_hash_map+ single prefix-cached sort (dev box, µs: no shrink 10760 vs 8007, fixed 10595 vs 7905, after shrink 17031 vs 10940): the btree pays anO(log N)cache-missing string compare on every insert, which the prefix cache exists to avoid.)Postings (
series references)flat_hash_map<LabelSetID, SeriesReference>with a dense, pre-allocated vector indexed byLabelSetID. Series ids are dense and the total count is known up front, so a single allocation suffices. This removes the per-id hash lookup while writing postings. A written series reference is never zero (the symbols table precedes the series section), so zero is used as the "not written" sentinel.write_symbolscgo callwrite_symbolsis the longest single C call on the index-writing path (a few ms, tens of ms in the shrunk state). Every other entry point goes throughfastcgo, which runs on the system stack without releasing the P — fine for short calls, but for this one it stalls the Go scheduler and GC for the whole duration.write_symbolsnow uses a regular cgo call, which parks the goroutine in_Gsyscalland frees the P; the ~tens-of-ns cgo overhead is negligible against a multi-ms call. The result header is mirrored with auintptrdata field so the struct carries no Go pointer type (the buffer is always nil or prompp-allocated), keeping the call clear of thevet/cgocheck"Go pointer to C" guard.Postings: batched cgo call (bounds the transient buffer)
WriteRestTooriginally emitted postings in 1 MiB batches (aWriteNextPostingsBatch(max_batch_size) -> (data, has_more_data)loop), so each cgo call did ~8 ms of work on production hardware. We first tried to bound a call to ~1 ms by shrinking the batch, but the byte bound is only checked after a full posting, and individual large postings (the all-series posting, hot label values) are emitted atomically — so the tail cannot be split. The sweep below makes this concrete (prod-like x86_64, all series written, real LSS; per-call µs):Smaller batches only multiply the number of tiny cgo calls (1065 calls at 16 KiB) while the
calls > 1 mscount and the ~30 ms max (the all-series posting) barely move — batching can't meet the 1 ms latency goal.We briefly dropped batching for a single un-batched
write_postingscall, but that is not viable for memory: a single call buffers the entire postings section into one prompp-allocated buffer before flushing. On the real LSS (~1.2M series) that buffer measured 50 904 044 bytes (~48.5 MiB) — larger than the whole serialized LSS (~30 MB), an unacceptable transient.So batching is kept, but the per-batch call now goes through a regular cgo call (not
fastcgo):prompp_index_writer_write_postings(writer, max_batch_size)writes one batch into the writer's internal buffer and sets ahas_more_postingsflag (a stable pointer returned by the constructor, like the output buffer) that Go reads to decide whether to loop. Go flushes each 64 KiB batch to the writer and reuses the buffer, so the transient is bounded by one batch plus the single largest atomic posting (a few MiB) instead of the whole ~50 MB section. Each batch parks the goroutine in_Gsyscalland frees the P for its duration, and only the writer pointer (a stable prompp-arena address) and the scalar batch size cross the boundary by value — no goroutine stack pointer is handed to C. The benchmark times one steady-state 64 KiB batch (the first batch, which emits the unsplittable all-series posting, is skipped untimed).Benchmark results
Real serialized LSS,
--benchmark_repetitions=10, min per entry-point call (microseconds).Note: the
write_next_postings_batch/write_postings_table_offsetsrows below are the original ARM dev-box numbers measured with a single series written (a full index traversal). Postings is written in 64 KiB batches; for the series-references map → vector comparison see the postings table in "Production-like hardware" below.write_headerwrite_symbols(no shrink)write_symbols(fixed state)write_symbols(after shrink)write_next_series_batchwrite_label_indiceswrite_next_postings_batchwrite_label_indices_tablewrite_postings_table_offsetswrite_table_of_contentsSymbol-collection structure comparison (
write_symbols, min us):(The
flat_hash_map + single sortand+ linked-list poolcolumns are from an earlier benchmarking session; the+ linked-list pooland+ 8-byte prefix sortcolumns were re-measured back-to-back on the same machine for a fair delta.)Production-like hardware (x86_64, 4 cores)
The numbers above were taken on an Apple-silicon dev box. The branch history was reordered so the benchmark and snapshot fixes land first and establish an honest baseline (
d57f3b723: fair benchmark + snapshot names collected once, no optimizations yet); the optimizations are then stacked on top. The full progression was re-measured back-to-back on a production-like x86_64 machine (--copt=-march=native,--benchmark_repetitions=10, min, µs):d57f3b723baseline (bench + snapshot fixes)flat_hash_map+ single sort(
Vector::resize(size, value)and "store resolved strings" are plumbing-only steps between btree andflat_hash_map; they don't movewrite_symbolsand are omitted.)End-to-end on prod-like hardware (
write_symbols, honest baseline → tip): no shrink -38% (21.7k → 13.4k), fixed -42% (22.2k → 12.8k), after shrink -53% (51.5k → 24.3k). Notable honesty point: theflat_hash_map + single sortstep on its own regresses vs the btree (after shrink 37.1k → 49.1k µs) — it only pays off once the linked-list pool removes the per-symbol allocations and the 8-byte prefix cache speeds the final sort.Postings — series-references map → dense vector (prod-like x86_64, real LSS, min of 10, µs). Measured per call at a fixed configuration so the only variable is the lookup structure:
flat_hash_mapReplacing the per-id hash lookup with a dense vector roughly halves the postings work (-55%). This win is a property of the lookup and is independent of batching (see "Postings: batched cgo call (bounds the transient buffer)"). Symbol-collection commits don't touch postings.
Symbol statistics (same real LSS)
The LSS contains 45 788 unique symbols. Label values are stored independently per key, so a value string is collected once per key it appears under (and again from the snapshot after shrink); the number of collected symbol-id entries is therefore much larger than the unique count:
Distribution of how many collected ids back each unique symbol (i.e. duplicate multiplicity):
Takeaways:
write_symbolspath expensive in the shrunk state.Memory
Measured with jemalloc (
--enable-prof) on the same real LSS. Two complementary views.Cumulative allocations over the whole benchmark run
MALLOC_CONF=prof:true,prof_accum:true,prof_final:true,lg_prof_sample:0(every allocation sampled); thet*total line of the final heap profile. This is whole-program cumulative (not peak) and inflated by the manyIndexWriterconstructions in the benchmark, but the structure is identical across commits so the trend is meaningful.flat_hash_map+ per-symbolstd::vectorThe per-symbol
std::vectorapproach exploded the allocation count (~46k tiny vectors per build); the linked-list pool brought it back to the baseline (~4.35x fewer allocations) and also cut bytes by ~51 MB. The remaining byte growth vs the original is the deliberate space-for-speed trade (dense series-references vector, prefix-sort cache).Peak transient memory per
write_symbolscalljemalloc
thread.peakreset/read around the call (whole-program peak is useless here — it is dominated by the static multi-hundred-MB LSS).flat_hash_map+ per-symbolstd::vectorPeak transient memory grew from ~3.75 MB to 6.5/11.25 MB — single-digit MB, negligible against the LSS itself — in exchange for the speedups above. The per-symbol-vector approach was the worst on both allocation count and peak; the pool fixed it, the prefix cache adds ~1.5 MB for the sort.
Note on the original row being flat across states: in the original version the transient was dominated by the serialized-symbols output buffer written to the stream, which is sized by the unique symbols (~1.98 MB of payload, identical in both states) and does not depend on the collected count. The collected
symbol_idsvector does grow after shrink (≈0.55 MB → ≈1.4 MB), but it is freed duringrebuild()before the stream is written, and it never exceeds the output buffer, so it does not move the peak. The newer versions introduce build-time structures (node pool, hash map, sort cache) that scale with the collected count and exceed the output buffer, which is why their peak does reflect the shrink state.To think about (not in this PR)
write_label_indicescost breakdownwrite_label_indicestakes ~3.15 ms on the dev box (real LSS). We expected the per-value symbol-ref map lookup and the per-uint32CRC32 to dominate, but a subtractive measurement (compile out one operation at a time, attribute the delta) showed otherwise:symbol_refs_map lookupuint32)The cost is dominated by walking the
names_trieand the per-namevalues_trieviamake_enumerative_iterator()(Cedar/double-array), not by the lookup/CRC/write we suspected. Optimizing the lookup or CRC would buy at most ~20%.Ideas to revisit:
values_trieenumeration with a cheaper per-name value source (e.g. whatever the reverse index can already provide contiguously).value-sortedsymbol refs sowrite_label_indicesjust streams ready refs — removing both the trie walk and the per-value map lookup at once.Test plan
bazel test //:series_index_test(incl. symbols / shrunk / postings / series-writer / label-indices)