Gate block-manager historical storage behind PROMPP feature flag by vporoshok · Pull Request #389 · deckhouse/prompp

vporoshok · 2026-06-20T05:46:32Z

Summary

add a staged rollout switch for historical block backend selection in server mode while keeping PP head + adapter as the write path in both modes
default to the pre-PR-377 behavior (historical reads via TSDB) and enable the new block-manager backend only with PROMPP_FEATURES=enable_block_manager
keep the additional compaction/blocks observability from earlier commits to make stage diagnostics explicit during rollout

Why

A stage rollout showed a sharp CPU and mmap spike after switching storage behavior, so we need a safer migration path and better visibility before enabling the new scheme by default.

Test plan

devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./pp/go/storage/block/...
devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./cmd/prometheus/...

Made with Cursor

* chore(deps): update snappy digest to 27ab5f7 * removed patches for snappy --------- Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com> Co-authored-by: Vladimir Pustovalov <cherep@sura.ru>

* created metrics for DataStorage * added ability to store metadata in metrics page * added encoder type count metrics for DataStorage * added DataStorage finalized_chunks_count metric * added DataStorage timestamp_states_count metric * review fixes * changed calculation logic of finalized_chunks metric * removed DataStorage timestamp_states_count metric * added ability to refresh metrics in metrics page * added DataStorage timestamp_states_count metric * fixed chunk_count_metric calculating * created unit test for DataStorage metrics * optimized encoding speed * fixed clang-format * fixed compilation error * fixed comment * fixed clang-tidy warning

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

…ity] (#384) Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

* Add block.Manager with reload, retention and queryable support Port reloadBlocks into a standalone pp/go/storage/block.Manager that reloads persisted blocks, applies retention via an injected tsdb.BlocksToDeleteFunc, and implements storage.Queryable/ChunkQueryable. Refactor pp-pkg/tsdb to a DB-free NewBlocksToDelete constructor that owns its retention metrics and limit gauges, and expose CatalogHeadsSize / CatalogHeadsExtraSize helpers. Add a tsdb.OpenBlocks wrapper. Co-authored-by: Cursor <cursoragent@cursor.com> * Implement Blocks method in block.Manager to return currently loaded blocks This update adds the Blocks method to the Manager struct, which provides a snapshot of the currently loaded blocks, implementing the BlockSource interface. The method ensures thread-safe access to the blocks using read locks. * Wire block.Manager and block.Compactor into main, disable tsdb In server mode, stop opening tsdb.DB and instead run block.Manager (persisted block reads + retention) and block.Compactor (compaction). block.Manager is plugged into the fanout via a querier-only storage.Storage adapter; localStorage stays an empty stub. Replace the TSDB run-group actor with a lifecycle actor and drop the dead openDBWithMetrics and its obsolete TestTimeMetrics. Co-authored-by: Cursor <cursoragent@cursor.com> * review fix * Add block manager coverage and fail-fast startup behavior. Ensure server startup aborts when the initial block reload fails, and add manager/compactor tests to cover startup loading, retention-driven deletion, and compaction loop triggering. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Vladimir Kavlakan <vladimir.kavlakan@flant.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Bastrykov Evgeniy <vporoshok@gmail.com>

…curity] (#385) Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

Add compaction plan/result logs, restore missing block-manager TSDB gauges, and cap compaction ranges by max block duration so 2h block setups follow configured bounds. Co-authored-by: Cursor <cursoragent@cursor.com>

Default to the pre-PR-377 historical TSDB path and enable block-manager only with PROMPP_FEATURES=enable_block_manager, keeping PP head+adapter as the write path in both modes. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose prometheus_tsdb_blocks_loaded_by_size to track loaded block size buckets after reload and help diagnose startup spikes caused by unexpected compaction output. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the size-based gauge with prometheus_tsdb_blocks_loaded_by_duration so legacy TSDB exposes the same block-duration view as block-manager during startup diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Grafana panels for loaded block layout and normalize loaded-block duration buckets to 5 minutes in both block-manager and legacy TSDB so duration heatmaps are stable and easy to read. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose loaded-block duration labels in minutes with 1-minute rounding for both block-manager and legacy TSDB paths, and update Grafana panels to query duration_minutes for clearer heatmap grouping. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirror legacy tsdb "Found healthy block" output so operators can see the on-disk block layout when the block manager starts, including each block's normalized duration in minutes. Co-authored-by: Cursor <cursoragent@cursor.com>

Run reload (with deletion) and a single compaction pass sequentially in one goroutine, mirroring tsdb's compact/reload loop. This removes the race where the compactor's independent loop could plan/compact blocks that the manager's reload was concurrently deleting (open meta.json: no such file or directory) and the resulting repeated re-compaction of overlapping blocks (CPU/mmap churn). The compactor no longer runs its own goroutine, ticker or shared mutex: it exposes a one-shot Compact() called by Manager after each reload. Also render the compaction plan as a string in logs (go-kit cannot encode []string). Co-authored-by: Cursor <cursoragent@cursor.com>

When the C++ scrape parser rejects a buffer with invalid UTF-8 (in a HELP text or a label value), print the containing line, the buffer size and the line start offset to stdout. The Go-side buffer is mutated in place during parsing, so the offending bytes can only be inspected reliably here. Co-authored-by: Cursor <cursoragent@cursor.com>

After a successful compaction, immediately reload and compact again instead of waiting for the next ticker interval, so multiple pending compactions converge in one tick (mirroring tsdb's compact/reload loop). Compact now reports whether it did any work to drive the loop. Co-authored-by: Cursor <cursoragent@cursor.com>

deckhouse-BOaTswain and others added 21 commits June 16, 2026 18:33

chore(deps): update snappy digest to 27ab5f7 (#300)

b90cdbb

* chore(deps): update snappy digest to 27ab5f7 * removed patches for snappy --------- Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com> Co-authored-by: Vladimir Pustovalov <cherep@sura.ru>

chore(deps): update dependency form-data to v3.0.5 [security] (#381)

8fe89ab

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update dependency @babel/runtime to v7.29.2 (#382)

c5e092a

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update dependency ws to v8.21.0 [security] (#380)

78e2de8

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update dependency http-proxy-middleware to v3.0.7 [secur…

d531461

…ity] (#384) Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update dependency bazel_skylib to v1.9.0 (#372)

8082ee8

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update gtest digest to 0b1e895 (#362)

29e307e

Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

chore(deps): update module go.mongodb.org/mongo-driver to v1.17.7 [se…

9e35ee0

…curity] (#385) Co-authored-by: Renovate Bot <renovate@whitesourcesoftware.com>

Merge remote-tracking branch 'origin/pp' into index_writer_benchmark

9108f7c

Improve block compactor observability and range selection.

2890125

Add compaction plan/result logs, restore missing block-manager TSDB gauges, and cap compaction ranges by max block duration so 2h block setups follow configured bounds. Co-authored-by: Cursor <cursoragent@cursor.com>

Add loaded block size distribution metric for legacy TSDB.

3374bda

Expose prometheus_tsdb_blocks_loaded_by_size to track loaded block size buckets after reload and help diagnose startup spikes caused by unexpected compaction output. Co-authored-by: Cursor <cursoragent@cursor.com>

Switch legacy TSDB loaded-block metric to duration buckets.

7ab63d9

Replace the size-based gauge with prometheus_tsdb_blocks_loaded_by_duration so legacy TSDB exposes the same block-duration view as block-manager during startup diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>

Log loaded blocks on block manager startup.

84bbbaf

Mirror legacy tsdb "Found healthy block" output so operators can see the on-disk block layout when the block manager starts, including each block's normalized duration in minutes. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate block-manager historical storage behind PROMPP feature flag#389

Gate block-manager historical storage behind PROMPP feature flag#389
vporoshok wants to merge 21 commits into
index_writer_benchmarkfrom
fix/block-compactor-observability

vporoshok commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vporoshok commented Jun 20, 2026

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants