Two-stage ANALYZE with adaptive count-min sketch params #30206

ztlpn · 2025-12-05T12:15:36Z

Changelog category

Not for changelog (changelog entry is not required)

Description for reviewers

Calculate column statistics in two stages, first stage determines total count and number of distinct values for each column, second stage calculates count-min sketches for some columns.

ydbot · 2025-12-05T12:15:55Z

Run Extra Tests

Run additional tests for this PR. You can customize:

Test Size: small, medium, large (default: all)
Test Targets: any directory path (default: ydb/)
Sanitizers: ASAN, MSAN, TSAN
Coredumps: enable for debugging (default: off)
Additional args: custom ya make arguments

github-actions · 2025-12-05T12:17:04Z

⚪ 2025-12-05 12:17:03 UTC Pre-commit check linux-x86_64-relwithdebinfo for 3100eb4 has started.
⚪ 2025-12-05 12:17:21 UTC Artifacts will be uploaded here
⚪ 2025-12-05 12:19:30 UTC ya make is running...
🟡 2025-12-05 14:40:53 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
41828	38890	0	5	2903	30

⚪ 2025-12-05 14:41:07 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-05 14:53:24 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
57 (only retried tests)	44	0	0	0	13

🟢 2025-12-05 14:53:31 UTC Build successful.
🟢 2025-12-05 14:53:53 UTC ydbd size 2.3 GiB changed* by +59.6 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash	main: `3707c7f`	merge: `3100eb4`	diff	diff %
ydbd size	2 465 899 016 Bytes	2 465 960 064 Bytes	+59.6 KiB	+0.002%
ydbd stripped size	524 793 600 Bytes	524 808 192 Bytes	+14.2 KiB	+0.003%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

github-actions · 2025-12-05T12:17:19Z

⚪ 2025-12-05 12:17:19 UTC Pre-commit check linux-x86_64-release-asan for 3100eb4 has started.
⚪ 2025-12-05 12:17:39 UTC Artifacts will be uploaded here
⚪ 2025-12-05 12:19:46 UTC ya make is running...
🟡 2025-12-05 14:08:38 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
13546	13464	0	66	7	9

🟢 2025-12-05 14:08:46 UTC Build successful.
🟡 2025-12-05 14:09:15 UTC ydbd size 3.8 GiB changed* by +102.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash	main: `3707c7f`	merge: `3100eb4`	diff	diff %
ydbd size	4 127 315 488 Bytes	4 127 420 880 Bytes	+102.9 KiB	+0.003%
ydbd stripped size	1 532 447 576 Bytes	1 532 491 544 Bytes	+42.9 KiB	+0.003%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

github-actions · 2025-12-05T12:18:47Z

🟢 2025-12-05 12:18:46 UTC The validation of the Pull Request description is successful.

Copilot

Pull request overview

This PR implements a two-stage ANALYZE process with adaptive count-min sketch parameters. The first stage calculates total row count and distinct value counts for each column using HyperLogLog. The second stage uses these statistics to determine which columns need count-min sketches and calculates optimal parameters (width and depth) based on the data distribution.

Key changes:

Refactored test utilities to separate table creation from data population and statistics collection
Introduced IColumnStatisticEval interface for extensible statistics calculation
Added TSelectBuilder class to dynamically construct YQL queries for statistics aggregation
Modified statistics storage to support multiple statistic types per column

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ydb/core/statistics/ut_common/ut_common.h	Added new test helper functions and structures, refactored function signatures to return `TTableInfo`
ydb/core/statistics/ut_common/ut_common.cpp	Implemented test helpers, changed table schemas from `Uint64` to `String` for Value column
ydb/core/statistics/service/ut/ya.make	Added dependencies for UDF functions (digest, hyperloglog)
ydb/core/statistics/service/ut/ut_http_request.cpp	Updated tests to use new helper functions
ydb/core/statistics/service/ut/ut_column_statistics.cpp	Simplified tests using new helper functions
ydb/core/statistics/service/ut/ut_basic_statistics.cpp	Updated to use `PrepareUniformTable` instead of `CreateUniformTable`
ydb/core/statistics/events.h	Added `SIMPLE_COLUMN` stat type and `TStatisticsItem` structure, extended `TEvFinishTraversal`
ydb/core/statistics/database/ut/ut_database.cpp	Updated tests to use new `TStatisticsItem` structure
ydb/core/statistics/database/database.h	Changed `CreateSaveStatisticsQuery` signature to accept `TStatisticsItem` vector
ydb/core/statistics/database/database.cpp	Refactored save query to handle heterogeneous statistics items with optional column tags
ydb/core/statistics/aggregator/ya.make	Added new source files for select builder
ydb/core/statistics/aggregator/ut/ya.make	Added UDF dependencies
ydb/core/statistics/aggregator/ut/ut_traverse_datashard.cpp	Updated tests to use `PrepareUniformTable`
ydb/core/statistics/aggregator/ut/ut_traverse_columnshard.cpp	Simplified tests using new helper functions
ydb/core/statistics/aggregator/ut/ut_analyze_datashard.cpp	Added validation functions, updated tests
ydb/core/statistics/aggregator/ut/ut_analyze_columnshard.cpp	Simplified tests using new helper functions
ydb/core/statistics/aggregator/tx_finish_trasersal.cpp	Added handling for `TEvFinishTraversal` with statistics payload
ydb/core/statistics/aggregator/tx_aggr_stat_response.cpp	Added filtering to skip columns without names
ydb/core/statistics/aggregator/select_builder.h	New class for building YQL SELECT queries with UDAF aggregations
ydb/core/statistics/aggregator/select_builder.cpp	Implementation of query builder
ydb/core/statistics/aggregator/analyze_actor.h	Added two-stage processing, `IColumnStatisticEval` interface, refactored column descriptor
ydb/core/statistics/aggregator/analyze_actor.cpp	Implemented two-stage analysis with adaptive CMS parameters
ydb/core/statistics/aggregator/aggregator_impl.h	Added `StatisticsToSave` member
ydb/core/statistics/aggregator/aggregator_impl.cpp	Updated to handle new statistics items format
ydb/core/protos/statistics.proto	Added `TSimpleColumnStatistics` message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ydb/core/statistics/aggregator/analyze_actor.h

ydb/core/statistics/service/ut/ut_column_statistics.cpp

ydb/core/statistics/aggregator/analyze_actor.cpp

github-actions · 2025-12-08T16:42:33Z

⚪ 2025-12-08 16:42:32 UTC Pre-commit check linux-x86_64-relwithdebinfo for b2b2610 has started.
⚪ 2025-12-08 16:43:07 UTC Artifacts will be uploaded here
⚪ 2025-12-08 16:45:14 UTC ya make is running...
🟡 2025-12-08 19:01:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
41841	38911	0	3	2902	25

⚪ 2025-12-08 19:02:08 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-12-08 19:12:41 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
44 (only retried tests)	31	0	1	0	12

⚪ 2025-12-08 19:12:48 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-12-08 19:20:03 UTC Tests successful.

Ya make output | Test bloat | Test bloat | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
26 (only retried tests)	15	0	0	0	11

🟢 2025-12-08 19:20:10 UTC Build successful.
🟢 2025-12-08 19:20:35 UTC ydbd size 2.3 GiB changed* by +60.4 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash	main: `33bff08`	merge: `b2b2610`	diff	diff %
ydbd size	2 467 827 992 Bytes	2 467 889 808 Bytes	+60.4 KiB	+0.003%
ydbd stripped size	525 130 656 Bytes	525 145 504 Bytes	+14.5 KiB	+0.003%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

github-actions · 2025-12-08T16:45:11Z

⚪ 2025-12-08 16:45:10 UTC Pre-commit check linux-x86_64-release-asan for b2b2610 has started.
⚪ 2025-12-08 16:45:30 UTC Artifacts will be uploaded here
⚪ 2025-12-08 16:47:40 UTC ya make is running...
🟡 2025-12-08 18:30:39 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
13550	13419	0	104	17	10

🟢 2025-12-08 18:30:47 UTC Build successful.
🟡 2025-12-08 18:31:22 UTC ydbd size 3.8 GiB changed* by +100.1 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash	main: `33bff08`	merge: `b2b2610`	diff	diff %
ydbd size	4 131 363 664 Bytes	4 131 466 168 Bytes	+100.1 KiB	+0.002%
ydbd stripped size	1 533 585 592 Bytes	1 533 625 976 Bytes	+39.4 KiB	+0.003%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 12 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-08T17:04:36Z

ydb/core/statistics/aggregator/analyze_actor.cpp

+    auto addColumn = [&](const TSysTables::TTableColumnInfo& colInfo) {
+        Columns.emplace_back(colInfo.Id, colInfo.PType, colInfo.Name);
+        // TODO: escape column names
+        Columns.back().CountDistinctSeq = stage1Builder.AddBuiltinAggregation(


The comment says "TODO: escape column names" but the column names are being passed directly to the query builder. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.

ydb/core/statistics/aggregator/select_builder.cpp

Copilot · 2025-12-08T17:04:36Z

ydb/core/statistics/aggregator/analyze_actor.cpp


    // TODO: escape table path
-    auto table = "/" + JoinVectorIntoString(entry.Path, "/");
+    TableName = "/" + JoinVectorIntoString(entry.Path, "/");


The comment says "TODO: escape table path" but the table path is being used directly in the query. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.

Copilot · 2025-12-08T17:04:37Z

ydb/core/statistics/events.h

 };

 enum EStatType {
    SIMPLE = 0,


The enum value SIMPLE_COLUMN (formerly HYPER_LOG_LOG) lacks documentation explaining what "simple column statistics" means and what data it represents. Consider adding a comment to document this statistic type.

Suggested change

SIMPLE = 0,

SIMPLE = 0,

// SIMPLE_COLUMN represents simple statistics for a specific column, such as row count and byte size.

// It is used to store basic metrics for individual columns, as opposed to the whole table.

Copilot · 2025-12-08T17:04:37Z

ydb/core/statistics/aggregator/analyze_actor.cpp

-    size_t ColumnCount() const {
-        return Columns.size();
+        const double c = 10;
+        const double eps = (c - 1) * (1 + std::log10(n / ndv)) / ndv;


The calculation (c - 1) * (1 + std::log10(n / ndv)) / ndv could potentially result in division by zero or very small ndv values leading to extremely large epsilon values. Although there's a check for ndv == 0 on line 27, if ndv is very small (e.g., 1), the epsilon calculation and subsequent width calculation could produce unexpected results. Consider adding validation or clamping for the calculated width to ensure it stays within reasonable bounds.

Copilot · 2025-12-08T17:04:38Z

ydb/core/statistics/aggregator/select_builder.h

+ui32 TSelectBuilder::AddUDAFAggregation(TString columnName, const TStringBuf& udafName, TArgs&&... params) {
+    auto factory = AddFactory(udafName);
+
+    // TODO: parameters escaping/binding


The comment says "TODO: parameters escaping/binding" but this is a potential security/correctness issue. If this TODO is not being addressed in this PR, it should be tracked properly as it relates to SQL injection prevention.

Copilot · 2025-12-08T17:04:38Z

ydb/core/statistics/aggregator/analyze_actor.cpp

+            if (!statEval) {
+                continue;
+            }
+            if (statEval->EstimateSize() >= 4_MB) {


The 4_MB threshold for statistic size is a magic number. It should be defined as a named constant (e.g., constexpr size_t MAX_STATISTIC_SIZE = 4_MB) to improve code maintainability and make it easier to adjust this limit in the future.

Copilot · 2025-12-08T17:04:39Z

ydb/core/statistics/aggregator/analyze_actor.cpp

-        res << " FROM `" << table << "`";
-        return res;
-    }
+        if (ndv >= 0.8 * n) {


The condition ndv >= 0.8 * n uses a magic number (0.8). This threshold for determining when to skip count-min sketch should be defined as a named constant (e.g., constexpr double NDV_THRESHOLD = 0.8) to make the code more maintainable and document the reasoning behind this value.

Copilot · 2025-12-08T17:04:39Z

ydb/core/statistics/aggregator/analyze_actor.cpp


-    size_t ColumnCount() const {
-        return Columns.size();
+        const double c = 10;


The constant c = 10 in the epsilon calculation is a magic number without explanation. This should be defined as a named constant with a comment explaining its purpose in the count-min sketch parameter calculation formula.

Copilot · 2025-12-08T17:04:39Z

ydb/core/statistics/aggregator/aggregator_impl.cpp

+                // operation.Types field is not used, TAnalyzeActor will determine suitable
+                // statistic types itself.


The comment says "operation.Types field is not used" but this statement is incomplete. It should clarify what this field was previously used for and why it's no longer needed, to help future maintainers understand the change.

Suggested change

// operation.Types field is not used, TAnalyzeActor will determine suitable

// statistic types itself.

// Previously, the operation.Types field was used to specify which statistic types

// should be collected and analyzed for each force traversal operation. This approach

// was replaced to allow TAnalyzeActor to determine the suitable statistic types itself,

// based on the current table schema and configuration. As a result, operation.Types is

// no longer needed and is ignored here.

github-actions · 2025-12-09T08:57:59Z

⚪ 2025-12-09 08:57:59 UTC Pre-commit check linux-x86_64-relwithdebinfo for 2f844a1 has started.
⚪ 2025-12-09 08:58:16 UTC Artifacts will be uploaded here
⚪ 2025-12-09 09:00:31 UTC ya make is running...
🟡 2025-12-09 11:07:04 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
41865	38936	0	4	2902	23

⚪ 2025-12-09 11:07:17 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-09 11:19:52 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
48 (only retried tests)	34	0	0	0	14

🟢 2025-12-09 11:20:00 UTC Build successful.
🟢 2025-12-09 11:20:22 UTC ydbd size 2.3 GiB changed* by +54.7 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash	main: `059973d`	merge: `2f844a1`	diff	diff %
ydbd size	2 466 721 552 Bytes	2 466 777 528 Bytes	+54.7 KiB	+0.002%
ydbd stripped size	524 960 640 Bytes	524 970 976 Bytes	+10.1 KiB	+0.002%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

github-actions · 2025-12-09T08:58:27Z

⚪ 2025-12-09 08:58:26 UTC Pre-commit check linux-x86_64-release-asan for 2f844a1 has started.
⚪ 2025-12-09 08:58:45 UTC Artifacts will be uploaded here
⚪ 2025-12-09 09:01:01 UTC ya make is running...
🟡 2025-12-09 10:58:02 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
13577	13495	0	65	7	10

🟢 2025-12-09 10:58:12 UTC Build successful.
🟡 2025-12-09 10:58:48 UTC ydbd size 3.8 GiB changed* by +101.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash	main: `059973d`	merge: `2f844a1`	diff	diff %
ydbd size	4 130 791 632 Bytes	4 130 895 504 Bytes	+101.4 KiB	+0.003%
ydbd stripped size	1 533 426 776 Bytes	1 533 470 040 Bytes	+42.2 KiB	+0.003%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

ydbot · 2025-12-09T12:05:43Z

Backport

To backport this PR, click the button next to the target branch and then click "Run workflow" in the Run Actions UI.

Branch	Run
`stable-25-2, stable-25-2-1, stable-25-3, stable-25-3-1`
`stable-25-3, stable-25-3-1`
`stable-25-3`

ztlpn added 11 commits December 3, 2025 17:55

statistics: support multiple stat types in SaveStatisticsQuery

4a54224

add proto msg for simple column statistics

51be969

calculate simple column statistics (count distinct)

9baf8ae

save column type information

6355776

2-stage analyze with adaptive count-min sketch params

7970ffa

move TSelectBuilder to a separate file

e71469e

add clarifying comment

f6bfb00

change Value column type to String

1f8c255

statistics: refactor test helpers

026201c

tests: split CreateUniformTable into two functions

0b48dcc

more test helpers refactoring (passes existing tests)

00c0b07

ztlpn requested a review from azevaykin December 5, 2025 12:15

ztlpn requested a review from a team as a code owner December 5, 2025 12:15

github-actions bot added the not-for-changelog label Dec 5, 2025

ztlpn self-assigned this Dec 5, 2025

azevaykin requested a review from Copilot December 6, 2025 06:10

Copilot started reviewing on behalf of azevaykin December 6, 2025 06:12 View session

Copilot AI reviewed Dec 6, 2025

View reviewed changes

ydb/core/statistics/aggregator/analyze_actor.h Outdated Show resolved Hide resolved

ydb/core/statistics/service/ut/ut_column_statistics.cpp Outdated Show resolved Hide resolved

ydb/core/statistics/aggregator/analyze_actor.cpp Outdated Show resolved Hide resolved

azevaykin reviewed Dec 6, 2025

View reviewed changes

ydb/core/statistics/aggregator/analyze_actor.cpp Show resolved Hide resolved

ydb/core/statistics/aggregator/analyze_actor.cpp Outdated Show resolved Hide resolved

ydb/core/statistics/aggregator/analyze_actor.cpp Outdated Show resolved Hide resolved

ztlpn added 2 commits December 8, 2025 19:16

review fixes

443978f

tests: single function to check count-min values/absence

069fded

ztlpn requested a review from azevaykin December 8, 2025 16:43

ztlpn requested a review from Copilot December 8, 2025 16:51

Copilot started reviewing on behalf of ztlpn December 8, 2025 16:54 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

azevaykin previously approved these changes Dec 8, 2025

View reviewed changes

review fixes

109e564

ztlpn dismissed azevaykin’s stale review via 109e564 December 9, 2025 08:56

ztlpn requested a review from azevaykin December 9, 2025 08:57

ztlpn enabled auto-merge (squash) December 9, 2025 12:00

azevaykin approved these changes Dec 9, 2025

View reviewed changes

ztlpn merged commit cd0aab0 into ydb-platform:main Dec 9, 2025
9 checks passed

		// operation.Types field is not used, TAnalyzeActor will determine suitable
		// statistic types itself.

-                // operation.Types field is not used, TAnalyzeActor will determine suitable
-                // statistic types itself.
+                // Previously, the operation.Types field was used to specify which statistic types
+                // should be collected and analyzed for each force traversal operation. This approach
+                // was replaced to allow TAnalyzeActor to determine the suitable statistic types itself,
+                // based on the current table schema and configuration. As a result, operation.Types is
+                // no longer needed and is ignored here.

Two-stage ANALYZE with adaptive count-min sketch params #30206

Two-stage ANALYZE with adaptive count-min sketch params #30206

Uh oh!

Conversation

ztlpn commented Dec 5, 2025

Changelog category

Description for reviewers

Uh oh!

ydbot commented Dec 5, 2025

Run Extra Tests

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ydbot commented Dec 9, 2025

Backport

Uh oh!

Reviewers

Assignees

github-actions bot commented Dec 5, 2025 •

edited

Loading

github-actions bot commented Dec 5, 2025 •

edited

Loading

github-actions bot commented Dec 8, 2025 •

edited

Loading

github-actions bot commented Dec 8, 2025 •

edited

Loading

github-actions bot commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 9, 2025 •

edited

Loading