Skip to content

Conversation

@ztlpn
Copy link
Collaborator

@ztlpn ztlpn commented Dec 5, 2025

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

Calculate column statistics in two stages, first stage determines total count and number of distinct values for each column, second stage calculates count-min sketches for some columns.

@ztlpn ztlpn requested a review from azevaykin December 5, 2025 12:15
@ztlpn ztlpn requested a review from a team as a code owner December 5, 2025 12:15
@ydbot
Copy link
Collaborator

ydbot commented Dec 5, 2025

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@ztlpn ztlpn self-assigned this Dec 5, 2025
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 12:17:03 UTC Pre-commit check linux-x86_64-relwithdebinfo for 3100eb4 has started.
2025-12-05 12:17:21 UTC Artifacts will be uploaded here
2025-12-05 12:19:30 UTC ya make is running...
🟡 2025-12-05 14:40:53 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41828 38890 0 5 2903 30

2025-12-05 14:41:07 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-05 14:53:24 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
57 (only retried tests) 44 0 0 0 13

🟢 2025-12-05 14:53:31 UTC Build successful.
🟢 2025-12-05 14:53:53 UTC ydbd size 2.3 GiB changed* by +59.6 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 3707c7f merge: 3100eb4 diff diff %
ydbd size 2 465 899 016 Bytes 2 465 960 064 Bytes +59.6 KiB +0.002%
ydbd stripped size 524 793 600 Bytes 524 808 192 Bytes +14.2 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 12:17:19 UTC Pre-commit check linux-x86_64-release-asan for 3100eb4 has started.
2025-12-05 12:17:39 UTC Artifacts will be uploaded here
2025-12-05 12:19:46 UTC ya make is running...
🟡 2025-12-05 14:08:38 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13546 13464 0 66 7 9

🟢 2025-12-05 14:08:46 UTC Build successful.
🟡 2025-12-05 14:09:15 UTC ydbd size 3.8 GiB changed* by +102.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 3707c7f merge: 3100eb4 diff diff %
ydbd size 4 127 315 488 Bytes 4 127 420 880 Bytes +102.9 KiB +0.003%
ydbd stripped size 1 532 447 576 Bytes 1 532 491 544 Bytes +42.9 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🟢 2025-12-05 12:18:46 UTC The validation of the Pull Request description is successful.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a two-stage ANALYZE process with adaptive count-min sketch parameters. The first stage calculates total row count and distinct value counts for each column using HyperLogLog. The second stage uses these statistics to determine which columns need count-min sketches and calculates optimal parameters (width and depth) based on the data distribution.

Key changes:

  • Refactored test utilities to separate table creation from data population and statistics collection
  • Introduced IColumnStatisticEval interface for extensible statistics calculation
  • Added TSelectBuilder class to dynamically construct YQL queries for statistics aggregation
  • Modified statistics storage to support multiple statistic types per column

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ydb/core/statistics/ut_common/ut_common.h Added new test helper functions and structures, refactored function signatures to return TTableInfo
ydb/core/statistics/ut_common/ut_common.cpp Implemented test helpers, changed table schemas from Uint64 to String for Value column
ydb/core/statistics/service/ut/ya.make Added dependencies for UDF functions (digest, hyperloglog)
ydb/core/statistics/service/ut/ut_http_request.cpp Updated tests to use new helper functions
ydb/core/statistics/service/ut/ut_column_statistics.cpp Simplified tests using new helper functions
ydb/core/statistics/service/ut/ut_basic_statistics.cpp Updated to use PrepareUniformTable instead of CreateUniformTable
ydb/core/statistics/events.h Added SIMPLE_COLUMN stat type and TStatisticsItem structure, extended TEvFinishTraversal
ydb/core/statistics/database/ut/ut_database.cpp Updated tests to use new TStatisticsItem structure
ydb/core/statistics/database/database.h Changed CreateSaveStatisticsQuery signature to accept TStatisticsItem vector
ydb/core/statistics/database/database.cpp Refactored save query to handle heterogeneous statistics items with optional column tags
ydb/core/statistics/aggregator/ya.make Added new source files for select builder
ydb/core/statistics/aggregator/ut/ya.make Added UDF dependencies
ydb/core/statistics/aggregator/ut/ut_traverse_datashard.cpp Updated tests to use PrepareUniformTable
ydb/core/statistics/aggregator/ut/ut_traverse_columnshard.cpp Simplified tests using new helper functions
ydb/core/statistics/aggregator/ut/ut_analyze_datashard.cpp Added validation functions, updated tests
ydb/core/statistics/aggregator/ut/ut_analyze_columnshard.cpp Simplified tests using new helper functions
ydb/core/statistics/aggregator/tx_finish_trasersal.cpp Added handling for TEvFinishTraversal with statistics payload
ydb/core/statistics/aggregator/tx_aggr_stat_response.cpp Added filtering to skip columns without names
ydb/core/statistics/aggregator/select_builder.h New class for building YQL SELECT queries with UDAF aggregations
ydb/core/statistics/aggregator/select_builder.cpp Implementation of query builder
ydb/core/statistics/aggregator/analyze_actor.h Added two-stage processing, IColumnStatisticEval interface, refactored column descriptor
ydb/core/statistics/aggregator/analyze_actor.cpp Implemented two-stage analysis with adaptive CMS parameters
ydb/core/statistics/aggregator/aggregator_impl.h Added StatisticsToSave member
ydb/core/statistics/aggregator/aggregator_impl.cpp Updated to handle new statistics items format
ydb/core/protos/statistics.proto Added TSimpleColumnStatistics message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

2025-12-08 16:42:32 UTC Pre-commit check linux-x86_64-relwithdebinfo for b2b2610 has started.
2025-12-08 16:43:07 UTC Artifacts will be uploaded here
2025-12-08 16:45:14 UTC ya make is running...
🟡 2025-12-08 19:01:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41841 38911 0 3 2902 25

2025-12-08 19:02:08 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-12-08 19:12:41 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
44 (only retried tests) 31 0 1 0 12

2025-12-08 19:12:48 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-12-08 19:20:03 UTC Tests successful.

Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
26 (only retried tests) 15 0 0 0 11

🟢 2025-12-08 19:20:10 UTC Build successful.
🟢 2025-12-08 19:20:35 UTC ydbd size 2.3 GiB changed* by +60.4 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 33bff08 merge: b2b2610 diff diff %
ydbd size 2 467 827 992 Bytes 2 467 889 808 Bytes +60.4 KiB +0.003%
ydbd stripped size 525 130 656 Bytes 525 145 504 Bytes +14.5 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@ztlpn ztlpn requested a review from azevaykin December 8, 2025 16:43
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

2025-12-08 16:45:10 UTC Pre-commit check linux-x86_64-release-asan for b2b2610 has started.
2025-12-08 16:45:30 UTC Artifacts will be uploaded here
2025-12-08 16:47:40 UTC ya make is running...
🟡 2025-12-08 18:30:39 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13550 13419 0 104 17 10

🟢 2025-12-08 18:30:47 UTC Build successful.
🟡 2025-12-08 18:31:22 UTC ydbd size 3.8 GiB changed* by +100.1 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 33bff08 merge: b2b2610 diff diff %
ydbd size 4 131 363 664 Bytes 4 131 466 168 Bytes +100.1 KiB +0.002%
ydbd stripped size 1 533 585 592 Bytes 1 533 625 976 Bytes +39.4 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

auto addColumn = [&](const TSysTables::TTableColumnInfo& colInfo) {
Columns.emplace_back(colInfo.Id, colInfo.PType, colInfo.Name);
// TODO: escape column names
Columns.back().CountDistinctSeq = stage1Builder.AddBuiltinAggregation(
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "TODO: escape column names" but the column names are being passed directly to the query builder. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.

Copilot uses AI. Check for mistakes.

// TODO: escape table path
auto table = "/" + JoinVectorIntoString(entry.Path, "/");
TableName = "/" + JoinVectorIntoString(entry.Path, "/");
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "TODO: escape table path" but the table path is being used directly in the query. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.

Copilot uses AI. Check for mistakes.
};

enum EStatType {
SIMPLE = 0,
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum value SIMPLE_COLUMN (formerly HYPER_LOG_LOG) lacks documentation explaining what "simple column statistics" means and what data it represents. Consider adding a comment to document this statistic type.

Suggested change
SIMPLE = 0,
SIMPLE = 0,
// SIMPLE_COLUMN represents simple statistics for a specific column, such as row count and byte size.
// It is used to store basic metrics for individual columns, as opposed to the whole table.

Copilot uses AI. Check for mistakes.
size_t ColumnCount() const {
return Columns.size();
const double c = 10;
const double eps = (c - 1) * (1 + std::log10(n / ndv)) / ndv;
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation (c - 1) * (1 + std::log10(n / ndv)) / ndv could potentially result in division by zero or very small ndv values leading to extremely large epsilon values. Although there's a check for ndv == 0 on line 27, if ndv is very small (e.g., 1), the epsilon calculation and subsequent width calculation could produce unexpected results. Consider adding validation or clamping for the calculated width to ensure it stays within reasonable bounds.

Copilot uses AI. Check for mistakes.
ui32 TSelectBuilder::AddUDAFAggregation(TString columnName, const TStringBuf& udafName, TArgs&&... params) {
auto factory = AddFactory(udafName);

// TODO: parameters escaping/binding
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "TODO: parameters escaping/binding" but this is a potential security/correctness issue. If this TODO is not being addressed in this PR, it should be tracked properly as it relates to SQL injection prevention.

Copilot uses AI. Check for mistakes.
if (!statEval) {
continue;
}
if (statEval->EstimateSize() >= 4_MB) {
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4_MB threshold for statistic size is a magic number. It should be defined as a named constant (e.g., constexpr size_t MAX_STATISTIC_SIZE = 4_MB) to improve code maintainability and make it easier to adjust this limit in the future.

Copilot uses AI. Check for mistakes.
res << " FROM `" << table << "`";
return res;
}
if (ndv >= 0.8 * n) {
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition ndv >= 0.8 * n uses a magic number (0.8). This threshold for determining when to skip count-min sketch should be defined as a named constant (e.g., constexpr double NDV_THRESHOLD = 0.8) to make the code more maintainable and document the reasoning behind this value.

Copilot uses AI. Check for mistakes.

size_t ColumnCount() const {
return Columns.size();
const double c = 10;
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant c = 10 in the epsilon calculation is a magic number without explanation. This should be defined as a named constant with a comment explaining its purpose in the count-min sketch parameter calculation formula.

Copilot uses AI. Check for mistakes.
Comment on lines +678 to +679
// operation.Types field is not used, TAnalyzeActor will determine suitable
// statistic types itself.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "operation.Types field is not used" but this statement is incomplete. It should clarify what this field was previously used for and why it's no longer needed, to help future maintainers understand the change.

Suggested change
// operation.Types field is not used, TAnalyzeActor will determine suitable
// statistic types itself.
// Previously, the operation.Types field was used to specify which statistic types
// should be collected and analyzed for each force traversal operation. This approach
// was replaced to allow TAnalyzeActor to determine the suitable statistic types itself,
// based on the current table schema and configuration. As a result, operation.Types is
// no longer needed and is ignored here.

Copilot uses AI. Check for mistakes.
azevaykin
azevaykin previously approved these changes Dec 8, 2025
@github-actions
Copy link

github-actions bot commented Dec 9, 2025

2025-12-09 08:57:59 UTC Pre-commit check linux-x86_64-relwithdebinfo for 2f844a1 has started.
2025-12-09 08:58:16 UTC Artifacts will be uploaded here
2025-12-09 09:00:31 UTC ya make is running...
🟡 2025-12-09 11:07:04 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41865 38936 0 4 2902 23

2025-12-09 11:07:17 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-09 11:19:52 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
48 (only retried tests) 34 0 0 0 14

🟢 2025-12-09 11:20:00 UTC Build successful.
🟢 2025-12-09 11:20:22 UTC ydbd size 2.3 GiB changed* by +54.7 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 059973d merge: 2f844a1 diff diff %
ydbd size 2 466 721 552 Bytes 2 466 777 528 Bytes +54.7 KiB +0.002%
ydbd stripped size 524 960 640 Bytes 524 970 976 Bytes +10.1 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

2025-12-09 08:58:26 UTC Pre-commit check linux-x86_64-release-asan for 2f844a1 has started.
2025-12-09 08:58:45 UTC Artifacts will be uploaded here
2025-12-09 09:01:01 UTC ya make is running...
🟡 2025-12-09 10:58:02 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13577 13495 0 65 7 10

🟢 2025-12-09 10:58:12 UTC Build successful.
🟡 2025-12-09 10:58:48 UTC ydbd size 3.8 GiB changed* by +101.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 059973d merge: 2f844a1 diff diff %
ydbd size 4 130 791 632 Bytes 4 130 895 504 Bytes +101.4 KiB +0.003%
ydbd stripped size 1 533 426 776 Bytes 1 533 470 040 Bytes +42.2 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@ztlpn ztlpn enabled auto-merge (squash) December 9, 2025 12:00
@ztlpn ztlpn merged commit cd0aab0 into ydb-platform:main Dec 9, 2025
9 checks passed
@ydbot
Copy link
Collaborator

ydbot commented Dec 9, 2025

Backport

To backport this PR, click the button next to the target branch and then click "Run workflow" in the Run Actions UI.

Branch Run
stable-25-2, stable-25-2-1, stable-25-3, stable-25-3-1 ▶  Backport
stable-25-3, stable-25-3-1 ▶  Backport
stable-25-3 ▶  Backport

▶  Backport manual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants