-
Notifications
You must be signed in to change notification settings - Fork 739
Two-stage ANALYZE with adaptive count-min sketch params #30206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a two-stage ANALYZE process with adaptive count-min sketch parameters. The first stage calculates total row count and distinct value counts for each column using HyperLogLog. The second stage uses these statistics to determine which columns need count-min sketches and calculates optimal parameters (width and depth) based on the data distribution.
Key changes:
- Refactored test utilities to separate table creation from data population and statistics collection
- Introduced
IColumnStatisticEvalinterface for extensible statistics calculation - Added
TSelectBuilderclass to dynamically construct YQL queries for statistics aggregation - Modified statistics storage to support multiple statistic types per column
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/statistics/ut_common/ut_common.h | Added new test helper functions and structures, refactored function signatures to return TTableInfo |
| ydb/core/statistics/ut_common/ut_common.cpp | Implemented test helpers, changed table schemas from Uint64 to String for Value column |
| ydb/core/statistics/service/ut/ya.make | Added dependencies for UDF functions (digest, hyperloglog) |
| ydb/core/statistics/service/ut/ut_http_request.cpp | Updated tests to use new helper functions |
| ydb/core/statistics/service/ut/ut_column_statistics.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/service/ut/ut_basic_statistics.cpp | Updated to use PrepareUniformTable instead of CreateUniformTable |
| ydb/core/statistics/events.h | Added SIMPLE_COLUMN stat type and TStatisticsItem structure, extended TEvFinishTraversal |
| ydb/core/statistics/database/ut/ut_database.cpp | Updated tests to use new TStatisticsItem structure |
| ydb/core/statistics/database/database.h | Changed CreateSaveStatisticsQuery signature to accept TStatisticsItem vector |
| ydb/core/statistics/database/database.cpp | Refactored save query to handle heterogeneous statistics items with optional column tags |
| ydb/core/statistics/aggregator/ya.make | Added new source files for select builder |
| ydb/core/statistics/aggregator/ut/ya.make | Added UDF dependencies |
| ydb/core/statistics/aggregator/ut/ut_traverse_datashard.cpp | Updated tests to use PrepareUniformTable |
| ydb/core/statistics/aggregator/ut/ut_traverse_columnshard.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/aggregator/ut/ut_analyze_datashard.cpp | Added validation functions, updated tests |
| ydb/core/statistics/aggregator/ut/ut_analyze_columnshard.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/aggregator/tx_finish_trasersal.cpp | Added handling for TEvFinishTraversal with statistics payload |
| ydb/core/statistics/aggregator/tx_aggr_stat_response.cpp | Added filtering to skip columns without names |
| ydb/core/statistics/aggregator/select_builder.h | New class for building YQL SELECT queries with UDAF aggregations |
| ydb/core/statistics/aggregator/select_builder.cpp | Implementation of query builder |
| ydb/core/statistics/aggregator/analyze_actor.h | Added two-stage processing, IColumnStatisticEval interface, refactored column descriptor |
| ydb/core/statistics/aggregator/analyze_actor.cpp | Implemented two-stage analysis with adaptive CMS parameters |
| ydb/core/statistics/aggregator/aggregator_impl.h | Added StatisticsToSave member |
| ydb/core/statistics/aggregator/aggregator_impl.cpp | Updated to handle new statistics items format |
| ydb/core/protos/statistics.proto | Added TSimpleColumnStatistics message |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto addColumn = [&](const TSysTables::TTableColumnInfo& colInfo) { | ||
| Columns.emplace_back(colInfo.Id, colInfo.PType, colInfo.Name); | ||
| // TODO: escape column names | ||
| Columns.back().CountDistinctSeq = stage1Builder.AddBuiltinAggregation( |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "TODO: escape column names" but the column names are being passed directly to the query builder. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.
|
|
||
| // TODO: escape table path | ||
| auto table = "/" + JoinVectorIntoString(entry.Path, "/"); | ||
| TableName = "/" + JoinVectorIntoString(entry.Path, "/"); |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "TODO: escape table path" but the table path is being used directly in the query. If this is a security or correctness concern that needs to be addressed, it should be tracked properly. If not, the TODO should be removed.
| }; | ||
|
|
||
| enum EStatType { | ||
| SIMPLE = 0, |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enum value SIMPLE_COLUMN (formerly HYPER_LOG_LOG) lacks documentation explaining what "simple column statistics" means and what data it represents. Consider adding a comment to document this statistic type.
| SIMPLE = 0, | |
| SIMPLE = 0, | |
| // SIMPLE_COLUMN represents simple statistics for a specific column, such as row count and byte size. | |
| // It is used to store basic metrics for individual columns, as opposed to the whole table. |
| size_t ColumnCount() const { | ||
| return Columns.size(); | ||
| const double c = 10; | ||
| const double eps = (c - 1) * (1 + std::log10(n / ndv)) / ndv; |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation (c - 1) * (1 + std::log10(n / ndv)) / ndv could potentially result in division by zero or very small ndv values leading to extremely large epsilon values. Although there's a check for ndv == 0 on line 27, if ndv is very small (e.g., 1), the epsilon calculation and subsequent width calculation could produce unexpected results. Consider adding validation or clamping for the calculated width to ensure it stays within reasonable bounds.
| ui32 TSelectBuilder::AddUDAFAggregation(TString columnName, const TStringBuf& udafName, TArgs&&... params) { | ||
| auto factory = AddFactory(udafName); | ||
|
|
||
| // TODO: parameters escaping/binding |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "TODO: parameters escaping/binding" but this is a potential security/correctness issue. If this TODO is not being addressed in this PR, it should be tracked properly as it relates to SQL injection prevention.
| if (!statEval) { | ||
| continue; | ||
| } | ||
| if (statEval->EstimateSize() >= 4_MB) { |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 4_MB threshold for statistic size is a magic number. It should be defined as a named constant (e.g., constexpr size_t MAX_STATISTIC_SIZE = 4_MB) to improve code maintainability and make it easier to adjust this limit in the future.
| res << " FROM `" << table << "`"; | ||
| return res; | ||
| } | ||
| if (ndv >= 0.8 * n) { |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition ndv >= 0.8 * n uses a magic number (0.8). This threshold for determining when to skip count-min sketch should be defined as a named constant (e.g., constexpr double NDV_THRESHOLD = 0.8) to make the code more maintainable and document the reasoning behind this value.
|
|
||
| size_t ColumnCount() const { | ||
| return Columns.size(); | ||
| const double c = 10; |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant c = 10 in the epsilon calculation is a magic number without explanation. This should be defined as a named constant with a comment explaining its purpose in the count-min sketch parameter calculation formula.
| // operation.Types field is not used, TAnalyzeActor will determine suitable | ||
| // statistic types itself. |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "operation.Types field is not used" but this statement is incomplete. It should clarify what this field was previously used for and why it's no longer needed, to help future maintainers understand the change.
| // operation.Types field is not used, TAnalyzeActor will determine suitable | |
| // statistic types itself. | |
| // Previously, the operation.Types field was used to specify which statistic types | |
| // should be collected and analyzed for each force traversal operation. This approach | |
| // was replaced to allow TAnalyzeActor to determine the suitable statistic types itself, | |
| // based on the current table schema and configuration. As a result, operation.Types is | |
| // no longer needed and is ignored here. |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
Changelog category
Description for reviewers
Calculate column statistics in two stages, first stage determines total count and number of distinct values for each column, second stage calculates count-min sketches for some columns.