feat(tdigest): implement TDIGEST.TRIMMED_MEAN command#3312
feat(tdigest): implement TDIGEST.TRIMMED_MEAN command#3312chakkk309 wants to merge 23 commits intoapache:unstablefrom
Conversation
# Conflicts: # src/types/redis_tdigest.h # src/types/tdigest.h
LindaSummer
left a comment
There was a problem hiding this comment.
Hi @chakkk309 ,
😊 It seems this commit couldn't pass the ci. Please help check the error message in github actions.
src/types/redis_tdigest.cc
Outdated
| if (auto status = dumpCentroids(ctx, ns_key, metadata, ¢roids); !status.ok()) { | ||
| return status; | ||
| } | ||
| auto dump_centroids = DummyCentroids(metadata, centroids); |
There was a problem hiding this comment.
Hi @chakkk309 ,
It seems that this line has compile error in CI. Please make a check.
|
Hi, thank you for your contribution! Before you start coding, could you please read our contribution guide (https://kvrocks.apache.org/community/contributing/)? It can be better if you build and test Kvrocks against your changes successfully in your local before pushing them. Also note that we have guidelines for AI-assisted contributions: https://kvrocks.apache.org/community/contributing/#guidelines-for-ai-assisted-contributions |
|
Hi @chakkk309 , Thanks very much for your effort. I will review later today.😊 |
There was a problem hiding this comment.
Pull request overview
Implements the Redis-compatible TDIGEST.TRIMMED_MEAN command in Kvrocks’ TDigest module (Fixes #3066), exposing the functionality through the command layer and adding unit test coverage.
Changes:
- Add core trimmed-mean computation helper (
TDigestTrimmedMean) to the TDigest algorithm utilities. - Wire trimmed-mean into the Redis TDigest type (
redis::TDigest::TrimmedMean) and register thetdigest.trimmed_meancommand. - Add Go and C++ unit tests for
TDIGEST.TRIMMED_MEAN.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/gocase/unit/type/tdigest/tdigest_test.go | Adds Go integration tests for TDIGEST.TRIMMED_MEAN including argument/quantile validation cases. |
| tests/cppunit/types/tdigest_test.cc | Adds a C++ unit test for trimmed mean behavior on a basic dataset. |
| src/types/tdigest.h | Introduces TDigestTrimmedMean helper to compute trimmed mean from centroids. |
| src/types/redis_tdigest.h | Adds TDigestTrimmedMeanResult and the TDigest::TrimmedMean API. |
| src/types/redis_tdigest.cc | Implements TDigest::TrimmedMean by dumping centroids and calling the helper. |
| src/commands/cmd_tdigest.cc | Adds CommandTDigestTrimmedMean and registers tdigest.trimmed_mean. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/types/tdigest.h
Outdated
| double low_boundary = std::numeric_limits<double>::quiet_NaN(); | ||
| double high_boundary = std::numeric_limits<double>::quiet_NaN(); | ||
|
|
||
| if (low_cut_quantile == 0.0) { | ||
| low_boundary = td.Min(); | ||
| } else { | ||
| auto low_result = TDigestQuantile(td, low_cut_quantile); | ||
| if (!low_result) { | ||
| return low_result; | ||
| } | ||
| low_boundary = *low_result; | ||
| } | ||
|
|
||
| if (high_cut_quantile == 1.0) { | ||
| high_boundary = td.Max(); | ||
| } else { | ||
| auto high_result = TDigestQuantile(td, high_cut_quantile); | ||
| if (!high_result) { | ||
| return high_result; | ||
| } | ||
| high_boundary = *high_result; | ||
| } | ||
|
|
||
| auto iter = td.Begin(); | ||
| double total_weight_in_range = 0; | ||
| double weighted_sum = 0; | ||
|
|
||
| while (iter->Valid()) { | ||
| auto centroid = GET_OR_RET(iter->GetCentroid()); | ||
|
|
||
| if ((low_cut_quantile == 0.0 && high_cut_quantile == 1.0) || | ||
| (centroid.mean >= low_boundary && centroid.mean <= high_boundary)) { | ||
| total_weight_in_range += centroid.weight; | ||
| weighted_sum += centroid.mean * centroid.weight; | ||
| } | ||
|
|
||
| iter->Next(); | ||
| } | ||
|
|
||
| if (total_weight_in_range == 0) { |
There was a problem hiding this comment.
TDigestTrimmedMean can incorrectly return NaN when the low/high cut boundaries fall between centroid means (e.g., after quantile interpolation). The current logic only includes whole centroids whose mean is within [low_boundary, high_boundary], so it may exclude all centroids even though the quantile range contains weight. Consider computing overlap by weight position (e.g., iterate centroids with cumulative weight and add partial centroid weight for the boundary centroids) rather than filtering by centroid.mean.
| double low_boundary = std::numeric_limits<double>::quiet_NaN(); | |
| double high_boundary = std::numeric_limits<double>::quiet_NaN(); | |
| if (low_cut_quantile == 0.0) { | |
| low_boundary = td.Min(); | |
| } else { | |
| auto low_result = TDigestQuantile(td, low_cut_quantile); | |
| if (!low_result) { | |
| return low_result; | |
| } | |
| low_boundary = *low_result; | |
| } | |
| if (high_cut_quantile == 1.0) { | |
| high_boundary = td.Max(); | |
| } else { | |
| auto high_result = TDigestQuantile(td, high_cut_quantile); | |
| if (!high_result) { | |
| return high_result; | |
| } | |
| high_boundary = *high_result; | |
| } | |
| auto iter = td.Begin(); | |
| double total_weight_in_range = 0; | |
| double weighted_sum = 0; | |
| while (iter->Valid()) { | |
| auto centroid = GET_OR_RET(iter->GetCentroid()); | |
| if ((low_cut_quantile == 0.0 && high_cut_quantile == 1.0) || | |
| (centroid.mean >= low_boundary && centroid.mean <= high_boundary)) { | |
| total_weight_in_range += centroid.weight; | |
| weighted_sum += centroid.mean * centroid.weight; | |
| } | |
| iter->Next(); | |
| } | |
| if (total_weight_in_range == 0) { | |
| // First, compute the total weight of the t-digest. | |
| double total_weight = 0.0; | |
| { | |
| auto iter = td.Begin(); | |
| while (iter->Valid()) { | |
| auto centroid = GET_OR_RET(iter->GetCentroid()); | |
| total_weight += centroid.weight; | |
| iter->Next(); | |
| } | |
| } | |
| if (total_weight == 0.0) { | |
| return std::numeric_limits<double>::quiet_NaN(); | |
| } | |
| // If no trimming is requested, just return the global weighted mean. | |
| if (low_cut_quantile == 0.0 && high_cut_quantile == 1.0) { | |
| double weighted_sum = 0.0; | |
| auto iter = td.Begin(); | |
| while (iter->Valid()) { | |
| auto centroid = GET_OR_RET(iter->GetCentroid()); | |
| weighted_sum += centroid.mean * centroid.weight; | |
| iter->Next(); | |
| } | |
| return weighted_sum / total_weight; | |
| } | |
| // Compute rank boundaries in weight space. | |
| const double low_rank = low_cut_quantile * total_weight; | |
| const double high_rank = high_cut_quantile * total_weight; | |
| double cumulative_weight = 0.0; | |
| double total_weight_in_range = 0.0; | |
| double weighted_sum = 0.0; | |
| auto iter = td.Begin(); | |
| while (iter->Valid()) { | |
| auto centroid = GET_OR_RET(iter->GetCentroid()); | |
| const double start_rank = cumulative_weight; | |
| const double end_rank = cumulative_weight + centroid.weight; | |
| // If this centroid is entirely before the trimmed region, skip it. | |
| if (end_rank <= low_rank) { | |
| cumulative_weight = end_rank; | |
| iter->Next(); | |
| continue; | |
| } | |
| // If we've passed the trimmed region, we can stop. | |
| if (start_rank >= high_rank) { | |
| break; | |
| } | |
| // Compute overlap of this centroid's weight with [low_rank, high_rank). | |
| double overlap_start = start_rank; | |
| if (overlap_start < low_rank) { | |
| overlap_start = low_rank; | |
| } | |
| double overlap_end = end_rank; | |
| if (overlap_end > high_rank) { | |
| overlap_end = high_rank; | |
| } | |
| const double overlap = overlap_end - overlap_start; | |
| if (overlap > 0.0) { | |
| total_weight_in_range += overlap; | |
| weighted_sum += centroid.mean * overlap; | |
| } | |
| cumulative_weight = end_rank; | |
| iter->Next(); | |
| } | |
| if (total_weight_in_range == 0.0) { |
| if meanStr == "nan" { | ||
| return | ||
| } | ||
| mean, err := strconv.ParseFloat(meanStr, 64) | ||
| require.NoError(t, err) | ||
| require.Greater(t, mean, 0.0) |
There was a problem hiding this comment.
This test allows "nan" and returns early, which can mask real correctness issues (a non-empty digest with low_cut < high_cut should always have some weight in the trimmed range). It would be better to assert the result is not NaN for this dataset and verify it’s within an expected numeric range/value.
| if meanStr == "nan" { | |
| return | |
| } | |
| mean, err := strconv.ParseFloat(meanStr, 64) | |
| require.NoError(t, err) | |
| require.Greater(t, mean, 0.0) | |
| mean, err := strconv.ParseFloat(meanStr, 64) | |
| require.NoError(t, err) | |
| require.False(t, math.IsNaN(mean)) | |
| require.Greater(t, mean, 4.0) | |
| require.Less(t, mean, 7.0) |
LindaSummer
left a comment
There was a problem hiding this comment.
Hi @chakkk309 ,
Left some comments, and please also go through Copilot's review comments. 😊
We should run the test cases on Redis and reproduce same result in our code.
Test case could be generated with AI's assistance, but the correction and stability are the most important principles.
Please follow Kvrocks community's policy of AI generated code if some code are generated by AI.
We'd better run our cases on Redis before Kvrocks testing and confirm the result are same.
Best Regards,
Edward
| if (!high_cut_quantile) { | ||
| return {Status::RedisParseErr, errValueIsNotFloat}; | ||
| } | ||
| high_cut_quantile_ = *high_cut_quantile; |
There was a problem hiding this comment.
Check the validation of high_cut_quantile and low_cut_quantile.
The parameter validation should be done in the earliest step rather than in the command processing.
There was a problem hiding this comment.
Thanks for the review! I've moved the validation to Parse stage and removed the duplicate checks in the TDigestTrimmedMean method.
src/types/tdigest.h
Outdated
| if (low_cut_quantile < 0.0 || low_cut_quantile > 1.0) { | ||
| return Status{Status::InvalidArgument, "low cut quantile must be between 0 and 1"}; | ||
| } | ||
| if (high_cut_quantile < 0.0 || high_cut_quantile > 1.0) { | ||
| return Status{Status::InvalidArgument, "high cut quantile must be between 0 and 1"}; | ||
| } | ||
| if (low_cut_quantile >= high_cut_quantile) { | ||
| return Status{Status::InvalidArgument, "low cut quantile must be less than high cut quantile"}; | ||
| } |
There was a problem hiding this comment.
Move to the command parse step.
We could add a guard here, but the validation should be in parsing step.
src/types/tdigest.h
Outdated
| double low_boundary = std::numeric_limits<double>::quiet_NaN(); | ||
| double high_boundary = std::numeric_limits<double>::quiet_NaN(); | ||
|
|
||
| if (low_cut_quantile == 0.0) { |
There was a problem hiding this comment.
Use a more stable way of comparing doubles.
src/types/tdigest.h
Outdated
| if (high_cut_quantile == 1.0) { | ||
| high_boundary = td.Max(); | ||
| } else { | ||
| auto high_result = TDigestQuantile(td, high_cut_quantile); |
There was a problem hiding this comment.
Iterate through the whole centroids to get centroids within the boundaries.
TDigestQuantile would return an estimated linear value with solved edge cases rather than real centroids you need.
There was a problem hiding this comment.
Plus, you have iterated the centroids twice after get the quantile.
With directly iteration, just scanning for one time is enough.
| EXPECT_TRUE(std::isinf(result[3])) << "Rank >= total_weight should be infinity"; | ||
| } | ||
|
|
||
| TEST_F(RedisTDigestTest, TrimmedMean) { |
There was a problem hiding this comment.
Add cases for invalid arguments and more unordered and complex inputs.
| require.NoError(t, result.Err()) | ||
| mean, err := strconv.ParseFloat(result.Val().(string), 64) | ||
| require.NoError(t, err) | ||
| require.InDelta(t, 5.5, mean, 1.0) |
There was a problem hiding this comment.
Is the delta 1.0 too large for this test case?
There was a problem hiding this comment.
Thanks for the feedback! I verified this on Redis and the result is exactly 5.5, so I'll tighten the delta.
| require.NoError(t, result.Err()) | ||
| mean, err := strconv.ParseFloat(result.Val().(string), 64) | ||
| require.NoError(t, err) | ||
| require.Less(t, mean, 50.0) |
There was a problem hiding this comment.
Why we use Less rather than a precise result?
There was a problem hiding this comment.
Will fix. Will replace it with a precise value verified on Redis.
|
|
||
| require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "-0.1", "0.9").Err(), "low cut quantile must be between 0 and 1") | ||
| require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "0.1", "1.1").Err(), "high cut quantile must be between 0 and 1") | ||
| require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "0.9", "0.1").Err(), "low cut quantile must be less than high cut quantile") |
There was a problem hiding this comment.
Error message could be constant string to reduce duplication.
There was a problem hiding this comment.
Thanks for the suggestion! Will extracted the error message strings as constants.
| require.NoError(t, result.Err()) | ||
| mean, err := strconv.ParseFloat(result.Val().(string), 64) | ||
| require.NoError(t, err) | ||
| require.InDelta(t, 42.0, mean, 0.001) |
There was a problem hiding this comment.
Maybe we could use a stable precision for delta in all cases?
| } | ||
| mean, err := strconv.ParseFloat(meanStr, 64) | ||
| require.NoError(t, err) | ||
| require.Greater(t, mean, 0.0) |
There was a problem hiding this comment.
We should use precise value for test cases for stable and correction.
There was a problem hiding this comment.
Got it, thanks! I've verified the tests on redis and updated them accordingly.
…N-command' into feat-implement-TDIGEST.TRIMMED_MEAN-command
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto low_cut_quantile = ParseFloat(args[2]); | ||
| if (!low_cut_quantile) { | ||
| return {Status::RedisParseErr, errValueIsNotFloat}; | ||
| } | ||
| low_cut_quantile_ = *low_cut_quantile; | ||
|
|
||
| auto high_cut_quantile = ParseFloat(args[3]); | ||
| if (!high_cut_quantile) { | ||
| return {Status::RedisParseErr, errValueIsNotFloat}; | ||
| } | ||
| high_cut_quantile_ = *high_cut_quantile; | ||
|
|
||
| if (low_cut_quantile_ < 0.0 || low_cut_quantile_ > 1.0) { | ||
| return {Status::RedisParseErr, errLowCutQuantileRange}; | ||
| } | ||
| if (high_cut_quantile_ < 0.0 || high_cut_quantile_ > 1.0) { | ||
| return {Status::RedisParseErr, errHighCutQuantileRange}; | ||
| } | ||
| if (DoubleCompare(low_cut_quantile_, high_cut_quantile_) >= 0) { | ||
| return {Status::RedisParseErr, errLowCutQuantileLess}; | ||
| } |
There was a problem hiding this comment.
The quantile validation doesn’t handle NaN: ParseFloat("nan") succeeds, but comparisons like < 0.0 / > 1.0 will be false, and DoubleCompare with NaN will route to the "low cut quantile must be less..." error (or even allow invalid values through in other cases). Explicitly reject NaN (and ideally non-finite values) for both low_cut_quantile_ and high_cut_quantile_ so invalid inputs consistently return the intended range errors.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
LindaSummer
left a comment
There was a problem hiding this comment.
Hi @chakkk309 ,
Thanks very much for your effort!
Left some comments. 😊
| } | ||
| high_cut_quantile_ = *high_cut_quantile; | ||
|
|
||
| if (!std::isfinite(low_cut_quantile_) || low_cut_quantile_ < 0.0 || low_cut_quantile_ > 1.0) { |
There was a problem hiding this comment.
Using a string validation before numeric validation maybe a better way to avoid the unstable comparison of float numbers.
The string must be ^(0(?:\.\d*)?)|(1(?:\.0*))$. Please double confirm my regex.
We could also use comparing with delta to do this, but from pure literal text would be more stable.
There was a problem hiding this comment.
I tested Redis's actual behavior and found that it accepts numeric forms like .5 and +0.5 for this command. If we add a strict string-level regex here, would that make it more restrictive than Redis?
To preserve Redis compatibility, I kept numeric parsing and aligned the validation with Redis. Could you please let me know if this approach would be acceptable? 👀
There was a problem hiding this comment.
Thanks for your update.
In this case we could keep the float number validation.
Maybe we could add a decimal parser in future.😊
| count_add -= std::min(std::max(0.0, leftmost_weight - count_done), count_add); | ||
| count_add = std::min(std::max(0.0, rightmost_weight - count_done), count_add); |
There was a problem hiding this comment.
Could we have a comment on this? It may be hard to understand at first glance.
There was a problem hiding this comment.
Sure, will add it.
src/commands/error_constants.h
Outdated
| inline constexpr const char *errNumkeysMustBePositive = "numkeys need to be a positive integer"; | ||
| inline constexpr const char *errWrongKeyword = "wrong keyword"; | ||
| inline constexpr const char *errInvalidRankValue = "rank needs to be non-negative"; | ||
| inline constexpr const char *errLowCutQuantileRange = "low cut quantile must be between 0 and 1"; |
There was a problem hiding this comment.
In redis, i tested and got below error message.
We'd better align with redis's bahavior.
localhost:6379> TDIGEST.TRIMMED_MEAN t -0.1 0.2
(error) ERR T-Digest: low_cut_percentile and high_cut_percentile should be in [0,1]
There was a problem hiding this comment.
Thanks for the feedback, I tested on Redis and got those results:
> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n nan 0.9
ERR T-Digest: error parsing low_cut_percentile
> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n 0.1 nan
ERR T-Digest: error parsing high_cut_percentile
> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n -0.1 0.9
ERR T-Digest: low_cut_percentile and high_cut_percentile should be in [0,1]
> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n 0.9 0.1
ERR T-Digest: low_cut_percentile should be lower than high_cut_percentile
| require.NoError(t, result.Err()) | ||
| mean, err := strconv.ParseFloat(result.Val().(string), 64) | ||
| require.NoError(t, err) | ||
| require.False(t, math.IsNaN(mean)) |
There was a problem hiding this comment.
Maybe this check is redundant since we have the next line.
There was a problem hiding this comment.
Yes, this line is useless and I will remove it.
|
Hi @chakkk309 , Is this pr ready to review? Best Regards, |
yes, please have a review, thanks! |



Fixes #3066