Skip to content

feat(tdigest): implement TDIGEST.TRIMMED_MEAN command#3312

Open
chakkk309 wants to merge 23 commits intoapache:unstablefrom
chakkk309:feat-implement-TDIGEST.TRIMMED_MEAN-command
Open

feat(tdigest): implement TDIGEST.TRIMMED_MEAN command#3312
chakkk309 wants to merge 23 commits intoapache:unstablefrom
chakkk309:feat-implement-TDIGEST.TRIMMED_MEAN-command

Conversation

@chakkk309
Copy link
Contributor

Fixes #3066

@git-hulk git-hulk requested a review from LindaSummer December 27, 2025 14:47
Copy link
Member

@LindaSummer LindaSummer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chakkk309 ,

😊 It seems this commit couldn't pass the ci. Please help check the error message in github actions.

if (auto status = dumpCentroids(ctx, ns_key, metadata, &centroids); !status.ok()) {
return status;
}
auto dump_centroids = DummyCentroids(metadata, centroids);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chakkk309 ,

It seems that this line has compile error in CI. Please make a check.

@PragmaTwice
Copy link
Member

PragmaTwice commented Dec 27, 2025

Hi, thank you for your contribution!

Before you start coding, could you please read our contribution guide (https://kvrocks.apache.org/community/contributing/)? It can be better if you build and test Kvrocks against your changes successfully in your local before pushing them.

Also note that we have guidelines for AI-assisted contributions: https://kvrocks.apache.org/community/contributing/#guidelines-for-ai-assisted-contributions

@chakkk309 chakkk309 requested a review from LindaSummer March 4, 2026 06:34
@LindaSummer LindaSummer requested review from Copilot and removed request for LindaSummer March 4, 2026 06:36
@LindaSummer
Copy link
Member

Hi @chakkk309 ,

Thanks very much for your effort. I will review later today.😊

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the Redis-compatible TDIGEST.TRIMMED_MEAN command in Kvrocks’ TDigest module (Fixes #3066), exposing the functionality through the command layer and adding unit test coverage.

Changes:

  • Add core trimmed-mean computation helper (TDigestTrimmedMean) to the TDigest algorithm utilities.
  • Wire trimmed-mean into the Redis TDigest type (redis::TDigest::TrimmedMean) and register the tdigest.trimmed_mean command.
  • Add Go and C++ unit tests for TDIGEST.TRIMMED_MEAN.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/gocase/unit/type/tdigest/tdigest_test.go Adds Go integration tests for TDIGEST.TRIMMED_MEAN including argument/quantile validation cases.
tests/cppunit/types/tdigest_test.cc Adds a C++ unit test for trimmed mean behavior on a basic dataset.
src/types/tdigest.h Introduces TDigestTrimmedMean helper to compute trimmed mean from centroids.
src/types/redis_tdigest.h Adds TDigestTrimmedMeanResult and the TDigest::TrimmedMean API.
src/types/redis_tdigest.cc Implements TDigest::TrimmedMean by dumping centroids and calling the helper.
src/commands/cmd_tdigest.cc Adds CommandTDigestTrimmedMean and registers tdigest.trimmed_mean.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +329 to +368
double low_boundary = std::numeric_limits<double>::quiet_NaN();
double high_boundary = std::numeric_limits<double>::quiet_NaN();

if (low_cut_quantile == 0.0) {
low_boundary = td.Min();
} else {
auto low_result = TDigestQuantile(td, low_cut_quantile);
if (!low_result) {
return low_result;
}
low_boundary = *low_result;
}

if (high_cut_quantile == 1.0) {
high_boundary = td.Max();
} else {
auto high_result = TDigestQuantile(td, high_cut_quantile);
if (!high_result) {
return high_result;
}
high_boundary = *high_result;
}

auto iter = td.Begin();
double total_weight_in_range = 0;
double weighted_sum = 0;

while (iter->Valid()) {
auto centroid = GET_OR_RET(iter->GetCentroid());

if ((low_cut_quantile == 0.0 && high_cut_quantile == 1.0) ||
(centroid.mean >= low_boundary && centroid.mean <= high_boundary)) {
total_weight_in_range += centroid.weight;
weighted_sum += centroid.mean * centroid.weight;
}

iter->Next();
}

if (total_weight_in_range == 0) {
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TDigestTrimmedMean can incorrectly return NaN when the low/high cut boundaries fall between centroid means (e.g., after quantile interpolation). The current logic only includes whole centroids whose mean is within [low_boundary, high_boundary], so it may exclude all centroids even though the quantile range contains weight. Consider computing overlap by weight position (e.g., iterate centroids with cumulative weight and add partial centroid weight for the boundary centroids) rather than filtering by centroid.mean.

Suggested change
double low_boundary = std::numeric_limits<double>::quiet_NaN();
double high_boundary = std::numeric_limits<double>::quiet_NaN();
if (low_cut_quantile == 0.0) {
low_boundary = td.Min();
} else {
auto low_result = TDigestQuantile(td, low_cut_quantile);
if (!low_result) {
return low_result;
}
low_boundary = *low_result;
}
if (high_cut_quantile == 1.0) {
high_boundary = td.Max();
} else {
auto high_result = TDigestQuantile(td, high_cut_quantile);
if (!high_result) {
return high_result;
}
high_boundary = *high_result;
}
auto iter = td.Begin();
double total_weight_in_range = 0;
double weighted_sum = 0;
while (iter->Valid()) {
auto centroid = GET_OR_RET(iter->GetCentroid());
if ((low_cut_quantile == 0.0 && high_cut_quantile == 1.0) ||
(centroid.mean >= low_boundary && centroid.mean <= high_boundary)) {
total_weight_in_range += centroid.weight;
weighted_sum += centroid.mean * centroid.weight;
}
iter->Next();
}
if (total_weight_in_range == 0) {
// First, compute the total weight of the t-digest.
double total_weight = 0.0;
{
auto iter = td.Begin();
while (iter->Valid()) {
auto centroid = GET_OR_RET(iter->GetCentroid());
total_weight += centroid.weight;
iter->Next();
}
}
if (total_weight == 0.0) {
return std::numeric_limits<double>::quiet_NaN();
}
// If no trimming is requested, just return the global weighted mean.
if (low_cut_quantile == 0.0 && high_cut_quantile == 1.0) {
double weighted_sum = 0.0;
auto iter = td.Begin();
while (iter->Valid()) {
auto centroid = GET_OR_RET(iter->GetCentroid());
weighted_sum += centroid.mean * centroid.weight;
iter->Next();
}
return weighted_sum / total_weight;
}
// Compute rank boundaries in weight space.
const double low_rank = low_cut_quantile * total_weight;
const double high_rank = high_cut_quantile * total_weight;
double cumulative_weight = 0.0;
double total_weight_in_range = 0.0;
double weighted_sum = 0.0;
auto iter = td.Begin();
while (iter->Valid()) {
auto centroid = GET_OR_RET(iter->GetCentroid());
const double start_rank = cumulative_weight;
const double end_rank = cumulative_weight + centroid.weight;
// If this centroid is entirely before the trimmed region, skip it.
if (end_rank <= low_rank) {
cumulative_weight = end_rank;
iter->Next();
continue;
}
// If we've passed the trimmed region, we can stop.
if (start_rank >= high_rank) {
break;
}
// Compute overlap of this centroid's weight with [low_rank, high_rank).
double overlap_start = start_rank;
if (overlap_start < low_rank) {
overlap_start = low_rank;
}
double overlap_end = end_rank;
if (overlap_end > high_rank) {
overlap_end = high_rank;
}
const double overlap = overlap_end - overlap_start;
if (overlap > 0.0) {
total_weight_in_range += overlap;
weighted_sum += centroid.mean * overlap;
}
cumulative_weight = end_rank;
iter->Next();
}
if (total_weight_in_range == 0.0) {

Copilot uses AI. Check for mistakes.
Comment on lines +808 to +813
if meanStr == "nan" {
return
}
mean, err := strconv.ParseFloat(meanStr, 64)
require.NoError(t, err)
require.Greater(t, mean, 0.0)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test allows "nan" and returns early, which can mask real correctness issues (a non-empty digest with low_cut < high_cut should always have some weight in the trimmed range). It would be better to assert the result is not NaN for this dataset and verify it’s within an expected numeric range/value.

Suggested change
if meanStr == "nan" {
return
}
mean, err := strconv.ParseFloat(meanStr, 64)
require.NoError(t, err)
require.Greater(t, mean, 0.0)
mean, err := strconv.ParseFloat(meanStr, 64)
require.NoError(t, err)
require.False(t, math.IsNaN(mean))
require.Greater(t, mean, 4.0)
require.Less(t, mean, 7.0)

Copilot uses AI. Check for mistakes.
Copy link
Member

@LindaSummer LindaSummer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chakkk309 ,

Left some comments, and please also go through Copilot's review comments. 😊

We should run the test cases on Redis and reproduce same result in our code.

Test case could be generated with AI's assistance, but the correction and stability are the most important principles.

Please follow Kvrocks community's policy of AI generated code if some code are generated by AI.

We'd better run our cases on Redis before Kvrocks testing and confirm the result are same.

Best Regards,
Edward

if (!high_cut_quantile) {
return {Status::RedisParseErr, errValueIsNotFloat};
}
high_cut_quantile_ = *high_cut_quantile;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the validation of high_cut_quantile and low_cut_quantile.
The parameter validation should be done in the earliest step rather than in the command processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've moved the validation to Parse stage and removed the duplicate checks in the TDigestTrimmedMean method.

Comment on lines +319 to +327
if (low_cut_quantile < 0.0 || low_cut_quantile > 1.0) {
return Status{Status::InvalidArgument, "low cut quantile must be between 0 and 1"};
}
if (high_cut_quantile < 0.0 || high_cut_quantile > 1.0) {
return Status{Status::InvalidArgument, "high cut quantile must be between 0 and 1"};
}
if (low_cut_quantile >= high_cut_quantile) {
return Status{Status::InvalidArgument, "low cut quantile must be less than high cut quantile"};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the command parse step.
We could add a guard here, but the validation should be in parsing step.

double low_boundary = std::numeric_limits<double>::quiet_NaN();
double high_boundary = std::numeric_limits<double>::quiet_NaN();

if (low_cut_quantile == 0.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more stable way of comparing doubles.

if (high_cut_quantile == 1.0) {
high_boundary = td.Max();
} else {
auto high_result = TDigestQuantile(td, high_cut_quantile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterate through the whole centroids to get centroids within the boundaries.
TDigestQuantile would return an estimated linear value with solved edge cases rather than real centroids you need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, you have iterated the centroids twice after get the quantile.
With directly iteration, just scanning for one time is enough.

EXPECT_TRUE(std::isinf(result[3])) << "Rank >= total_weight should be infinity";
}

TEST_F(RedisTDigestTest, TrimmedMean) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add cases for invalid arguments and more unordered and complex inputs.

require.NoError(t, result.Err())
mean, err := strconv.ParseFloat(result.Val().(string), 64)
require.NoError(t, err)
require.InDelta(t, 5.5, mean, 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the delta 1.0 too large for this test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I verified this on Redis and the result is exactly 5.5, so I'll tighten the delta.

require.NoError(t, result.Err())
mean, err := strconv.ParseFloat(result.Val().(string), 64)
require.NoError(t, err)
require.Less(t, mean, 50.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use Less rather than a precise result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Will replace it with a precise value verified on Redis.


require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "-0.1", "0.9").Err(), "low cut quantile must be between 0 and 1")
require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "0.1", "1.1").Err(), "high cut quantile must be between 0 and 1")
require.ErrorContains(t, rdb.Do(ctx, "TDIGEST.TRIMMED_MEAN", key, "0.9", "0.1").Err(), "low cut quantile must be less than high cut quantile")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message could be constant string to reduce duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Will extracted the error message strings as constants.

require.NoError(t, result.Err())
mean, err := strconv.ParseFloat(result.Val().(string), 64)
require.NoError(t, err)
require.InDelta(t, 42.0, mean, 0.001)
Copy link
Member

@LindaSummer LindaSummer Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use a stable precision for delta in all cases?

}
mean, err := strconv.ParseFloat(meanStr, 64)
require.NoError(t, err)
require.Greater(t, mean, 0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use precise value for test cases for stable and correction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! I've verified the tests on redis and updated them accordingly.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +504 to +524
auto low_cut_quantile = ParseFloat(args[2]);
if (!low_cut_quantile) {
return {Status::RedisParseErr, errValueIsNotFloat};
}
low_cut_quantile_ = *low_cut_quantile;

auto high_cut_quantile = ParseFloat(args[3]);
if (!high_cut_quantile) {
return {Status::RedisParseErr, errValueIsNotFloat};
}
high_cut_quantile_ = *high_cut_quantile;

if (low_cut_quantile_ < 0.0 || low_cut_quantile_ > 1.0) {
return {Status::RedisParseErr, errLowCutQuantileRange};
}
if (high_cut_quantile_ < 0.0 || high_cut_quantile_ > 1.0) {
return {Status::RedisParseErr, errHighCutQuantileRange};
}
if (DoubleCompare(low_cut_quantile_, high_cut_quantile_) >= 0) {
return {Status::RedisParseErr, errLowCutQuantileLess};
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantile validation doesn’t handle NaN: ParseFloat("nan") succeeds, but comparisons like < 0.0 / > 1.0 will be false, and DoubleCompare with NaN will route to the "low cut quantile must be less..." error (or even allow invalid values through in other cases). Explicitly reject NaN (and ideally non-finite values) for both low_cut_quantile_ and high_cut_quantile_ so invalid inputs consistently return the intended range errors.

Copilot uses AI. Check for mistakes.
chakkk309 and others added 2 commits March 9, 2026 23:24
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chakkk309 chakkk309 requested a review from LindaSummer March 10, 2026 01:45
@sonarqubecloud
Copy link

Copy link
Member

@LindaSummer LindaSummer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chakkk309 ,

Thanks very much for your effort!
Left some comments. 😊

}
high_cut_quantile_ = *high_cut_quantile;

if (!std::isfinite(low_cut_quantile_) || low_cut_quantile_ < 0.0 || low_cut_quantile_ > 1.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a string validation before numeric validation maybe a better way to avoid the unstable comparison of float numbers.
The string must be ^(0(?:\.\d*)?)|(1(?:\.0*))$. Please double confirm my regex.
We could also use comparing with delta to do this, but from pure literal text would be more stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested Redis's actual behavior and found that it accepts numeric forms like .5 and +0.5 for this command. If we add a strict string-level regex here, would that make it more restrictive than Redis?

To preserve Redis compatibility, I kept numeric parsing and aligned the validation with Redis. Could you please let me know if this approach would be acceptable? 👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your update.

In this case we could keep the float number validation.

Maybe we could add a decimal parser in future.😊

Comment on lines +333 to +334
count_add -= std::min(std::max(0.0, leftmost_weight - count_done), count_add);
count_add = std::min(std::max(0.0, rightmost_weight - count_done), count_add);
Copy link
Member

@LindaSummer LindaSummer Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a comment on this? It may be hard to understand at first glance.

Copy link
Contributor Author

@chakkk309 chakkk309 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add it.

inline constexpr const char *errNumkeysMustBePositive = "numkeys need to be a positive integer";
inline constexpr const char *errWrongKeyword = "wrong keyword";
inline constexpr const char *errInvalidRankValue = "rank needs to be non-negative";
inline constexpr const char *errLowCutQuantileRange = "low cut quantile must be between 0 and 1";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In redis, i tested and got below error message.
We'd better align with redis's bahavior.

localhost:6379> TDIGEST.TRIMMED_MEAN t -0.1 0.2
(error) ERR T-Digest: low_cut_percentile and high_cut_percentile should be in [0,1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I tested on Redis and got those results:

> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n nan 0.9
ERR T-Digest: error parsing low_cut_percentile

> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n 0.1 nan
ERR T-Digest: error parsing high_cut_percentile

> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n -0.1 0.9
ERR T-Digest: low_cut_percentile and high_cut_percentile should be in [0,1]

> redis-cli -p 6380 --raw TDIGEST.TRIMMED_MEAN n 0.9 0.1
ERR T-Digest: low_cut_percentile should be lower than high_cut_percentile

require.NoError(t, result.Err())
mean, err := strconv.ParseFloat(result.Val().(string), 64)
require.NoError(t, err)
require.False(t, math.IsNaN(mean))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this check is redundant since we have the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this line is useless and I will remove it.

@LindaSummer
Copy link
Member

Hi @chakkk309 ,

Is this pr ready to review?
I would review it in days if this pr is ready.😊

Best Regards,
Edward

@chakkk309
Copy link
Contributor Author

Hi @chakkk309 ,

Is this pr ready to review? I would review it in days if this pr is ready.😊

Best Regards, Edward

yes, please have a review, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TDigest: Implement TDIGEST.TRIMMED_MEAN command

4 participants