Skip to content

Conversation

@ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Jan 1, 2026

  • add binomial_bounds to support calculate lower_bound&&upper_bound
  • add get_lower_bound&&get_upper_bound in ThetaSketch

cc @notfilippo @Xuanwo @PsiACE @tisonkun

@leerho
Copy link
Member

leerho commented Jan 4, 2026

Waiting on requested change.

Comment on lines 154 to 159
pub fn lower_bound(&self, num_std_devs: u32) -> Result<f64, Error> {
if !self.is_estimation_mode() {
return Ok(self.num_retained() as f64);
}
binomial_bounds::lower_bound(self.num_retained() as u64, self.theta(), num_std_devs)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a NumStdDev enum:

/// Number of standard deviations for confidence bounds
///
/// This enum specifies the number of standard deviations to use when computing
/// upper and lower bounds for cardinality estimates. Higher values provide wider
/// confidence intervals with greater certainty that the true cardinality falls
/// within the bounds.
#[repr(u8)]
pub enum NumStdDev {
/// One standard deviation (\~68% confidence interval)
One = 1,
/// Two standard deviations (\~95% confidence interval)
Two = 2,
/// Three standard deviations (\~99.7% confidence interval)
Three = 3,
}

You can reuse it by moving the struct to a shared module level and thus we would never error here.

Besides, we may consider where to place these shared structs, including the ResizeFactor. datasketches::{NumStdDev, ResizeFactor} is fair enough now, but perhaps pollute the top-level namespace too much.

Comment on lines 748 to 764
// Comment for this larger test like Java
// for ci in 1..=3 {
// let arr = run_test_aux(2000, ci, 1e-7);
// for j in 0..5 {
// let ratio = arr[j] / STD[i][j];
// assert!(
// (ratio - 1.0).abs() < TOL,
// "ci={}, j={}: expected {}, got {}, ratio={}",
// ci,
// j,
// STD[i][j],
// arr[j],
// ratio
// );
// }
// i += 1;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time it costs? I'd prefer not to comment out tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find the source for this commented out test in either C++ or Java . Could you give me a full link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time it costs? I'd prefer not to comment out tests.

0.56s(comment out) vs 3.19s in mac m1 pro

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.56s(comment out) vs 3.19s in mac m1 pro

This can be fair enough to include.

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* add binomial_bounds to support calculate lower_bound&&upper_bound
* add get_lower_bound&&get_upper_bound in ThetaSketch
@tisonkun
Copy link
Member

tisonkun commented Jan 5, 2026

I'll appreciate it if you can, the next time, avoid force push after reviewers claim in. Then reviewers can easily get the changeset from the last review.

mod binomial_bounds;
mod codec;
mod hash;
pub mod num_std_dev;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub mod num_std_dev;
mod num_std_dev;

Why pub?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove it later

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jan 5, 2026

I'll appreciate it if you can, the next time, avoid force push after reviewers claim in. Then reviewers can easily get the changeset from the last review.

I’ll be more careful next time. If I need to rebase, should I use merge in situations like this instead?

Copy link
Member

@leerho leerho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Looks OK to me. However, there are 2 unresolved requested changes, and the last CI produced errors.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jan 6, 2026

This Looks OK to me. However, there are 2 unresolved requested changes, and the last CI produced errors.

Looks like ci fail at ./tools/generate_serialization_test_data.py, should we re run to try again. Test locally pass

@leerho
Copy link
Member

leerho commented Jan 6, 2026

Ran the tests again, still failed.

@leerho
Copy link
Member

leerho commented Jan 6, 2026

I think it would be a good idea to have the workflow_dispatch: trigger on any workflows. It simplifies troubleshooting.

@tisonkun
Copy link
Member

tisonkun commented Jan 7, 2026

Error: Expected generated files directory not found at /home/runner/work/datasketches-rust/datasketches-rust/tmp_datasketches_java/serialization_test_data/java_generated_files

Look like some Java side logic changed so that directory has changed as well.

I'll take a look after managing the release vote to the next stage. And perhaps #10 should help in this issue.

I think it would be a good idea to have the workflow_dispatch: trigger on any workflows. It simplifies troubleshooting.

Good point.

@PsiACE
Copy link
Member

PsiACE commented Jan 7, 2026

Ran the tests again, still failed.

I included a temporary fix in #62 (Density Sketch) to ensure CI passes.

@tisonkun tisonkun mentioned this pull request Jan 7, 2026
@tisonkun
Copy link
Member

tisonkun commented Jan 8, 2026

CI fixed. I'll give another look tomorrow and try to merge it to cut 0.2.0-rc.2/

Signed-off-by: tison <wander4096@gmail.com>
Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push a new commit for reorg common types.

Now this looks good to me.

@tisonkun tisonkun requested a review from leerho January 8, 2026 16:09
@tisonkun
Copy link
Member

tisonkun commented Jan 8, 2026

cc @leerho @PsiACE you may take another look then.

Copy link
Member

@PsiACE PsiACE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And looking forward to the new release.

@leerho leerho merged commit fdb38b8 into apache:main Jan 8, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants