Describe the bug
The Hurdle distribution contains two independent but related issues:
-
Incorrect variance formula in _var The current implementation computes: p * var_positive + p * (1 - p) * mean_positive However, the correct variance for a hurdle distribution is:
$$
\mathrm{Var}(Y) = p\sigma^2 + p(1-p)\mu^2
$$
The mean term is not squared in the implementation, violating the law of total variance and leading to severely underestimated variance.
-
Incorrect capability tagging Hurdle determines whether mean and var are exact by inspecting the base distribution (e.g., Normal). However, internally it wraps this distribution in LeftTruncated, which:
does not implement exact mean/var instead relies on numerical integration (PPF-based approximation)
As a result: Hurdle advertises exact capabilities but actually produces approximate results and triggers runtime warnings This leads to misleading API behavior and inconsistent capability metadata.
To Reproduce
import numpy as np
from skpro.distributions.normal import Normal
from skpro.distributions.hurdle import Hurdle
base = Normal(mu=10.0, sigma=1.0)
hurdle = Hurdle(p=0.5, distribution=base)
np.random.seed(42)
samples = hurdle.sample(100_000)
empirical_var = float(samples.var().iloc[0])
skpro_var = float(hurdle.var().iloc[0, 0])
print("Empirical variance:", empirical_var)
print("skpro variance:", skpro_var)
screenshot of the issue is attached below:
Expected behavior
Variance should follow:
$$
\mathrm{Var}(Y) = p\sigma^2 + p(1-p)\mu^2
$$
Empirical and analytical variance should match within numerical tolerance Hurdle should correctly advertise mean and var as approximate, not exact
Environment
OS: macOS
Python: 3.x
skpro: latest (main branch)
NumPy / Pandas: standard versions
Additional context
The discrepancy is structural, not due to numerical approximation The related ZeroInflated distribution correctly implements this variance formula, indicating inconsistency
The issue impacts: uncertainty estimation ,probabilistic metrics ,downstream model evaluation
Proposed fix
Correct variance formula:
return ( self.p * var_positive+ (mean_positive ** 2) * self.p * (1.0 - self.p))
Fix capability tagging. Inspect LeftTruncated(distribution) instead of the raw base distribution Downgrade mean and var to approximate capabilities
Describe the bug
The Hurdle distribution contains two independent but related issues:
Incorrect variance formula in _var The current implementation computes: p * var_positive + p * (1 - p) * mean_positive However, the correct variance for a hurdle distribution is:
$$
\mathrm{Var}(Y) = p\sigma^2 + p(1-p)\mu^2
$$
The mean term is not squared in the implementation, violating the law of total variance and leading to severely underestimated variance.
Incorrect capability tagging Hurdle determines whether mean and var are exact by inspecting the base distribution (e.g., Normal). However, internally it wraps this distribution in LeftTruncated, which:
does not implement exact mean/var instead relies on numerical integration (PPF-based approximation)
As a result: Hurdle advertises exact capabilities but actually produces approximate results and triggers runtime warnings This leads to misleading API behavior and inconsistent capability metadata.
To Reproduce
screenshot of the issue is attached below:
Expected behavior
Variance should follow:
Empirical and analytical variance should match within numerical tolerance Hurdle should correctly advertise mean and var as approximate, not exact
Environment
OS: macOS
Python: 3.x
skpro: latest (main branch)
NumPy / Pandas: standard versions
Additional context
The discrepancy is structural, not due to numerical approximation The related ZeroInflated distribution correctly implements this variance formula, indicating inconsistency
The issue impacts: uncertainty estimation ,probabilistic metrics ,downstream model evaluation
Proposed fix
Correct variance formula:
return ( self.p * var_positive+ (mean_positive ** 2) * self.p * (1.0 - self.p))
Fix capability tagging. Inspect LeftTruncated(distribution) instead of the raw base distribution Downgrade mean and var to approximate capabilities