Skip to content

[BUG] Incorrect variance computation and misleading capability tags in Hurdle #972

@ANANYA542

Description

@ANANYA542

Describe the bug
The Hurdle distribution contains two independent but related issues:

  1. Incorrect variance formula in _var The current implementation computes: p * var_positive + p * (1 - p) * mean_positive However, the correct variance for a hurdle distribution is:
    $$
    \mathrm{Var}(Y) = p\sigma^2 + p(1-p)\mu^2
    $$
    The mean term is not squared in the implementation, violating the law of total variance and leading to severely underestimated variance.

  2. Incorrect capability tagging Hurdle determines whether mean and var are exact by inspecting the base distribution (e.g., Normal). However, internally it wraps this distribution in LeftTruncated, which:
    does not implement exact mean/var instead relies on numerical integration (PPF-based approximation)
    As a result: Hurdle advertises exact capabilities but actually produces approximate results and triggers runtime warnings This leads to misleading API behavior and inconsistent capability metadata.

To Reproduce

import numpy as np
from skpro.distributions.normal import Normal
from skpro.distributions.hurdle import Hurdle

base = Normal(mu=10.0, sigma=1.0)
hurdle = Hurdle(p=0.5, distribution=base)

np.random.seed(42)
samples = hurdle.sample(100_000)
empirical_var = float(samples.var().iloc[0])

skpro_var = float(hurdle.var().iloc[0, 0])

print("Empirical variance:", empirical_var)
print("skpro variance:", skpro_var)

screenshot of the issue is attached below:

Image

Expected behavior
Variance should follow:

$$ \mathrm{Var}(Y) = p\sigma^2 + p(1-p)\mu^2 $$

Empirical and analytical variance should match within numerical tolerance Hurdle should correctly advertise mean and var as approximate, not exact

Environment
OS: macOS
Python: 3.x
skpro: latest (main branch)
NumPy / Pandas: standard versions

Additional context
The discrepancy is structural, not due to numerical approximation The related ZeroInflated distribution correctly implements this variance formula, indicating inconsistency
The issue impacts: uncertainty estimation ,probabilistic metrics ,downstream model evaluation

Proposed fix
Correct variance formula:
return ( self.p * var_positive+ (mean_positive ** 2) * self.p * (1.0 - self.p))
Fix capability tagging. Inspect LeftTruncated(distribution) instead of the raw base distribution Downgrade mean and var to approximate capabilities

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions