Skip to content

HyperGeometric distribution gives wrong results #4366

Closed
@ricardoV94

Description

@ricardoV94

Reading #4108 and #4249, it seems that there was some back and forth to try to match the behavior of scipy.stats.hypergeom with that of HyperGeometric, specially to pass the default unit tests when given invalid parameters. Unfortunately the current solution misbehaves often with invalid parameters.

Here is an example:

pm.HyperGeometric.dist(N=3, k=3, n=4).logp(3).eval()  # returns array(1.38629434)
st.hypergeom(M=3, n=3, N=4).logpmf(3)  # returns nan

(n=4 is impossible since it must be smaller or equal than N)

The initial PR draft had a bound to deal with invalid parameters, but it was removed so that logp would return the same values as the scipy counterpart (otherwise we would sometimes get pymc=-inf vs scipy=nan). Unfortunately, as shown in the example above, we cannot trust the pymc3 implementation to reliably return nan for invalid parameters. Scipy seems to have some parameter validation method, but it is not clear to me if and how it is used by the logpmf method:

https://github.com/scipy/scipy/blob/624fd766e2886c458012d98e3c6ec85551ab2916/scipy/stats/_discrete_distns.py#L455-L458

Besides not being reliable, returning nan also seems to deviate from the usual pymc behavior of getting -inf for invalid configurations. In addition, the logcdf method for the HyperGeometric (see #4331) will also be incompatible with results of the scipy counterpart, which actually gives wrong results for some configurations of impossible parameters:

st.hypergeom(M=3, n=3, N=4).logpmf(np.arange(5))  # returns array([nan, nan, nan, nan, nan])
st.hypergeom(M=3, n=3, N=4).logcdf(np.arange(5))  # returns array([nan, nan, nan,  0.,  0.])

Given these issues, I suggest we depart from the scipy implementation and return to using bound to replace impossible configurations with -inf. I am willing to do a PR (or give it to @Harivallabha if he is interested). To overcome the original issue with the unit tests, we can change the check_logp method in pymc3_matches_scipy to replace scipy nan with -inf (perhaps with an optional argument, in case this conflicts with other distributions tests):
https://github.com/pymc-devs/pymc3/blob/9dbf9eabf3b18bf2529a8855c086e90249be0b5e/pymc3/tests/test_distributions.py#L543-L551

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions