Skip to content

BUG: Series.corr/cov raising with masked dtype #51422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1230,6 +1230,7 @@ Numeric
- Bug in arithmetic operations on :class:`Series` not propagating mask when combining masked dtypes and numpy dtypes (:issue:`45810`, :issue:`42630`)
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
- Bug in :meth:`Series.corr` and :meth:`Series.cov` raising ``AttributeError`` for masked dtypes (:issue:`51422`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move to 2.1 as well?

- Bug in :meth:`~arrays.ArrowExtensionArray.mode` where ``dropna=False`` was not respected when there was ``NA`` values (:issue:`50982`)
- Bug in :meth:`DataFrame.query` with ``engine="numexpr"`` and column names are ``min`` or ``max`` would raise a ``TypeError`` (:issue:`50937`)
- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` with tz-aware data containing ``pd.NaT`` and ``axis=1`` would return incorrect results (:issue:`51242`)
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1610,6 +1610,9 @@ def nancorr(
if len(a) < min_periods:
return np.nan

a = _ensure_numeric(a)
b = _ensure_numeric(b)

f = get_corr_func(method)
return f(a, b)

Expand Down Expand Up @@ -1668,6 +1671,9 @@ def nancov(
if len(a) < min_periods:
return np.nan

a = _ensure_numeric(a)
b = _ensure_numeric(b)

return np.cov(a, b, ddof=ddof)[0, 1]


Expand Down
9 changes: 7 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2647,9 +2647,12 @@ def corr(
if len(this) == 0:
return np.nan

this_values = np.asarray(this.values)
other_values = np.asarray(other.values)

if method in ["pearson", "spearman", "kendall"] or callable(method):
return nanops.nancorr(
this.values, other.values, method=method, min_periods=min_periods
this_values, other_values, method=method, min_periods=min_periods
)

raise ValueError(
Expand Down Expand Up @@ -2702,8 +2705,10 @@ def cov(
this, other = self.align(other, join="inner", copy=False)
if len(this) == 0:
return np.nan
this_values = np.asarray(this.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ._values here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. For my knowledge, if we need an ndarray for nanops.nancov, is there a benefit to using np.asarray(this._values) vs np.asarray(this.values)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably harmless in this case but in general .values can be either a) lossy (dt64tz) or b) convert to object (period) while ._values is safe from those failure modes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks! I'll update just in case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values also creates references and is read_only (in the future) for CoW. Better to use ._values

other_values = np.asarray(other.values)
return nanops.nancov(
this.values, other.values, min_periods=min_periods, ddof=ddof
this_values, other_values, min_periods=min_periods, ddof=ddof
)

@doc(
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/methods/test_cov_corr.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ def test_corrwith_mixed_dtypes(self, numeric_only):
else:
with pytest.raises(
TypeError,
match=r"unsupported operand type\(s\) for /: 'str' and 'int'",
match=r"Could not convert \['a' 'b' 'c' 'd'\] to numeric",
):
df.corrwith(s, numeric_only=numeric_only)

Expand Down
12 changes: 8 additions & 4 deletions pandas/tests/series/methods/test_cov_corr.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,14 @@ def test_cov(self, datetime_series):
assert isna(ts1.cov(ts2, min_periods=12))

@pytest.mark.parametrize("test_ddof", [None, 0, 1, 2, 3])
def test_cov_ddof(self, test_ddof):
@pytest.mark.parametrize("dtype", ["float64", "Float64"])
def test_cov_ddof(self, test_ddof, dtype):
# GH#34611
np_array1 = np.random.rand(10)
np_array2 = np.random.rand(10)

s1 = Series(np_array1)
s2 = Series(np_array2)
s1 = Series(np_array1, dtype=dtype)
s2 = Series(np_array2, dtype=dtype)

result = s1.cov(s2, ddof=test_ddof)
expected = np.cov(np_array1, np_array2, ddof=test_ddof)[0][1]
Expand All @@ -57,9 +58,12 @@ def test_cov_ddof(self, test_ddof):

class TestSeriesCorr:
@td.skip_if_no_scipy
def test_corr(self, datetime_series):
@pytest.mark.parametrize("dtype", ["float64", "Float64"])
def test_corr(self, datetime_series, dtype):
from scipy import stats

datetime_series = datetime_series.astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if there are NAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both left and right are filtered for notna here:

valid = notna(a) & notna(b)

so this works:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: ser1 = pd.Series(np.random.randn(100), dtype="Float64")

In [4]: ser2 = pd.Series(np.random.randn(100), dtype="Float64")

In [5]: ser1[1] = pd.NA

In [6]: ser2[5:7] = pd.NA

In [7]: ser1.corr(ser2)
Out[7]: 0.09774253881093414

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. what about if there is an nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an nan within a masked array? e.g.

In [8]: ser1[1] = 0.0

In [9]: ser1 /= (ser1 != 0)

In [10]: ser1
Out[10]: 
0     0.148073
1          NaN
2     0.556972
3    -0.554886
4     1.216938
        ...   
95     0.33919
96    0.528683
97    1.590215
98     0.84015
99    0.333666
Length: 100, dtype: Float64

In [11]: ser1.corr(ser2)
Out[11]: 0.09774253881093414

It works either way since the notna is applied to the ndarray which will capture both np.nan and pd.NA


# full overlap
tm.assert_almost_equal(datetime_series.corr(datetime_series), 1)

Expand Down