Skip to content

BUG: Series.corr/cov raising with masked dtype #51422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ Timezones

Numeric
^^^^^^^
-
- Bug in :meth:`Series.corr` and :meth:`Series.cov` raising ``AttributeError`` for masked dtypes (:issue:`51422`)
-

Conversion
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1610,6 +1610,9 @@ def nancorr(
if len(a) < min_periods:
return np.nan

a = _ensure_numeric(a)
b = _ensure_numeric(b)

f = get_corr_func(method)
return f(a, b)

Expand Down Expand Up @@ -1668,6 +1671,9 @@ def nancov(
if len(a) < min_periods:
return np.nan

a = _ensure_numeric(a)
b = _ensure_numeric(b)

return np.cov(a, b, ddof=ddof)[0, 1]


Expand Down
9 changes: 7 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2643,9 +2643,12 @@ def corr(
if len(this) == 0:
return np.nan

this_values = np.asarray(this._values)
other_values = np.asarray(other._values)

if method in ["pearson", "spearman", "kendall"] or callable(method):
return nanops.nancorr(
this.values, other.values, method=method, min_periods=min_periods
this_values, other_values, method=method, min_periods=min_periods
)

raise ValueError(
Expand Down Expand Up @@ -2698,8 +2701,10 @@ def cov(
this, other = self.align(other, join="inner", copy=False)
if len(this) == 0:
return np.nan
this_values = np.asarray(this._values)
other_values = np.asarray(other._values)
return nanops.nancov(
this.values, other.values, min_periods=min_periods, ddof=ddof
this_values, other_values, min_periods=min_periods, ddof=ddof
)

@doc(
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/methods/test_cov_corr.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ def test_corrwith_mixed_dtypes(self, numeric_only):
else:
with pytest.raises(
TypeError,
match=r"unsupported operand type\(s\) for /: 'str' and 'int'",
match=r"Could not convert \['a' 'b' 'c' 'd'\] to numeric",
):
df.corrwith(s, numeric_only=numeric_only)

Expand Down
12 changes: 8 additions & 4 deletions pandas/tests/series/methods/test_cov_corr.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,14 @@ def test_cov(self, datetime_series):
assert isna(ts1.cov(ts2, min_periods=12))

@pytest.mark.parametrize("test_ddof", [None, 0, 1, 2, 3])
def test_cov_ddof(self, test_ddof):
@pytest.mark.parametrize("dtype", ["float64", "Float64"])
def test_cov_ddof(self, test_ddof, dtype):
# GH#34611
np_array1 = np.random.rand(10)
np_array2 = np.random.rand(10)

s1 = Series(np_array1)
s2 = Series(np_array2)
s1 = Series(np_array1, dtype=dtype)
s2 = Series(np_array2, dtype=dtype)

result = s1.cov(s2, ddof=test_ddof)
expected = np.cov(np_array1, np_array2, ddof=test_ddof)[0][1]
Expand All @@ -57,9 +58,12 @@ def test_cov_ddof(self, test_ddof):

class TestSeriesCorr:
@td.skip_if_no_scipy
def test_corr(self, datetime_series):
@pytest.mark.parametrize("dtype", ["float64", "Float64"])
def test_corr(self, datetime_series, dtype):
from scipy import stats

datetime_series = datetime_series.astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if there are NAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both left and right are filtered for notna here:

valid = notna(a) & notna(b)

so this works:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: ser1 = pd.Series(np.random.randn(100), dtype="Float64")

In [4]: ser2 = pd.Series(np.random.randn(100), dtype="Float64")

In [5]: ser1[1] = pd.NA

In [6]: ser2[5:7] = pd.NA

In [7]: ser1.corr(ser2)
Out[7]: 0.09774253881093414

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. what about if there is an nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an nan within a masked array? e.g.

In [8]: ser1[1] = 0.0

In [9]: ser1 /= (ser1 != 0)

In [10]: ser1
Out[10]: 
0     0.148073
1          NaN
2     0.556972
3    -0.554886
4     1.216938
        ...   
95     0.33919
96    0.528683
97    1.590215
98     0.84015
99    0.333666
Length: 100, dtype: Float64

In [11]: ser1.corr(ser2)
Out[11]: 0.09774253881093414

It works either way since the notna is applied to the ndarray which will capture both np.nan and pd.NA


# full overlap
tm.assert_almost_equal(datetime_series.corr(datetime_series), 1)

Expand Down