Skip to content

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

Closed
@jtkiley

Description

@jtkiley
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
                    {'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')

# Initial columns are _correct_ types
df.dtypes

# complex128 and across columns
df.mean()

# agg('mean') is the same.
df.agg('mean')

# Happens even when the columns at issue is alone.
df2.mean()

# Case of the 'J' doesn't matter.
df3.mean()

# Numeric only works as expected.
df.mean(numeric_only=True)

# Two rows works as expected, too.
df4.mean()

# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()

# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))

# Not in a numpy array.
np.mean(df['db'].to_numpy())

# Also happens in a Series.
df['db'].mean()
type(df['db'])

# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])

# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()

# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))

# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))

# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()

# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()

With output:

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
...                     {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db         object
numeric     int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric    123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric    289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
    x = float(x)
ValueError: could not convert string to float: 'JJ'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
    x = complex(x)
ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
    return self._reduce(
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
    return f(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'

Problem description

It seems like df.mean() is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.

I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.

Expected Output

Expected output is what we get with numeric_only=True set or more than one row. See above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : None
pytest : 6.0.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

Labels

BugDtype ConversionsUnexpected or buggy dtype conversionsReduction Operationssum, mean, min, max, etc.StringsString extension data type and string data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions