Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
{'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
# Initial columns are _correct_ types
df.dtypes
# complex128 and across columns
df.mean()
# agg('mean') is the same.
df.agg('mean')
# Happens even when the columns at issue is alone.
df2.mean()
# Case of the 'J' doesn't matter.
df3.mean()
# Numeric only works as expected.
df.mean(numeric_only=True)
# Two rows works as expected, too.
df4.mean()
# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()
# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))
# Not in a numpy array.
np.mean(df['db'].to_numpy())
# Also happens in a Series.
df['db'].mean()
type(df['db'])
# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])
# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()
# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))
# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()
# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()
With output:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
... {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db object
numeric int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db 0.000000+1.000000j
numeric 123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db 0.000000+1.000000j
numeric 123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db 0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db 0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric 123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric 289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in mean
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
x = float(x)
ValueError: could not convert string to float: 'JJ'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
x = complex(x)
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
return self._reduce(
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
return f(*args, **kwargs)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'
Problem description
It seems like df.mean()
is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.
I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.
Expected Output
Expected output is what we get with numeric_only=True
set or more than one row. See above.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : None
pytest : 6.0.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None