Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
import numpy as np
a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)
assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4
Problem description
pd.DataFrame.mean
/pd.Series.mean
/np.mean(pd.Series)
outputs a Python float instead of a numpy float. Sincenp.mean(pd.Series.values)
does return an np float, I'm assuming for now that this should be fixed in pandas- if
dtype==np.float32
, then callingmean
on a pandas object gives a significantly different result vs callingmean
on the underlying numpy ndarray.
Expected Output
The output of np.mean(a)
should be the same as np.mean(a.values)
.
additional tests
# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64
h = np.mean(a)
assert isinstance(h, float)
assert h == b
c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected
print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6
print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6
d = a.sum() / len(a)
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32
e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division
# these methods are equivalent
assert d==e
# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c
# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64
g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality
print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9
# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b
output
errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06
relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14
errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07
Output of pd.show_versions()
INSTALLED VERSIONS
commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2