Skip to content

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

Closed
@sebasv

Description

@sebasv
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)

assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4

Problem description

  1. pd.DataFrame.mean/pd.Series.mean/np.mean(pd.Series) outputs a Python float instead of a numpy float. Since np.mean(pd.Series.values) does return an np float, I'm assuming for now that this should be fixed in pandas
  2. if dtype==np.float32, then calling mean on a pandas object gives a significantly different result vs calling mean on the underlying numpy ndarray.

Expected Output

The output of np.mean(a) should be the same as np.mean(a.values).

additional tests

# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64

h = np.mean(a)
assert isinstance(h, float)
assert h == b

c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected

print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6

print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6

d = a.sum() / len(a) 
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32

e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division

# these methods are equivalent
assert d==e

# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c

# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64

g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality

print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9

# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b

output

errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06

relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14

errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDependenciesRequired and optional dependenciesNumeric OperationsArithmetic, Comparison, and Logical operationsUpstream issueIssue related to pandas dependency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions