Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
I must admit there were similar complaints before, here I present my analysis and solution. (ex. API: Signature of UDF methods, Higher Order Methods API)
Reproducible Example
import pandas as pd
import numpy as np
dat = pd.DataFrame({'color':['A', 'A', 'E', 'E', 'J'],
'carat':[0.23, 0.21, 0.23, 0.29, 0.31],
'x':[3.95, 3.89, 4.05, 4.20, 4.34]},
index = [1,2,3,4,5])
dat['carat'].agg(np.mean)
## 0.254
dat['carat'].agg(np.mean, axis=0)
## 0.254
dat['carat'].agg('mean')
## 0.254
dat['carat'].agg(lambda x: np.mean(x))
## 1 0.23
## 2 0.21
## 3 0.23
## 4 0.29
## 5 0.31
## Name: carat, dtype: float64
dat['carat'].agg(lambda x, axis: np.mean(x, axis=axis), axis=0)
## TypeError: <lambda>() missing 1 required positional argument: 'axis'
Compare the result above with the result below
```python
dat['carat'].groupby(dat['color']).agg(np.mean),
dat['carat'].groupby(dat['color']).agg(np.mean, axis=0),
dat['carat'].groupby(dat['color']).agg('mean'),
dat['carat'].groupby(dat['color']).agg(lambda x: np.mean(x)),
dat['carat'].groupby(dat['color']).agg(lambda x, axis: np.mean(x, axis=axis), axis=0)
## <all the same here>
## color
## A 0.22
## E 0.26
## J 0.31
## Name: carat, dtype: float64,
##
##
Call it bug or inconsistency or poor design?
Issue Description
Basically, in consistency's term, pd.Series.agg()
and pd.Series.groupby().agg()
should act the same way.
Anyway as I read the the code(the code is in the bottom), I found parameter axis for Series.aggregate()
seems useless(or am I missing something here?)
Even if it really needs a named paramter, it'd better be named something like _axis_
to avoid conflicting names with
functions to apply. For example np.mean()
has a parameter named axis=
and user might name a parameter axis
(for example, lambda x, axis: np.mean(x, axis=axis)
).
As far as I know, we do not need axis=0
and self._get_axis_number(axis)
for Series at least(DataFrame has similar signature, and I think it should be renamed to something like _axis_
or _axis
or __axis__
).
So I propose deleting axis
parameter from Series.aggregate and renaming axis
to _axis
for DataFrame.aggreate.
def aggregate(self, func=None, axis=0, *args, **kwargs):
# Validate the axis parameter
self._get_axis_number(axis)
# if func is None, will switch to user-provided "named aggregation" kwargs
if func is None:
func = dict(kwargs.items())
op = SeriesApply(self, func, convert_dtype=False, args=args, kwargs=kwargs)
result = op.agg()
return result
Expected Behavior
dat['carat'].groupby(dat['color']).agg(lambda x: np.mean(x))
and
dat['carat']..agg(lambda x: np.mean(x))
works similarly
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Korean_Korea.949
pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.3
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.8.0
sqlalchemy : None
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None