Description
Similar issues: #10923, #9697, #9941
Please, consider the following data:
import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
'B':numpy.random.rand(20)*10,
'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None
Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name
, the second one, a lambda
function. The first one works, the second, doesn't.
In [41]: df.groupby('C')['B'].transform('mean')
Out[41]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 5.670891
6 5.335332
7 0.580197
8 5.670891
9 5.670891
10 1.628290
11 1.628290
12 5.670891
13 8.493416
14 5.670891
15 8.493416
16 5.335332
17 5.670891
18 5.670891
19 5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
3061
3062 result.name = self._selected_obj.name
-> 3063 result.index = self._selected_obj.index
3064 return result
3065
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
3092 try:
3093 object.__getattribute__(self, name)
-> 3094 return object.__setattr__(self, name, value)
3095 except AttributeError:
3096 pass
pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
306 object.__setattr__(self, '_index', labels)
307 if not fastpath:
--> 308 self._data.set_axis(axis, labels)
309
310 def _set_subtyp(self, is_all_dates):
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
2834 raise ValueError('Length mismatch: Expected axis has %d elements, '
2835 'new values have %d elements' %
-> 2836 (old_len, new_len))
2837
2838 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements
The first one, using 'mean'
, is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda
function, used to work on (pandas) version 0.19.1
I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.
Thanks
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None