Skip to content

Ambiguous behaviour when transform groupby with NaNs #17093

@chbrandt

Description

@chbrandt

Similar issues: #10923, #9697, #9941

Please, consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn't.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean', is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda function, used to work on (pandas) version 0.19.1

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

Labels

ApplyApply, Aggregate, Transform, MapGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNeeds TestsUnit test(s) needed to prevent regressions

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions