Skip to content

Unexpected behaviour of groupby.transform when using 'fillna' #30918

Closed
@lfiedler

Description

@lfiedler

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'baz'],
        'B': [1, 2, np.nan, 3, 3, np.nan, 4],
        'C': [np.nan]*7,
        'D': [0,1,2,3,4,5,6],
        'E': [np.nan] + [datetime.datetime(2020,1,1)]*3 + [datetime.datetime(2020,1,2)]*2 +[datetime.datetime(2020,1,3)],
        'F': list('abcdefg'),
        'G': list('abc') + [np.nan] + list('efg'),
        'id': range(0,7),
    }
).set_index('id')
df.groupby('A').transform('fillna', value=9999)

Output

B C D E F G
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
1.0 9999.0 0 9999 a a
1.0 9999.0 0 9999 a a
2.0 9999.0 1 2020-01-01 00:00:00 b b

Problem description

When using GroupBy.transform together with 'fillna' I expected it to work like GroupBy.transform together with lambda x: x.fillna(). Instead, it seems to also change values that are not NaN. Even worse, it seems to shuffle contents between groups.

Is this how it is expected to work?

Expected Output

df.groupby('A').transform(lambda x: x.fillna(9999))
B C D E F G
1.0 9999.0 0 9999 a a
2.0 9999.0 1 2020-01-01 00:00:00 b b
9999.0 9999.0 2 2020-01-01 00:00:00 c c
3.0 9999.0 3 2020-01-01 00:00:00 d 9999
3.0 9999.0 4 2020-01-02 00:00:00 e e
9999.0 9999.0 5 2020-01-02 00:00:00 f f
4.0 9999.0 6 2020-01-03 00:00:00 g g

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions