Skip to content

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

Closed
@ivirshup

Description

@ivirshup
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"), dtype="category")
df.agg("-".join, axis=1)

Using pandas 1.2.5:

a    a-a
a    b-b
b    c-c
b    d-d
c    e-e
dtype: object

Using pandas 1.3.0:

a    b-b
b    d-d
c    e-e
dtype: object

It does not look like this is an issue if I use df.apply instead of df.agg.

Problem description

When a aggregation of the rows is run on a dataframe with categorical columns and non-unique indices, the result is the wrong length.

It's weird that the output isn't the right length. Since I'm computing a value per row, I expect the same number of rows in the output as in the input. It's especially weird that this only happens if the columns are categorical.

That is:

df = pd.DataFrame({"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"))
df.agg("-".join, axis=1)
a    a-a
a    b-b
b    c-c
b    d-d
c    e-e
dtype: object

in both versions.

Expected Output

I would expect the same output between versions. The result given by 1.2.5 seems more correct to me at the moment.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.5.0
Version          : Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 56.0.0
Cython           : 0.29.23
pytest           : 6.2.4
hypothesis       : None
sphinx           : 4.0.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.3
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.23.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : 2021.06.0
fastparquet      : 0.4.1
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : 2.7.2
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 4.0.1
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.7.0
sqlalchemy       : 1.3.18
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.18.2
xlrd             : 1.2.0
xlwt             : None
numba            : 0.53.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions