Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame.from_dict(
{
('column_1', 'a'): {
('group1', 'group2', 0): '01a',
('group1', 'group2', 1): '11a',
('group1', 'group2', 2): '21a'
},
('column_1', 'b'): {
('group1', 'group2', 0): '01b',
('group1', 'group2', 1): '11b',
('group1', 'group2', 2): '21b'
},
('column_2', 'a'): {
('group1', 'group2', 0): '02a',
('group1', 'group2', 1): '12a',
('group1', 'group2', 2): '22a'
},
('column_2', 'b'): {
('group1', 'group2', 0): '02b',
('group1', 'group2', 1): '12b',
('group1', 'group2', 2): '22b'
}}
)
desired_result = df.groupby(level=0, axis=1).apply(lambda df: df.apply(tuple, axis=1)) # semantically chrystal-clear, why deprecate?
should_be_equivalent = df.T.groupby(level=0).apply(lambda df: df.apply(tuple)).T # has one column level more, not less
this_works = df.T.groupby(level=0).apply(lambda df: df.apply(tuple, result_type="reduce")).T
desired_result.equals(should_be_equivalent) # False
desired_result.equals(this_works) # True
Issue Description
Related to #51203, #51778 where people fail to see the use of .groupby(axis=1)
. These issues fail to name good reasons for deprecating this that could counter this: It should be retained for API consistency and for backward compatibility. It does not matter if transforms are happening under the hood, as a user I don't care. I care about consistency and being able to use my code in the future.
So here comes the problem: Here's a dataframe with two-by-two multicolumns, four columns in total. The aim is to eliminate the second level and to end up with only two columns in total (according to the first level) that should contain tuples being pairs which correspond to a combination of the two squashed columns from the second level (desired_result
). The problem is, when, instead of grouping columns and applying tuple
to rows, I transform, group, and apply tuple
to all colums, the output of tuple
, by default, will be interpreted as a transformation of the column (corresponding to result_type="expand"
behaviour) and not as a "squashed" tuple that would get rid of the second, now-index level (corresponding to result_type="reduce"
behaviour). That is, I still end up with four columns. Even worse, the fact that group_keys=True
by default results in an additional column level duplicating the original first one.
As shown above, the fix is to include .apply(tuple, result_type="reduce")
. In this case, no need to include group_keys=False
to avoid the duplicated column level.
So my plea, again: Please do not deprecate .groupby(axis=1)
. Rather, seek to make the behaviours consistent.
Installed Versions
INSTALLED VERSIONS
commit : 2a953cf
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.62-1-MANJARO
Version : #1 SMP PREEMPT_DYNAMIC Thu Nov 9 03:01:44 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.3
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.23
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
...lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning:
Setuptools is replacing distutils.