Skip to content

DEPR: DataFrame.stack on columns containing duplicate values #53761

Closed
@rhshadrach

Description

@rhshadrach

With MultiIndex columns, we get incorrect results

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y"]],
    codes=[[0, 1, 0], [0, 1, 0]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1  a  b  a
# l2  x  y  x
# 0   0  1  2
# 1   3  4  5
  
print(df.stack(0))
# l2    x    y
#   l1
# 0 a   0  NaN
#   b   2  1.0
# 1 a   3  NaN
#   b   5  4.0

In particular, the value of df indexed by (0, (a, x)) is 2, and this gets moved to the value indexed by ((0, b), x).

Taking the same example but with an Index gives a more reasonable result:

df = df.droplevel(1, axis=1)
print(df)
# l1  a  b  a
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
#    l1
# 0  a     0
#    b     1
#    a     2
# 1  a     3
#    b     4
#    a     5
# dtype: int64

However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.

Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDeprecateFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions