DEPR: DataFrame.stack on columns containing duplicate values

With MultiIndex columns, we get incorrect results

```
columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y"]],
    codes=[[0, 1, 0], [0, 1, 0]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1  a  b  a
# l2  x  y  x
# 0   0  1  2
# 1   3  4  5
  
print(df.stack(0))
# l2    x    y
#   l1
# 0 a   0  NaN
#   b   2  1.0
# 1 a   3  NaN
#   b   5  4.0
```

In particular, the value of `df` indexed by `(0, (a, x))` is 2, and this gets moved to the value indexed by `((0, b), x)`.

Taking the same example but with an Index gives a more reasonable result:

```
df = df.droplevel(1, axis=1)
print(df)
# l1  a  b  a
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
#    l1
# 0  a     0
#    b     1
#    a     2
# 1  a     3
#    b     4
#    a     5
# dtype: int64
```

However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.

Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DEPR: DataFrame.stack on columns containing duplicate values #53761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DEPR: DataFrame.stack on columns containing duplicate values #53761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions