Closed
Description
With MultiIndex columns, we get incorrect results
columns = pd.MultiIndex(
levels=[["a", "b"], ["x", "y"]],
codes=[[0, 1, 0], [0, 1, 0]],
names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1 a b a
# l2 x y x
# 0 0 1 2
# 1 3 4 5
print(df.stack(0))
# l2 x y
# l1
# 0 a 0 NaN
# b 2 1.0
# 1 a 3 NaN
# b 5 4.0
In particular, the value of df
indexed by (0, (a, x))
is 2, and this gets moved to the value indexed by ((0, b), x)
.
Taking the same example but with an Index gives a more reasonable result:
df = df.droplevel(1, axis=1)
print(df)
# l1 a b a
# 0 0 1 2
# 1 3 4 5
print(df.stack(0))
# l1
# 0 a 0
# b 1
# a 2
# 1 a 3
# b 4
# a 5
# dtype: int64
However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.
Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.