Skip to content

ENH: Add allow_duplicates to MultiIndex.to_frame #45318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 22, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@ Other enhancements
- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`).
- Add support for `Zstandard <http://facebook.github.io/zstd/>`_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`)
- :meth:`DataFrame.to_sql` now returns an ``int`` of the number of written rows (:issue:`23998`)
- :meth:`MultiIndex.to_frame` now supports the argument ``allow_duplicates`` and raises on duplicate labels if it is missing or False (:issue:`45245`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.. ---------------------------------------------------------------------------

Expand Down
28 changes: 22 additions & 6 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1710,7 +1710,12 @@ def unique(self, level=None):
level = self._get_level_number(level)
return self._get_level_values(level=level, unique=True)

def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame:
def to_frame(
self,
index: bool = True,
name=lib.no_default,
allow_duplicates: bool = False,
) -> DataFrame:
"""
Create a DataFrame with the levels of the MultiIndex as columns.

Expand All @@ -1725,6 +1730,11 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame:
name : list / sequence of str, optional
The passed names should substitute index level names.

allow_duplicates : bool, optional default False
Allow duplicate column labels to be created.

.. versionadded:: 1.4.0

Returns
-------
DataFrame : a DataFrame containing the original MultiIndex data.
Expand Down Expand Up @@ -1774,14 +1784,20 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame:
else:
idx_names = self.names

idx_names = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why are you repeating L1785?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doing a transform: filling in None names with the level number.

Whether that is the right thing to do is another issue. I am just preserving the existing behavior,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this needs another argument similar to how this is done in .reset_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not changing anything here. The old code is https://github.com/johnzangwill/pandas/blob/6cc5584bba59ef8f06d4dc901dc39ddd08d1519f/pandas/core/indexes/multi.py#L1780:

(level if lvlname is None else lvlname): self._get_level_values(level)

and I have just moved that logic earlier, since I need unique dictionary indexes.

In any case, insert and reset_index do this differently, replacing None level labels with level_n. As I say, that is a separate issue and I have raised it elsewhere (#45245), but is is not the subject of this PR,

I don't think that this is conditional in reset_index or that there is an argument for it. Which argument are you referring to?

This is the code in reset_index:

            if isinstance(self.index, MultiIndex):
                names = com.fill_missing_names(self.index.names)
                to_insert = zip(self.index.levels, self.index.codes)
            else:
                default = "index" if "index" not in self else "level_0"
                names = [default] if self.index.name is None else [self.index.name]
                to_insert = ((self.index, None),)

that puts in "level_n" for multi-index and "index" or "level_0" for simple index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you just make a method on Index then to do this, repeating this code is not great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have searched Pandas and I cannot find any other instance of this. The nearest is

name = self.name or 0
which does implement the policy (on self.name, not self.names). I can factor that down if you think that it is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think a common method on index is worth it here (to share here & reset_index)

Copy link
Contributor Author

@johnzangwill johnzangwill Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I have already explained that reset_index does not do this.

reset_index, to_records and many other methods all fill the None entries with "level_n", not with "n". As you know, I factored those out into a common method (com.fill_missing_names #44878) which is invoked in 6 different places.

These MI/Index.to_frame methods are the only ones which do it differently, filling the gaps with the column number. This difference could be discussed, and I have made an issue (#45245), but I don't suggest changing it without a lot of thought. Changing to_frame would break virtually all its tests.

level if name is None else name for level, name in enumerate(idx_names)
]

if not allow_duplicates and len(set(idx_names)) != len(idx_names):
raise ValueError(
"Cannot create duplicate column labels if allow_duplicates is False"
)

# Guarantee resulting column order - PY36+ dict maintains insertion order
result = DataFrame(
{
(level if lvlname is None else lvlname): self._get_level_values(level)
for lvlname, level in zip(idx_names, range(len(self.levels)))
},
{level: self._get_level_values(level) for level in range(len(self.levels))},
copy=False,
)
).set_axis(idx_names, axis=1)

if index:
result.index = self
Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/indexes/multi/test_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,28 @@ def test_to_frame_resulting_column_order():
assert result == expected


def test_to_frame_duplicate_labels():
# GH 45245
data = [(1, 2), (3, 4)]
names = ["a", "a"]
index = MultiIndex.from_tuples(data, names=names)
with pytest.raises(ValueError, match="Cannot create duplicate column labels"):
index.to_frame()

result = index.to_frame(allow_duplicates=True)
expected = DataFrame(data, index=index, columns=names)
tm.assert_frame_equal(result, expected)

names = [None, 0]
index = MultiIndex.from_tuples(data, names=names)
with pytest.raises(ValueError, match="Cannot create duplicate column labels"):
index.to_frame()

result = index.to_frame(allow_duplicates=True)
expected = DataFrame(data, index=index, columns=[0, 0])
tm.assert_frame_equal(result, expected)


def test_to_flat_index(idx):
expected = pd.Index(
(
Expand Down