BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162

taytzehao · 2021-04-26T08:18:46Z

closes pandas concat does not match index names when concatenating two dataframes with a multiindex #40849
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-04-26T08:18:49Z

Hello @taytzehao! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-26 13:39:41 UTC

pandas/tests/arrays/sparse/test_array.py

pandas/core/indexes/multi.py

attack68 · 2021-04-26T11:31:09Z

if you consider six indexes each with the combinations:

vals = [['i0', 'i1'], ['j0', 'j1']]
idx0 =  MultiIndex.from_product(vals, names=['First', 'Second'])
idx1 =  MultiIndex.from_product(vals, names=['First', 'Third'])
idx2 =  MultiIndex.from_product(vals, names=['Second', 'Third'])
idx3 =  MultiIndex.from_product(vals, names=[None, 'First'])
idx4 =  MultiIndex.from_product(vals, names=['Second', None])
idx5 =  MultiIndex.from_product(vals, names=[None, None])

It is not obvious to me what your PR is going to do do when it concats say: idx1 with idx3, given that in some cases it will find a name that matches, and in some cases it won't, and in some case there isn't a name.

This needs tests for various cases, I think.

This might need complex logic but as a start it might be sensible to match index names only in the case they exactly overlap:

e.g {'name1', 'name2'} == {'name2', 'name1'}, and if it doesn't overlap exactly it falls back to the original?

doc/source/whatsnew/v1.3.0.rst

pandas/core/indexes/multi.py

…tiindex_different_arrangement

taytzehao · 2021-05-11T13:48:33Z

#41162 (comment) comment has been taken into consideration into this commit.

attack68 · 2021-05-15T09:38:29Z

#41162 (comment) comment has been taken into consideration into this commit.

can you describe qualitatively how you are handling the cases. what is the goal for different cases?

attack68 · 2021-05-15T09:39:42Z

also have you run any performance checks (asv) to ensure there isn't any performance degradation here? is any expected?

taytzehao · 2021-05-20T07:28:09Z

pandas/core/indexes/multi.py

+            if self.names.count(None) > 1 or any(
+                o.names.count(None) > 1 for o in other
+            ):
+


This if statement checks if there are more than 2 None columns (Columns that don't have name) in the self and each "other" Dataframe. If there are any dataframe that have more than one column that do not have a name, the old method would be used.

This is because when there are multiple "None" multiindex column in any of the dataframe, it would be hard for the other dataframe that have any None to decide which None column to be placed into. For example:

DataFrame1=pd.Dataframe(Index=[3,6], Index_column_name=[None,None]) DataFrame2=pd.Dataframe(Index=[9,7], Index_column_name=["Light",None]) pd.concat(DataFrame1,DataFrame2)

Should the index be

Light None None NA 3 6 9 7 NA

or

Light None None NA 3 6 9 NA 7

Since there is no way of deciding this, I have decided to make them use the old method

taytzehao · 2021-05-20T07:29:34Z

pandas/core/indexes/multi.py

+
+            else:
+                index_label_list = self.get_unique_indexes(other)
+


In this new method, a list of unique column name indexes from self and other are firstly obtained. For example:

DataFrame1=MultiIndex(Index(data=[3], name="Stationary"),Index(data=[6], name="Food")) DataFrame2=MultiIndex(Index(data=[9], name="Light"),Index(data=[6], name="Food")) get_unique_indexes(DataFrame1, DataFrame2)

would return

[ "Stationary","Food", "Light"]

taytzehao · 2021-05-20T07:37:45Z

pandas/core/indexes/multi.py

+                        data_index=self, column_name=index_label, other=other
+                    )
+                    appended = []
+


Obtain the index data from the self dataframe.

Search self is false here as the "self" Dataframe is searched being searched through here.

taytzehao · 2021-05-20T07:37:58Z

pandas/core/indexes/multi.py

+                            search_self=True,
+                        )
+                        appended.append(data)
+


Obtain the index data from the other dataframe.

taytzehao · 2021-05-20T08:12:00Z

pandas/core/indexes/multi.py

+                    NA_type = o.levels[Index_position].dtype
+                    data = Index([NA] * data_index.size, dtype=NA_type)
+                    return data
+


This entire function is to return the data that the index have. if the MultiIndex does not contain the column, it would return a pd.Index filled with pd.NA. Example,

MultiIndex0 = pd.MultiIndex( Index(data=[ 0 , 1 ], name="Stationary"), Index(data=[ 2 , 3 ], name="Tree")) MultiIndex1 = pd.MultiIndex( Index(data=[ 4 , 5 ], name="Food"), Index(data=[ 6 , 7 ], name="Tree")) MultiIndex2 = pd.MultiIndex( Index(data=[ 8 , 9 ], name="Stationary"), Index(data=[ 10 , 11 ], name="Tree")) self = DataFrame0 other = [DataFrame1, DataFrame2] self. get_index_data( data_index=MultiIndex2, column_name= "Food", other=other, search_self=True ) self. get_index_data( data_index=self, column_name= "Stationary", other=other, search_self=False )

would yield

pd.Index([ pd.NA, pd.NA ]) and pd.Index([ 0 , 1 ])

taytzehao · 2021-05-20T08:13:18Z

pandas/core/indexes/multi.py

+            Index_position = data_index.names.index(column_name)
+            data = data_index._get_level_values(Index_position)
+            return data
+


If the MultIndex being checked has the column name, it would just return the original data that is in the MultiIndex

…tiindex_different_arrangement

jreback

let's back up a second. you are simply trying to align these indices right? we already have .align functions for this. can you simply call those?

jreback · 2021-05-21T16:22:52Z

pandas/core/indexes/multi.py

@@ -2144,7 +2145,7 @@ def take(
            levels=self.levels, codes=taken, names=self.names, verify_integrity=False
        )

-    def append(self, other):
+    def append(self, other, concat_indexes=False):


-1 on adding any keywords, this should just work

jreback · 2021-05-21T16:23:22Z

pandas/core/indexes/multi.py

+
+                for i in range(self.nlevels):
+
+                    label = self._get_level_values(i)


pls do not write loops like this.

this could be a function itself. this is very hard to follow

jreback · 2021-05-21T16:43:36Z

if you consider six indexes each with the combinations:
vals = [['i0', 'i1'], ['j0', 'j1']]
idx0 =  MultiIndex.from_product(vals, names=['First', 'Second'])
idx1 =  MultiIndex.from_product(vals, names=['First', 'Third'])
idx2 =  MultiIndex.from_product(vals, names=['Second', 'Third'])
idx3 =  MultiIndex.from_product(vals, names=[None, 'First'])
idx4 =  MultiIndex.from_product(vals, names=['Second', None])
idx5 =  MultiIndex.from_product(vals, names=[None, None])
It is not obvious to me what your PR is going to do do when it concats say: idx1 with idx3, given that in some cases it will find a name that matches, and in some cases it won't, and in some case there isn't a name.

This needs tests for various cases, I think.

This might need complex logic but as a start it might be sensible to match index names only in the case they exactly overlap:

e.g {'name1', 'name2'} == {'name2', 'name1'}, and if it doesn't overlap exactly it falls back to the original?

agreed what is the meaning of all of doing this tyep of concat. I would prefer to raise in any non-completely obvious / non-ambiguous case.

taytzehao · 2021-05-21T17:04:51Z

Ah .. never thought of that. Thanks.. will look into that

taytzehao · 2021-05-23T13:20:21Z

@jreback , it seems the align function can only be used on NDFrame class. It utilizes the blockmanager for alignment.

Unless, should the align function be implemented in the MultiIndex class?

…tiindex_different_arrangement

…angement

…tiindex_different_arrangement

…hub.com/taytzehao/pandas into concat_multiindex_different_arrangement

taytzehao · 2021-05-26T15:11:23Z

I am actually not quite sure why the test actions-38 fail. When I test it locally, the test passes. Can anybody give me some suggestion?

github-actions · 2021-06-26T00:02:54Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

mroeschke · 2021-08-17T00:48:50Z

Thanks for the PR, but it appears that this PR has gone stale and needs ideally a simpler solution as explained in #41162 (review) (or why a simpler solution does not exist).

Going to close due to inactivity, but let us know if you'd like to continue working on this issue.

Update pandas concat multiindex

f8cf24e

taytzehao added 2 commits April 26, 2021 16:21

Update pandas concat multiindex 2

e5d2d06

Update pandas concat multiindex 3

2387a62

attack68 reviewed Apr 26, 2021

View reviewed changes

pandas/tests/arrays/sparse/test_array.py Outdated Show resolved Hide resolved

attack68 reviewed Apr 26, 2021

View reviewed changes

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

jreback requested changes Apr 26, 2021

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

jreback added MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 26, 2021

taytzehao added 7 commits May 9, 2021 09:43

Resilved test_sparse_array conflict

7a3cfa4

Resolved test_sparse_array conflict

b034486

Added multiindex concatenation with different column

0637c0f

Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…

524b8f5

…tiindex_different_arrangement

Added NA library

d09ddbe

Resolved bug for multiple None column

6056f87

Merge branch 'master' into concat_multiindex_different_arrangement

0772202

taytzehao marked this pull request as draft May 11, 2021 03:06

taytzehao added 2 commits May 11, 2021 20:51

Include pull

85c1007

Resolved origin conflict

cc22775

taytzehao marked this pull request as ready for review May 11, 2021 13:46

taytzehao commented May 20, 2021

View reviewed changes

pandas/core/indexes/multi.py

search_self=True,

)

appended.append(data)

Copy link

Contributor Author

taytzehao May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the index data from the other dataframe.

taytzehao commented May 20, 2021

View reviewed changes

Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…

f753066

…tiindex_different_arrangement

jreback requested changes May 21, 2021

View reviewed changes

taytzehao added 9 commits May 24, 2021 01:26

Addressed comments

a881d17

Addressed comments

6a03335

Removed single index test

dc3e10e

Removed single index test

8ee9865

Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…

91697d0

…tiindex_different_arrangement

Delete conftest

a714d0f

Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…

d7f0592

…tiindex_different_arrangement

Resolve CI error

77f53ae

Resolve CI issue

fbfbb9b

simonjayhawkins changed the title ~~Update pandas concat multiindex~~ BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex May 25, 2021

simonjayhawkins added this to the 1.3 milestone May 25, 2021

attack68 mentioned this pull request May 25, 2021

BUG: pd.concat on MultiIndex joins named levels by order rather than by name #41660

Closed

jreback removed this from the 1.3 milestone May 26, 2021

taytzehao added 3 commits May 26, 2021 12:10

Merge branch 'pandas-dev:master' into concat_multiindex_different_arr…

d5f010a

…angement

Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…

f82794e

…tiindex_different_arrangement

Merge branch 'concat_multiindex_different_arrangement' of https://git…

312bd44

…hub.com/taytzehao/pandas into concat_multiindex_different_arrangement

taytzehao force-pushed the concat_multiindex_different_arrangement branch 2 times, most recently from 5394be3 to 312bd44 Compare May 26, 2021 13:39

taytzehao mentioned this pull request May 28, 2021

Min max sparse fillna #41691

Merged

4 tasks

github-actions bot added the Stale label Jun 26, 2021

mroeschke closed this Aug 17, 2021


		for i in range(self.nlevels):

		label = self._get_level_values(i)

Uh oh!

BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162

BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162

Uh oh!

Conversation

taytzehao commented Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-05-26 13:39:41 UTC

Uh oh!

Uh oh!

Uh oh!

attack68 commented Apr 26, 2021

Uh oh!

Uh oh!

Uh oh!

taytzehao commented May 11, 2021

Uh oh!

attack68 commented May 15, 2021

Uh oh!

attack68 commented May 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taytzehao May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 21, 2021

Uh oh!

taytzehao commented May 21, 2021

Uh oh!

taytzehao commented May 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taytzehao commented May 26, 2021

Uh oh!

github-actions bot commented Jun 26, 2021

Uh oh!

mroeschke commented Aug 17, 2021

Uh oh!

Uh oh!

taytzehao commented Apr 26, 2021 •

edited

Loading

pep8speaks commented Apr 26, 2021 •

edited

Loading

taytzehao May 20, 2021 •

edited

Loading

taytzehao commented May 23, 2021 •

edited

Loading