Skip to content

BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

taytzehao
Copy link
Contributor

@taytzehao taytzehao commented Apr 26, 2021

@pep8speaks
Copy link

pep8speaks commented Apr 26, 2021

Hello @taytzehao! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-26 13:39:41 UTC

@attack68
Copy link
Contributor

if you consider six indexes each with the combinations:

vals = [['i0', 'i1'], ['j0', 'j1']]
idx0 =  MultiIndex.from_product(vals, names=['First', 'Second'])
idx1 =  MultiIndex.from_product(vals, names=['First', 'Third'])
idx2 =  MultiIndex.from_product(vals, names=['Second', 'Third'])
idx3 =  MultiIndex.from_product(vals, names=[None, 'First'])
idx4 =  MultiIndex.from_product(vals, names=['Second', None])
idx5 =  MultiIndex.from_product(vals, names=[None, None])

It is not obvious to me what your PR is going to do do when it concats say: idx1 with idx3, given that in some cases it will find a name that matches, and in some cases it won't, and in some case there isn't a name.

This needs tests for various cases, I think.

This might need complex logic but as a start it might be sensible to match index names only in the case they exactly overlap:

e.g {'name1', 'name2'} == {'name2', 'name1'}, and if it doesn't overlap exactly it falls back to the original?

@jreback jreback added MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 26, 2021
@taytzehao taytzehao marked this pull request as draft May 11, 2021 03:06
@taytzehao taytzehao marked this pull request as ready for review May 11, 2021 13:46
@taytzehao
Copy link
Contributor Author

#41162 (comment) comment has been taken into consideration into this commit.

@attack68
Copy link
Contributor

#41162 (comment) comment has been taken into consideration into this commit.

can you describe qualitatively how you are handling the cases. what is the goal for different cases?

@attack68
Copy link
Contributor

also have you run any performance checks (asv) to ensure there isn't any performance degradation here? is any expected?

if self.names.count(None) > 1 or any(
o.names.count(None) > 1 for o in other
):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement checks if there are more than 2 None columns (Columns that don't have name) in the self and each "other" Dataframe. If there are any dataframe that have more than one column that do not have a name, the old method would be used.

This is because when there are multiple "None" multiindex column in any of the dataframe, it would be hard for the other dataframe that have any None to decide which None column to be placed into. For example:

DataFrame1=pd.Dataframe(Index=[3,6], Index_column_name=[None,None])
DataFrame2=pd.Dataframe(Index=[9,7], Index_column_name=["Light",None])
pd.concat(DataFrame1,DataFrame2)

Should the index be

Light None None
  NA   3    6
   9   7    NA

or

Light None None
  NA    3     6
   9    NA   7

Since there is no way of deciding this, I have decided to make them use the old method


else:
index_label_list = self.get_unique_indexes(other)

Copy link
Contributor Author

@taytzehao taytzehao May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this new method, a list of unique column name indexes from self and other are firstly obtained. For example:

DataFrame1=MultiIndex(Index(data=[3], name="Stationary"),Index(data=[6], name="Food"))
DataFrame2=MultiIndex(Index(data=[9], name="Light"),Index(data=[6], name="Food"))
get_unique_indexes(DataFrame1, DataFrame2) 

would return

[ "Stationary","Food", "Light"]

data_index=self, column_name=index_label, other=other
)
appended = []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the index data from the self dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search self is false here as the "self" Dataframe is searched being searched through here.

search_self=True,
)
appended.append(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the index data from the other dataframe.

NA_type = o.levels[Index_position].dtype
data = Index([NA] * data_index.size, dtype=NA_type)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function is to return the data that the index have. if the MultiIndex does not contain the column, it would return a pd.Index filled with pd.NA. Example,

MultiIndex0 = pd.MultiIndex( Index(data=[ 0 , 1 ], name="Stationary"), Index(data=[ 2 , 3 ], name="Tree"))
MultiIndex1 = pd.MultiIndex( Index(data=[ 4 , 5 ], name="Food"), Index(data=[ 6 , 7 ], name="Tree"))
MultiIndex2 =  pd.MultiIndex( Index(data=[ 8 , 9 ], name="Stationary"), Index(data=[ 10 , 11 ], name="Tree"))
self = DataFrame0
other = [DataFrame1, DataFrame2]

self. get_index_data( data_index=MultiIndex2,
                            column_name= "Food",
                            other=other,
                            search_self=True ) 

self. get_index_data( data_index=self,
                            column_name= "Stationary",
                            other=other,
                            search_self=False ) 

would yield

pd.Index([ pd.NA, pd.NA ])

and 

pd.Index([ 0 , 1 ])

Index_position = data_index.names.index(column_name)
data = data_index._get_level_values(Index_position)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MultIndex being checked has the column name, it would just return the original data that is in the MultiIndex

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's back up a second. you are simply trying to align these indices right? we already have .align functions for this. can you simply call those?

@@ -2144,7 +2145,7 @@ def take(
levels=self.levels, codes=taken, names=self.names, verify_integrity=False
)

def append(self, other):
def append(self, other, concat_indexes=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on adding any keywords, this should just work


for i in range(self.nlevels):

label = self._get_level_values(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do not write loops like this.

this could be a function itself. this is very hard to follow

@jreback
Copy link
Contributor

jreback commented May 21, 2021

if you consider six indexes each with the combinations:

vals = [['i0', 'i1'], ['j0', 'j1']]
idx0 =  MultiIndex.from_product(vals, names=['First', 'Second'])
idx1 =  MultiIndex.from_product(vals, names=['First', 'Third'])
idx2 =  MultiIndex.from_product(vals, names=['Second', 'Third'])
idx3 =  MultiIndex.from_product(vals, names=[None, 'First'])
idx4 =  MultiIndex.from_product(vals, names=['Second', None])
idx5 =  MultiIndex.from_product(vals, names=[None, None])

It is not obvious to me what your PR is going to do do when it concats say: idx1 with idx3, given that in some cases it will find a name that matches, and in some cases it won't, and in some case there isn't a name.

This needs tests for various cases, I think.

This might need complex logic but as a start it might be sensible to match index names only in the case they exactly overlap:

e.g {'name1', 'name2'} == {'name2', 'name1'}, and if it doesn't overlap exactly it falls back to the original?

agreed what is the meaning of all of doing this tyep of concat. I would prefer to raise in any non-completely obvious / non-ambiguous case.

@taytzehao
Copy link
Contributor Author

Ah .. never thought of that. Thanks.. will look into that

@taytzehao
Copy link
Contributor Author

taytzehao commented May 23, 2021

@jreback , it seems the align function can only be used on NDFrame class. It utilizes the blockmanager for alignment.

Unless, should the align function be implemented in the MultiIndex class?

@simonjayhawkins simonjayhawkins changed the title Update pandas concat multiindex BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex May 25, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone May 25, 2021
@jreback jreback removed this from the 1.3 milestone May 26, 2021
@taytzehao taytzehao force-pushed the concat_multiindex_different_arrangement branch 2 times, most recently from 5394be3 to 312bd44 Compare May 26, 2021 13:39
@taytzehao
Copy link
Contributor Author

I am actually not quite sure why the test actions-38 fail. When I test it locally, the test passes. Can anybody give me some suggestion?

@taytzehao taytzehao mentioned this pull request May 28, 2021
4 tasks
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 26, 2021
@mroeschke
Copy link
Member

Thanks for the PR, but it appears that this PR has gone stale and needs ideally a simpler solution as explained in #41162 (review) (or why a simpler solution does not exist).

Going to close due to inactivity, but let us know if you'd like to continue working on this issue.

@mroeschke mroeschke closed this Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas concat does not match index names when concatenating two dataframes with a multiindex
6 participants