-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162
Conversation
Hello @taytzehao! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-05-26 13:39:41 UTC |
if you consider six indexes each with the combinations:
It is not obvious to me what your PR is going to do do when it concats say: This needs tests for various cases, I think. This might need complex logic but as a start it might be sensible to match index names only in the case they exactly overlap: e.g {'name1', 'name2'} == {'name2', 'name1'}, and if it doesn't overlap exactly it falls back to the original? |
…tiindex_different_arrangement
#41162 (comment) comment has been taken into consideration into this commit. |
can you describe qualitatively how you are handling the cases. what is the goal for different cases? |
also have you run any performance checks (asv) to ensure there isn't any performance degradation here? is any expected? |
if self.names.count(None) > 1 or any( | ||
o.names.count(None) > 1 for o in other | ||
): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement checks if there are more than 2 None columns (Columns that don't have name) in the self and each "other" Dataframe. If there are any dataframe that have more than one column that do not have a name, the old method would be used.
This is because when there are multiple "None" multiindex column in any of the dataframe, it would be hard for the other dataframe that have any None to decide which None column to be placed into. For example:
DataFrame1=pd.Dataframe(Index=[3,6], Index_column_name=[None,None])
DataFrame2=pd.Dataframe(Index=[9,7], Index_column_name=["Light",None])
pd.concat(DataFrame1,DataFrame2)
Should the index be
Light None None
NA 3 6
9 7 NA
or
Light None None
NA 3 6
9 NA 7
Since there is no way of deciding this, I have decided to make them use the old method
|
||
else: | ||
index_label_list = self.get_unique_indexes(other) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this new method, a list of unique column name indexes from self and other are firstly obtained. For example:
DataFrame1=MultiIndex(Index(data=[3], name="Stationary"),Index(data=[6], name="Food"))
DataFrame2=MultiIndex(Index(data=[9], name="Light"),Index(data=[6], name="Food"))
get_unique_indexes(DataFrame1, DataFrame2)
would return
[ "Stationary","Food", "Light"]
data_index=self, column_name=index_label, other=other | ||
) | ||
appended = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obtain the index data from the self dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search self is false here as the "self" Dataframe is searched being searched through here.
search_self=True, | ||
) | ||
appended.append(data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obtain the index data from the other dataframe.
NA_type = o.levels[Index_position].dtype | ||
data = Index([NA] * data_index.size, dtype=NA_type) | ||
return data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function is to return the data that the index have. if the MultiIndex does not contain the column, it would return a pd.Index
filled with pd.NA
. Example,
MultiIndex0 = pd.MultiIndex( Index(data=[ 0 , 1 ], name="Stationary"), Index(data=[ 2 , 3 ], name="Tree"))
MultiIndex1 = pd.MultiIndex( Index(data=[ 4 , 5 ], name="Food"), Index(data=[ 6 , 7 ], name="Tree"))
MultiIndex2 = pd.MultiIndex( Index(data=[ 8 , 9 ], name="Stationary"), Index(data=[ 10 , 11 ], name="Tree"))
self = DataFrame0
other = [DataFrame1, DataFrame2]
self. get_index_data( data_index=MultiIndex2,
column_name= "Food",
other=other,
search_self=True )
self. get_index_data( data_index=self,
column_name= "Stationary",
other=other,
search_self=False )
would yield
pd.Index([ pd.NA, pd.NA ])
and
pd.Index([ 0 , 1 ])
Index_position = data_index.names.index(column_name) | ||
data = data_index._get_level_values(Index_position) | ||
return data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the MultIndex being checked has the column name, it would just return the original data that is in the MultiIndex
…tiindex_different_arrangement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's back up a second. you are simply trying to align these indices right? we already have .align functions for this. can you simply call those?
pandas/core/indexes/multi.py
Outdated
@@ -2144,7 +2145,7 @@ def take( | |||
levels=self.levels, codes=taken, names=self.names, verify_integrity=False | |||
) | |||
|
|||
def append(self, other): | |||
def append(self, other, concat_indexes=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on adding any keywords, this should just work
pandas/core/indexes/multi.py
Outdated
|
||
for i in range(self.nlevels): | ||
|
||
label = self._get_level_values(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls do not write loops like this.
this could be a function itself. this is very hard to follow
agreed what is the meaning of all of doing this tyep of concat. I would prefer to raise in any non-completely obvious / non-ambiguous case. |
Ah .. never thought of that. Thanks.. will look into that |
@jreback , it seems the align function can only be used on Unless, should the align function be implemented in the MultiIndex class? |
…tiindex_different_arrangement
…tiindex_different_arrangement
…tiindex_different_arrangement
…hub.com/taytzehao/pandas into concat_multiindex_different_arrangement
5394be3
to
312bd44
Compare
I am actually not quite sure why the test actions-38 fail. When I test it locally, the test passes. Can anybody give me some suggestion? |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
Thanks for the PR, but it appears that this PR has gone stale and needs ideally a simpler solution as explained in #41162 (review) (or why a simpler solution does not exist). Going to close due to inactivity, but let us know if you'd like to continue working on this issue. |
Uh oh!
There was an error while loading. Please reload this page.