-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162
Changes from 12 commits
f8cf24e
e5d2d06
2387a62
7a3cfa4
b034486
0637c0f
524b8f5
d09ddbe
6056f87
0772202
85c1007
cc22775
f753066
a881d17
6a03335
dc3e10e
8ee9865
91697d0
a714d0f
d7f0592
77f53ae
fbfbb9b
d5f010a
f82794e
312bd44
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
lib, | ||
) | ||
from pandas._libs.hashtable import duplicated | ||
from pandas._libs.missing import NA | ||
from pandas._typing import ( | ||
AnyArrayLike, | ||
DtypeObj, | ||
|
@@ -2144,7 +2145,7 @@ def take( | |
levels=self.levels, codes=taken, names=self.names, verify_integrity=False | ||
) | ||
|
||
def append(self, other): | ||
def append(self, other, concat_indexes=False): | ||
""" | ||
Append a collection of Index options together | ||
|
||
|
@@ -2163,11 +2164,41 @@ def append(self, other): | |
(isinstance(o, MultiIndex) and o.nlevels >= self.nlevels) for o in other | ||
): | ||
arrays = [] | ||
for i in range(self.nlevels): | ||
label = self._get_level_values(i) | ||
appended = [o._get_level_values(i) for o in other] | ||
arrays.append(label.append(appended)) | ||
return MultiIndex.from_arrays(arrays, names=self.names) | ||
if self.names.count(None) > 1 or any( | ||
o.names.count(None) > 1 for o in other | ||
): | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This if statement checks if there are more than 2 None columns (Columns that don't have name) in the self and each "other" Dataframe. If there are any dataframe that have more than one column that do not have a name, the old method would be used. This is because when there are multiple "None" multiindex column in any of the dataframe, it would be hard for the other dataframe that have any None to decide which None column to be placed into. For example:
Should the index be
or
Since there is no way of deciding this, I have decided to make them use the old method |
||
for i in range(self.nlevels): | ||
|
||
label = self._get_level_values(i) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls do not write loops like this. this could be a function itself. this is very hard to follow |
||
appended = [o._get_level_values(i) for o in other] | ||
arrays.append(label.append(appended)) | ||
index_label_list = self.names | ||
|
||
else: | ||
index_label_list = self.get_unique_indexes(other) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this new method, a list of unique column name indexes from self and other are firstly obtained. For example:
would return
|
||
for index_label in index_label_list: | ||
|
||
index = self.get_index_data( | ||
data_index=self, column_name=index_label, other=other | ||
) | ||
appended = [] | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Obtain the index data from the self dataframe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Search self is false here as the "self" Dataframe is searched being searched through here. |
||
for o in other: | ||
|
||
data = self.get_index_data( | ||
data_index=o, | ||
column_name=index_label, | ||
other=other, | ||
search_self=True, | ||
) | ||
appended.append(data) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Obtain the index data from the other dataframe. |
||
index = index.append(data) | ||
|
||
arrays.append(index) | ||
return MultiIndex.from_arrays(arrays, names=index_label_list) | ||
|
||
to_concat = (self._values,) + tuple(k._values for k in other) | ||
new_tuples = np.concatenate(to_concat) | ||
|
@@ -2178,6 +2209,47 @@ def append(self, other): | |
except (TypeError, IndexError): | ||
return Index(new_tuples) | ||
|
||
def get_index_data(self, data_index, column_name, other, search_self=False): | ||
|
||
# Returns original data if the data_index input has data for this column name | ||
if column_name in data_index.names: | ||
Index_position = data_index.names.index(column_name) | ||
data = data_index._get_level_values(Index_position) | ||
return data | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the MultIndex being checked has the column name, it would just return the original data that is in the MultiIndex |
||
else: | ||
|
||
# If the data_index input is from other and if it don't | ||
# have the column name, it returns an Index filled with pd.NA | ||
# with data type that the other dataframe has the column. | ||
if search_self is True: | ||
if column_name in self.names: | ||
Index_position = self.names.index(column_name) | ||
NA_type = self.levels[Index_position].dtype | ||
data = Index([NA] * data_index.size, dtype=NA_type) | ||
return data | ||
|
||
for o in other: | ||
if o is not data_index and column_name in o.names: | ||
Index_position = o.names.index(column_name) | ||
NA_type = o.levels[Index_position].dtype | ||
data = Index([NA] * data_index.size, dtype=NA_type) | ||
return data | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This entire function is to return the data that the index have. if the MultiIndex does not contain the column, it would return a
would yield
|
||
def get_unique_indexes(self, other): | ||
|
||
Union_list = list(self.names) | ||
|
||
for o in other: | ||
if not set(o.names).issubset(Union_list): | ||
|
||
for element in o.names: | ||
if element not in Union_list: | ||
|
||
Union_list.append(element) | ||
|
||
return Union_list | ||
|
||
def argsort(self, *args, **kwargs) -> np.ndarray: | ||
return self._values.argsort(*args, **kwargs) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on adding any keywords, this should just work