Skip to content

BUG: pandas concat does not match index names when concatenating two dataframes with a multiindex #41162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f8cf24e
Update pandas concat multiindex
taytzehao Apr 26, 2021
e5d2d06
Update pandas concat multiindex 2
taytzehao Apr 26, 2021
2387a62
Update pandas concat multiindex 3
taytzehao Apr 26, 2021
7a3cfa4
Resilved test_sparse_array conflict
taytzehao May 9, 2021
b034486
Resolved test_sparse_array conflict
taytzehao May 9, 2021
0637c0f
Added multiindex concatenation with different column
taytzehao May 10, 2021
524b8f5
Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…
taytzehao May 10, 2021
d09ddbe
Added NA library
taytzehao May 10, 2021
6056f87
Resolved bug for multiple None column
taytzehao May 11, 2021
0772202
Merge branch 'master' into concat_multiindex_different_arrangement
taytzehao May 11, 2021
85c1007
Include pull
taytzehao May 11, 2021
cc22775
Resolved origin conflict
taytzehao May 11, 2021
f753066
Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…
taytzehao May 20, 2021
a881d17
Addressed comments
taytzehao May 23, 2021
6a03335
Addressed comments
taytzehao May 23, 2021
dc3e10e
Removed single index test
taytzehao May 24, 2021
8ee9865
Removed single index test
taytzehao May 24, 2021
91697d0
Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…
taytzehao May 24, 2021
a714d0f
Delete conftest
taytzehao May 24, 2021
d7f0592
Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…
taytzehao May 24, 2021
77f53ae
Resolve CI error
taytzehao May 24, 2021
fbfbb9b
Resolve CI issue
taytzehao May 24, 2021
d5f010a
Merge branch 'pandas-dev:master' into concat_multiindex_different_arr…
taytzehao May 26, 2021
f82794e
Merge branch 'master' of github.com:pandas-dev/pandas into concat_mul…
taytzehao May 26, 2021
312bd44
Merge branch 'concat_multiindex_different_arrangement' of https://git…
taytzehao May 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -809,6 +809,7 @@ MultiIndex
- Bug in :meth:`MultiIndex.equals` incorrectly returning ``True`` when :class:`MultiIndex` containing ``NaN`` even when they are differently ordered (:issue:`38439`)
- Bug in :meth:`MultiIndex.intersection` always returning empty when intersecting with :class:`CategoricalIndex` (:issue:`38653`)


I/O
^^^

Expand Down Expand Up @@ -916,6 +917,8 @@ Reshaping
- Bug in :func:`to_datetime` raising error when input sequence contains unhashable items (:issue:`39756`)
- Bug in :meth:`Series.explode` preserving index when ``ignore_index`` was ``True`` and values were scalars (:issue:`40487`)
- Bug in :func:`to_datetime` raising ``ValueError`` when :class:`Series` contains ``None`` and ``NaT`` and has more than 50 elements (:issue:`39882`)
- Bug in :meth:`DataFrame.concat` does not match index names when concatenating two dataframes with a multiindex (:issue:`40849`)


Sparse
^^^^^^
Expand Down
84 changes: 78 additions & 6 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
lib,
)
from pandas._libs.hashtable import duplicated
from pandas._libs.missing import NA
from pandas._typing import (
AnyArrayLike,
DtypeObj,
Expand Down Expand Up @@ -2144,7 +2145,7 @@ def take(
levels=self.levels, codes=taken, names=self.names, verify_integrity=False
)

def append(self, other):
def append(self, other, concat_indexes=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on adding any keywords, this should just work

"""
Append a collection of Index options together

Expand All @@ -2163,11 +2164,41 @@ def append(self, other):
(isinstance(o, MultiIndex) and o.nlevels >= self.nlevels) for o in other
):
arrays = []
for i in range(self.nlevels):
label = self._get_level_values(i)
appended = [o._get_level_values(i) for o in other]
arrays.append(label.append(appended))
return MultiIndex.from_arrays(arrays, names=self.names)
if self.names.count(None) > 1 or any(
o.names.count(None) > 1 for o in other
):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement checks if there are more than 2 None columns (Columns that don't have name) in the self and each "other" Dataframe. If there are any dataframe that have more than one column that do not have a name, the old method would be used.

This is because when there are multiple "None" multiindex column in any of the dataframe, it would be hard for the other dataframe that have any None to decide which None column to be placed into. For example:

DataFrame1=pd.Dataframe(Index=[3,6], Index_column_name=[None,None])
DataFrame2=pd.Dataframe(Index=[9,7], Index_column_name=["Light",None])
pd.concat(DataFrame1,DataFrame2)

Should the index be

Light None None
  NA   3    6
   9   7    NA

or

Light None None
  NA    3     6
   9    NA   7

Since there is no way of deciding this, I have decided to make them use the old method

for i in range(self.nlevels):

label = self._get_level_values(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do not write loops like this.

this could be a function itself. this is very hard to follow

appended = [o._get_level_values(i) for o in other]
arrays.append(label.append(appended))
index_label_list = self.names

else:
index_label_list = self.get_unique_indexes(other)

Copy link
Contributor Author

@taytzehao taytzehao May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this new method, a list of unique column name indexes from self and other are firstly obtained. For example:

DataFrame1=MultiIndex(Index(data=[3], name="Stationary"),Index(data=[6], name="Food"))
DataFrame2=MultiIndex(Index(data=[9], name="Light"),Index(data=[6], name="Food"))
get_unique_indexes(DataFrame1, DataFrame2) 

would return

[ "Stationary","Food", "Light"]

for index_label in index_label_list:

index = self.get_index_data(
data_index=self, column_name=index_label, other=other
)
appended = []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the index data from the self dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search self is false here as the "self" Dataframe is searched being searched through here.

for o in other:

data = self.get_index_data(
data_index=o,
column_name=index_label,
other=other,
search_self=True,
)
appended.append(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the index data from the other dataframe.

index = index.append(data)

arrays.append(index)
return MultiIndex.from_arrays(arrays, names=index_label_list)

to_concat = (self._values,) + tuple(k._values for k in other)
new_tuples = np.concatenate(to_concat)
Expand All @@ -2178,6 +2209,47 @@ def append(self, other):
except (TypeError, IndexError):
return Index(new_tuples)

def get_index_data(self, data_index, column_name, other, search_self=False):

# Returns original data if the data_index input has data for this column name
if column_name in data_index.names:
Index_position = data_index.names.index(column_name)
data = data_index._get_level_values(Index_position)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MultIndex being checked has the column name, it would just return the original data that is in the MultiIndex

else:

# If the data_index input is from other and if it don't
# have the column name, it returns an Index filled with pd.NA
# with data type that the other dataframe has the column.
if search_self is True:
if column_name in self.names:
Index_position = self.names.index(column_name)
NA_type = self.levels[Index_position].dtype
data = Index([NA] * data_index.size, dtype=NA_type)
return data

for o in other:
if o is not data_index and column_name in o.names:
Index_position = o.names.index(column_name)
NA_type = o.levels[Index_position].dtype
data = Index([NA] * data_index.size, dtype=NA_type)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function is to return the data that the index have. if the MultiIndex does not contain the column, it would return a pd.Index filled with pd.NA. Example,

MultiIndex0 = pd.MultiIndex( Index(data=[ 0 , 1 ], name="Stationary"), Index(data=[ 2 , 3 ], name="Tree"))
MultiIndex1 = pd.MultiIndex( Index(data=[ 4 , 5 ], name="Food"), Index(data=[ 6 , 7 ], name="Tree"))
MultiIndex2 =  pd.MultiIndex( Index(data=[ 8 , 9 ], name="Stationary"), Index(data=[ 10 , 11 ], name="Tree"))
self = DataFrame0
other = [DataFrame1, DataFrame2]

self. get_index_data( data_index=MultiIndex2,
                            column_name= "Food",
                            other=other,
                            search_self=True ) 

self. get_index_data( data_index=self,
                            column_name= "Stationary",
                            other=other,
                            search_self=False ) 

would yield

pd.Index([ pd.NA, pd.NA ])

and 

pd.Index([ 0 , 1 ])

def get_unique_indexes(self, other):

Union_list = list(self.names)

for o in other:
if not set(o.names).issubset(Union_list):

for element in o.names:
if element not in Union_list:

Union_list.append(element)

return Union_list

def argsort(self, *args, **kwargs) -> np.ndarray:
return self._values.argsort(*args, **kwargs)

Expand Down
37 changes: 37 additions & 0 deletions pandas/tests/arrays/sparse/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1333,3 +1333,40 @@ def test_maxmin(self, raw_data, max_expected, min_expected):
min_result = SparseArray(raw_data).min()
assert max_result in max_expected
assert min_result in min_expected


def test_concat_with_different_index_arrangement():
df_first = pd.DataFrame(
[["i1_top", "i2_top", 1]], columns=["index1", "index2", "value1"]
)
df_second = pd.DataFrame(
[["i1_middle", "i2_middle", 1]], columns=["index1", "index3", "value1"]
)
df_third = pd.DataFrame(
[["i1_bottom", "i2_bottom", 1]], columns=["index1", "index4", "value1"]
)

df_concatenated_result = pd.concat(
[df_first, df_second, df_third], ignore_index=True
)
df_concatenated_expected = pd.DataFrame(
[
["i1_top", "i2_top", 1, pd.NA, pd.NA],
["i1_middle", pd.NA, 1, "i2_middle", pd.NA],
["i1_bottom", pd.NA, 1, pd.NA, "i2_bottom"],
],
columns=["index1", "index2", "value1", "index3", "index4"],
)

tm.assert_frame_equal(df_concatenated_result, df_concatenated_expected)

df_first.set_index(["index1", "index2"], inplace=True)
df_second.set_index(["index3", "index1"], inplace=True)
df_third.set_index(["index4", "index1"], inplace=True)

df_concatenated_result = pd.concat([df_first, df_second, df_third])
df_concatenated_expected.set_index(
["index1", "index2", "index3", "index4"], inplace=True
)

tm.assert_frame_equal(df_concatenated_result, df_concatenated_expected)