Skip to content

[ArrayManager] Implement concat with axis=1 (merge/join) #39841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ jobs:
run: |
source activate pandas-dev
pytest pandas/tests/frame/methods --array-manager
pytest pandas/tests/reshape/merge --array-manager
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reshape/merge and not reshape/concat, as there are too many tests still failing in the concat tests (because axis=0 is not yet handled properly)


# indexing iset related (temporary since other tests don't pass yet)
pytest pandas/tests/frame/indexing/test_indexing.py::TestDataFrameIndexing::test_setitem_multi_index --array-manager
Expand Down
7 changes: 5 additions & 2 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -754,10 +754,12 @@ def reindex_indexer(
# ignored keywords
consolidate: bool = True,
only_slice: bool = False,
# ArrayManager specific keywords
do_integrity_check: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we call this integrity_check elsewhere

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockManager/ArrayManager constructors call this do_integrity_check, so it's consistent with that. We don't have integrity_check, but we do have verify_integrity elsewhere (and this is actually used more, also in public APIs).

I think it might be good to change do_integrity_check to verify_integrity, but given that currently do_integrity_check is consistently used within the (already existing) internals code, that's something for a different PR, I would say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right that's what i mean. can you normalize all to verify_integrity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you normalize all to verify_integrity

Since that's not strictly related to concat or ArrayManager, will do that in a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #39989

) -> T:
axis = self._normalize_axis(axis)
return self._reindex_indexer(
new_axis, indexer, axis, fill_value, allow_dups, copy
new_axis, indexer, axis, fill_value, allow_dups, copy, do_integrity_check
)

def _reindex_indexer(
Expand All @@ -768,6 +770,7 @@ def _reindex_indexer(
fill_value=None,
allow_dups: bool = False,
copy: bool = True,
do_integrity_check: bool = True,
) -> T:
"""
Parameters
Expand Down Expand Up @@ -822,7 +825,7 @@ def _reindex_indexer(
new_axes = list(self._axes)
new_axes[axis] = new_axis

return type(self)(new_arrays, new_axes)
return type(self)(new_arrays, new_axes, do_integrity_check=do_integrity_check)

def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True):
"""
Expand Down
55 changes: 42 additions & 13 deletions pandas/core/internals/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,47 @@
from pandas import Index


def concatenate_array_managers(
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool
) -> Manager:
"""
Concatenate array managers into one.

Parameters
----------
mgrs_indexers : list of (ArrayManager, {axis: indexer,...}) tuples
axes : list of Index
concat_axis : int
copy : bool

Returns
-------
ArrayManager
"""
# reindex all arrays
mgrs = []
for mgr, indexers in mgrs_indexers:
for ax, indexer in indexers.items():
mgr = mgr.reindex_indexer(
axes[ax], indexer, axis=ax, allow_dups=True, do_integrity_check=False
)
mgrs.append(mgr)

# concatting along the rows -> concat the reindexed arrays
# TODO(ArrayManager) doesn't yet preserve the correct dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this comment apply to both cases or just concat_axis==1? (this ambiguity is why i like to put comments inside below the "if"/"else")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the comment (yes, it applies only to the axis==1)

if concat_axis == 1:
arrays = [
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))])
for j in range(len(mgrs[0].arrays))
]
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False)
# concatting along the columns -> combine reindexed arrays in a single manager
else:
assert concat_axis == 0
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs]))
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False)


def concatenate_block_managers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you at a higher level call concatenate_array_managers instead? (e.g. instead of dispatching thru the BM)

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require adding if/else checking the manager type in each of the places where currently concatenate_array_managers is called (3 places), so that would not improve the code there.

If you want I can create a new concatenate_managers in this file that simply does the dispatch to concatenate_block_managers / concatenate_array_managers, and call that instead of concatenate_block_managers outside of the /internals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok maybe rename this then to concatenate_managers? to make it much more obvious.

mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool
) -> Manager:
Expand All @@ -49,19 +90,7 @@ def concatenate_block_managers(
BlockManager
"""
if isinstance(mgrs_indexers[0][0], ArrayManager):

if concat_axis == 1:
# TODO for now only fastpath without indexers
mgrs = [t[0] for t in mgrs_indexers]
arrays = [
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))], axis=0)
for j in range(len(mgrs[0].arrays))
]
return ArrayManager(arrays, [axes[1], axes[0]])
elif concat_axis == 0:
mgrs = [t[0] for t in mgrs_indexers]
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs]))
return ArrayManager(arrays, [axes[1], axes[0]])
return concatenate_array_managers(mgrs_indexers, axes, concat_axis, copy)

concat_plans = [
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/methods/test_drop.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ def test_drop(self):
assert return_value is None
tm.assert_frame_equal(df, expected)

@td.skip_array_manager_not_yet_implemented
@td.skip_array_manager_not_yet_implemented # TODO(ArrayManager) groupby
def test_drop_multiindex_not_lexsorted(self):
# GH#11640

Expand Down
5 changes: 0 additions & 5 deletions pandas/tests/frame/methods/test_explode.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

# TODO(ArrayManager) concat with reindexing
pytestmark = td.skip_array_manager_not_yet_implemented


def test_error():
df = pd.DataFrame(
Expand Down
10 changes: 3 additions & 7 deletions pandas/tests/frame/methods/test_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,10 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
from pandas import DataFrame, Index, MultiIndex, date_range, period_range
import pandas._testing as tm

# TODO(ArrayManager) concat with reindexing
pytestmark = td.skip_array_manager_not_yet_implemented


@pytest.fixture
def frame_with_period_index():
Expand Down Expand Up @@ -234,8 +229,9 @@ def test_join(self, multiindex_dataframe_random_data):
b = frame.loc[frame.index[2:], ["B", "C"]]

joined = a.join(b, how="outer").reindex(frame.index)
expected = frame.copy()
expected.values[np.isnan(joined.values)] = np.nan
expected = frame.copy().values
expected[np.isnan(joined.values)] = np.nan
expected = DataFrame(expected, index=frame.index, columns=frame.columns)

assert not np.isnan(joined.values).all()

Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/io/formats/test_printing.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def test_ambiguous_width(self):
assert adjoined == expected


@td.skip_array_manager_not_yet_implemented
@td.skip_array_manager_not_yet_implemented # TODO(ArrayManager) JSON
class TestTableSchemaRepr:
@classmethod
def setup_class(cls):
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/io/test_fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ def test_pickle_options(fsspectest):
tm.assert_frame_equal(df, out)


@td.skip_array_manager_not_yet_implemented
@td.skip_array_manager_not_yet_implemented # TODO(ArrayManager) JSON
def test_json_options(fsspectest):
df = DataFrame({"a": [0]})
df.to_json("testmem://afile", storage_options={"test": "json_write"})
Expand Down
3 changes: 3 additions & 0 deletions pandas/tests/reshape/merge/test_join.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
from pandas import (
Categorical,
Expand Down Expand Up @@ -547,6 +549,7 @@ def test_join_non_unique_period_index(self):
)
tm.assert_frame_equal(result, expected)

@td.skip_array_manager_not_yet_implemented # TODO(ArrayManager) groupby
def test_mixed_type_join_with_suffix(self):
# GH #916
df = DataFrame(np.random.randn(20, 6), columns=["a", "b", "c", "d", "e", "f"])
Expand Down
29 changes: 22 additions & 7 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,17 +277,27 @@ def test_merge_copy(self):
merged["d"] = "peekaboo"
assert (right["d"] == "bar").all()

def test_merge_nocopy(self):
def test_merge_nocopy(self, using_array_manager):
left = DataFrame({"a": 0, "b": 1}, index=range(10))
right = DataFrame({"c": "foo", "d": "bar"}, index=range(10))

merged = merge(left, right, left_index=True, right_index=True, copy=False)

merged["a"] = 6
assert (left["a"] == 6).all()
if using_array_manager:
# With ArrayManager, setting a column doesn't change the values inplace
# and thus does not propagate the changes to the original left/right
# dataframes -> need to check that no copy was made in a different way
# TODO(ArrayManager) we should be able to simplify this with a .loc
# setitem test: merged.loc[0, "a"] = 10; assert left.loc[0, "a"] == 10
# but this currently replaces the array (_setitem_with_indexer_split_path)
assert merged._mgr.arrays[0] is left._mgr.arrays[0]
assert merged._mgr.arrays[2] is right._mgr.arrays[0]
else:
merged["a"] = 6
assert (left["a"] == 6).all()

merged["d"] = "peekaboo"
assert (right["d"] == "peekaboo").all()
merged["d"] = "peekaboo"
assert (right["d"] == "peekaboo").all()

def test_intelligently_handle_join_key(self):
# #733, be a bit more 1337 about not returning unconsolidated DataFrame
Expand Down Expand Up @@ -1362,7 +1372,7 @@ def test_merge_take_missing_values_from_index_of_other_dtype(self):
expected = expected.reindex(columns=["a", "key", "b"])
tm.assert_frame_equal(result, expected)

def test_merge_readonly(self):
def test_merge_readonly(self, using_array_manager):
# https://github.com/pandas-dev/pandas/issues/27943
data1 = DataFrame(
np.arange(20).reshape((4, 5)) + 1, columns=["a", "b", "c", "d", "e"]
Expand All @@ -1371,7 +1381,12 @@ def test_merge_readonly(self):
np.arange(20).reshape((5, 4)) + 1, columns=["a", "b", "x", "y"]
)

data1._mgr.blocks[0].values.flags.writeable = False
if using_array_manager:
for arr in data1._mgr.arrays:
arr.flags.writeable = False
else:
data1._mgr.blocks[0].values.flags.writeable = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could implement BlockManager.arrays as a property yielding the values for compat

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but that would not help for this specific case (as those arrays would not be the actual arrays stored in the BlockManager, and setting the flag wouldn't have an impact on the actual values).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as those arrays would not be the actual arrays stored in the BlockManager

@property
def arrays(self):
   for blk in self.blocks:
       yield blk.values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, like that. I wouldn't find it great API design, as the same attribute/property would return something quite different (at least what you can do with it is different), but I am fine adding it for testing convenience

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added + test simplified


data1.merge(data2) # no error


Expand Down