-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrayManager] Implement concat with axis=1 (merge/join) #39841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
072ed39
dd14d2d
4da4433
31cfb0d
480b32d
3488d63
437c1d2
2597cd5
31d1305
b6a6b54
3f5fc38
54692ca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -754,10 +754,12 @@ def reindex_indexer( | |
# ignored keywords | ||
consolidate: bool = True, | ||
only_slice: bool = False, | ||
# ArrayManager specific keywords | ||
do_integrity_check: bool = True, | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think we call this integrity_check elsewhere There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The BlockManager/ArrayManager constructors call this I think it might be good to change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right that's what i mean. can you normalize all to verify_integrity There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Since that's not strictly related to concat or ArrayManager, will do that in a follow-up PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> #39989 |
||
) -> T: | ||
axis = self._normalize_axis(axis) | ||
return self._reindex_indexer( | ||
new_axis, indexer, axis, fill_value, allow_dups, copy | ||
new_axis, indexer, axis, fill_value, allow_dups, copy, do_integrity_check | ||
) | ||
|
||
def _reindex_indexer( | ||
|
@@ -768,6 +770,7 @@ def _reindex_indexer( | |
fill_value=None, | ||
allow_dups: bool = False, | ||
copy: bool = True, | ||
do_integrity_check: bool = True, | ||
) -> T: | ||
""" | ||
Parameters | ||
|
@@ -822,7 +825,7 @@ def _reindex_indexer( | |
new_axes = list(self._axes) | ||
new_axes[axis] = new_axis | ||
|
||
return type(self)(new_arrays, new_axes) | ||
return type(self)(new_arrays, new_axes, do_integrity_check=do_integrity_check) | ||
|
||
def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True): | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,47 @@ | |
from pandas import Index | ||
|
||
|
||
def concatenate_array_managers( | ||
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool | ||
) -> Manager: | ||
""" | ||
Concatenate array managers into one. | ||
|
||
Parameters | ||
---------- | ||
mgrs_indexers : list of (ArrayManager, {axis: indexer,...}) tuples | ||
axes : list of Index | ||
concat_axis : int | ||
copy : bool | ||
|
||
Returns | ||
------- | ||
ArrayManager | ||
""" | ||
# reindex all arrays | ||
mgrs = [] | ||
for mgr, indexers in mgrs_indexers: | ||
for ax, indexer in indexers.items(): | ||
mgr = mgr.reindex_indexer( | ||
axes[ax], indexer, axis=ax, allow_dups=True, do_integrity_check=False | ||
) | ||
mgrs.append(mgr) | ||
|
||
# concatting along the rows -> concat the reindexed arrays | ||
# TODO(ArrayManager) doesn't yet preserve the correct dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this comment apply to both cases or just concat_axis==1? (this ambiguity is why i like to put comments inside below the "if"/"else") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved the comment (yes, it applies only to the axis==1) |
||
if concat_axis == 1: | ||
arrays = [ | ||
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))]) | ||
for j in range(len(mgrs[0].arrays)) | ||
] | ||
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False) | ||
# concatting along the columns -> combine reindexed arrays in a single manager | ||
else: | ||
assert concat_axis == 0 | ||
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs])) | ||
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False) | ||
|
||
|
||
def concatenate_block_managers( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you at a higher level call concatenate_array_managers instead? (e.g. instead of dispatching thru the BM) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would require adding if/else checking the manager type in each of the places where currently If you want I can create a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok maybe rename this then to concatenate_managers? to make it much more obvious. |
||
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool | ||
) -> Manager: | ||
|
@@ -49,19 +90,7 @@ def concatenate_block_managers( | |
BlockManager | ||
""" | ||
if isinstance(mgrs_indexers[0][0], ArrayManager): | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if concat_axis == 1: | ||
# TODO for now only fastpath without indexers | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = [ | ||
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))], axis=0) | ||
for j in range(len(mgrs[0].arrays)) | ||
] | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
elif concat_axis == 0: | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs])) | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
return concatenate_array_managers(mgrs_indexers, axes, concat_axis, copy) | ||
|
||
concat_plans = [ | ||
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -277,17 +277,27 @@ def test_merge_copy(self): | |
merged["d"] = "peekaboo" | ||
assert (right["d"] == "bar").all() | ||
|
||
def test_merge_nocopy(self): | ||
def test_merge_nocopy(self, using_array_manager): | ||
left = DataFrame({"a": 0, "b": 1}, index=range(10)) | ||
right = DataFrame({"c": "foo", "d": "bar"}, index=range(10)) | ||
|
||
merged = merge(left, right, left_index=True, right_index=True, copy=False) | ||
|
||
merged["a"] = 6 | ||
assert (left["a"] == 6).all() | ||
if using_array_manager: | ||
# With ArrayManager, setting a column doesn't change the values inplace | ||
# and thus does not propagate the changes to the original left/right | ||
# dataframes -> need to check that no copy was made in a different way | ||
# TODO(ArrayManager) we should be able to simplify this with a .loc | ||
# setitem test: merged.loc[0, "a"] = 10; assert left.loc[0, "a"] == 10 | ||
# but this currently replaces the array (_setitem_with_indexer_split_path) | ||
assert merged._mgr.arrays[0] is left._mgr.arrays[0] | ||
assert merged._mgr.arrays[2] is right._mgr.arrays[0] | ||
else: | ||
merged["a"] = 6 | ||
assert (left["a"] == 6).all() | ||
|
||
merged["d"] = "peekaboo" | ||
assert (right["d"] == "peekaboo").all() | ||
merged["d"] = "peekaboo" | ||
assert (right["d"] == "peekaboo").all() | ||
|
||
def test_intelligently_handle_join_key(self): | ||
# #733, be a bit more 1337 about not returning unconsolidated DataFrame | ||
|
@@ -1362,7 +1372,7 @@ def test_merge_take_missing_values_from_index_of_other_dtype(self): | |
expected = expected.reindex(columns=["a", "key", "b"]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_merge_readonly(self): | ||
def test_merge_readonly(self, using_array_manager): | ||
# https://github.com/pandas-dev/pandas/issues/27943 | ||
data1 = DataFrame( | ||
np.arange(20).reshape((4, 5)) + 1, columns=["a", "b", "c", "d", "e"] | ||
|
@@ -1371,7 +1381,12 @@ def test_merge_readonly(self): | |
np.arange(20).reshape((5, 4)) + 1, columns=["a", "b", "x", "y"] | ||
) | ||
|
||
data1._mgr.blocks[0].values.flags.writeable = False | ||
if using_array_manager: | ||
for arr in data1._mgr.arrays: | ||
arr.flags.writeable = False | ||
else: | ||
data1._mgr.blocks[0].values.flags.writeable = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could implement BlockManager.arrays as a property yielding the values for compat There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but that would not help for this specific case (as those arrays would not be the actual arrays stored in the BlockManager, and setting the flag wouldn't have an impact on the actual values). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, like that. I wouldn't find it great API design, as the same attribute/property would return something quite different (at least what you can do with it is different), but I am fine adding it for testing convenience There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added + test simplified |
||
|
||
data1.merge(data2) # no error | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only
reshape/merge
and notreshape/concat
, as there are too many tests still failing in the concat tests (because axis=0 is not yet handled properly)