Skip to content

BUG: Fixed handling of boolean indexing with 2-d ndarrays #18645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 30, 2017

Conversation

dhirschfeld
Copy link
Contributor

@dhirschfeld dhirschfeld commented Dec 5, 2017

@dhirschfeld
Copy link
Contributor Author

Looks like just one error. The rest are the same as before the change:

 1 failed, 17689 passed, 454 skipped, 47 xfailed, 1 xpassed, 44 warnings in 1607.96 seconds
================================== FAILURES ===================================
_________________ TestDataFrameIndexing.test_setitem_boolean __________________

self = <pandas.tests.frame.test_indexing.TestDataFrameIndexing object at 0x000002C9E29D2A90>

    def test_setitem_boolean(self):
        df = self.frame.copy()
        values = self.frame.values

        df[df['A'] > 0] = 4
        values[values[:, 0] > 0] = 4
        assert_almost_equal(df.values, values)

        # test that column reindexing works
        series = df['A'] == 4
        series = series.reindex(df.index[::-1])
        df[series] = 1
        values[values[:, 0] == 4] = 1
        assert_almost_equal(df.values, values)

        df[df > 0] = 5
        values[values > 0] = 5
        assert_almost_equal(df.values, values)

        df[df == 5] = 0
        values[values == 5] = 0
        assert_almost_equal(df.values, values)

        # a df that needs alignment first
        df[df[:-1] < 0] = 2
        np.putmask(values[:-1], values[:-1] < 0, 2)
        assert_almost_equal(df.values, values)

        # indexed with same shape but rows-reversed df
        df[df[::-1] == 2] = 3
        values[values == 2] = 3
        assert_almost_equal(df.values, values)

        with tm.assert_raises_regex(TypeError, 'Must pass '
                                    'DataFrame with '
                                    'boolean values only'):
>           df[df * 0] = 2

pandas\tests\frame\test_indexing.py:528:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self =                    A         B    C         D
YZQTztQ8Ng  0.000000  0.000000  0.0  0.000000
lYi3UlJmyD  3.000000  3.00...00  0.000000  0.0  0.000000
8RLavvGvaC  3.000000  0.000000  0.0  3.000000
S4V8aOm8PB -2.325562 -0.382488  0.0 -0.010254
key =               A    B    C    D
YZQTztQ8Ng  0.0  0.0  0.0  0.0
lYi3UlJmyD  0.0  0.0  0.0  0.0
PAZ8DZYTxG  0.0  0.0  0.0...rCoQFA  0.0  0.0  0.0  0.0
O4jF5wCtd2  0.0  0.0  0.0  0.0
8RLavvGvaC  0.0  0.0  0.0  0.0
S4V8aOm8PB -0.0 -0.0  0.0 -0.0
value = 2

    def __setitem__(self, key, value):
        key = com._apply_if_callable(key, self)

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            return self._setitem_slice(indexer, value)

        if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
>           self._setitem_frame(key, value)

pandas\core\frame.py:2534:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self =                    A         B    C         D
YZQTztQ8Ng  0.000000  0.000000  0.0  0.000000
lYi3UlJmyD  3.000000  3.00...00  0.000000  0.0  0.000000
8RLavvGvaC  3.000000  0.000000  0.0  3.000000
S4V8aOm8PB -2.325562 -0.382488  0.0 -0.010254
key =               A    B    C    D
YZQTztQ8Ng  0.0  0.0  0.0  0.0
lYi3UlJmyD  0.0  0.0  0.0  0.0
PAZ8DZYTxG  0.0  0.0  0.0...rCoQFA  0.0  0.0  0.0  0.0
O4jF5wCtd2  0.0  0.0  0.0  0.0
8RLavvGvaC  0.0  0.0  0.0  0.0
S4V8aOm8PB -0.0 -0.0  0.0 -0.0
value = 2

    def _setitem_frame(self, key, value):
        # support boolean setting with DataFrame input, e.g.
        # df[df > df2] = 0
        if isinstance(key, np.ndarray):
            if key.shape != self.shape:
                raise ValueError(
                    'Array conditional must be same shape as self'
                )
            key = self._constructor(key, **self._construct_axes_dict())

        if key.values.size and not is_bool_dtype(key.values):
            raise TypeError(
>               'Must pass DataFrame or 2-d ndarray with boolean values only'
            )
E           TypeError: Must pass DataFrame or 2-d ndarray with boolean values only

pandas\core\frame.py:2578: TypeError

During handling of the above exception, another exception occurred:

self = <pandas.tests.frame.test_indexing.TestDataFrameIndexing object at 0x000002C9E29D2A90>

    def test_setitem_boolean(self):
        df = self.frame.copy()
        values = self.frame.values

        df[df['A'] > 0] = 4
        values[values[:, 0] > 0] = 4
        assert_almost_equal(df.values, values)

        # test that column reindexing works
        series = df['A'] == 4
        series = series.reindex(df.index[::-1])
        df[series] = 1
        values[values[:, 0] == 4] = 1
        assert_almost_equal(df.values, values)

        df[df > 0] = 5
        values[values > 0] = 5
        assert_almost_equal(df.values, values)

        df[df == 5] = 0
        values[values == 5] = 0
        assert_almost_equal(df.values, values)

        # a df that needs alignment first
        df[df[:-1] < 0] = 2
        np.putmask(values[:-1], values[:-1] < 0, 2)
        assert_almost_equal(df.values, values)

        # indexed with same shape but rows-reversed df
        df[df[::-1] == 2] = 3
        values[values == 2] = 3
        assert_almost_equal(df.values, values)

        with tm.assert_raises_regex(TypeError, 'Must pass '
                                    'DataFrame with '
                                    'boolean values only'):
>           df[df * 0] = 2

pandas\tests\frame\test_indexing.py:528:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\util\testing.py:2489: in __exit__
    return self.exception_matches(exc_type, exc_value, trace_back)
pandas\util\testing.py:2525: in exception_matches
    raise_with_traceback(e, trace_back)
pandas\compat\__init__.py:385: in raise_with_traceback
    raise exc.with_traceback(traceback)
pandas\tests\frame\test_indexing.py:528: in test_setitem_boolean
    df[df * 0] = 2
pandas\core\frame.py:2534: in __setitem__
    self._setitem_frame(key, value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self =                    A         B    C         D
YZQTztQ8Ng  0.000000  0.000000  0.0  0.000000
lYi3UlJmyD  3.000000  3.00...00  0.000000  0.0  0.000000
8RLavvGvaC  3.000000  0.000000  0.0  3.000000
S4V8aOm8PB -2.325562 -0.382488  0.0 -0.010254
key =               A    B    C    D
YZQTztQ8Ng  0.0  0.0  0.0  0.0
lYi3UlJmyD  0.0  0.0  0.0  0.0
PAZ8DZYTxG  0.0  0.0  0.0...rCoQFA  0.0  0.0  0.0  0.0
O4jF5wCtd2  0.0  0.0  0.0  0.0
8RLavvGvaC  0.0  0.0  0.0  0.0
S4V8aOm8PB -0.0 -0.0  0.0 -0.0
value = 2

    def _setitem_frame(self, key, value):
        # support boolean setting with DataFrame input, e.g.
        # df[df > df2] = 0
        if isinstance(key, np.ndarray):
            if key.shape != self.shape:
                raise ValueError(
                    'Array conditional must be same shape as self'
                )
            key = self._constructor(key, **self._construct_axes_dict())

        if key.values.size and not is_bool_dtype(key.values):
            raise TypeError(
>               'Must pass DataFrame or 2-d ndarray with boolean values only'
            )
E           AssertionError: "Must pass DataFrame with boolean values only" does not match "Must pass DataFrame or 2-d ndarray with boolean values only"

pandas\core\frame.py:2578: AssertionError

@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 5, 2017
@codecov
Copy link

codecov bot commented Dec 6, 2017

Codecov Report

Merging #18645 into master will decrease coverage by 0.02%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18645      +/-   ##
==========================================
- Coverage   91.57%   91.55%   -0.03%     
==========================================
  Files         150      150              
  Lines       48933    48937       +4     
==========================================
- Hits        44812    44803       -9     
- Misses       4121     4134      +13
Flag Coverage Δ
#multiple 89.91% <87.5%> (-0.03%) ⬇️
#single 41.75% <25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.63% <87.5%> (-0.05%) ⬇️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c24a0d2...c2da170. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. pls add a whatsnew note for 0.22, bug fixes in indexing.

expected.values[mask.values] = nan
assert_frame_equal(df, expected)

# index with DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make these a separate test (test_setitem_boolean_ndarary)

@pep8speaks
Copy link

pep8speaks commented Dec 9, 2017

Hello @dhirschfeld! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 30, 2017 at 14:24 Hours UTC

@dhirschfeld
Copy link
Contributor Author

@jreback - test moved to it's own function and whatsnew entry added

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes, ping on green.

@@ -542,6 +541,16 @@ def test_setitem_boolean(self):
np.putmask(expected.values, mask.values, df.values * 2)
assert_frame_equal(df, expected)

def test_setitem_boolean_ndarary(self):
df = self.frame.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here, also sp on ndarray

mask = df > np.abs(df)
expected = df.copy()
expected.values[mask.values] = nan
# index with 2-d boolean ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line before the comment

@@ -542,6 +541,16 @@ def test_setitem_boolean(self):
np.putmask(expected.values, mask.values, df.values * 2)
assert_frame_equal(df, expected)

def test_setitem_boolean_ndarary(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize to pass both lambda mask: mask, lambda mask: mask.value (IOW pass a dataframe and the ndarray)

Copy link
Contributor

@jreback jreback Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this, IOW, use parametrize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You saw my comment? You definitely want to use parametrize rather than just adding it like I did below?

https://github.com/dhirschfeld/pandas/blob/520adf778867e9054ee3634e23654eadfc0507ab/pandas/tests/frame/test_indexing.py#L557-L560

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize this

@jreback jreback added this to the 0.22.0 milestone Dec 9, 2017
@jreback jreback changed the title [WIP] Fixed handling of boolean indexing with 2-d ndarrays BUG: Fixed handling of boolean indexing with 2-d ndarrays Dec 9, 2017
# index with boolean DataFrame
actual = df.copy()
actual[mask] = nan
assert_frame_equal(actual, expected)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this block to the test rather than try and use pytest parameterisation as it seem a more straightforward and direct way to do it. Possibly just my lack of familiarity with pytest fixtures but I think they might be overkill for this usecase.

@dhirschfeld
Copy link
Contributor Author

Just thinking about this - as a bug fix wouldn't it be more appropriate for 0.21.1?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

Just thinking about this - as a bug fix wouldn't it be more appropriate for 0.21.1?

yes its a bug fix, but it is substantially different, so put in 0.22. folks may be relying on this behavior (event though its wrong).

@dhirschfeld
Copy link
Contributor Author

Fair enough. I think it's good to go then if there are no more changes you'd like?

@@ -542,6 +541,16 @@ def test_setitem_boolean(self):
np.putmask(expected.values, mask.values, df.values * 2)
assert_frame_equal(df, expected)

def test_setitem_boolean_ndarary(self):
Copy link
Contributor

@jreback jreback Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this, IOW, use parametrize

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

pls rebase & ping on green (the 3.6 build is failing unrelated so that may fail the build), but otherwise.

@@ -268,6 +268,7 @@ Indexing
- Bug in :func:`IntervalIndex.symmetric_difference` where the symmetric difference with a non-``IntervalIndex`` did not raise (:issue:`18475`)
- Bug in indexing a datetimelike ``Index`` that raised ``ValueError`` instead of ``IndexError`` (:issue:`18386`).
- Bug in tz-aware :class:`DatetimeIndex` where addition/subtraction with a :class:`TimedeltaIndex` or array with ``dtype='timedelta64[ns]'`` was incorrect (:issue:`17558`)
- Bug in ``__setitem__`` when indexing a :class:`DataFrame` with a 2-d boolean ndarray (:issue:`18582`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.23 (0.22 is a special release)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Should hopefully have the time to finish this off this weekend...

@@ -542,6 +541,16 @@ def test_setitem_boolean(self):
np.putmask(expected.values, mask.values, df.values * 2)
assert_frame_equal(df, expected)

def test_setitem_boolean_ndarary(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize this

@dhirschfeld
Copy link
Contributor Author

@jreback - should be good to go I think...

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

pushed a change to the more common form of parametrization. ping on green.

@jreback jreback merged commit 2dac793 into pandas-dev:master Dec 30, 2017
@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

thanks @dhirschfeld !

@dhirschfeld dhirschfeld deleted the ndarray-boolean-mask branch December 30, 2017 22:11
@dhirschfeld
Copy link
Contributor Author

Thanks for getting this over the line @jreback!

hexgnu pushed a commit to hexgnu/pandas that referenced this pull request Jan 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results
4 participants