Skip to content

ENH: Add axis and level keywords to where, so that the other argument can now be an alignable pandas object. #4781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 10, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 25 additions & 7 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,18 @@ This can be done intuitively like so:
df2[df2 < 0] = 0
df2

By default, ``where`` returns a modified copy of the data. There is an
optional parameter ``inplace`` so that the original data can be modified
without creating a copy:

.. ipython:: python

df_orig = df.copy()
df_orig.where(df > 0, -df, inplace=True);
df_orig

**alignment**

Furthermore, ``where`` aligns the input boolean condition (ndarray or DataFrame),
such that partial selection with setting is possible. This is analagous to
partial setting via ``.ix`` (but on the contents rather than the axis labels)
Expand All @@ -635,24 +647,30 @@ partial setting via ``.ix`` (but on the contents rather than the axis labels)
df2[ df2[1:4] > 0 ] = 3
df2

By default, ``where`` returns a modified copy of the data. There is an
optional parameter ``inplace`` so that the original data can be modified
without creating a copy:
.. versionadded:: 0.13

Where can also accept ``axis`` and ``level`` parameters to align the input when
performing the ``where``.

.. ipython:: python

df_orig = df.copy()
df2 = df.copy()
df2.where(df2>0,df2['A'],axis='index')

df_orig.where(df > 0, -df, inplace=True);
This is equivalent (but faster than) the following.

df_orig
.. ipython:: python

df2 = df.copy()
df.apply(lambda x, y: x.where(x>0,y), y=df['A'])

**mask**

``mask`` is the inverse boolean operation of ``where``.

.. ipython:: python

s.mask(s >= 0)

df.mask(df >= 0)

Take Methods
Expand Down
27 changes: 27 additions & 0 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,33 @@ To remind you, these are the available filling methods:
With time series data, using pad/ffill is extremely common so that the "last
known value" is available at every time point.

Filling with a PandasObject
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.12

You can also fill using a direct assignment with an alignable object. The
use case of this is to fill a DataFrame with the mean of that column.

.. ipython:: python

df = DataFrame(np.random.randn(10,3))
df.iloc[3:5,0] = np.nan
df.iloc[4:6,1] = np.nan
df.iloc[5:8,2] = np.nan
df

df.fillna(df.mean())

.. versionadded:: 0.13

Same result as above, but is aligning the 'fill' value which is
a Series in this case.

.. ipython:: python

df.where(pd.notnull(df),df.mean(),axis='columns')

.. _missing_data.dropna:

Dropping axis labels with missing data: dropna
Expand Down
2 changes: 2 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ Improvements to existing features
tests/test_frame, tests/test_multilevel (:issue:`4732`).
- Performance improvement of timesesies plotting with PeriodIndex and added
test to vbench (:issue:`4705` and :issue:`4722`)
- Add ``axis`` and ``level`` keywords to ``where``, so that the ``other`` argument
can now be an alignable pandas object.

API Changes
~~~~~~~~~~~
Expand Down
19 changes: 13 additions & 6 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2173,6 +2173,8 @@ def align(self, other, join='outer', axis=None, level=None, copy=True,
from pandas import DataFrame, Series
method = com._clean_fill_method(method)

if axis is not None:
axis = self._get_axis_number(axis)
if isinstance(other, DataFrame):
return self._align_frame(other, join=join, axis=axis, level=level,
copy=copy, fill_value=fill_value,
Expand Down Expand Up @@ -2262,7 +2264,8 @@ def _align_series(self, other, join='outer', axis=None, level=None,
else:
return left_result, right_result

def where(self, cond, other=np.nan, inplace=False, try_cast=False, raise_on_error=True):
def where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
try_cast=False, raise_on_error=True):
"""
Return an object of same shape as self and whose corresponding
entries are from self where cond is True and otherwise are from other.
Expand All @@ -2273,6 +2276,8 @@ def where(self, cond, other=np.nan, inplace=False, try_cast=False, raise_on_erro
other : scalar or DataFrame
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Expand Down Expand Up @@ -2306,15 +2311,17 @@ def where(self, cond, other=np.nan, inplace=False, try_cast=False, raise_on_erro
# align with me
if other.ndim <= self.ndim:

_, other = self.align(other, join='left', fill_value=np.nan)
_, other = self.align(other, join='left',
axis=axis, level=level,
fill_value=np.nan)

# if we are NOT aligned, raise as we cannot where index
if not all([ other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes) ]):
if axis is None and not all([ other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes) ]):
raise InvalidIndexError

# slice me out of the other
else:
raise NotImplemented
raise NotImplemented("cannot align with a bigger dimensional PandasObject")

elif is_list_like(other):

Expand Down Expand Up @@ -2386,11 +2393,11 @@ def where(self, cond, other=np.nan, inplace=False, try_cast=False, raise_on_erro
if inplace:
# we may have different type blocks come out of putmask, so
# reconstruct the block manager
self._data = self._data.putmask(cond, other, inplace=True)
self._data = self._data.putmask(cond, other, align=axis is None, inplace=True)

else:
new_data = self._data.where(
other, cond, raise_on_error=raise_on_error, try_cast=try_cast)
other, cond, align=axis is None, raise_on_error=raise_on_error, try_cast=try_cast)

return self._constructor(new_data)

Expand Down
59 changes: 46 additions & 13 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,29 +593,52 @@ def setitem(self, indexer, value):

return [ self ]

def putmask(self, mask, new, inplace=False):
def putmask(self, mask, new, align=True, inplace=False):
""" putmask the data to the block; it is possible that we may create a new dtype of block
return the resulting block(s) """
return the resulting block(s)

Parameters
----------
mask : the condition to respect
new : a ndarray/object
align : boolean, perform alignment on other/cond, default is True
inplace : perform inplace modification, default is False

Returns
-------
a new block(s), the result of the putmask
"""

new_values = self.values if inplace else self.values.copy()

# may need to align the new
if hasattr(new, 'reindex_axis'):
axis = getattr(new, '_info_axis_number', 0)
new = new.reindex_axis(self.items, axis=axis, copy=False).values.T
if align:
axis = getattr(new, '_info_axis_number', 0)
new = new.reindex_axis(self.items, axis=axis, copy=False).values.T
else:
new = new.values.T

# may need to align the mask
if hasattr(mask, 'reindex_axis'):
axis = getattr(mask, '_info_axis_number', 0)
mask = mask.reindex_axis(
self.items, axis=axis, copy=False).values.T
if align:
axis = getattr(mask, '_info_axis_number', 0)
mask = mask.reindex_axis(
self.items, axis=axis, copy=False).values.T
else:
mask = mask.values.T

# if we are passed a scalar None, convert it here
if not is_list_like(new) and isnull(new):
new = np.nan

if self._can_hold_element(new):
new = self._try_cast(new)

# pseudo-broadcast
if isinstance(new,np.ndarray) and new.ndim == self.ndim-1:
new = np.repeat(new,self.shape[-1]).reshape(self.shape)

np.putmask(new_values, mask, new)

# maybe upcast me
Expand Down Expand Up @@ -842,14 +865,15 @@ def handle_error():

return [make_block(result, self.items, self.ref_items, ndim=self.ndim, fastpath=True)]

def where(self, other, cond, raise_on_error=True, try_cast=False):
def where(self, other, cond, align=True, raise_on_error=True, try_cast=False):
"""
evaluate the block; return result block(s) from the result

Parameters
----------
other : a ndarray/object
cond : the condition to respect
align : boolean, perform alignment on other/cond
raise_on_error : if True, raise when I can't perform the function, False by default (and just return
the data that we had coming in)

Expand All @@ -862,21 +886,30 @@ def where(self, other, cond, raise_on_error=True, try_cast=False):

# see if we can align other
if hasattr(other, 'reindex_axis'):
axis = getattr(other, '_info_axis_number', 0)
other = other.reindex_axis(self.items, axis=axis, copy=True).values
if align:
axis = getattr(other, '_info_axis_number', 0)
other = other.reindex_axis(self.items, axis=axis, copy=True).values
else:
other = other.values

# make sure that we can broadcast
is_transposed = False
if hasattr(other, 'ndim') and hasattr(values, 'ndim'):
if values.ndim != other.ndim or values.shape == other.shape[::-1]:
values = values.T
is_transposed = True

# pseodo broadcast (its a 2d vs 1d say and where needs it in a specific direction)
if other.ndim >= 1 and values.ndim-1 == other.ndim and values.shape[0] != other.shape[0]:
other = _block_shape(other).T
else:
values = values.T
is_transposed = True

# see if we can align cond
if not hasattr(cond, 'shape'):
raise ValueError(
"where must have a condition that is ndarray like")
if hasattr(cond, 'reindex_axis'):

if align and hasattr(cond, 'reindex_axis'):
axis = getattr(cond, '_info_axis_number', 0)
cond = cond.reindex_axis(self.items, axis=axis, copy=True).values
else:
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2725,7 +2725,7 @@ def apply(self, func, convert_dtype=True, args=(), **kwds):
else:
return self._constructor(mapped, index=self.index, name=self.name)

def align(self, other, join='outer', level=None, copy=True,
def align(self, other, join='outer', axis=None, level=None, copy=True,
fill_value=None, method=None, limit=None):
"""
Align two Series object with the specified join method
Expand All @@ -2734,6 +2734,7 @@ def align(self, other, join='outer', level=None, copy=True,
----------
other : Series
join : {'outer', 'inner', 'left', 'right'}, default 'outer'
axis : None, alignment axis (is 0 for Series)
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
Expand Down
29 changes: 29 additions & 0 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -7931,6 +7931,35 @@ def test_where_none(self):
expected = DataFrame({'series': Series([0,1,2,3,4,5,6,7,np.nan,np.nan]) })
assert_frame_equal(df, expected)

def test_where_align(self):

def create():
df = DataFrame(np.random.randn(10,3))
df.iloc[3:5,0] = np.nan
df.iloc[4:6,1] = np.nan
df.iloc[5:8,2] = np.nan
return df

# series
df = create()
expected = df.fillna(df.mean())
result = df.where(pd.notnull(df),df.mean(),axis='columns')
assert_frame_equal(result, expected)

df.where(pd.notnull(df),df.mean(),inplace=True,axis='columns')
assert_frame_equal(df, expected)

df = create().fillna(0)
expected = df.apply(lambda x, y: x.where(x>0,y), y=df[0])
result = df.where(df>0,df[0],axis='index')
assert_frame_equal(result, expected)

# frame
df = create()
expected = df.fillna(1)
result = df.where(pd.notnull(df),DataFrame(1,index=df.index,columns=df.columns))
assert_frame_equal(result, expected)

def test_mask(self):
df = DataFrame(np.random.randn(5, 3))
cond = df > 0
Expand Down