Skip to content

ENH: Add pipe method to GroupBy (fixes #10353) #10466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,67 @@ See the :ref:`visualization documentation<visualization.box>` for more.
to ``df.boxplot(by="g")``. See :ref:`here<visualization.box.return>` for
an explanation.


.. _groupby.pipe:

Piping function calls
~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.17.0

Similar to the funcionality provided by ``DataFrames`` and ``Series``, functions
that take ``GroupBy`` objects can be chained together using a ``pipe`` method to
allow for a cleaner, more readable syntax.

Imagine that one had functions f, g, and h that each takes a ``DataFrameGroupBy``
as well as a single argument and returns a ``DataFrameGroupBy``, and one wanted
to apply these functions in succession to a grouped DataFrame. Instead of having
to deeply compose these functions and their arguments, such as:

.. code-block:: python

>>> h(g(f(df.groupby('group'), arg1), arg2), arg4)

one can write the following:

.. code-block:: python

>>> (df
.groupby('group')
.pipe(f, arg1)
.pipe(g, arg2)
.pipe(h, arg3))

For a more concrete example, imagine one wanted to group a DataFrame by column
'A' and the user wanted to take the square of the difference between the maximum
value of 'B' in each group and the overal minimum value of 'C' (across all
groups). One could write this as a pipeline of functions applied to the original
dataframe:

.. code-block:: python

def f(dfgb):
"""
Take a DataFrameGroupBy and return a Series
where each value corresponds to the maximum
value of column 'B' in each group minus the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't do what your text says. the 'global mininum of C' is just df['C'].min(), not dfgb.C.min().min(), I think that's an error actually, as dfgb.C.min() is a scalar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your words are describing this.

In [27]: (g.B.transform('max')-g.C.min().min())**2
Out[27]: 
0    8.17517
1    8.99110
2    8.17517
3    8.99110
4    8.17517
5    8.99110
6    8.17517
7    8.17517
dtype: float64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dfgb.C.min() is a series, I believe. It's the minimum value of column 'C' in each group. If I do dfgb.C.min().min(), I should get the global minimum of C, as I'm getting the minimum value of the minimum values in each group (essentially, this is the statement that "min" is associative).

import numpy as np
from pandas import DataFrame

random_state = np.random.RandomState(1234567890)
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': random_state.randn(8),
                    'C': random_state.randn(8)})

df.C.min() == df.groupby('A').C.min().min()

returns True

That seems to make sense to me.

However,

df.groupby('A').B.transform('max')

is not the same as

df.groupby('A').B.max().

The former returns the same number or rows as the original dataframe but with each value equal to the max of B in whatever group that row falls into. The latter returns a dataframe with Nrows = Ngroups (of column A).

I agree that my text is somewhat ambiguous between whether I want my final DataFrame to contain len(df) rows or Ngroups rows. To me, it sounds more like the result should contain only 1 row per group, but I can try to update the text to make that even more explicit.

This example is what is run in the "test_pipe" unit test, which passes (I tested the asserted values in the test by hand and they are what you'd expect if you take my text to mean the Ngroups version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghl3 I guess what @jorisvandenbossche brought up before. Is this to me is just a very confusing example, and not motivating for adding this. Can you craft something that a real world user would actually do?

e.g. maybe something that you yourself have done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.

There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:

import pandas as pd

df = pd.DataFrame({'class': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
                                 'weight': [1, 2, 3, 4, 3, 6, 7, 4]})

def gini_impurity(dfgb):

    tot = dfgb.weight.sum().sum()

    p2 = 0
    for _, group in dfgb:
        p2 += (group.weight.sum() / tot) ** 2

return 1.0 - p2

# Example A
df.groupby('A').pipe(gini_impurity)

def save_description_of_groups(dfgb, path):
    with open(path, 'w+') as f:
        for name, group in dfgb:
            f.write("{}:\n".format(name))
            f.write(group.describe().to_string())
            f.write("\n\n")

# Example B
df.groupby('A').pipe(save_description_of_groups, 'description.txt')

My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.

Would cleaned-up versions of either of the above be better, or at least are they on the right track?

global minimum of column 'C'.
"""
return dfgb.B.max() - dfgb.C.min().min()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be too nitpicky, but also this example looks more easy without pipe to me?

df.groupby('A')['B'].max() - df['C'].min()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but OK, you can't do the above with one function call to the groupby object ..


def square(srs):
"""
Take a Series and transform it by
squaring each value.
"""
return srs ** 2

res = df.groupby('A').pipe(f).pipe(square)


For more details on pipeline functionality, see :ref:`here<basics.pipe>`.


Examples
--------

Expand Down
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,9 @@ Other enhancements
- ``pd.read_csv`` can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (:issue:`11070`, :issue:`11072`).


- ``GroupBy`` objects now have a ``pipe`` method, similar to the one on ``DataFrame`` and ``Series`` that allow for functions that take a ``GroupBy`` to be composed in a clean, readable syntax. See the :ref:`documentation <groupby.pipe>` for more.


.. _whatsnew_0170.api:

.. _whatsnew_0170.api_breaking:
Expand Down
14 changes: 4 additions & 10 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
AbstractMethodError)
import pandas.core.nanops as nanops
from pandas.util.decorators import Appender, Substitution, deprecate_kwarg
from pandas.tools.util import _pipe
from pandas.core import config


Expand Down Expand Up @@ -2169,7 +2170,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No
-----

Use ``.pipe`` when chaining together functions that expect
on Series or DataFrames. Instead of writing
on Series, DataFrames, or GroupBys. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)

Expand All @@ -2191,22 +2192,15 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No

See Also
--------
pandas.GroupBy.pipe
pandas.DataFrame.apply
pandas.DataFrame.applymap
pandas.Series.map
"""
)
@Appender(_shared_docs['pipe'] % _shared_doc_kwargs)
def pipe(self, func, *args, **kwargs):
if isinstance(func, tuple):
func, target = func
if target in kwargs:
msg = '%s is both the pipe target and a keyword argument' % target
raise ValueError(msg)
kwargs[target] = self
return func(*args, **kwargs)
else:
return func(self, *args, **kwargs)
return _pipe(self, func, *args, **kwargs)

#----------------------------------------------------------------------
# Attribute access
Expand Down
56 changes: 55 additions & 1 deletion pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@
from pandas.core.base import PandasObject
from pandas.core.categorical import Categorical
from pandas.core.frame import DataFrame
from pandas.core.generic import NDFrame
from pandas.core.generic import NDFrame, _pipe
from pandas.core.index import Index, MultiIndex, CategoricalIndex, _ensure_index
from pandas.core.internals import BlockManager, make_block
from pandas.core.series import Series
from pandas.core.panel import Panel
from pandas.util.decorators import (cache_readonly, Appender, make_signature,
deprecate_kwarg)
from pandas.tools.util import _pipe
import pandas.core.algorithms as algos
import pandas.core.common as com
from pandas.core.common import(_possibly_downcast_to_dtype, isnull,
Expand Down Expand Up @@ -1076,6 +1077,59 @@ def tail(self, n=5):
tail = obj[in_tail]
return tail

def pipe(self, func, *args, **kwargs):
""" Apply a function with arguments to this GroupBy object

.. versionadded:: 0.17.0

Parameters
----------
func : callable or tuple of (callable, string)
Function to apply to this GroupBy or, alternatively, a
``(callable, data_keyword)`` tuple where ``data_keyword`` is a
string indicating the keyword of `callable`` that expects the
%(klass)s.
args : iterable, optional
positional arguments passed into ``func``.
kwargs : any, dictionary
a dictionary of keyword arguments passed into ``func``.

Returns
-------
object : the return type of ``func``.

Notes
-----

Use ``.pipe`` when chaining together functions that expect
a GroupBy, or when alternating between functions that take
a DataFrame and a GroupBy.

Assuming that one has a function f that takes and returns
a DataFrameGroupBy, a function g that takes a DataFrameGroupBy
and returns a DataFrame, and a function h that takes a DataFrame,
instead of having to write:

>>> f(g(h(df.groupby('group')), arg1=a), arg2=b, arg3=c)

You can write

>>> (df
... .groupby('group')
... .pipe(f, arg1)
... .pipe(g, arg2)
... .pipe(h, arg3))


See Also
--------
pandas.Series.pipe
pandas.DataFrame.pipe
pandas.GroupBy.apply
"""
return _pipe(self, func, *args, **kwargs)


def _cumcount_array(self, arr=None, ascending=True):
"""
arr is where cumcount gets its values from
Expand Down
63 changes: 62 additions & 1 deletion pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -5159,7 +5159,7 @@ def test_tab_completion(self):
'resample', 'cummin', 'fillna', 'cumsum', 'cumcount',
'all', 'shift', 'skew', 'bfill', 'ffill',
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
'cov', 'dtypes', 'diff', 'idxmax', 'idxmin'
'cov', 'dtypes', 'diff', 'idxmax', 'idxmin', 'pipe'
])
self.assertEqual(results, expected)

Expand Down Expand Up @@ -5467,6 +5467,7 @@ def test_func(x):
expected = DataFrame()
tm.assert_frame_equal(result, expected)


def test_first_last_max_min_on_time_data(self):
# GH 10295
# Verify that NaT is not in the result of max, min, first and last on
Expand Down Expand Up @@ -5512,6 +5513,66 @@ def test_sort(x):
g.apply(test_sort)


def test_pipe(self):
# Test the pipe method of DataFrameGroupBy.
# Issue #10353

random_state = np.random.RandomState(1234567890)

df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': random_state.randn(8),
'C': random_state.randn(8)})

def f(dfgb):
return dfgb.B.max() - dfgb.C.min().min()

def square(srs):
return srs ** 2

# Note that the transformations are
# GroupBy -> Series
# Series -> Series
# This then chains the GroupBy.pipe and the
# NDFrame.pipe methods
res = df.groupby('A').pipe(f).pipe(square)

index = Index([u'bar', u'foo'], dtype='object', name=u'A')
expected = pd.Series([8.99110003361, 8.17516964785], name='B', index=index)

assert_series_equal(expected, res)


def test_pipe_args(self):
# Test passing args to the pipe method of DataFrameGroupBy.
# Issue #10353

df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C'],
'x': [1.0, 2.0, 3.0, 2.0, 5.0],
'y': [10.0, 100.0, 1000.0, -100.0, -1000.0]})

def f(dfgb, arg1):
return dfgb.filter(lambda grp: grp.y.mean() > arg1, dropna=False).groupby(dfgb.grouper)

def g(dfgb, arg2):
return dfgb.sum() / dfgb.sum().sum() + arg2

def h(df, arg3):
return df.x + df.y - arg3

res = (df
.groupby('group')
.pipe(f, 0)
.pipe(g, 10)
.pipe(h, 100))

# Assert the results here
index = pd.Index(['A', 'B', 'C'], name='group')
expected = pd.Series([-79.5160891089, -78.4839108911, None], index=index)

assert_series_equal(expected, res)


def assert_fp_equal(a, b):
assert (np.abs(a - b) < 1e-12).all()

Expand Down
22 changes: 22 additions & 0 deletions pandas/tools/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,25 @@ def compose(*funcs):
"""Compose 2 or more callables"""
assert len(funcs) > 1, 'At least 2 callables must be passed to compose'
return reduce(_compose2, funcs)


def _pipe(obj, func, *args, **kwargs):
"""
Apply a function to a obj either by
passing the obj as the first argument
to the function or, in the case that
the func is a tuple, interpret the first
element of the tuple as a function and
pass the obj to that function as a keyword
arguemnt whose key is the value of the
second element of the tuple
"""
if isinstance(func, tuple):
func, target = func
if target in kwargs:
msg = '%s is both the pipe target and a keyword argument' % target
raise ValueError(msg)
kwargs[target] = obj
return func(*args, **kwargs)
else:
return func(obj, *args, **kwargs)