ENH: Add pipe method to GroupBy (fixes #10353) #10466

ghl3 · 2015-06-28T19:51:19Z

closes #10353, extends the "pipe" method to a GroupBy object to allow one to chain it with NDFrame.pipe calls

Moves the functionality of "pipe" from NDFrame into generics._pipe to avoid code duplication
Leverages this in GroupBy object
Adds unit test

jreback · 2015-06-28T20:13:54Z

pandas/core/generic.py

+    """
+    if isinstance(func, tuple):
+        func, target = func
+        if target in kwargs:


I would move this to here: https://github.com/pydata/pandas/blob/master/pandas/tools/util.py

and tests the the analogous module

Moving the "_pipe" function from core/generic.py to tools.util.py makes sense. Is the second suggestion to move the NDFrame.pipe and GroupBy.pipe tests also to tools/tests? Or is it to add new stand-alone tests for tools.util._pipe in tools/tests (but to keep the tests on NDFrame and GroupBy where they are)?

you can leave the ndframe/group u tests where they are
I don't there are any standalone tests

TomAugspurger · 2015-06-29T12:53:47Z

Sorry I wasn't able to comment on the other thread. Haven't looked at the code here either.

I think this is a good addition, but I've never had the need myself so I can't really speak from experience.

jreback · 2015-06-30T10:43:15Z

pandas/core/groupby.py

+        You can write
+
+        >>> (df.pipe(h)
+        ...    .pipe(g, arg1=a)


this example should be relevant to a groupby. Pls test make an explicit test for the example as well.

jreback · 2015-08-05T21:25:04Z

can you rebase

jreback · 2015-08-16T00:06:48Z

can you rebase. we need a little section in the whatsnew for this. also pls add a small section (can be basically the same) to groupby.rst (and provide a link back to the main pipe usage in basics.rst).

ghl3 · 2015-08-17T02:28:23Z

Okay, based on your comments, I did the following:

Rebased against master
Added an additional tests that covered argument passing (and presents a less contrived example of the 'pipe' method on a groupby)
Added a section to doc/source/groupby.rst that describes the pipe method
Added a similar section to doc/source/whatsnew/v0.17.0.txt

Remaining questions:

Is the "v0.17.0.tx" the correct release file for the whatsnew?
Under whatsnew, I added the documentation under "enhancements", but would be happy to put it as a "new feature" if you think that's more accurate.
The example in the documentation is still very generic. If you'd prefer, I can give specific implementations of the f, g, and h functions to give more context or motivation to the groupby pipe method.

For the last point, I would simply use those implementations that are in the "tests.test_groupby:test_pipe_args" test that I'm adding here as well. I could also give some written description. But I hesitated to do that in this first pass because I wasn't sure that example was too specific for the level of detail in general documentation. I don't have strong opinions either way, though.

jreback · 2015-08-17T12:25:21Z

doc/source/groupby.rst

+
+Imagine that one has a function f that both takes and returns a ``DataFrameGroupBy``, a function g that takes a ``DataFrameGroupBy`` and a ``DataFrame``, and a function h that takes a ``DataFrame`` and returns a number.  Imagine further that each of these takes a single argument.
+
+Instead of having to deeply compose these functions and their arguments like the following


I think this should give an actual example of the utility here

jreback · 2015-08-17T12:28:05Z

where you put the docs are fine

jreback · 2015-08-26T01:39:23Z

@ghl3 can you update

jreback · 2015-09-01T11:59:30Z

@ghl3 can you rebase / update according to comments

ghl3 · 2015-09-01T12:15:04Z

Yes, will have an update in the next day or so. Apologies for the delay.

Best,

George

On Sep 1, 2015, at 7:59 AM, Jeff Reback notifications@github.com wrote:

@ghl3 https://github.com/ghl3 can you rebase / update according to comments

—
Reply to this email directly or view it on GitHub #10466 (comment).

jreback · 2015-09-05T23:32:21Z

@ghl3 if you can update this would be gr8

ghl3 · 2015-09-06T17:39:50Z

@jreback Updates

Rebased against master
Added the "test" example to the documentation
Cleaned up the "test" example
The whatsnew now is shorter but provides a link to the docs

jreback · 2015-09-06T17:44:17Z

pandas/core/groupby.py

+        """
+        Apply func(self, \*args, \*\*kwargs)
+
+        .. versionadded:: 0.16.3


jreback · 2015-09-06T17:46:34Z

couple of comments. pls rebase and squash

jorisvandenbossche · 2015-09-06T22:24:13Z

doc/source/groupby.rst

+        Take a DataFrameGroupBy and return a Series
+        of counts of the values of column B
+        """
+        return dfgb['B'].value_counts()


I find this a bit a strange (not very useful) example, as this is executing an actual method of a DataFrameGroupBy method.

So in this case df.groupby('A').pipe(f) is the same as df.groupby('A')['B'].value_counts()

Yeah, that's a fair point. I think the main properties we want with an example here are:

A function that's not natively available on a GroupBy

A function that can't be expressed via an "apply" on the underlying DataFrames for each group

I think I have a slightly better example. In addition to these other comments, I'll push a better example and we can see if that's more useful for demonstrating the value of this feature.

ghl3 · 2015-09-07T18:50:46Z

Updates:

Addressed comments in documentation
Improved the example in the groupby documentation and in the first unit test to (hopefully) be more useful and instructive
Rebased and squashed all commits

jorisvandenbossche · 2015-09-09T22:22:26Z

doc/source/groupby.rst

+        value of column 'B' in each group minus the
+        global minimum of column 'C'.
+        """
+        return dfgb.B.max() - dfgb.C.min().min()


Not to be too nitpicky, but also this example looks more easy without pipe to me?

df.groupby('A')['B'].max() - df['C'].min()

but OK, you can't do the above with one function call to the groupby object ..

ghl3 · 2015-09-13T16:47:20Z

@jorisvandenbossche I've made the requested updates to the documentation style

jreback · 2015-09-13T18:02:03Z

doc/source/groupby.rst

+        """
+        Take a DataFrameGroupBy and return a Series
+        where each value corresponds to the maximum
+        value of column 'B' in each group minus the


this doesn't do what your text says. the 'global mininum of C' is just df['C'].min(), not dfgb.C.min().min(), I think that's an error actually, as dfgb.C.min() is a scalar

Your words are describing this.

In [27]: (g.B.transform('max')-g.C.min().min())**2 Out[27]: 0 8.17517 1 8.99110 2 8.17517 3 8.99110 4 8.17517 5 8.99110 6 8.17517 7 8.17517 dtype: float64

dfgb.C.min() is a series, I believe. It's the minimum value of column 'C' in each group. If I do dfgb.C.min().min(), I should get the global minimum of C, as I'm getting the minimum value of the minimum values in each group (essentially, this is the statement that "min" is associative).

import numpy as np from pandas import DataFrame random_state = np.random.RandomState(1234567890) df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': random_state.randn(8), 'C': random_state.randn(8)}) df.C.min() == df.groupby('A').C.min().min()

returns True

That seems to make sense to me.

However,

df.groupby('A').B.transform('max')

is not the same as

df.groupby('A').B.max().

The former returns the same number or rows as the original dataframe but with each value equal to the max of B in whatever group that row falls into. The latter returns a dataframe with Nrows = Ngroups (of column A).

I agree that my text is somewhat ambiguous between whether I want my final DataFrame to contain len(df) rows or Ngroups rows. To me, it sounds more like the result should contain only 1 row per group, but I can try to update the text to make that even more explicit.

This example is what is run in the "test_pipe" unit test, which passes (I tested the asserted values in the test by hand and they are what you'd expect if you take my text to mean the Ngroups version)

@ghl3 I guess what @jorisvandenbossche brought up before. Is this to me is just a very confusing example, and not motivating for adding this. Can you craft something that a real world user would actually do?

e.g. maybe something that you yourself have done?

Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.

There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:

import pandas as pd df = pd.DataFrame({'class': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'weight': [1, 2, 3, 4, 3, 6, 7, 4]}) def gini_impurity(dfgb): tot = dfgb.weight.sum().sum() p2 = 0 for _, group in dfgb: p2 += (group.weight.sum() / tot) ** 2 return 1.0 - p2 # Example A df.groupby('A').pipe(gini_impurity) def save_description_of_groups(dfgb, path): with open(path, 'w+') as f: for name, group in dfgb: f.write("{}:\n".format(name)) f.write(group.describe().to_string()) f.write("\n\n") # Example B df.groupby('A').pipe(save_description_of_groups, 'description.txt')

My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.

Would cleaned-up versions of either of the above be better, or at least are they on the right track?

jreback · 2015-09-17T17:29:10Z

@jorisvandenbossche what do you think here?

jreback · 2015-09-17T17:29:18Z

@ghl3 can you rebase pls

jreback · 2015-09-25T12:20:08Z

@ghl3 can you rebase this

shoyer · 2015-09-25T16:54:57Z

I actually think that this PR is a good example of where less documentation would be better. This is certainly a useful method, but the long motivating example mostly serves to over emphasize these niche uses.

jreback · 2015-09-25T17:17:40Z

well, I do appreciate good documentation. I think an example is key here. Otherwise people will be wondering why this exists and what its for.

jreback · 2015-10-25T15:25:34Z

@ghl3 can you rebase / update

jreback · 2015-11-13T15:16:16Z

@ghl3 pls rebase / update

jreback · 2015-12-06T19:09:55Z

closing this, but pls reopen if you wish to update

jreback reviewed Jun 28, 2015
View reviewed changes

ghl3 force-pushed the groupby-pipe branch from bcb9b85 to a51cad7 Compare June 28, 2015 22:20

jreback added the API Design label Jun 30, 2015

jreback added this to the 0.17.0 milestone Jun 30, 2015

jreback reviewed Jun 30, 2015
View reviewed changes

ghl3 force-pushed the groupby-pipe branch from a51cad7 to b5b3c5c Compare August 16, 2015 15:21

jreback reviewed Aug 17, 2015
View reviewed changes

ghl3 force-pushed the groupby-pipe branch 2 times, most recently from 660cc2f to 253a2e6 Compare September 6, 2015 17:23

jreback reviewed Sep 6, 2015
View reviewed changes

pandas/core/groupby.py

"""

Apply func(self, \*args, \*\*kwargs)

.. versionadded:: 0.16.3

Copy link

Contributor

jreback Sep 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.17.0

jorisvandenbossche reviewed Sep 6, 2015
View reviewed changes

ghl3 force-pushed the groupby-pipe branch 2 times, most recently from cb2291a to 0a9b438 Compare September 7, 2015 18:48

jorisvandenbossche reviewed Sep 9, 2015
View reviewed changes

ENH: Add pipe method to GroupBy (fixes pandas-dev#10353)

b3686e1

ghl3 force-pushed the groupby-pipe branch from 0a9b438 to b3686e1 Compare September 13, 2015 16:46

jreback reviewed Sep 13, 2015
View reviewed changes

jreback modified the milestones: 0.17.0, 0.17.1 Sep 25, 2015

jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015

jreback added the Groupby label Nov 13, 2015

jreback closed this Dec 6, 2015

TomAugspurger mentioned this pull request Oct 13, 2017

ENH: Add .pipe to GroupBy objects #17863

Closed

topper-123 mentioned this pull request Oct 14, 2017

ENH: add GroupBy.pipe method #17871

Merged


		Imagine that one has a function f that both takes and returns a ``DataFrameGroupBy``, a function g that takes a ``DataFrameGroupBy`` and a ``DataFrame``, and a function h that takes a ``DataFrame`` and returns a number. Imagine further that each of these takes a single argument.

		Instead of having to deeply compose these functions and their arguments like the following

Uh oh!

ENH: Add pipe method to GroupBy (fixes #10353) #10466

ENH: Add pipe method to GroupBy (fixes #10353) #10466

Uh oh!

Conversation

ghl3 commented Jun 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jun 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 5, 2015

Uh oh!

jreback commented Aug 16, 2015

Uh oh!

ghl3 commented Aug 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 17, 2015

Uh oh!

jreback commented Aug 26, 2015

Uh oh!

jreback commented Sep 1, 2015

Uh oh!

ghl3 commented Sep 1, 2015

Uh oh!

jreback commented Sep 5, 2015

Uh oh!

ghl3 commented Sep 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghl3 commented Sep 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghl3 commented Sep 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 17, 2015

Uh oh!

jreback commented Sep 17, 2015

Uh oh!

jreback commented Sep 25, 2015

Uh oh!

shoyer commented Sep 25, 2015

Uh oh!

jreback commented Sep 25, 2015

Uh oh!

jreback commented Oct 25, 2015

Uh oh!

jreback commented Nov 13, 2015

Uh oh!

jreback commented Dec 6, 2015