Skip to content

ENH: Add pipe method to GroupBy (fixes #10353) #10466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ghl3
Copy link

@ghl3 ghl3 commented Jun 28, 2015

closes #10353, extends the "pipe" method to a GroupBy object to allow one to chain it with NDFrame.pipe calls

  • Moves the functionality of "pipe" from NDFrame into generics._pipe to avoid code duplication
  • Leverages this in GroupBy object
  • Adds unit test

"""
if isinstance(func, tuple):
func, target = func
if target in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and tests the the analogous module

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the "_pipe" function from core/generic.py to tools.util.py makes sense. Is the second suggestion to move the NDFrame.pipe and GroupBy.pipe tests also to tools/tests? Or is it to add new stand-alone tests for tools.util._pipe in tools/tests (but to keep the tests on NDFrame and GroupBy where they are)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave the ndframe/group u tests where they are
I don't there are any standalone tests

@TomAugspurger
Copy link
Contributor

Sorry I wasn't able to comment on the other thread. Haven't looked at the code here either.

I think this is a good addition, but I've never had the need myself so I can't really speak from experience.

@jreback jreback added this to the 0.17.0 milestone Jun 30, 2015
You can write

>>> (df.pipe(h)
... .pipe(g, arg1=a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example should be relevant to a groupby. Pls test make an explicit test for the example as well.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2015

can you rebase

@jreback
Copy link
Contributor

jreback commented Aug 16, 2015

can you rebase. we need a little section in the whatsnew for this. also pls add a small section (can be basically the same) to groupby.rst (and provide a link back to the main pipe usage in basics.rst).

@ghl3
Copy link
Author

ghl3 commented Aug 17, 2015

Okay, based on your comments, I did the following:

  • Rebased against master
  • Added an additional tests that covered argument passing (and presents a less contrived example of the 'pipe' method on a groupby)
  • Added a section to doc/source/groupby.rst that describes the pipe method
  • Added a similar section to doc/source/whatsnew/v0.17.0.txt

Remaining questions:

  • Is the "v0.17.0.tx" the correct release file for the whatsnew?
  • Under whatsnew, I added the documentation under "enhancements", but would be happy to put it as a "new feature" if you think that's more accurate.
  • The example in the documentation is still very generic. If you'd prefer, I can give specific implementations of the f, g, and h functions to give more context or motivation to the groupby pipe method.

For the last point, I would simply use those implementations that are in the "tests.test_groupby:test_pipe_args" test that I'm adding here as well. I could also give some written description. But I hesitated to do that in this first pass because I wasn't sure that example was too specific for the level of detail in general documentation. I don't have strong opinions either way, though.


Imagine that one has a function f that both takes and returns a ``DataFrameGroupBy``, a function g that takes a ``DataFrameGroupBy`` and a ``DataFrame``, and a function h that takes a ``DataFrame`` and returns a number. Imagine further that each of these takes a single argument.

Instead of having to deeply compose these functions and their arguments like the following
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should give an actual example of the utility here

@jreback
Copy link
Contributor

jreback commented Aug 17, 2015

where you put the docs are fine

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

@ghl3 can you update

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@ghl3 can you rebase / update according to comments

@ghl3
Copy link
Author

ghl3 commented Sep 1, 2015

Yes, will have an update in the next day or so. Apologies for the delay.

Best,

  • George

On Sep 1, 2015, at 7:59 AM, Jeff Reback notifications@github.com wrote:

@ghl3 https://github.com/ghl3 can you rebase / update according to comments


Reply to this email directly or view it on GitHub #10466 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

@ghl3 if you can update this would be gr8

@ghl3 ghl3 force-pushed the groupby-pipe branch 2 times, most recently from 660cc2f to 253a2e6 Compare September 6, 2015 17:23
@ghl3
Copy link
Author

ghl3 commented Sep 6, 2015

@jreback Updates

  • Rebased against master
  • Added the "test" example to the documentation
  • Cleaned up the "test" example
  • The whatsnew now is shorter but provides a link to the docs

"""
Apply func(self, \*args, \*\*kwargs)

.. versionadded:: 0.16.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.17.0

@jreback
Copy link
Contributor

jreback commented Sep 6, 2015

couple of comments. pls rebase and squash

Take a DataFrameGroupBy and return a Series
of counts of the values of column B
"""
return dfgb['B'].value_counts()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit a strange (not very useful) example, as this is executing an actual method of a DataFrameGroupBy method.

So in this case df.groupby('A').pipe(f) is the same as df.groupby('A')['B'].value_counts()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point. I think the main properties we want with an example here are:

  • A function that's not natively available on a GroupBy
  • A function that can't be expressed via an "apply" on the underlying DataFrames for each group

I think I have a slightly better example. In addition to these other comments, I'll push a better example and we can see if that's more useful for demonstrating the value of this feature.

@ghl3 ghl3 force-pushed the groupby-pipe branch 2 times, most recently from cb2291a to 0a9b438 Compare September 7, 2015 18:48
@ghl3
Copy link
Author

ghl3 commented Sep 7, 2015

Updates:

  • Addressed comments in documentation
  • Improved the example in the groupby documentation and in the first unit test to (hopefully) be more useful and instructive
  • Rebased and squashed all commits

value of column 'B' in each group minus the
global minimum of column 'C'.
"""
return dfgb.B.max() - dfgb.C.min().min()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be too nitpicky, but also this example looks more easy without pipe to me?

df.groupby('A')['B'].max() - df['C'].min()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but OK, you can't do the above with one function call to the groupby object ..

@ghl3
Copy link
Author

ghl3 commented Sep 13, 2015

@jorisvandenbossche I've made the requested updates to the documentation style

"""
Take a DataFrameGroupBy and return a Series
where each value corresponds to the maximum
value of column 'B' in each group minus the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't do what your text says. the 'global mininum of C' is just df['C'].min(), not dfgb.C.min().min(), I think that's an error actually, as dfgb.C.min() is a scalar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your words are describing this.

In [27]: (g.B.transform('max')-g.C.min().min())**2
Out[27]: 
0    8.17517
1    8.99110
2    8.17517
3    8.99110
4    8.17517
5    8.99110
6    8.17517
7    8.17517
dtype: float64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dfgb.C.min() is a series, I believe. It's the minimum value of column 'C' in each group. If I do dfgb.C.min().min(), I should get the global minimum of C, as I'm getting the minimum value of the minimum values in each group (essentially, this is the statement that "min" is associative).

import numpy as np
from pandas import DataFrame

random_state = np.random.RandomState(1234567890)
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': random_state.randn(8),
                    'C': random_state.randn(8)})

df.C.min() == df.groupby('A').C.min().min()

returns True

That seems to make sense to me.

However,

df.groupby('A').B.transform('max')

is not the same as

df.groupby('A').B.max().

The former returns the same number or rows as the original dataframe but with each value equal to the max of B in whatever group that row falls into. The latter returns a dataframe with Nrows = Ngroups (of column A).

I agree that my text is somewhat ambiguous between whether I want my final DataFrame to contain len(df) rows or Ngroups rows. To me, it sounds more like the result should contain only 1 row per group, but I can try to update the text to make that even more explicit.

This example is what is run in the "test_pipe" unit test, which passes (I tested the asserted values in the test by hand and they are what you'd expect if you take my text to mean the Ngroups version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghl3 I guess what @jorisvandenbossche brought up before. Is this to me is just a very confusing example, and not motivating for adding this. Can you craft something that a real world user would actually do?

e.g. maybe something that you yourself have done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.

There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:

import pandas as pd

df = pd.DataFrame({'class': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
                                 'weight': [1, 2, 3, 4, 3, 6, 7, 4]})

def gini_impurity(dfgb):

    tot = dfgb.weight.sum().sum()

    p2 = 0
    for _, group in dfgb:
        p2 += (group.weight.sum() / tot) ** 2

return 1.0 - p2

# Example A
df.groupby('A').pipe(gini_impurity)

def save_description_of_groups(dfgb, path):
    with open(path, 'w+') as f:
        for name, group in dfgb:
            f.write("{}:\n".format(name))
            f.write(group.describe().to_string())
            f.write("\n\n")

# Example B
df.groupby('A').pipe(save_description_of_groups, 'description.txt')

My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.

Would cleaned-up versions of either of the above be better, or at least are they on the right track?

@jreback
Copy link
Contributor

jreback commented Sep 17, 2015

@jorisvandenbossche what do you think here?

@jreback
Copy link
Contributor

jreback commented Sep 17, 2015

@ghl3 can you rebase pls

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

@ghl3 can you rebase this

@shoyer
Copy link
Member

shoyer commented Sep 25, 2015

I actually think that this PR is a good example of where less documentation would be better. This is certainly a useful method, but the long motivating example mostly serves to over emphasize these niche uses.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

well, I do appreciate good documentation. I think an example is key here. Otherwise people will be wondering why this exists and what its for.

@jreback jreback modified the milestones: 0.17.0, 0.17.1 Sep 25, 2015
@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

@ghl3 can you rebase / update

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

@ghl3 pls rebase / update

@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

closing this, but pls reopen if you wish to update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Add pipe method to GroupBy objects
5 participants