-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add pipe method to GroupBy (fixes #10353) #10466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
""" | ||
if isinstance(func, tuple): | ||
func, target = func | ||
if target in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to here: https://github.com/pydata/pandas/blob/master/pandas/tools/util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and tests the the analogous module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the "_pipe" function from core/generic.py to tools.util.py makes sense. Is the second suggestion to move the NDFrame.pipe and GroupBy.pipe tests also to tools/tests? Or is it to add new stand-alone tests for tools.util._pipe in tools/tests (but to keep the tests on NDFrame and GroupBy where they are)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave the ndframe/group u tests where they are
I don't there are any standalone tests
Sorry I wasn't able to comment on the other thread. Haven't looked at the code here either. I think this is a good addition, but I've never had the need myself so I can't really speak from experience. |
You can write | ||
|
||
>>> (df.pipe(h) | ||
... .pipe(g, arg1=a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example should be relevant to a groupby. Pls test make an explicit test for the example as well.
can you rebase |
can you rebase. we need a little section in the whatsnew for this. also pls add a small section (can be basically the same) to groupby.rst (and provide a link back to the main pipe usage in basics.rst). |
Okay, based on your comments, I did the following:
Remaining questions:
For the last point, I would simply use those implementations that are in the "tests.test_groupby:test_pipe_args" test that I'm adding here as well. I could also give some written description. But I hesitated to do that in this first pass because I wasn't sure that example was too specific for the level of detail in general documentation. I don't have strong opinions either way, though. |
|
||
Imagine that one has a function f that both takes and returns a ``DataFrameGroupBy``, a function g that takes a ``DataFrameGroupBy`` and a ``DataFrame``, and a function h that takes a ``DataFrame`` and returns a number. Imagine further that each of these takes a single argument. | ||
|
||
Instead of having to deeply compose these functions and their arguments like the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should give an actual example of the utility here
where you put the docs are fine |
@ghl3 can you update |
@ghl3 can you rebase / update according to comments |
Yes, will have an update in the next day or so. Apologies for the delay. Best,
|
@ghl3 if you can update this would be gr8 |
660cc2f
to
253a2e6
Compare
@jreback Updates
|
""" | ||
Apply func(self, \*args, \*\*kwargs) | ||
|
||
.. versionadded:: 0.16.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.17.0
couple of comments. pls rebase and squash |
Take a DataFrameGroupBy and return a Series | ||
of counts of the values of column B | ||
""" | ||
return dfgb['B'].value_counts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit a strange (not very useful) example, as this is executing an actual method of a DataFrameGroupBy method.
So in this case df.groupby('A').pipe(f)
is the same as df.groupby('A')['B'].value_counts()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair point. I think the main properties we want with an example here are:
- A function that's not natively available on a GroupBy
- A function that can't be expressed via an "apply" on the underlying DataFrames for each group
I think I have a slightly better example. In addition to these other comments, I'll push a better example and we can see if that's more useful for demonstrating the value of this feature.
cb2291a
to
0a9b438
Compare
Updates:
|
value of column 'B' in each group minus the | ||
global minimum of column 'C'. | ||
""" | ||
return dfgb.B.max() - dfgb.C.min().min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to be too nitpicky, but also this example looks more easy without pipe to me?
df.groupby('A')['B'].max() - df['C'].min()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but OK, you can't do the above with one function call to the groupby object ..
@jorisvandenbossche I've made the requested updates to the documentation style |
""" | ||
Take a DataFrameGroupBy and return a Series | ||
where each value corresponds to the maximum | ||
value of column 'B' in each group minus the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't do what your text says. the 'global mininum of C' is just df['C'].min()
, not dfgb.C.min().min()
, I think that's an error actually, as dfgb.C.min()
is a scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your words are describing this.
In [27]: (g.B.transform('max')-g.C.min().min())**2
Out[27]:
0 8.17517
1 8.99110
2 8.17517
3 8.99110
4 8.17517
5 8.99110
6 8.17517
7 8.17517
dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfgb.C.min() is a series, I believe. It's the minimum value of column 'C' in each group. If I do dfgb.C.min().min(), I should get the global minimum of C, as I'm getting the minimum value of the minimum values in each group (essentially, this is the statement that "min" is associative).
import numpy as np
from pandas import DataFrame
random_state = np.random.RandomState(1234567890)
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': random_state.randn(8),
'C': random_state.randn(8)})
df.C.min() == df.groupby('A').C.min().min()
returns True
That seems to make sense to me.
However,
df.groupby('A').B.transform('max')
is not the same as
df.groupby('A').B.max().
The former returns the same number or rows as the original dataframe but with each value equal to the max of B in whatever group that row falls into. The latter returns a dataframe with Nrows = Ngroups (of column A).
I agree that my text is somewhat ambiguous between whether I want my final DataFrame to contain len(df) rows or Ngroups rows. To me, it sounds more like the result should contain only 1 row per group, but I can try to update the text to make that even more explicit.
This example is what is run in the "test_pipe" unit test, which passes (I tested the asserted values in the test by hand and they are what you'd expect if you take my text to mean the Ngroups version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghl3 I guess what @jorisvandenbossche brought up before. Is this to me is just a very confusing example, and not motivating for adding this. Can you craft something that a real world user would actually do?
e.g. maybe something that you yourself have done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.
There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:
import pandas as pd
df = pd.DataFrame({'class': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'weight': [1, 2, 3, 4, 3, 6, 7, 4]})
def gini_impurity(dfgb):
tot = dfgb.weight.sum().sum()
p2 = 0
for _, group in dfgb:
p2 += (group.weight.sum() / tot) ** 2
return 1.0 - p2
# Example A
df.groupby('A').pipe(gini_impurity)
def save_description_of_groups(dfgb, path):
with open(path, 'w+') as f:
for name, group in dfgb:
f.write("{}:\n".format(name))
f.write(group.describe().to_string())
f.write("\n\n")
# Example B
df.groupby('A').pipe(save_description_of_groups, 'description.txt')
My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.
Would cleaned-up versions of either of the above be better, or at least are they on the right track?
@jorisvandenbossche what do you think here? |
@ghl3 can you rebase pls |
@ghl3 can you rebase this |
I actually think that this PR is a good example of where less documentation would be better. This is certainly a useful method, but the long motivating example mostly serves to over emphasize these niche uses. |
well, I do appreciate good documentation. I think an example is key here. Otherwise people will be wondering why this exists and what its for. |
@ghl3 can you rebase / update |
@ghl3 pls rebase / update |
closing this, but pls reopen if you wish to update |
closes #10353, extends the "pipe" method to a GroupBy object to allow one to chain it with NDFrame.pipe calls