-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add pipe method to GroupBy (fixes #10353) #10466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1002,6 +1002,67 @@ See the :ref:`visualization documentation<visualization.box>` for more. | |
to ``df.boxplot(by="g")``. See :ref:`here<visualization.box.return>` for | ||
an explanation. | ||
|
||
|
||
.. _groupby.pipe: | ||
|
||
Piping function calls | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. versionadded:: 0.17.0 | ||
|
||
Similar to the funcionality provided by ``DataFrames`` and ``Series``, functions | ||
that take ``GroupBy`` objects can be chained together using a ``pipe`` method to | ||
allow for a cleaner, more readable syntax. | ||
|
||
Imagine that one had functions f, g, and h that each takes a ``DataFrameGroupBy`` | ||
as well as a single argument and returns a ``DataFrameGroupBy``, and one wanted | ||
to apply these functions in succession to a grouped DataFrame. Instead of having | ||
to deeply compose these functions and their arguments, such as: | ||
|
||
.. code-block:: python | ||
|
||
>>> h(g(f(df.groupby('group'), arg1), arg2), arg4) | ||
|
||
one can write the following: | ||
|
||
.. code-block:: python | ||
|
||
>>> (df | ||
.groupby('group') | ||
.pipe(f, arg1) | ||
.pipe(g, arg2) | ||
.pipe(h, arg3)) | ||
|
||
For a more concrete example, imagine one wanted to group a DataFrame by column | ||
'A' and the user wanted to take the square of the difference between the maximum | ||
value of 'B' in each group and the overal minimum value of 'C' (across all | ||
groups). One could write this as a pipeline of functions applied to the original | ||
dataframe: | ||
|
||
.. code-block:: python | ||
|
||
def f(dfgb): | ||
""" | ||
Take a DataFrameGroupBy and return a Series | ||
where each value corresponds to the maximum | ||
value of column 'B' in each group minus the | ||
global minimum of column 'C'. | ||
""" | ||
return dfgb.B.max() - dfgb.C.min().min() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not to be too nitpicky, but also this example looks more easy without pipe to me?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but OK, you can't do the above with one function call to the groupby object .. |
||
|
||
def square(srs): | ||
""" | ||
Take a Series and transform it by | ||
squaring each value. | ||
""" | ||
return srs ** 2 | ||
|
||
res = df.groupby('A').pipe(f).pipe(square) | ||
|
||
|
||
For more details on pipeline functionality, see :ref:`here<basics.pipe>`. | ||
|
||
|
||
Examples | ||
-------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't do what your text says. the 'global mininum of C' is just
df['C'].min()
, notdfgb.C.min().min()
, I think that's an error actually, asdfgb.C.min()
is a scalarThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your words are describing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfgb.C.min() is a series, I believe. It's the minimum value of column 'C' in each group. If I do dfgb.C.min().min(), I should get the global minimum of C, as I'm getting the minimum value of the minimum values in each group (essentially, this is the statement that "min" is associative).
returns True
That seems to make sense to me.
However,
is not the same as
The former returns the same number or rows as the original dataframe but with each value equal to the max of B in whatever group that row falls into. The latter returns a dataframe with Nrows = Ngroups (of column A).
I agree that my text is somewhat ambiguous between whether I want my final DataFrame to contain len(df) rows or Ngroups rows. To me, it sounds more like the result should contain only 1 row per group, but I can try to update the text to make that even more explicit.
This example is what is run in the "test_pipe" unit test, which passes (I tested the asserted values in the test by hand and they are what you'd expect if you take my text to mean the Ngroups version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghl3 I guess what @jorisvandenbossche brought up before. Is this to me is just a very confusing example, and not motivating for adding this. Can you craft something that a real world user would actually do?
e.g. maybe something that you yourself have done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.
There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:
My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.
Would cleaned-up versions of either of the above be better, or at least are they on the right track?