From bcdbf7323f66243cae655dd6995dd4fd14eac8b4 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Fri, 24 Feb 2023 18:10:48 -0500 Subject: [PATCH 1/5] DOC: Overhaul groupby.rst in the User Guide --- doc/source/user_guide/groupby.rst | 572 ++++++++++++++++++------------ 1 file changed, 345 insertions(+), 227 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 2fdd36d861e15..d7e37a30e1cc8 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -36,9 +36,22 @@ following: * Discard data that belongs to groups with only a few members. * Filter out data based on the group sum or mean. -* Some combination of the above: GroupBy will examine the results of the apply - step and try to return a sensibly combined result if it doesn't fit into - either of the above two categories. +Many of these operations are defined on GroupBy objects. These operations are similar +to the :ref:`aggregating API `, :ref:`window API `, +and :ref:`resample API `. + +It is possible that a given operation does not fall into one of these categories or +is some combination of them. In such a case, it may be possible to compute the +operation using GroupBy's ``apply`` method. This method will examine the results of the +apply step and try to return a sensibly combined result if it doesn't fit into either +of the above two categories. + +.. note:: + + An operation that is split into multiple steps using built-in GroupBy operations + will be more efficient than using the ``apply`` method with a user-defined Python + function. + Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function @@ -68,7 +81,7 @@ object (more on what the GroupBy object is later), you may do the following: .. ipython:: python - df = pd.DataFrame( + speeds = pd.DataFrame( [ ("bird", "Falconiformes", 389.0), ("bird", "Psittaciformes", 24.0), @@ -79,12 +92,12 @@ object (more on what the GroupBy object is later), you may do the following: index=["falcon", "parrot", "lion", "monkey", "leopard"], columns=("class", "order", "max_speed"), ) - df + speeds # default is axis=0 - grouped = df.groupby("class") - grouped = df.groupby("order", axis="columns") - grouped = df.groupby(["class", "order"]) + grouped = speeds.groupby("class") + grouped = speeds.groupby("order", axis="columns") + grouped = speeds.groupby(["class", "order"]) The mapping can be specified many different ways: @@ -465,41 +478,71 @@ Or for an object grouped on multiple columns: Aggregation ----------- -Once the GroupBy object has been created, several methods are available to -perform a computation on the grouped data. These operations are similar to the -:ref:`aggregating API `, :ref:`window API `, -and :ref:`resample API `. - -An obvious one is aggregation via the -:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently -:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method: +An aggregation is a GroupBy operation that reduces the dimension of the grouping +object. The result of an aggregation is, or at least treated as, +a scalar value for each column in a group. For example, producing a sum of each +column in group of values. .. ipython:: python - grouped = df.groupby("A") - grouped[["C", "D"]].aggregate(np.sum) - - grouped = df.groupby(["A", "B"]) - grouped.aggregate(np.sum) + animals = pd.DataFrame( + { + "kind": ["cat", "dog", "cat", "dog"], + "height": [9.1, 6.0, 9.5, 34.0], + "weight": [7.9, 7.5, 9.9, 198.0], + } + ) + animals + animals.groupby("kind").sum() -As you can see, the result of the aggregation will have the group names as the -new index along the grouped axis. In the case of multiple keys, the result is a -:ref:`MultiIndex ` by default, though this can be -changed by using the ``as_index`` option: +In the result, the keys of the groups appear in the index by default. They can be +instead included in the columns by passing ``as_index=False``. .. ipython:: python - grouped = df.groupby(["A", "B"], as_index=False) - grouped.aggregate(np.sum) + animals.groupby("kind", as_index=False).sum() - df.groupby("A", as_index=False)[["C", "D"]].sum() +.. _groupby.aggregate.builtin: -Note that you could use the ``reset_index`` DataFrame function to achieve the -same result as the column names are stored in the resulting ``MultiIndex``: +Built-in aggregation methods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +Many common aggregations are built-in to GroupBy objects as methods. Of the methods +listed below, those with a ``*`` do _not_ have a Cython-optimized implementation. + +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; - df.groupby(["A", "B"]).sum().reset_index() + :meth:`~.DataFrameGroupBy.any`;Compute whether any of the values in the groups are truthy + :meth:`~.DataFrameGroupBy.all`;Compute whether all of the values in the groups are truthy + :meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups + :meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups + :meth:`~.DataFrameGroupBy.first` *;Compute the first occurring value in each group + :meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group + :meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group + :meth:`~.DataFrameGroupBy.last` *;Compute the last occurring value in each group + :meth:`~.DataFrameGroupBy.max` *;Compute the maximum value in each group + :meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group + :meth:`~.DataFrameGroupBy.median`;Compute the median of each group + :meth:`~.DataFrameGroupBy.min` *;Compute the minimum value in each group + :meth:`~.DataFrameGroupBy.nunique`;Compute the number of unique values in each group + :meth:`~.DataFrameGroupBy.prod` *;Compute the product of the values in each group + :meth:`~.DataFrameGroupBy.quantile`;Compute a given quantile of the values in each group + :meth:`~.DataFrameGroupBy.sem`;Compute the standard error of the mean of the values in each group + :meth:`~.DataFrameGroupBy.size`;Compute the number of values in each group + :meth:`~.DataFrameGroupBy.skew` *;Compute the skew of the values in each group + :meth:`~.DataFrameGroupBy.std`;Compute the standard deviation of the values in each group + :meth:`~.DataFrameGroupBy.sum`;Compute the sum of the values in each group + :meth:`~.DataFrameGroupBy.var`;Compute the variance of the values in each group + +Some examples: + +.. ipython:: python + + df.groupby("A")[["C", "D"]].max() + df.groupby(["A", "B"]).mean() Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the ``size`` method. It returns a Series whose @@ -507,6 +550,7 @@ index are the group names and whose values are the sizes of each group. .. ipython:: python + grouped = df.groupby(["A", "B"]) grouped.size() .. ipython:: python @@ -531,34 +575,76 @@ Another aggregation example is to compute the number of unique values of each gr Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are named *columns*. -Aggregating functions are the ones that reduce the dimension of the returned objects. -Some common aggregating functions are tabulated below: -.. csv-table:: - :header: "Function", "Description" - :widths: 20, 80 - :delim: ; +.. _groupby.aggregate.agg: + +The :meth:`~.DataFrameGroupBy.aggregate` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The :meth:`~.DataFrameGroupBy.aggregate` method can accept many different types of +inputs. This section details using string aliases for various GroupBy methods; other +inputs are detailed in the sections below. + +.. ipython:: python + + grouped = df.groupby("A") + grouped[["C", "D"]].aggregate("sum") + + grouped = df.groupby(["A", "B"]) + grouped.agg("sum") + +As you can see, the result of the aggregation will have the group names as the +new index along the grouped axis. In the case of multiple keys, the result is a +:ref:`MultiIndex ` by default. As mentioned above, this can be +changed by using the ``as_index`` option: + +.. ipython:: python + + grouped = df.groupby(["A", "B"], as_index=False) + grouped.aggregate("sum") + + df.groupby("A", as_index=False)[["C", "D"]].sum() + +Note that you could use the ``reset_index`` DataFrame function to achieve the +same result as the column names are stored in the resulting ``MultiIndex``: - :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes - :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group - :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics - :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list - :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values +.. ipython:: python + df.groupby(["A", "B"]).agg("sum").reset_index() The aggregating functions above will exclude NA values. Any function which reduces a :class:`Series` to a scalar value is an aggregation function and will work, -a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that -:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a -filter, see :ref:`here `. +a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. + +.. _groupby.aggregate.udf: + +Aggregation with User-Defined Functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Users can also provide their own User-Defined Functions (UDFs) for custom aggregations. + +.. warning:: + + When aggregating with a UDF, the UDF should not mutate the + provided ``Series``, see :ref:`gotchas.udf-mutation` for more information. + +.. note:: + + Aggregating with a UDF is often less performant than using + the pandas built-in methods on GroupBy. Consider breaking up a complex operation + into a chain of operations that utilize the built-in methods. + +.. ipython:: python + + animals + animals.groupby("kind")[["height"]].agg(lambda x: set(x)) + +The resulting dtype will reflect that of the aggregating function. If the results from different groups have +different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. + +.. ipython:: python + + animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum()) .. _groupby.aggregate.multifunc: @@ -571,14 +657,14 @@ aggregation with, outputting a DataFrame: .. ipython:: python grouped = df.groupby("A") - grouped["C"].agg([np.sum, np.mean, np.std]) + grouped["C"].agg(["sum", "mean", "std"]) On a grouped ``DataFrame``, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index: .. ipython:: python - grouped[["C", "D"]].agg([np.sum, np.mean, np.std]) + grouped[["C", "D"]].agg(["sum", "mean", "std"]) The resulting aggregations are named for the functions themselves. If you @@ -588,7 +674,7 @@ need to rename, then you can add in a chained operation for a ``Series`` like th ( grouped["C"] - .agg([np.sum, np.mean, np.std]) + .agg(["sum", "mean", "std"]) .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"}) ) @@ -597,24 +683,23 @@ For a grouped ``DataFrame``, you can rename in a similar manner: .. ipython:: python ( - grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename( + grouped[["C", "D"]].agg(["sum", "mean", "std"]).rename( columns={"sum": "foo", "mean": "bar", "std": "baz"} ) ) .. note:: - In general, the output column names should be unique. You can't apply - the same function (or two functions with the same name) to the same + In general, the output column names should be unique, but pandas will allow + you apply to the same function (or two functions with the same name) to the same column. .. ipython:: python - :okexcept: grouped["C"].agg(["sum", "sum"]) - pandas *does* allow you to provide multiple lambdas. In this case, pandas + pandas also allows you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending ``_`` to each subsequent lambda. @@ -623,14 +708,13 @@ For a grouped ``DataFrame``, you can rename in a similar manner: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()]) - .. _groupby.aggregate.named: Named aggregation ~~~~~~~~~~~~~~~~~ To support column-specific aggregation *with control over the output column names*, pandas -accepts the special syntax in :meth:`DataFrameGroupBy.agg` and :meth:`SeriesGroupBy.agg`, known as "named aggregation", where +accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGroupBy.agg`, known as "named aggregation", where - The keywords are the *output* column names - The values are tuples whose first element is the column to select @@ -641,19 +725,12 @@ accepts the special syntax in :meth:`DataFrameGroupBy.agg` and :meth:`SeriesGrou .. ipython:: python - animals = pd.DataFrame( - { - "kind": ["cat", "dog", "cat", "dog"], - "height": [9.1, 6.0, 9.5, 34.0], - "weight": [7.9, 7.5, 9.9, 198.0], - } - ) animals animals.groupby("kind").agg( min_height=pd.NamedAgg(column="height", aggfunc="min"), max_height=pd.NamedAgg(column="height", aggfunc="max"), - average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), + average_weight=pd.NamedAgg(column="weight", aggfunc="mean"), ) @@ -664,7 +741,7 @@ accepts the special syntax in :meth:`DataFrameGroupBy.agg` and :meth:`SeriesGrou animals.groupby("kind").agg( min_height=("height", "min"), max_height=("height", "max"), - average_weight=("weight", np.mean), + average_weight=("weight", "mean"), ) @@ -675,21 +752,15 @@ and unpack the keyword arguments animals.groupby("kind").agg( **{ - "total weight": pd.NamedAgg(column="weight", aggfunc=sum) + "total weight": pd.NamedAgg(column="weight", aggfunc="sum") } ) -Additional keyword arguments are not passed through to the aggregation functions. Only pairs +When using named aggregation, additional keyword arguments are not passed through +to the aggregation functions; only pairs of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions requires additional arguments, partially apply them with :meth:`functools.partial`. -.. note:: - - For Python 3.5 and earlier, the order of ``**kwargs`` in a functions was not - preserved. This means that the output column ordering would not be - consistent. To ensure consistent ordering, the keys (and so output columns) - will always be sorted for Python 3.5. - Named aggregation is also valid for Series groupby aggregations. In this case there's no column selection, so the values are just the functions. @@ -708,59 +779,97 @@ columns of a DataFrame: .. ipython:: python - grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)}) + grouped.agg({"C": "sum", "D": lambda x: np.std(x, ddof=1)}) The function names can also be strings. In order for a string to be valid it -must be either implemented on GroupBy or available via :ref:`dispatching -`: +must be either implemented on GroupBy: .. ipython:: python grouped.agg({"C": "sum", "D": "std"}) -.. _groupby.aggregate.cython: +.. _groupby.transform: -Cython-optimized aggregation functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Transformation +-------------- -Some common aggregations, currently only ``sum``, ``mean``, ``std``, and ``sem``, have -optimized Cython implementations: +A transformation is a GroupBy operation whose result is indexed the same +as the one being grouped. Common examples include ``cumsum`` and ``diff``. .. ipython:: python - df.groupby("A")[["C", "D"]].sum() - df.groupby(["A", "B"]).mean() + speeds + grouped = speeds.groupby("class")["max_speed"] + grouped.cumsum() + grouped.diff() -Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above -code would work even without the special versions via dispatching (see below). +Unlike aggregations, the groupings that are used to split +the original object are not included in the result. -.. _groupby.aggregate.udfs: +.. note:: -Aggregations with User-Defined Functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Since transformations do not include the groupings that are used to split the result, + the arguments ``as_index`` and ``sort`` in :meth:`DataFrame.groupby` and + :meth:`Series.groupby` have no effect. -Users can also provide their own functions for custom aggregations. When aggregating -with a User-Defined Function (UDF), the UDF should not mutate the provided ``Series``, see -:ref:`gotchas.udf-mutation` for more information. +A common use of a transformation is to add the result back into the original DataFrame. .. ipython:: python - animals.groupby("kind")[["height"]].agg(lambda x: set(x)) + result = speeds.copy() + result["cumsum"] = grouped.cumsum() + result["diff"] = grouped.diff() + result -The resulting dtype will reflect that of the aggregating function. If the results from different groups have -different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. +Built-in transformation methods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +The following methods on GroupBy act as transformations. Of these methods, only +``fillna`` does not have a Cython-optimized implementation. - animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum()) +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; -.. _groupby.transform: + :meth:`~.DataFrameGroupBy.bfill`;Back fill NA values within each group + :meth:`~.DataFrameGroupBy.cumcount`;Compute the cumulative count within each group + :meth:`~.DataFrameGroupBy.cummax`;Compute the cumulative max within each group + :meth:`~.DataFrameGroupBy.cummin`;Compute the cumulative min within each group + :meth:`~.DataFrameGroupBy.cumprod`;Compute the cumulative product within each group + :meth:`~.DataFrameGroupBy.cumsum`;Compute the cumulative sum within each group + :meth:`~.DataFrameGroupBy.diff`;Compute the difference between adjacent values within each group + :meth:`~.DataFrameGroupBy.ffill`;Forward fill NA values within each group + :meth:`~.DataFrameGroupBy.fillna`;Fill NA values within each group + :meth:`~.DataFrameGroupBy.pct_change`;Compute the percent change between adjacent values within each group + :meth:`~.DataFrameGroupBy.rank`;Compute the rank of each value within each group + :meth:`~.DataFrameGroupBy.shift`;Shift values up or down within each group -Transformation --------------- +In addition, passing any built-in aggregation method as a string to +:meth:`~.DataFrameGroupBy.transform` (see below) will broadcast the result across the group, +producing a transformed result. If the aggregation method is Cython-optimized, this +will be performant as well. -The ``transform`` method returns an object that is indexed the same -as the one being grouped. The transform function must: +.. _groupby.transformation.transform: + +The :meth:`~.DataFrameGroupBy.transform` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Similar to the :ref:`aggregation method `, the +:meth:`~.DataFrameGroupBy.transform` can accept string aliases to the built-in +transform methods in the previous section. It can *also* accept string aliases to the +built-in aggregation methods. When an aggregation method is provided, the result will +be broadcast across the group. + +.. ipython:: python + + speeds + grouped = speeds.groupby("class")[["max_speed"]] + grouped.transform("cumsum") + grouped.transform("sum") + +In addition to string aliases, the :meth:`~.DataFrameGroupBy.transform` method can +also except User-Defined functions (UDFs). The UDF must: * Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, @@ -769,18 +878,29 @@ as the one being grouped. The transform function must: the first group chunk using chunk.apply. * Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected - results. -* (Optionally) operates on the entire group chunk. If this is supported, a - fast path is used starting from the *second* chunk. + results. See :ref:`gotchas.udf-mutation` for more information. +* (Optionally) operates on all columns of the entire group chunk at once. If this is + supported, a fast path is used starting from the *second* chunk. + +.. note:: + + Transforming by supplying ``transform`` with a UDF is + often less performant than using the built-in methods on GroupBy. + Consider breaking up a complex operation into a chain of operations that utilize + the built-in methods. + + All of the examples in this section can be made more performant by calling + built-in methods instead of using ``transform``. + See :ref:`below for examples `. .. versionchanged:: 2.0.0 When using ``.transform`` on a grouped DataFrame and the transformation function returns a DataFrame, pandas now aligns the result's index - with the input's index. You can call ``.to_numpy()`` on the - result of the transformation function to avoid alignment. + with the input's index. You can call ``.to_numpy()`` within the transformation + function to avoid alignment. -Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the +Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that of the transformation function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. @@ -831,15 +951,6 @@ match the shape of the input array. ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min()) -Alternatively, the built-in methods could be used to produce the same outputs. - -.. ipython:: python - - max_ts = ts.groupby(lambda x: x.year).transform("max") - min_ts = ts.groupby(lambda x: x.year).transform("min") - - max_ts - min_ts - Another common data transform is to replace missing data with the group mean. .. ipython:: python @@ -880,18 +991,27 @@ and that the transformed data contains no NAs. grouped_trans.count() # counts after transformation grouped_trans.size() # Verify non-NA count equals group size -.. note:: +.. _groupby_efficient_transforms: - Some functions will automatically transform the input when applied to a - GroupBy object, but returning an object of the same shape as the original. - Passing ``as_index=False`` will not affect these transformation methods. +As mentioned in the note above, each of the examples in this section can be computed +more efficiently using built-in methods. - For example: ``fillna, ffill, bfill, shift.``. +.. ipython:: python - .. ipython:: python + # ts.groupby(lambda x: x.year).transform( + # lambda x: (x - x.mean()) / x.std() + # ) + grouped = ts.groupby(lambda x: x.year) + result = (ts - grouped.transform("mean")) / grouped.transform("std") - grouped.ffill() + # ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min()) + grouped = ts.groupby(lambda x: x.year) + result = grouped.transform("max") - grouped.transform("min") + # grouped = data_df.groupby(key) + # grouped.transform(lambda x: x.fillna(x.mean())) + grouped = data_df.groupby(key) + result = data_df.fillna(grouped.transform("mean")) .. _groupby.transform.window_resample: @@ -943,127 +1063,134 @@ missing values with the ``ffill()`` method. Filtration ---------- -The ``filter`` method returns a subset of the original object. Suppose we -want to take only elements that belong to groups with a group sum greater -than 2. +A filtration is a GroupBy operation the subsets the original grouping object. It +may either filter out entire groups, part of groups, or both. Filtrations return +a filtered version of the calling object, including the grouping columns when provided. +In the following example, ``class`` is included in the result. .. ipython:: python - sf = pd.Series([1, 1, 2, 3, 3, 3]) - sf.groupby(sf).filter(lambda x: x.sum() > 2) - -The argument of ``filter`` must be a function that, applied to the group as a -whole, returns ``True`` or ``False``. + speeds + speeds.groupby("class").nth(1) -Another useful operation is filtering out elements that belong to groups -with only a couple members. +.. note:: -.. ipython:: python + Unlike aggregations, filtrations do not add the group keys to the index of the + result. Because of this, passing ``as_index=False`` will not affect these + transformation methods. - dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")}) - dff.groupby("B").filter(lambda x: len(x) > 2) - -Alternatively, instead of dropping the offending groups, we can return a -like-indexed objects where the groups that do not pass the filter are filled -with NaNs. +Filtrations will respect subsetting the columns of the GroupBy object. .. ipython:: python - dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False) + speeds.groupby("class")[["order", "max_speed"]].nth(1) -For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion. +Built-in filtrations +~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +The following methods on GroupBy act as filtrations. All these methods have a +Cython-optimized implementation. - dff["C"] = np.arange(8) - dff.groupby("B").filter(lambda x: len(x["C"]) > 2) +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; -.. note:: + :meth:`~.DataFrameGroupBy.head`;Select the top row(s) of each group + :meth:`~.DataFrameGroupBy.nth`;Select the nth row(s) of each group + :meth:`~.DataFrameGroupBy.tail`;Select the bottom row(s) of each group - Some functions when applied to a groupby object will act as a **filter** on the input, returning - a reduced shape of the original (and potentially eliminating groups), but with the index unchanged. - Passing ``as_index=False`` will not affect these transformation methods. +Users can also use transformations along with Boolean indexing to construct complex +filtrations within groups. For example, suppose we are given groups of products and +their volumes, and we wish to subset the data to only the largest products capturing no +more than 90% of the total volume within each group. - For example: ``head, tail``. +.. ipython:: python - .. ipython:: python + product_volumes = pd.DataFrame( + { + "group": list("xxxxyyy"), + "product": list("abcdefg"), + "volume": [10, 30, 20, 15, 40, 10, 20], + } + ) + product_volumes - dff.groupby("B").head(2) + # Sort by volume to select the largest products first + product_volumes = product_volumes.sort_values("volume", ascending=False) + grouped = product_volumes.groupby("group")["volume"] + cumpct = grouped.cumsum() / grouped.transform("sum") + cumpct + significant_products = product_volumes[cumpct <= 0.9] + significant_products.sort_values(["group", "product"]) +The :class:`~DataFrameGroupBy.filter` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. _groupby.dispatch: +.. note:: -Dispatching to instance methods -------------------------------- + Filtering by supplying ``filter`` with a User-Defined Function (UDF) is + often less performant than using the built-in methods on GroupBy. + Consider breaking up a complex operation into a chain of operations that utilize + the built-in methods. + +The ``filter`` method takes a User-Defined Function (UDF) that, when applied to +an entire group, returns either ``True`` or ``False``. The result of the ``filter`` +method is then the subset of groups for which the UDF returned ``True``. -When doing an aggregation or transformation, you might just want to call an -instance method on each data group. This is pretty easy to do by passing lambda -functions: +Suppose we want to take only elements that belong to groups with a group sum greater +than 2. .. ipython:: python - :okwarning: - grouped = df.groupby("A")[["C", "D"]] - grouped.agg(lambda x: x.std()) + sf = pd.Series([1, 1, 2, 3, 3, 3]) + sf.groupby(sf).filter(lambda x: x.sum() > 2) -But, it's rather verbose and can be untidy if you need to pass additional -arguments. Using a bit of metaprogramming cleverness, GroupBy now has the -ability to "dispatch" method calls to the groups: +Another useful operation is filtering out elements that belong to groups +with only a couple members. .. ipython:: python - :okwarning: - grouped.std() + dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")}) + dff.groupby("B").filter(lambda x: len(x) > 2) -What is actually happening here is that a function wrapper is being -generated. When invoked, it takes any passed arguments and invokes the function -with any arguments on each group (in the above example, the ``std`` -function). The results are then combined together much in the style of ``agg`` -and ``transform`` (it actually uses ``apply`` to infer the gluing, documented -next). This enables some operations to be carried out rather succinctly: +Alternatively, instead of dropping the offending groups, we can return a +like-indexed objects where the groups that do not pass the filter are filled +with NaNs. .. ipython:: python - tsdf = pd.DataFrame( - np.random.randn(1000, 3), - index=pd.date_range("1/1/2000", periods=1000), - columns=["A", "B", "C"], - ) - tsdf.iloc[::2] = np.nan - grouped = tsdf.groupby(lambda x: x.year) - grouped.fillna(method="pad") - -In this example, we chopped the collection of time series into yearly chunks -then independently called :ref:`fillna ` on the -groups. + dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False) -The ``nlargest`` and ``nsmallest`` methods work on ``Series`` style groupbys: +For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion. .. ipython:: python - s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3]) - g = pd.Series(list("abababab")) - gb = s.groupby(g) - gb.nlargest(3) - gb.nsmallest(3) + dff["C"] = np.arange(8) + dff.groupby("B").filter(lambda x: len(x["C"]) > 2) .. _groupby.apply: Flexible ``apply`` ------------------ -Some operations on the grouped data might not fit into either the aggregate or -transform categories. Or, you may simply want GroupBy to infer how to combine -the results. For these, use the ``apply`` function, which can be substituted -for both ``aggregate`` and ``transform`` in many standard use cases. However, -``apply`` can handle some exceptional use cases. +Some operations on the grouped data might not fit into the aggregation, +transformation, or filtration categories. For these, you can use the ``apply`` +function. + +.. warning:: + + ``apply`` has to try to infer from the result whether it should act as a reducer, + transformer, *or* filter, depending on exactly what is passed to it. Thus the + grouped column(s) may be included in the output as well as set the indices. While + it tries to intelligently guess how to behave, it can sometimes guess wrong. .. note:: - ``apply`` can act as a reducer, transformer, *or* filter function, depending - on exactly what is passed to it. It can depend on the passed function and - exactly what you are grouping. Thus the grouped column(s) may be included in - the output as well as set the indices. + All of these examples can be more reliably, and more efficiently, computed using + other pandas functionality. In fact, pandas maintainers are interested if you + have an operation that you must use ``apply`` for. If you believe you do, please + `raise an issue on GitHub `_ .. ipython:: python @@ -1098,10 +1225,14 @@ that is itself a series, and possibly upcast the result to a DataFrame: s s.apply(f) +Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that of the +apply function. If the results from different groups have different dtypes, then +a common dtype will be determined in the same way as ``DataFrame`` construction. + Control grouped column(s) placement with ``group_keys`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. note:: +.. versionchanged:: 1.5.0 If ``group_keys=True`` is specified when calling :meth:`~DataFrame.groupby`, functions passed to ``apply`` that return like-indexed outputs will have the @@ -1111,8 +1242,6 @@ Control grouped column(s) placement with ``group_keys`` not be added for like-indexed outputs. In the future this behavior will change to always respect ``group_keys``, which defaults to ``True``. - .. versionchanged:: 1.5.0 - To control whether the grouped column(s) are included in the indices, you can use the argument ``group_keys``. Compare @@ -1126,10 +1255,6 @@ with df.groupby("A", group_keys=False).apply(lambda x: x) -Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the -apply function. If the results from different groups have different dtypes, then -a common dtype will be determined in the same way as ``DataFrame`` construction. - Numba Accelerated Routines -------------------------- @@ -1153,8 +1278,8 @@ will be passed into ``values``, and the group index will be passed into ``index` Other useful features --------------------- -Automatic exclusion of "nuisance" columns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Exclusion of "nuisance" columns +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Again consider the example DataFrame we've been looking at: @@ -1164,8 +1289,8 @@ Again consider the example DataFrame we've been looking at: Suppose we wish to compute the standard deviation grouped by the ``A`` column. There is a slight problem, namely that we don't care about the data in -column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance -columns by specifying ``numeric_only=True``: +column ``B`` because it is not numeric. We refer to these non-numeric columns as +"nuisance" columns. You can avoid nuisance columns by specifying ``numeric_only=True``: .. ipython:: python @@ -1178,20 +1303,13 @@ is only interesting over one column (here ``colname``), it may be filtered .. note:: Any object column, also if it contains numerical values such as ``Decimal`` - objects, is considered as a "nuisance" columns. They are excluded from + objects, is considered as a "nuisance" column. They are excluded from aggregate functions automatically in groupby. If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must do so explicitly. -.. warning:: - The automatic dropping of nuisance columns has been deprecated and will be removed - in a future version of pandas. If columns are included that cannot be operated - on, pandas will instead raise an error. In order to avoid this, either select - the columns you wish to operate on or specify ``numeric_only=True``. - .. ipython:: python - :okwarning: from decimal import Decimal From fc158ee9de76c372ca3e42c71be8c403e00984a4 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Mon, 27 Feb 2023 18:59:49 -0500 Subject: [PATCH 2/5] Improvements --- doc/source/user_guide/groupby.rst | 56 ++++++++++++++++------------ doc/source/user_guide/timeseries.rst | 2 +- doc/source/whatsnew/v0.7.0.rst | 2 +- 3 files changed, 34 insertions(+), 26 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index bc9802c0fa154..d35f3092ba1e5 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -508,7 +508,7 @@ Built-in aggregation methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Many common aggregations are built-in to GroupBy objects as methods. Of the methods -listed below, those with a ``*`` do _not_ have a Cython-optimized implementation. +listed below, those with a ``*`` do *not* have a Cython-optimized implementation. .. csv-table:: :header: "Method", "Description" @@ -553,11 +553,17 @@ index are the group names and whose values are the sizes of each group. grouped = df.groupby(["A", "B"]) grouped.size() +While the :meth:`~.DataFrameGroupBy.describe` method is not itself a reducer, it +can be used to conveniently produce a collection of summary statistics about each of +the groups. + .. ipython:: python grouped.describe() -Another aggregation example is to compute the number of unique values of each group. This is similar to the ``value_counts`` function, except that it only counts unique values. +Another aggregation example is to compute the number of unique values of each group. +This is similar to the ``value_counts`` function, except that it only counts the +number of unique values. .. ipython:: python @@ -568,12 +574,12 @@ Another aggregation example is to compute the number of unique values of each gr .. note:: - Aggregation functions **will not** return the groups that you are aggregating over + Aggregation functions **will not** operate on the groups that you are aggregating over if they are named *columns*, when ``as_index=True``, the default. The grouped columns will be the **indices** of the returned object. Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are - named *columns*. + named **indices** or *columns*. .. _groupby.aggregate.agg: @@ -581,9 +587,14 @@ Another aggregation example is to compute the number of unique values of each gr The :meth:`~.DataFrameGroupBy.aggregate` method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The :meth:`~.DataFrameGroupBy.aggregate` method can accept many different types of -inputs. This section details using string aliases for various GroupBy methods; other -inputs are detailed in the sections below. +.. note:: + The :meth:`~.DataFrameGroupBy.aggregate` method can accept many different types of + inputs. This section details using string aliases for various GroupBy methods; other + inputs are detailed in the sections below. + +Any reduction method that pandas implements can be passed as a string to +:meth:`~.DataFrameGroupBy.aggregate`. Users are encouraged to use the shorthand, +``agg``. It will operate as if the corresponding method was called. .. ipython:: python @@ -593,7 +604,7 @@ inputs are detailed in the sections below. grouped = df.groupby(["A", "B"]) grouped.agg("sum") -As you can see, the result of the aggregation will have the group names as the +The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a :ref:`MultiIndex ` by default. As mentioned above, this can be changed by using the ``as_index`` option: @@ -601,9 +612,9 @@ changed by using the ``as_index`` option: .. ipython:: python grouped = df.groupby(["A", "B"], as_index=False) - grouped.aggregate("sum") + grouped.agg("sum") - df.groupby("A", as_index=False)[["C", "D"]].sum() + df.groupby("A", as_index=False)[["C", "D"]].agg("sum") Note that you could use the ``reset_index`` DataFrame function to achieve the same result as the column names are stored in the resulting ``MultiIndex``: @@ -612,10 +623,6 @@ same result as the column names are stored in the resulting ``MultiIndex``: df.groupby(["A", "B"]).agg("sum").reset_index() -The aggregating functions above will exclude NA values. Any function which -reduces a :class:`Series` to a scalar value is an aggregation function and will work, -a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. - .. _groupby.aggregate.udf: Aggregation with User-Defined Functions @@ -719,7 +726,7 @@ accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGr - The keywords are the *output* column names - The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas - provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']`` + provides the :class:`NamedAgg` namedtuple with the fields ``['column', 'aggfunc']`` to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias. @@ -734,7 +741,7 @@ accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGr ) -``pandas.NamedAgg`` is just a ``namedtuple``. Plain tuples are allowed as well. +:class:`NamedAgg` is just a ``namedtuple``. Plain tuples are allowed as well. .. ipython:: python @@ -794,7 +801,8 @@ Transformation -------------- A transformation is a GroupBy operation whose result is indexed the same -as the one being grouped. Common examples include ``cumsum`` and ``diff``. +as the one being grouped. Common examples include :meth:`~.DataFrameGroupBy.cumsum` and +:meth:`~.DataFrameGroupBy.diff`. .. ipython:: python @@ -846,9 +854,9 @@ The following methods on GroupBy act as transformations. Of these methods, only :meth:`~.DataFrameGroupBy.shift`;Shift values up or down within each group In addition, passing any built-in aggregation method as a string to -:meth:`~.DataFrameGroupBy.transform` (see below) will broadcast the result across the group, -producing a transformed result. If the aggregation method is Cython-optimized, this -will be performant as well. +:meth:`~.DataFrameGroupBy.transform` (see the next section) will broadcast the result +across the group, producing a transformed result. If the aggregation method is +Cython-optimized, this will be performant as well. .. _groupby.transformation.transform: @@ -856,10 +864,10 @@ The :meth:`~.DataFrameGroupBy.transform` method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similar to the :ref:`aggregation method `, the -:meth:`~.DataFrameGroupBy.transform` can accept string aliases to the built-in -transform methods in the previous section. It can *also* accept string aliases to the -built-in aggregation methods. When an aggregation method is provided, the result will -be broadcast across the group. +:meth:`~.DataFrameGroupBy.transform` method can accept string aliases to the built-in +transformation methods in the previous section. It can *also* accept string aliases to +the built-in aggregation methods. When an aggregation method is provided, the result +will be broadcast across the group. .. ipython:: python diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index a675e30823c89..4cd98c89e7180 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -1618,7 +1618,7 @@ The ``resample`` function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation. -Any function available via :ref:`dispatching ` is available as +Any built-in method available via :ref:`GroupBy ` is available as a method of the returned object, including ``sum``, ``mean``, ``std``, ``sem``, ``max``, ``min``, ``median``, ``first``, ``last``, ``ohlc``: diff --git a/doc/source/whatsnew/v0.7.0.rst b/doc/source/whatsnew/v0.7.0.rst index 1ee6a9899a655..2336ccaeac820 100644 --- a/doc/source/whatsnew/v0.7.0.rst +++ b/doc/source/whatsnew/v0.7.0.rst @@ -346,7 +346,7 @@ Other API changes Performance improvements ~~~~~~~~~~~~~~~~~~~~~~~~ -- :ref:`Cythonized GroupBy aggregations ` no longer +- :ref:`Cythonized GroupBy aggregations ` no longer presort the data, thus achieving a significant speedup (:issue:`93`). GroupBy aggregations with Python functions significantly sped up by clever manipulation of the ndarray data type in Cython (:issue:`496`). From 436339740e6f0fad9e8ff2ba9162950d554b92d0 Mon Sep 17 00:00:00 2001 From: Dea Leon Date: Fri, 3 Mar 2023 18:34:24 +0100 Subject: [PATCH 3/5] DOC Checking groupby guide --- doc/source/user_guide/groupby.rst | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index d35f3092ba1e5..886581f9f45ea 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -479,9 +479,9 @@ Aggregation ----------- An aggregation is a GroupBy operation that reduces the dimension of the grouping -object. The result of an aggregation is, or at least treated as, -a scalar value for each column in a group. For example, producing a sum of each -column in group of values. +object. The result of an aggregation is, or at least is treated as, +a scalar value for each column in a group. For example, producing the sum of each +column in a group of values. .. ipython:: python @@ -633,7 +633,7 @@ Users can also provide their own User-Defined Functions (UDFs) for custom aggreg .. warning:: When aggregating with a UDF, the UDF should not mutate the - provided ``Series``, see :ref:`gotchas.udf-mutation` for more information. + provided ``Series``. See :ref:`gotchas.udf-mutation` for more information. .. note:: @@ -674,7 +674,7 @@ column, which produces an aggregated result with a hierarchical index: grouped[["C", "D"]].agg(["sum", "mean", "std"]) -The resulting aggregations are named for the functions themselves. If you +The resulting aggregations are named after the functions themselves. If you need to rename, then you can add in a chained operation for a ``Series`` like this: .. ipython:: python @@ -752,7 +752,7 @@ accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGr ) -If your desired output column names are not valid Python keywords, construct a dictionary +If column names you want are not valid Python keywords, construct a dictionary and unpack the keyword arguments .. ipython:: python @@ -766,7 +766,7 @@ and unpack the keyword arguments When using named aggregation, additional keyword arguments are not passed through to the aggregation functions; only pairs of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions -requires additional arguments, partially apply them with :meth:`functools.partial`. +require additional arguments, apply them partially with :meth:`functools.partial`. Named aggregation is also valid for Series groupby aggregations. In this case there's no column selection, so the values are just the functions. @@ -789,7 +789,7 @@ columns of a DataFrame: grouped.agg({"C": "sum", "D": lambda x: np.std(x, ddof=1)}) The function names can also be strings. In order for a string to be valid it -must be either implemented on GroupBy: +must be implemented on GroupBy: .. ipython:: python @@ -912,7 +912,7 @@ Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that o transformation function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. -Suppose we wished to standardize the data within each group: +Suppose we wish to standardize the data within each group: .. ipython:: python @@ -985,7 +985,7 @@ Another common data transform is to replace missing data with the group mean. transformed = grouped.transform(lambda x: x.fillna(x.mean())) -We can verify that the group means have not changed in the transformed data +We can verify that the group means have not changed in the transformed data, and that the transformed data contains no NAs. .. ipython:: python @@ -1030,7 +1030,7 @@ It is possible to use ``resample()``, ``expanding()`` and ``rolling()`` as methods on groupbys. The example below will apply the ``rolling()`` method on the samples of -the column B based on the groups of column A. +the column B, based on the groups of column A. .. ipython:: python @@ -1050,7 +1050,7 @@ group. Suppose you want to use the ``resample()`` method to get a daily -frequency in each group of your dataframe and wish to complete the +frequency in each group of your dataframe, and wish to complete the missing values with the ``ffill()`` method. .. ipython:: python From 7d28a97cca22593337d3868143478d272077f35d Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Sat, 4 Mar 2023 08:06:36 -0500 Subject: [PATCH 4/5] Fix as_index language --- doc/source/user_guide/groupby.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 886581f9f45ea..a2b072045bb96 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -574,8 +574,8 @@ number of unique values. .. note:: - Aggregation functions **will not** operate on the groups that you are aggregating over - if they are named *columns*, when ``as_index=True``, the default. The grouped columns will + Aggregation functions **will not** return the groups that you are aggregating over + as named *columns*, when ``as_index=True``, the default. The grouped columns will be the **indices** of the returned object. Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are @@ -752,7 +752,7 @@ accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGr ) -If column names you want are not valid Python keywords, construct a dictionary +If the column names you want are not valid Python keywords, construct a dictionary and unpack the keyword arguments .. ipython:: python From ec0d5f85ff037a1f2048cb5f07873e553c7ebdd7 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Sun, 5 Mar 2023 15:48:20 -0500 Subject: [PATCH 5/5] Improvements --- doc/source/user_guide/groupby.rst | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 0fa605278c938..31c4bd1d7c87c 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -517,16 +517,16 @@ listed below, those with a ``*`` do *not* have a Cython-optimized implementation :meth:`~.DataFrameGroupBy.all`;Compute whether all of the values in the groups are truthy :meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups :meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups - :meth:`~.DataFrameGroupBy.first` *;Compute the first occurring value in each group + :meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group :meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group :meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group - :meth:`~.DataFrameGroupBy.last` *;Compute the last occurring value in each group - :meth:`~.DataFrameGroupBy.max` *;Compute the maximum value in each group + :meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group + :meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group :meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group :meth:`~.DataFrameGroupBy.median`;Compute the median of each group - :meth:`~.DataFrameGroupBy.min` *;Compute the minimum value in each group + :meth:`~.DataFrameGroupBy.min`;Compute the minimum value in each group :meth:`~.DataFrameGroupBy.nunique`;Compute the number of unique values in each group - :meth:`~.DataFrameGroupBy.prod` *;Compute the product of the values in each group + :meth:`~.DataFrameGroupBy.prod`;Compute the product of the values in each group :meth:`~.DataFrameGroupBy.quantile`;Compute a given quantile of the values in each group :meth:`~.DataFrameGroupBy.sem`;Compute the standard error of the mean of the values in each group :meth:`~.DataFrameGroupBy.size`;Compute the number of values in each group @@ -614,8 +614,9 @@ changed by using the ``as_index`` option: df.groupby("A", as_index=False)[["C", "D"]].agg("sum") -Note that you could use the ``reset_index`` DataFrame function to achieve the -same result as the column names are stored in the resulting ``MultiIndex``: +Note that you could use the :meth:`DataFrame.reset_index` DataFrame function to achieve +the same result as the column names are stored in the resulting ``MultiIndex``, although +this will make an extra copy. .. ipython:: python @@ -1000,7 +1001,8 @@ and that the transformed data contains no NAs. .. _groupby_efficient_transforms: As mentioned in the note above, each of the examples in this section can be computed -more efficiently using built-in methods. +more efficiently using built-in methods. In the code below, the inefficient way +using a UDF is commented out and the faster alternative appears below. .. ipython:: python @@ -1082,8 +1084,8 @@ In the following example, ``class`` is included in the result. .. note:: Unlike aggregations, filtrations do not add the group keys to the index of the - result. Because of this, passing ``as_index=False`` will not affect these - transformation methods. + result. Because of this, passing ``as_index=False`` or ``sort=True`` will not + affect these methods. Filtrations will respect subsetting the columns of the GroupBy object.