diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst index 298d0c4e0111c..346a5cecf601d 100644 --- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst +++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst @@ -154,11 +154,11 @@ The apply and combine steps are typically done together in pandas. In the previous example, we explicitly selected the 2 columns first. If not, the ``mean`` method is applied to each column containing numerical -columns: +columns by passing ``numeric_only=True``: .. ipython:: python - titanic.groupby("Sex").mean() + titanic.groupby("Sex").mean(numeric_only=True) It does not make much sense to get the average value of the ``Pclass``. If we are only interested in the average age for each gender, the diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst index 9ccf191194e19..9916f13e015dd 100644 --- a/doc/source/user_guide/10min.rst +++ b/doc/source/user_guide/10min.rst @@ -532,7 +532,7 @@ groups: .. ipython:: python - df.groupby("A").sum() + df.groupby("A")[["C", "D"]].sum() Grouping by multiple columns forms a hierarchical index, and again we can apply the :meth:`~pandas.core.groupby.GroupBy.sum` function: diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index f381d72069775..f2d83885df2d0 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -477,7 +477,7 @@ An obvious one is aggregation via the .. ipython:: python grouped = df.groupby("A") - grouped.aggregate(np.sum) + grouped[["C", "D"]].aggregate(np.sum) grouped = df.groupby(["A", "B"]) grouped.aggregate(np.sum) @@ -492,7 +492,7 @@ changed by using the ``as_index`` option: grouped = df.groupby(["A", "B"], as_index=False) grouped.aggregate(np.sum) - df.groupby("A", as_index=False).sum() + df.groupby("A", as_index=False)[["C", "D"]].sum() Note that you could use the ``reset_index`` DataFrame function to achieve the same result as the column names are stored in the resulting ``MultiIndex``: @@ -730,7 +730,7 @@ optimized Cython implementations: .. ipython:: python - df.groupby("A").sum() + df.groupby("A")[["C", "D"]].sum() df.groupby(["A", "B"]).mean() Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above @@ -1159,13 +1159,12 @@ Again consider the example DataFrame we've been looking at: Suppose we wish to compute the standard deviation grouped by the ``A`` column. There is a slight problem, namely that we don't care about the data in -column ``B``. We refer to this as a "nuisance" column. If the passed -aggregation function can't be applied to some columns, the troublesome columns -will be (silently) dropped. Thus, this does not pose any problems: +column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance +columns by specifying ``numeric_only=True``: .. ipython:: python - df.groupby("A").std() + df.groupby("A").std(numeric_only=True) Note that ``df.groupby('A').colname.std().`` is more efficient than ``df.groupby('A').std().colname``, so if the result of an aggregation function @@ -1180,7 +1179,14 @@ is only interesting over one column (here ``colname``), it may be filtered If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must do so explicitly. +.. warning:: + The automatic dropping of nuisance columns has been deprecated and will be removed + in a future version of pandas. If columns are included that cannot be operated + on, pandas will instead raise an error. In order to avoid this, either select + the columns you wish to operate on or specify ``numeric_only=True``. + .. ipython:: python + :okwarning: from decimal import Decimal @@ -1304,7 +1310,7 @@ Groupby a specific column with the desired frequency. This is like resampling. .. ipython:: python - df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"]).sum() + df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum() You have an ambiguous specification in that you have a named index and a column that could be potential groupers. @@ -1313,9 +1319,9 @@ that could be potential groupers. df = df.set_index("Date") df["Date"] = df.index + pd.offsets.MonthEnd(2) - df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"]).sum() + df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum() - df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"]).sum() + df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum() Taking the first rows of each group diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index a94681924d211..3c08b5a498eea 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -583,7 +583,7 @@ without using a temporary variable. .. ipython:: python bb = pd.read_csv('data/baseball.csv', index_col='id') - (bb.groupby(['year', 'team']).sum() + (bb.groupby(['year', 'team']).sum(numeric_only=True) .loc[lambda df: df['r'] > 100]) diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst index f9e68b1b39ddc..b24890564d1bf 100644 --- a/doc/source/user_guide/reshaping.rst +++ b/doc/source/user_guide/reshaping.rst @@ -414,12 +414,11 @@ We can produce pivot tables from this data very easily: The result object is a :class:`DataFrame` having potentially hierarchical indexes on the rows and columns. If the ``values`` column name is not given, the pivot table -will include all of the data that can be aggregated in an additional level of -hierarchy in the columns: +will include all of the data in an additional level of hierarchy in the columns: .. ipython:: python - pd.pivot_table(df, index=["A", "B"], columns=["C"]) + pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"]) Also, you can use :class:`Grouper` for ``index`` and ``columns`` keywords. For detail of :class:`Grouper`, see :ref:`Grouping with a Grouper specification `. @@ -432,7 +431,7 @@ calling :meth:`~DataFrame.to_string` if you wish: .. ipython:: python - table = pd.pivot_table(df, index=["A", "B"], columns=["C"]) + table = pd.pivot_table(df, index=["A", "B"], columns=["C"], values=["D", "E"]) print(table.to_string(na_rep="")) Note that :meth:`~DataFrame.pivot_table` is also available as an instance method on DataFrame, @@ -449,7 +448,13 @@ rows and columns: .. ipython:: python - table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std) + table = df.pivot_table( + index=["A", "B"], + columns="C", + values=["D", "E"], + margins=True, + aggfunc=np.std + ) table Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index 582620d8b6479..c67d028b65b3e 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -1821,7 +1821,7 @@ to resample based on datetimelike column in the frame, it can passed to the ), ) df - df.resample("M", on="date").sum() + df.resample("M", on="date")[["a"]].sum() Similarly, if you instead want to resample by a datetimelike level of ``MultiIndex``, its name or location can be passed to the @@ -1829,7 +1829,7 @@ level of ``MultiIndex``, its name or location can be passed to the .. ipython:: python - df.resample("M", level="d").sum() + df.resample("M", level="d")[["a"]].sum() .. _timeseries.iterating-label: diff --git a/doc/source/whatsnew/v0.18.1.rst b/doc/source/whatsnew/v0.18.1.rst index f873d320822ae..7d9008fdbdecd 100644 --- a/doc/source/whatsnew/v0.18.1.rst +++ b/doc/source/whatsnew/v0.18.1.rst @@ -166,7 +166,7 @@ without using temporary variable. .. ipython:: python bb = pd.read_csv("data/baseball.csv", index_col="id") - (bb.groupby(["year", "team"]).sum().loc[lambda df: df.r > 100]) + (bb.groupby(["year", "team"]).sum(numeric_only=True).loc[lambda df: df.r > 100]) .. _whatsnew_0181.partial_string_indexing: diff --git a/doc/source/whatsnew/v0.19.0.rst b/doc/source/whatsnew/v0.19.0.rst index a2bb935c708bc..113bbcf0a05bc 100644 --- a/doc/source/whatsnew/v0.19.0.rst +++ b/doc/source/whatsnew/v0.19.0.rst @@ -497,8 +497,8 @@ Other enhancements ), ) df - df.resample("M", on="date").sum() - df.resample("M", level="d").sum() + df.resample("M", on="date")[["a"]].sum() + df.resample("M", level="d")[["a"]].sum() - The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials `__. See the docs for more details (:issue:`13577`). - The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)