Skip to content

DOC: Fix deprecation warnings in docs for groupby nuisance columns #47065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -154,11 +154,11 @@ The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If
not, the ``mean`` method is applied to each column containing numerical
columns:
columns by passing ``numeric_only=True``:

.. ipython:: python

titanic.groupby("Sex").mean()
titanic.groupby("Sex").mean(numeric_only=True)

It does not make much sense to get the average value of the ``Pclass``.
If we are only interested in the average age for each gender, the
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ groups:

.. ipython:: python

df.groupby("A").sum()
df.groupby("A")[["C", "D"]].sum()

Grouping by multiple columns forms a hierarchical index, and again we can
apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:
Expand Down
26 changes: 16 additions & 10 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -477,7 +477,7 @@ An obvious one is aggregation via the
.. ipython:: python

grouped = df.groupby("A")
grouped.aggregate(np.sum)
grouped[["C", "D"]].aggregate(np.sum)

grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
Expand All @@ -492,7 +492,7 @@ changed by using the ``as_index`` option:
grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)

df.groupby("A", as_index=False).sum()
df.groupby("A", as_index=False)[["C", "D"]].sum()

Note that you could use the ``reset_index`` DataFrame function to achieve the
same result as the column names are stored in the resulting ``MultiIndex``:
Expand Down Expand Up @@ -730,7 +730,7 @@ optimized Cython implementations:

.. ipython:: python

df.groupby("A").sum()
df.groupby("A")[["C", "D"]].sum()
df.groupby(["A", "B"]).mean()

Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above
Expand Down Expand Up @@ -1159,13 +1159,12 @@ Again consider the example DataFrame we've been looking at:

Suppose we wish to compute the standard deviation grouped by the ``A``
column. There is a slight problem, namely that we don't care about the data in
column ``B``. We refer to this as a "nuisance" column. If the passed
aggregation function can't be applied to some columns, the troublesome columns
will be (silently) dropped. Thus, this does not pose any problems:
column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance
columns by specifying ``numeric_only=True``:

.. ipython:: python

df.groupby("A").std()
df.groupby("A").std(numeric_only=True)

Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
Expand All @@ -1180,7 +1179,14 @@ is only interesting over one column (here ``colname``), it may be filtered
If you do wish to include decimal or object columns in an aggregation with
other non-nuisance data types, you must do so explicitly.

.. warning::
The automatic dropping of nuisance columns has been deprecated and will be removed
in a future version of pandas. If columns are included that cannot be operated
on, pandas will instead raise an error. In order to avoid this, either select
the columns you wish to operate on or specify ``numeric_only=True``.

.. ipython:: python
:okwarning:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this example in with an okwarning because it demonstrates the note above - I think this should be in the docs until the behavior is fully removed.


from decimal import Decimal

Expand Down Expand Up @@ -1304,7 +1310,7 @@ Groupby a specific column with the desired frequency. This is like resampling.

.. ipython:: python

df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"]).sum()
df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum()

You have an ambiguous specification in that you have a named index and a column
that could be potential groupers.
Expand All @@ -1313,9 +1319,9 @@ that could be potential groupers.

df = df.set_index("Date")
df["Date"] = df.index + pd.offsets.MonthEnd(2)
df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"]).sum()
df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum()

df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"]).sum()
df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum()


Taking the first rows of each group
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,7 @@ without using a temporary variable.
.. ipython:: python

bb = pd.read_csv('data/baseball.csv', index_col='id')
(bb.groupby(['year', 'team']).sum()
(bb.groupby(['year', 'team']).sum(numeric_only=True)
.loc[lambda df: df['r'] > 100])


Expand Down
15 changes: 10 additions & 5 deletions doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,12 +414,11 @@ We can produce pivot tables from this data very easily:

The result object is a :class:`DataFrame` having potentially hierarchical indexes on the
rows and columns. If the ``values`` column name is not given, the pivot table
will include all of the data that can be aggregated in an additional level of
hierarchy in the columns:
will include all of the data in an additional level of hierarchy in the columns:

.. ipython:: python

pd.pivot_table(df, index=["A", "B"], columns=["C"])
pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"])

Also, you can use :class:`Grouper` for ``index`` and ``columns`` keywords. For detail of :class:`Grouper`, see :ref:`Grouping with a Grouper specification <groupby.specify>`.

Expand All @@ -432,7 +431,7 @@ calling :meth:`~DataFrame.to_string` if you wish:

.. ipython:: python

table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
table = pd.pivot_table(df, index=["A", "B"], columns=["C"], values=["D", "E"])
print(table.to_string(na_rep=""))

Note that :meth:`~DataFrame.pivot_table` is also available as an instance method on DataFrame,
Expand All @@ -449,7 +448,13 @@ rows and columns:

.. ipython:: python

table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
table = df.pivot_table(
index=["A", "B"],
columns="C",
values=["D", "E"],
margins=True,
aggfunc=np.std
)
table

Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame
Expand Down
4 changes: 2 additions & 2 deletions doc/source/user_guide/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1821,15 +1821,15 @@ to resample based on datetimelike column in the frame, it can passed to the
),
)
df
df.resample("M", on="date").sum()
df.resample("M", on="date")[["a"]].sum()

Similarly, if you instead want to resample by a datetimelike
level of ``MultiIndex``, its name or location can be passed to the
``level`` keyword.

.. ipython:: python

df.resample("M", level="d").sum()
df.resample("M", level="d")[["a"]].sum()

.. _timeseries.iterating-label:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.18.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ without using temporary variable.
.. ipython:: python

bb = pd.read_csv("data/baseball.csv", index_col="id")
(bb.groupby(["year", "team"]).sum().loc[lambda df: df.r > 100])
(bb.groupby(["year", "team"]).sum(numeric_only=True).loc[lambda df: df.r > 100])

.. _whatsnew_0181.partial_string_indexing:

Expand Down
4 changes: 2 additions & 2 deletions doc/source/whatsnew/v0.19.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -497,8 +497,8 @@ Other enhancements
),
)
df
df.resample("M", on="date").sum()
df.resample("M", level="d").sum()
df.resample("M", on="date")[["a"]].sum()
df.resample("M", level="d")[["a"]].sum()

- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__. See the docs for more details (:issue:`13577`).
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
Expand Down