-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: categorical grouping will no longer return the cartesian product #20583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
fa532b6
144a63d
19c9cf7
7ae10ba
bdb7ad3
bdf7525
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -991,24 +991,24 @@ is only interesting over one column (here ``colname``), it may be filtered | |
|
||
.. _groupby.observed: | ||
|
||
observed hanlding | ||
~~~~~~~~~~~~~~~~~ | ||
Handling of (un)observed Categorical values | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword | ||
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those | ||
that are observed groupers (``observed=True``). The ``observed`` keyword will default to ``True`` in the future. | ||
that are observed groupers (``observed=True``). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "or only those that are observed groupers" -> "or only the observed categories" |
||
|
||
Show only the observed values: | ||
Show all values: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count() | ||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would maybe just create |
||
|
||
Show all values: | ||
Show only the observed values: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count() | ||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count() | ||
|
||
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. catergories -> categories |
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our | |
|
||
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/ | ||
|
||
.. _whatsnew_0230.enhancements.categorical_grouping: | ||
|
||
Categorical Groupers has gained an observed keyword | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. has -> have? Because "categorical Groupers" is plural right? |
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't change this on purpose, this is more correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as "Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, the explanation of Tom is exactly what I meant. @jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :)) |
||
each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward | ||
compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`) | ||
|
||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df['C'] = ['foo', 'bar'] * 2 | ||
df | ||
|
||
To show all values, the previous behavior: | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=False).count() | ||
|
||
|
||
To show only observed values: | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=True).count() | ||
|
||
For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword: | ||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df | ||
|
||
.. ipython:: python | ||
|
||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=True) | ||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=False) | ||
|
||
|
||
.. _whatsnew_0230.enhancements.other: | ||
|
||
Other Enhancements | ||
|
@@ -527,68 +579,6 @@ If you wish to retain the old behavior while using Python >= 3.6, you can use | |
'Taxes': -200, | ||
'Net result': 300}).sort_index() | ||
|
||
.. _whatsnew_0230.api_breaking.categorical_grouping: | ||
|
||
Categorical Groupers will now require passing the observed keyword | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for | ||
each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward | ||
compatible (generate a cartesian product). Pandas will show a ``FutureWarning`` if the ``observed`` keyword is not passed; the default will | ||
change to ``observed=True`` in the future. (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`) | ||
|
||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df['C'] = ['foo', 'bar'] * 2 | ||
df | ||
|
||
``observed`` must now be passed when grouping by categoricals, or a | ||
``FutureWarning`` will show: | ||
|
||
.. ipython:: python | ||
:okwarning: | ||
|
||
df.groupby(['A', 'B', 'C']).count() | ||
|
||
|
||
To suppress the warning, with previous Behavior (show all values): | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=False).count() | ||
|
||
|
||
Future Behavior (show only observed values): | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=True).count() | ||
|
||
For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword: | ||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df | ||
|
||
.. ipython:: python | ||
|
||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=True) | ||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=False) | ||
|
||
|
||
.. _whatsnew_0230.api_breaking.deprecate_panel: | ||
|
||
Deprecate Panel | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6633,10 +6633,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | |
reduce the dimensionality of the return type if possible, | ||
otherwise return a consistent type | ||
observed : boolean, default None | ||
if True: only show observed values for categorical groupers | ||
if False: show all values for categorical groupers | ||
if True: only show observed values for categorical groupers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. capital If (below as well) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, can you start this explanation with noting this keyword is only when grouping by categorical values? |
||
if False: show all values for categorical groupers. | ||
if None: if any categorical groupers, show a FutureWarning, | ||
default to False | ||
default to False. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no identation for rst formatting |
||
|
||
.. versionadded:: 0.23.0 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use "grouper" as terminology in our documentation (except for the
pd.Grouper
object), so I would write "groupby key" or "to group by"also "multipler" -> "multiple"