-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Categorical type #16015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical type #16015
Changes from 4 commits
790cd42
ed5c814
416d1d7
e6c05a0
41172ce
141e509
43f90cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to | |
df["B"] = raw_cat | ||
df | ||
|
||
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``: | ||
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of | ||
|
||
1. categories are inferred from the data | ||
2. categories are unordered. | ||
|
||
To control those behaviors, instead of passing ``'category'``, use an instance | ||
of :class:`~pandas.api.types.CategoricalDtype`. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(["a","b","c","a"]) | ||
s_cat = s.astype("category", categories=["b","c","d"], ordered=False) | ||
from pandas.api.types import CategoricalDtype | ||
|
||
s = pd.Series(["a", "b", "c", "a"]) | ||
cat_type = CategoricalDtype(categories=["b", "c", "d"], | ||
ordered=True) | ||
s_cat = s.astype(cat_type) | ||
s_cat | ||
|
||
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`: | ||
|
@@ -133,6 +143,73 @@ constructor to save the factorize step during normal constructor mode: | |
splitter = np.random.choice([0,1], 5, p=[0.5,0.5]) | ||
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"])) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a ref |
||
.. _categorical.categoricaldtype: | ||
|
||
CategoricalDtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a sub-section ref here |
||
---------------- | ||
|
||
.. versionchanged:: 0.21.0 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add these as bullet points |
||
A categorical's type is fully described by | ||
|
||
1. ``categories``: a sequence of unique values and no missing values | ||
2. ``ordered``: a boolean | ||
|
||
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`. | ||
The ``categories`` argument is optional, which implies that the actual categories | ||
should be inferred from whatever is present in the data when the | ||
:class:`pandas.Categorical` is created. The categories are assumed to be unordered | ||
by default. | ||
|
||
.. ipython:: python | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. show the import |
||
from pandas.api.types import CategoricalDtype | ||
|
||
CategoricalDtype(['a', 'b', 'c']) | ||
CategoricalDtype(['a', 'b', 'c'], ordered=True) | ||
CategoricalDtype() | ||
|
||
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas | ||
expects a `dtype`. For example :func:`pandas.read_csv`, | ||
:func:`pandas.DataFrame.astype`, or in the Series constructor. | ||
|
||
.. note:: | ||
|
||
As a convenience, you can use the string ``'category'`` in place of a | ||
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of | ||
the categories being unordered, and equal to the set values present in the | ||
array. In other words, ``dtype='category'`` is equivalent to | ||
``dtype=CategoricalDtype()``. | ||
|
||
Equality Semantics | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal | ||
whenever they have the same categories and orderedness. When comparing two | ||
unordered categoricals, the order of the ``categories`` is not considered | ||
|
||
.. ipython:: python | ||
|
||
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False) | ||
|
||
# Equal, since order is not considered when ordered=False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a blank line before comments |
||
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False) | ||
|
||
# Unequal, since the second CategoricalDtype is ordered | ||
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True) | ||
|
||
All instances of ``CategoricalDtype`` compare equal to the string ``'category'`` | ||
|
||
.. ipython:: python | ||
|
||
c1 == 'category' | ||
|
||
.. warning:: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is confusing, better as a comment in the code itself. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're saying not in the user-docs at all? I think it's worthwhile including precisely becuas it's so confusing (I also have a comment in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think its a very subtle point and will be lost on the user. If you can show a case where the user might be confused then the warning would be ok. |
||
|
||
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``, | ||
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``, | ||
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so you should say |
||
|
||
Description | ||
----------- | ||
|
||
|
@@ -184,7 +261,7 @@ It's also possible to pass in the categories in a specific order: | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(list('babc')).astype('category', categories=list('abcd')) | ||
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd'))) | ||
s | ||
|
||
# categories | ||
|
@@ -297,7 +374,9 @@ meaning and certain operations are possible. If the categorical is unordered, `` | |
|
||
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False)) | ||
s.sort_values(inplace=True) | ||
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True) | ||
s = pd.Series(["a","b","c","a"]).astype( | ||
CategoricalDtype(ordered=True) | ||
) | ||
s.sort_values(inplace=True) | ||
s | ||
s.min(), s.max() | ||
|
@@ -397,9 +476,15 @@ categories or a categorical with any list-like object, will raise a TypeError. | |
|
||
.. ipython:: python | ||
|
||
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True) | ||
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True) | ||
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True) | ||
cat = pd.Series([1,2,3]).astype( | ||
CategoricalDtype([3, 2, 1], ordered=True) | ||
) | ||
cat_base = pd.Series([2,2,2]).astype( | ||
CategoricalDtype([3, 2, 1], ordered=True) | ||
) | ||
cat_base2 = pd.Series([2,2,2]).astype( | ||
CategoricalDtype(ordered=True) | ||
) | ||
|
||
cat | ||
cat_base | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -830,8 +830,10 @@ The left frame. | |
|
||
.. ipython:: python | ||
|
||
from pandas.api.types import CategoricalDtype | ||
|
||
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,))) | ||
X = X.astype('category', categories=['foo', 'bar']) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. import first |
||
X = X.astype(CategoricalDtype(categories=['foo', 'bar'])) | ||
|
||
left = pd.DataFrame({'X': X, | ||
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))}) | ||
|
@@ -842,8 +844,11 @@ The right frame. | |
|
||
.. ipython:: python | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']), | ||
'Z': [1, 2]}) | ||
right = pd.DataFrame({ | ||
'X': pd.Series(['foo', 'bar'], | ||
dtype=CategoricalDtype(['foo', 'bar'])), | ||
'Z': [1, 2] | ||
}) | ||
right | ||
right.dtypes | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,8 @@ users upgrade to this version. | |
Highlights include: | ||
|
||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. | ||
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying | ||
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`. | ||
|
||
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating. | ||
|
||
|
@@ -89,6 +91,30 @@ This does not raise any obvious exceptions, but also does not create a new colum | |
|
||
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`. | ||
|
||
.. _whatsnew_0210.enhancements.categorical_dtype: | ||
|
||
``CategoricalDtype`` for specifying categoricals | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and | ||
expanded to include the ``categories`` and ``ordered`` attributes. A | ||
``CategoricalDtype`` can be used to specify the set of categories and | ||
orderedness of an array, independent of the data themselves. This can be useful, | ||
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is 1 more issue you listed in the top of the PR |
||
|
||
.. ipython:: python | ||
|
||
from pandas.api.types import CategoricalDtype | ||
|
||
s = pd.Series(['a', 'b', 'c', 'a']) # strings | ||
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True) | ||
s.astype(dtype) | ||
|
||
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a | ||
``Series`` with categorical type will now return an instance of ``CategoricalDtype``. | ||
|
||
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more. | ||
|
||
.. _whatsnew_0210.enhancements.other: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like tslib.html was included somehow? |
||
|
||
Other Enhancements | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we should use the auto* things more readily in other places. Maybe make an issue about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did this instead of
autosummary
since there are a bunch of unrelated methods that are just there for NumPy duck-typing.