Skip to content

Categorical type #16015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 23, 2017
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup

.. ipython:: python

from pandas.api.types import CategoricalDtype

df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
5 changes: 4 additions & 1 deletion doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~

If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we should use the auto* things more readily in other places. Maybe make an issue about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did this instead of autosummary since there are a bunch of unrelated methods that are just there for NumPy duck-typing.


If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
following usable methods and properties:

Expand Down
101 changes: 93 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df

You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`~pandas.api.types.CategoricalDtype`.

.. ipython:: python

s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
from pandas.api.types import CategoricalDtype

s = pd.Series(["a", "b", "c", "a"])
cat_type = CategoricalDtype(categories=["b", "c", "d"],
ordered=True)
s_cat = s.astype(cat_type)
s_cat

Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -133,6 +143,73 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref

.. _categorical.categoricaldtype:

CategoricalDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a sub-section ref here

----------------

.. versionchanged:: 0.21.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add these as bullet points

A categorical's type is fully described by

1. ``categories``: a sequence of unique values and no missing values
2. ``ordered``: a boolean

This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
by default.

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the import

from pandas.api.types import CategoricalDtype

CategoricalDtype(['a', 'b', 'c'])
CategoricalDtype(['a', 'b', 'c'], ordered=True)
CategoricalDtype()

A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.

.. note::

As a convenience, you can use the string ``'category'`` in place of a
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the
array. In other words, ``dtype='category'`` is equivalent to
``dtype=CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and orderedness. When comparing two
unordered categoricals, the order of the ``categories`` is not considered

.. ipython:: python

c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)

# Equal, since order is not considered when ordered=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line before comments

c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)

# Unequal, since the second CategoricalDtype is ordered
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)

All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python

c1 == 'category'

.. warning::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing, better as a comment in the code itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying not in the user-docs at all? I think it's worthwhile including precisely becuas it's so confusing (I also have a comment in CategoricalDtype.__eq__ explaining this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a very subtle point and will be lost on the user. If you can show a case where the user might be confused then the warning would be ok.


Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you should say CategoricalDtype(None, False) at the end.


Description
-----------

Expand Down Expand Up @@ -184,7 +261,7 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python

s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
s

# categories
Expand Down Expand Up @@ -297,7 +374,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``

s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(
CategoricalDtype(ordered=True)
)
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -397,9 +476,15 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python

cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base = pd.Series([2,2,2]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base2 = pd.Series([2,2,2]).astype(
CategoricalDtype(ordered=True)
)

cat
cat_base
Expand Down
11 changes: 8 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -830,8 +830,10 @@ The left frame.

.. ipython:: python

from pandas.api.types import CategoricalDtype

X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import first

X = X.astype(CategoricalDtype(categories=['foo', 'bar']))

left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +844,11 @@ The right frame.

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'],
dtype=CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes

Expand Down
26 changes: 26 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.

Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.

Expand Down Expand Up @@ -89,6 +91,30 @@ This does not raise any obvious exceptions, but also does not create a new colum

Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is 1 more issue you listed in the top of the PR


.. ipython:: python

from pandas.api.types import CategoricalDtype

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like tslib.html was included somehow?


Other Enhancements
Expand Down
Loading