Skip to content

Commit 738f3b7

Browse files
committed
fix issue with grouping with sort=True on an unordered Categorical
update categorical.rst docs test unsortable when ordered=True
1 parent b7238e6 commit 738f3b7

File tree

7 files changed

+198
-74
lines changed

7 files changed

+198
-74
lines changed

doc/source/api.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -585,6 +585,8 @@ following usable methods and properties (all available as ``Series.cat.<method_o
585585
Categorical.remove_categories
586586
Categorical.remove_unused_categories
587587
Categorical.set_categories
588+
Categorical.as_ordered
589+
Categorical.as_unordered
588590
Categorical.codes
589591

590592
To create a Series of dtype ``category``, use ``cat = s.astype("category")``.

doc/source/categorical.rst

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,6 @@ By using some special functions:
9090
See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.
9191

9292
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
93-
This is the only possibility to specify differently ordered categories (or no order at all) at
94-
creation time and the only reason to use :class:`pandas.Categorical` directly:
9593

9694
.. ipython:: python
9795
@@ -103,6 +101,14 @@ creation time and the only reason to use :class:`pandas.Categorical` directly:
103101
df["B"] = raw_cat
104102
df
105103
104+
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
105+
106+
.. ipython:: python
107+
108+
s = Series(["a","b","c","a"])
109+
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
110+
s_cat
111+
106112
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
107113

108114
.. ipython:: python
@@ -176,10 +182,9 @@ It's also possible to pass in the categories in a specific order:
176182
s.cat.ordered
177183
178184
.. note::
179-
New categorical data is automatically ordered if the passed in values are sortable or a
180-
`categories` argument is supplied. This is a difference to R's `factors`, which are unordered
181-
unless explicitly told to be ordered (``ordered=TRUE``). You can of course overwrite that by
182-
passing in an explicit ``ordered=False``.
185+
186+
New categorical data are NOT automatically ordered. You must explicity pass ``ordered=True`` to
187+
indicate an ordered ``Categorical``.
183188

184189

185190
Renaming categories
@@ -270,6 +275,10 @@ Sorting and Order
270275

271276
.. _categorical.sort:
272277

278+
.. warning::
279+
280+
The default for construction has change in v0.16.0 to ``ordered=False``, from the prior implicit ``ordered=True``
281+
273282
If categorical data is ordered (``s.cat.ordered == True``), then the order of the categories has a
274283
meaning and certain operations are possible. If the categorical is unordered, a `TypeError` is
275284
raised.
@@ -281,18 +290,26 @@ raised.
281290
s.sort()
282291
except TypeError as e:
283292
print("TypeError: " + str(e))
284-
s = Series(["a","b","c","a"], dtype="category") # ordered per default!
293+
s = Series(["a","b","c","a"]).astype('category',ordered=True)
285294
s.sort()
286295
s
287296
s.min(), s.max()
288297
298+
You can set categorical data to be ordered by using ``as_ordered()`` or unordered by using ``as_unordered()``. These will by
299+
default return a *new* object.
300+
301+
.. ipython:: python
302+
303+
s.cat.as_ordered()
304+
s.cat.as_unordered()
305+
289306
Sorting will use the order defined by categories, not any lexical order present on the data type.
290307
This is even true for strings and numeric data:
291308

292309
.. ipython:: python
293310
294311
s = Series([1,2,3,1], dtype="category")
295-
s.cat.categories = [2,3,1]
312+
s = s.cat.set_categories([2,3,1], ordered=True)
296313
s
297314
s.sort()
298315
s
@@ -310,7 +327,7 @@ necessarily make the sort order the same as the categories order.
310327
.. ipython:: python
311328
312329
s = Series([1,2,3,1], dtype="category")
313-
s = s.cat.reorder_categories([2,3,1])
330+
s = s.cat.reorder_categories([2,3,1], ordered=True)
314331
s
315332
s.sort()
316333
s
@@ -339,7 +356,7 @@ The ordering of the categorical is determined by the ``categories`` of that colu
339356

340357
.. ipython:: python
341358
342-
dfs = DataFrame({'A' : Categorical(list('bbeebbaa'),categories=['e','a','b']),
359+
dfs = DataFrame({'A' : Categorical(list('bbeebbaa'),categories=['e','a','b'],ordered=True),
343360
'B' : [1,2,1,2,2,1,2,1] })
344361
dfs.sort(['A','B'])
345362
@@ -664,9 +681,6 @@ The following differences to R's factor functions can be observed:
664681

665682
* R's `levels` are named `categories`
666683
* R's `levels` are always of type string, while `categories` in pandas can be of any dtype.
667-
* New categorical data is automatically ordered if the passed in values are sortable or a
668-
`categories` argument is supplied. This is a difference to R's `factors`, which are unordered
669-
unless explicitly told to be ordered (``ordered=TRUE``).
670684
* It's not possible to specify labels at creation time. Use ``s.cat.rename_categories(new_labels)``
671685
afterwards.
672686
* In contrast to R's `factor` function, using categorical data as the sole input to create a

doc/source/whatsnew/v0.16.0.txt

Lines changed: 50 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -312,51 +312,54 @@ Categorical Changes
312312

313313
.. _whatsnew_0160.api_breaking.categorical:
314314

315-
In prior versions, Categoricals that with had an unspecified ordering (meaning no ``ordered`` keyword was passed) were defaulted to have a lexographic ordering of the values and designated as ``ordered`` Categoricals. Going forward, the ``ordered`` keyword in the ``Categorical`` constructor will default to ``False``, so ordering must now be explicit.
315+
In prior versions, ``Categoricals`` that had an unspecified ordering (meaning no ``ordered`` keyword was passed) were defaulted as ``ordered`` Categoricals. Going forward, the ``ordered`` keyword in the ``Categorical`` constructor will default to ``False``, ordering must now be explicit.
316316

317-
Furthermore, previously you *could* change the ``ordered`` attribute of a Categorical by just setting the attribute, e.g. ``cat.ordered=True``; This is now deprecated and you should use ``cat.set_ordered(True)``. This will by default return a **new** object and not modify the existing object.
317+
Furthermore, previously you *could* change the ``ordered`` attribute of a Categorical by just setting the attribute, e.g. ``cat.ordered=True``; This is now deprecated and you should use ``cat.as_ordered()`` or ``cat.as_unordered()``. These will by default return a **new** object and not modify the existing object. (:issue:`9347`, :issue:`9190`)
318318

319319
Previous Behavior
320320

321321
.. code-block:: python
322322

323-
In [1]: cat = pd.Categorical([0,1,2])
323+
In [3]: s = Series([0,1,2], dtype='category')
324324

325-
In [2]: cat
326-
Out[2]:
327-
[0, 1, 2]
325+
In [4]: s
326+
Out[4]:
327+
0 0
328+
1 1
329+
2 2
330+
dtype: category
328331
Categories (3, int64): [0 < 1 < 2]
329332

330-
In [3]: cat.ordered
331-
Out[3]: True
333+
In [5]: s.cat.ordered
334+
Out[5]: True
332335

333-
In [4]: cat.ordered=False
336+
In [6]: s.cat.ordered = False
334337

335-
In [5]: cat
336-
Out[5]:
337-
[0, 1, 2]
338+
In [7]: s
339+
Out[7]:
340+
0 0
341+
1 1
342+
2 2
343+
dtype: category
338344
Categories (3, int64): [0, 1, 2]
339345

340-
341346
New Behavior
342347

343348
.. ipython:: python
344349

345-
cat = pd.Categorical([0,1,2])
346-
cat
347-
cat.ordered
348-
cat.ordered=True
349-
cat = cat.set_ordered(True)
350-
cat
351-
cat.ordered
350+
s = Series([0,1,2], dtype='category')
351+
s
352+
s.cat.ordered
353+
s = s.cat.as_ordered()
354+
s
355+
s.cat.ordered
352356

353-
# you can set in the construtor
354-
cat = pd.Categorical([0,1,2],ordered=True)
355-
cat
356-
cat.ordered
357+
# you can set in the constructor of the Categorical
358+
s = Series(Categorical([0,1,2],ordered=True))
359+
s
360+
s.cat.ordered
357361

358-
For ease of creation of series of ``Categoricals``, we have added the ability to pass keywords when calling ``.astype()``, these
359-
are passed directly to the constructor.
362+
For ease of creation of series of categorical data, we have added the ability to pass keywords when calling ``.astype()``, these are passed directly to the constructor.
360363

361364
.. ipython:: python
362365

@@ -365,6 +368,27 @@ are passed directly to the constructor.
365368
s = Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)
366369
s
367370

371+
.. warning::
372+
373+
This simple API change may have suprising effects if a user is relying on the previous defaulted behavior implicity. In particular,
374+
sorting operations with a ``Categorical`` will now raise an error:
375+
376+
.. code-block:: python
377+
378+
In [1]: df = DataFrame({ 'A' : Series(list('aabc')).astype('category'), 'B' : np.arange(4) })
379+
380+
In [2]: df['A'].order()
381+
TypeError: Categorical not ordered
382+
you can use .as_ordered() to change the Categorical to an ordered one
383+
384+
In [3]: df.groupby('A').sum()
385+
ValueError: cannot sort by an unordered Categorical in the grouper
386+
you can set sort=False in the groupby expression or
387+
make the categorical ordered by using .as_ordered()
388+
389+
The solution is to make 'A' orderable, e.g. ``df['A'] = df['A'].cat.as_ordered()``
390+
391+
368392
Indexing Changes
369393
~~~~~~~~~~~~~~~~
370394

pandas/core/categorical.py

Lines changed: 48 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ class Categorical(PandasObject):
139139
categories : Index-like (unique), optional
140140
The unique categories for this categorical. If not given, the categories are assumed
141141
to be the unique values of values.
142-
ordered : boolean, optional
142+
ordered : boolean, (default False)
143143
Whether or not this categorical is treated as a ordered categorical. If not given,
144144
the resulting categorical will not be ordered.
145145
name : str, optional
@@ -259,12 +259,14 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
259259

260260
if categories is None:
261261
try:
262-
codes, categories = factorize(values, sort=ordered)
262+
codes, categories = factorize(values, sort=True)
263263
except TypeError:
264-
# raise, as we don't have a sortable data structure and so the user should
265-
# give us one by specifying categories
266-
raise TypeError("'values' is not factorizable, please pass "
267-
"categories order by passing in a categories argument.")
264+
codes, categories = factorize(values, sort=False)
265+
if ordered:
266+
# raise, as we don't have a sortable data structure and so the user should
267+
# give us one by specifying categories
268+
raise TypeError("'values' is not ordered, please explicitly specify the "
269+
"categories order by passing in a categories argument.")
268270
except ValueError:
269271

270272
### FIXME ####
@@ -290,7 +292,7 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
290292
warn("None of the categories were found in values. Did you mean to use\n"
291293
"'Categorical.from_codes(codes, categories)'?", RuntimeWarning)
292294

293-
self._ordered = ordered
295+
self.set_ordered(ordered, inplace=True)
294296
self.categories = categories
295297
self.name = name
296298
self._codes = _coerce_indexer_dtype(codes, categories)
@@ -345,7 +347,7 @@ def from_codes(cls, codes, categories, ordered=False, name=None):
345347
An integer array, where each integer points to a category in categories or -1 for NaN
346348
categories : index-like
347349
The categories for the categorical. Items need to be unique.
348-
ordered : boolean, optional
350+
ordered : boolean, (default False)
349351
Whether or not this categorical is treated as a ordered categorical. If not given,
350352
the resulting categorical will be unordered.
351353
name : str, optional
@@ -470,6 +472,30 @@ def set_ordered(self, value, inplace=False):
470472
if not inplace:
471473
return cat
472474

475+
def as_ordered(self, inplace=False):
476+
"""
477+
Sets the Categorical to be ordered
478+
479+
Parameters
480+
----------
481+
inplace : boolean (default: False)
482+
Whether or not to set the ordered attribute inplace or return a copy of this categorical
483+
with ordered set to True
484+
"""
485+
return self.set_ordered(True, inplace=inplace)
486+
487+
def as_unordered(self, inplace=False):
488+
"""
489+
Sets the Categorical to be unordered
490+
491+
Parameters
492+
----------
493+
inplace : boolean (default: False)
494+
Whether or not to set the ordered attribute inplace or return a copy of this categorical
495+
with ordered set to False
496+
"""
497+
return self.set_ordered(False, inplace=inplace)
498+
473499
def _get_ordered(self):
474500
""" Gets the ordered attribute """
475501
return self._ordered
@@ -853,7 +879,8 @@ def searchsorted(self, v, side='left', sorter=None):
853879
array([3, 5]) # eggs after donuts, after switching milk and donuts
854880
"""
855881
if not self.ordered:
856-
raise ValueError("searchsorted requires an ordered Categorical.")
882+
raise ValueError("Categorical not ordered\n"
883+
"you can use .as_ordered() to change the Categorical to an ordered one\n")
857884

858885
from pandas.core.series import Series
859886
values_as_codes = self.categories.values.searchsorted(Series(v).values, side)
@@ -981,7 +1008,8 @@ def argsort(self, ascending=True, **kwargs):
9811008
argsorted : numpy array
9821009
"""
9831010
if not self.ordered:
984-
raise TypeError("Categorical not ordered")
1011+
raise TypeError("Categorical not ordered\n"
1012+
"you can use .as_ordered() to change the Categorical to an ordered one\n")
9851013
result = np.argsort(self._codes.copy(), **kwargs)
9861014
if not ascending:
9871015
result = result[::-1]
@@ -1013,7 +1041,8 @@ def order(self, inplace=False, ascending=True, na_position='last'):
10131041
Category.sort
10141042
"""
10151043
if not self.ordered:
1016-
raise TypeError("Categorical not ordered")
1044+
raise TypeError("Categorical not ordered\n"
1045+
"you can use .as_ordered() to change the Categorical to an ordered one\n")
10171046
if na_position not in ['last','first']:
10181047
raise ValueError('invalid na_position: {!r}'.format(na_position))
10191048

@@ -1394,7 +1423,8 @@ def min(self, numeric_only=None, **kwargs):
13941423
min : the minimum of this `Categorical`
13951424
"""
13961425
if not self.ordered:
1397-
raise TypeError("Categorical not ordered")
1426+
raise TypeError("Categorical not ordered\n"
1427+
"you can use .as_ordered() to change the Categorical to an ordered one\n")
13981428
if numeric_only:
13991429
good = self._codes != -1
14001430
pointer = self._codes[good].min(**kwargs)
@@ -1421,7 +1451,8 @@ def max(self, numeric_only=None, **kwargs):
14211451
max : the maximum of this `Categorical`
14221452
"""
14231453
if not self.ordered:
1424-
raise TypeError("Categorical not ordered")
1454+
raise TypeError("Categorical not ordered\n"
1455+
"you can use .as_ordered() to change the Categorical to an ordered one\n")
14251456
if numeric_only:
14261457
good = self._codes != -1
14271458
pointer = self._codes[good].max(**kwargs)
@@ -1524,7 +1555,8 @@ class CategoricalAccessor(PandasDelegate):
15241555
>>> s.cat.remove_categories(['d'])
15251556
>>> s.cat.remove_unused_categories()
15261557
>>> s.cat.set_categories(list('abcde'))
1527-
>>> s.cat.set_ordered(True)
1558+
>>> s.cat.as_ordered()
1559+
>>> s.cat.as_unordered()
15281560
15291561
"""
15301562

@@ -1561,7 +1593,8 @@ def _delegate_method(self, name, *args, **kwargs):
15611593
"remove_categories",
15621594
"remove_unused_categories",
15631595
"set_categories",
1564-
"set_ordered"],
1596+
"as_ordered",
1597+
"as_unordered"],
15651598
typ='method')
15661599

15671600
##### utility routines #####

pandas/core/groupby.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1923,8 +1923,16 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
19231923

19241924
# a passed Categorical
19251925
elif isinstance(self.grouper, Categorical):
1926+
1927+
# must have an ordered categorical
1928+
if self.sort:
1929+
if not self.grouper.ordered:
1930+
raise ValueError("cannot sort by an unordered Categorical in the grouper\n"
1931+
"you can set sort=False in the groupby expression or\n"
1932+
"make the categorical ordered by using .set_ordered(True)\n")
1933+
19261934
# fix bug #GH8868 sort=False being ignored in categorical groupby
1927-
if not self.sort:
1935+
else:
19281936
self.grouper = self.grouper.reorder_categories(self.grouper.unique())
19291937
self._labels = self.grouper.codes
19301938
self._group_index = self.grouper.categories

0 commit comments

Comments
 (0)