Skip to content

Make categories and ordered part of CategoricalDtype #14711

Closed
@TomAugspurger

Description

@TomAugspurger

This is to discuss pushing the Categorical.categories and
Categorical.ordered information into the extension type CategoricalDtype.

pd.CategoricalDtype(categories, ordered=False)

Note that there is no values argument. This is a type constructor, that
isn't attached to any specific Categorical instance.

Why?

Several times now (read_csv(..., dtype=...), .astype(...), Series([], dtype=...))
we have places where we accept dtype='category' which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories and ordered arguments.

Categorical(values, categories=None, ordered=False)

The proposal here would add the categories and ordered
attributes / arguments to CategoricalDtype and provide a common API
for specifying non-default parameters for the Categorical constructor
in methods like read_csv, astype, etc.

t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True)
pd.read_csv('foo.csv', dtype={'A': int, 'B': t)
pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high'])
s.astype(t)

We would continue to accept dtype='category'.

This becomes even more import when doing operations on larger than memory datasets with something like dask or even (read_csv(..., chunksize=N)). Right now you don't have an easy way to specify the categories or ordered for columns (assuming you know them ahead of time).

Issues

  1. CategoricalDtype currently isn't part of the public API. Which methods
    on it do we make public?
  2. Equality semantics: For backwards compat, I think all instances
    of CategoricalDtype should compare equal with all others. You can use
    identity to check if two types are the same
t1 = pd.CategoricalDtype(['a', 'b'], ordered=True)
t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2  # True
t1 is t2  # False
t1 is t1  # True
  1. Should the categories argument be required? Currently dtype='category'
    says 1.) infer the categories based on the values, and 2.) it's unordered.
    Would CategoricalDtype(None, ordered=False) be allowed?
  2. Strictness? If I say
pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))

What happens? I would probably expect a TypeError or ValueError as c
isn't "supposed" to be there. Or do we replace 'c' with NA? Should
strict be another parameter to CategoricalDtype (I don't think so).

I'm willing to work on this over the next couple weeks.

xref #14676 (astype)
xref #14503 (read_csv)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions