Make categories and ordered part of CategoricalDtype

This is to discuss pushing the `Categorical.categories` and
`Categorical.ordered` information into the extension type `CategoricalDtype`.

```python
pd.CategoricalDtype(categories, ordered=False)
```

Note that there is no `values` argument. This is a type constructor, that
isn't attached to any specific `Categorical` instance.

## Why?

Several times now (`read_csv(..., dtype=...)`, `.astype(...)`, `Series([], dtype=...)`)
we have places where we accept `dtype='category'` which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the *value* constructor, with no control over the
`categories` and `ordered` arguments.

```python
Categorical(values, categories=None, ordered=False)
```

The proposal here would add the `categories` and `ordered`
attributes / arguments to `CategoricalDtype` and provide a common API
for specifying non-default parameters for the `Categorical` constructor
in methods like `read_csv`, `astype`, etc.


```python
t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True)
pd.read_csv('foo.csv', dtype={'A': int, 'B': t)
pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high'])
s.astype(t)
```

We would continue to accept `dtype='category'`.

This becomes even more import when doing operations on larger than memory datasets with something like `dask` or even (`read_csv(..., chunksize=N)`). Right now you don't have an easy way to specify the `categories` or `ordered` for columns (assuming you know them ahead of time).

## Issues

1. `CategoricalDtype` currently isn't part of the public API. Which methods
on it do we make public?
2. Equality semantics: For backwards compat, I think all instances
of `CategoricalDtype` should compare equal with all others. You can use
identity to check if two types are the same

```python
t1 = pd.CategoricalDtype(['a', 'b'], ordered=True)
t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2  # True
t1 is t2  # False
t1 is t1  # True
```

3. Should the `categories` argument be required? Currently `dtype='category'`
says 1.) infer the categories based on the *values*, and 2.) it's unordered.
Would `CategoricalDtype(None, ordered=False)` be allowed?
4. Strictness? If I say

```python
pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))
```

What happens? I would probably expect a `TypeError` or `ValueError` as `c`
isn't "supposed" to be there. Or do we replace `'c'` with `NA`? Should
`strict` be another parameter to `CategoricalDtype` (I don't think so).

I'm willing to work on this over the next couple weeks.

xref https://github.com/pandas-dev/pandas/issues/14676 (astype)
xref https://github.com/pandas-dev/pandas/issues/14503 (read_csv)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make categories and ordered part of CategoricalDtype #14711

Why?

Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make categories and ordered part of CategoricalDtype #14711

Description

Why?

Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions