Description
Background
Follow-up from this specfic chain of comments: #18710 (comment)
And these PR's in general: #18677, #18710
Issue
For the context of this discussion, I'm only referring to data that is already categorical; I don't think there was any ambiguity with converting non-categorical to categorical. This applies using .astype('category')
on Categorical
, CategoricalIndex
, and Series
.
The crux of the issue comes down to whether .astype('category')
should ever change data that is already categorical. An argument that it shouldn't is that .astype('category')
doesn't explicitly specify any changes, so nothing should be changed, and it's the existing behavior.
The other argument is that .astype('category')
should be equivalent to .astype(CategoricalDtype())
. Note that CategoricalDtype()
is the same as CategoricalDtype(categories=None, ordered=False)
:
In [2]: CategoricalDtype()
Out[2]: CategoricalDtype(categories=None, ordered=False)
This means that if the existing categorical data is ordered, then .astype(CategoricalDtype())
would change the categorical data from having ordered=True
to ordered=False
, and so .astype('category')
should do the same.
I don't think there are any scenarios where the categories themselves would change; the only potential thing that could change is ordered=True
to ordered=False
. See below for a summary of some potential options. Feel free to modify any of the pro/cons listed below, or suggest any other potential options.
Option 1: .astype('category')
does not change anything
This would not require any additional code changes, as it's the current behavior.
Pros:
- Maintains current behavior
.astype('category')
- Less likely to cause user confusion due to unforeseen changes
- At least in my mind, but I could be convinced otherwise
- Forces the user to be explicit when making potentially unintended changes
Cons:
- Inconsistent with
.astype(CategoricalDtype())
Option 2: .astype('category')
changes ordered=True
to ordered=False
This would require some additional code changes, but is relatively minor.
Pros:
- Makes
.astype('category')
consistent with.astype(CategoricalDtype())
- A bit cleaner/more maintainable in terms of code
- No special case checking for the string 'category'
Cons:
- Changes current behavior of
.astype('category')
Option 3: Allow ordered=None
in CategoricalDtype
Basically, make CategoricalDtype()
return CategoricalDtype(categories=None, ordered=None)
. I should preface this by saying that I have not scoped out the amount of code that would need to be changed for this, nor the potential ramifications. This may not be a good idea.
Pros:
- Maintains current behavior
.astype('category')
- Makes
.astype('category')
consistent with.astype(CategoricalDtype())
Cons:
- Changes the default behavior of
CategoricalDtype
- Could potentially involve a lot of code change and unseen ramifications