Skip to content

DISC: Behavior of .astype('category') on existing categorical data #18790

Closed
@jschendel

Description

@jschendel

Background

Follow-up from this specfic chain of comments: #18710 (comment)
And these PR's in general: #18677, #18710

Issue

For the context of this discussion, I'm only referring to data that is already categorical; I don't think there was any ambiguity with converting non-categorical to categorical. This applies using .astype('category') on Categorical, CategoricalIndex, and Series.

The crux of the issue comes down to whether .astype('category') should ever change data that is already categorical. An argument that it shouldn't is that .astype('category') doesn't explicitly specify any changes, so nothing should be changed, and it's the existing behavior.

The other argument is that .astype('category') should be equivalent to .astype(CategoricalDtype()). Note that CategoricalDtype() is the same as CategoricalDtype(categories=None, ordered=False):

In [2]: CategoricalDtype()
Out[2]: CategoricalDtype(categories=None, ordered=False)

This means that if the existing categorical data is ordered, then .astype(CategoricalDtype()) would change the categorical data from having ordered=True to ordered=False, and so .astype('category') should do the same.

I don't think there are any scenarios where the categories themselves would change; the only potential thing that could change is ordered=True to ordered=False. See below for a summary of some potential options. Feel free to modify any of the pro/cons listed below, or suggest any other potential options.

Option 1: .astype('category') does not change anything

This would not require any additional code changes, as it's the current behavior.

Pros:

  • Maintains current behavior .astype('category')
  • Less likely to cause user confusion due to unforeseen changes
    • At least in my mind, but I could be convinced otherwise
    • Forces the user to be explicit when making potentially unintended changes

Cons:

  • Inconsistent with .astype(CategoricalDtype())

Option 2: .astype('category') changes ordered=True to ordered=False

This would require some additional code changes, but is relatively minor.

Pros:

  • Makes .astype('category') consistent with .astype(CategoricalDtype())
  • A bit cleaner/more maintainable in terms of code
    • No special case checking for the string 'category'

Cons:

  • Changes current behavior of .astype('category')

Option 3: Allow ordered=None in CategoricalDtype

Basically, make CategoricalDtype() return CategoricalDtype(categories=None, ordered=None). I should preface this by saying that I have not scoped out the amount of code that would need to be changed for this, nor the potential ramifications. This may not be a good idea.

Pros:

  • Maintains current behavior .astype('category')
  • Makes .astype('category') consistent with .astype(CategoricalDtype())

Cons:

  • Changes the default behavior of CategoricalDtype
  • Could potentially involve a lot of code change and unseen ramifications

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions