Skip to content

API: .str ops on category should return category if result is non-boolean #15198

Open
@jreback

Description

@jreback

In the PR implementing .str/.dt on Categoricals, #11582.

This is perfectly reasonable. We perform the string op on the uniques. This routine is a boolean result, so we return a boolean result.

In [2]: s = pd.Series(list('aabb')).astype('category')

In [3]: s
Out[3]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]

In [4]:  s.str.contains("a")
Out[4]: 
0     True
1     True
2    False
3    False
dtype: bool

However, I don't recall the rationale for: performing the op on the uniques (as its a categorical), but then returning an object dtype.

In [5]: s.str.upper()
Out[5]: 
0    A
1    A
2    B
3    B
dtype: object

These are by-definition pure transforms, and so a new categorical makes sense. e.g. in this case

In [6]: pd.Series(pd.Categorical.from_codes(s.cat.codes, s.cat.categories.str.upper()))
Out[6]: 
0    A
1    A
2    B
3    B
dtype: category
Categories (2, object): [A, B]

This will be way more efficient than actually converting to object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeEnhancementNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions