Skip to content

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

Closed
@jreback

Description

@jreback

xref to #10696, #8709

We don't allow an astype of a DataFrame to category directly

In [44]: df.astype('category')
NotImplementedError: > 1 ndim Categorical are not supported at this time

Instead you can apply the astype per-column.

In [35]: df = DataFrame({'A' : list('aabcda'), 'B' : list('bcdaae')})

In [36]: df
Out[36]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [37]: df.apply(lambda x: x.astype('category'))
Out[37]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [38]: df.apply(lambda x: x.astype('category')).B
Out[38]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

In [39]: df.apply(lambda x: x.astype('category')).A
Out[39]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (4, object): [a, b, c, d]

But if you have 'similar' cateogories then you would usually do this, automatically
astyping with the same uniques.

In [41]: uniques = np.sort(pd.unique(df.values.ravel()))

In [42]: df.apply(lambda x: x.astype('category', categories=uniques)).A
Out[42]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (5, object): [a, b, c, d, e]

In [43]: df.apply(lambda x: x.astype('category', categories=uniques)).B
Out[43]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

This is failry straightforward to actually implement, and I think is a nice easy way of coding, w/o having to actually support 2D categoricals internally (and we are moving away from internal 2-d structures anyhow).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions