Skip to content

PERF: categorical value_counts can be much faster #10804

Closed
@jreback

Description

@jreback

The internal impl of Categorical.value_counts should just do this. I think it is factorizing multiple times when it is not necessary.

In [32]: np.random.seed(1234)

In [33]: n = 500000

In [34]: u = int(0.1*n)

In [35]: arr = [ "s%04d" % i for i in np.random.randint(0,u,size=n) ]

In [36]: c = pd.Series(arr).astype('category')                              

In [37]: result1 = Series(np.arange(len(c.cat.categories)),c.cat.categories).map(c.cat.codes.value_counts()).order(ascending=False)

In [38]: result2 = c.value_counts()

In [39]: %timeit Series(np.arange(len(c.cat.categories)),c.cat.categories).map(c.cat.codes.value_counts()).order(ascending=False)
100 loops, best of 3: 17.2 ms per loop

In [40]: %timeit c.value_counts()
10 loops, best of 3: 62.3 ms per loop

In [41]: result1.equals(result2)
Out[41]: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions