Skip to content

nunique performance for groupby with large number of groups #10820

Closed
@aldanor

Description

@aldanor

It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
                       'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')

>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop

>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop

>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop

>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop

The fastest way I know to accomplish the same thing is this:

>>> g = df.groupby(['a', 'b'])

>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop

... which is a LOT faster apparently.

Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions