nunique performance for groupby with large number of groups

It looks like `len(set)` beats both `len(np.unique)` and `pd.Series.nunique` if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

``` python
>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
                       'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')

>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop

>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop

>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop

>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop
```

The fastest way I know to accomplish the same thing is this:

``` python
>>> g = df.groupby(['a', 'b'])

>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop
```

... which is a _LOT_ faster apparently.

Wonder if something similar could be done in `GroupBy.nunique` since it's quite a common use case?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

nunique performance for groupby with large number of groups #10820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

nunique performance for groupby with large number of groups #10820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions