Closed
Description
It looks like len(set)
beats both len(np.unique)
and pd.Series.nunique
if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:
>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')
>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop
>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop
>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop
>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop
The fastest way I know to accomplish the same thing is this:
>>> g = df.groupby(['a', 'b'])
>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop
... which is a LOT faster apparently.
Wonder if something similar could be done in GroupBy.nunique
since it's quite a common use case?