Closed
Description
I've noticed that Series.unique
and Series.nunique
when used with a categorical series can be slow on large dataset. Presumably its not utilising the shortcuts:
Series.unique = Series.cat.categories
Series.nunique = len(Series.cat.categories)
Heres an example in iPhython:
s = pd.Series(np.random.choice(['a','b','c'], 100000000)).astype('category')
%time s.nunique()
896 ms
%time len(s.cat.categories)
55.1 µs
Its significantly slower indicating its not using the above shortcut.
pd.show_versions()
pandas: 0.17.1
Cython: 0.23.4
numpy: 1.10.1
IPython: 4.0.1