Closed
Description
Currently we first remove nans, then use len
on the result of Series.unique
. Except for Series that are mostly null values, it is more performant to switch the order of these operations:
n = 100_000
part_nan = 10
ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)
%timeit ser.nunique()
%timeit (~np.isnan(ser.unique())).sum()
gives
104 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 567 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Changing part_nan to 100 gives
126 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
96.5 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
On my machine, they are about equal when part_nan is 250 (~70% null values).