Skip to content

PERF: Series.nunique can compute unique, then remove na #40865

Closed
@rhshadrach

Description

@rhshadrach

Currently we first remove nans, then use len on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations:

n = 100_000
part_nan = 10
ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

%timeit ser.nunique()
%timeit (~np.isnan(ser.unique())).sum()

gives

104 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 567 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Changing part_nan to 100 gives

126 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
96.5 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On my machine, they are about equal when part_nan is 250 (~70% null values).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions