Optimising Series.nunique for Nan values #40865 #41236

KenilMehta · 2021-04-30T14:08:32Z

closes PERF: Series.nunique can compute unique, then remove na #40865

pep8speaks · 2021-04-30T14:08:35Z

Hello @KenilMehta! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-03 05:05:01 UTC

jreback · 2021-04-30T16:51:26Z

thanks @KenilMehta

can you add a whatsnew entry (improved perf of Series.nunique)

and do we have asv's for this? (if not we would like to add them), and run to show the improvements. (could also just show via a timeout on master vs your PR)

KenilMehta · 2021-04-30T17:24:12Z

@jreback
I am new to this. Can you please tell me what do you mean by a asv?

jreback · 2021-04-30T17:29:15Z

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#running-the-performance-test-suite

KenilMehta · 2021-05-01T11:49:25Z

What is new in the PR?
As @rhshadrach pointed out while raising this issue,
Currently we first remove nans, then use len() on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations.
I have done exactly the same thing in the PR by first finding the unique values of series and then removing nans from the uniques values.
Is there an asv for this?
I could not find any asv for this. I have added an asv for this in the PR.
Show the improvements.
I have written a small script which I ran against my PR and the master.

import time
import pandas as pd
import numpy as np

def calculateTime(part_nan):
    n = 100_000
    ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

    startTime = time.time()
    n = ser.nunique()
    endTime = time.time()
    print(f"time taken {(endTime - startTime)*1000} {part_nan}")


calculateTime(0)
calculateTime(10)
calculateTime(100)
calculateTime(250)
calculateTime(300)

Following are the results from master:

time taken 134.84644889831543 0
time taken 128.6320686340332 10
time taken 159.5449447631836 100
time taken 196.3047981262207 250
time taken 206.98261260986328 300

Following are the results from PR:

time taken 72.15404510498047 0
time taken 75.88934898376465 10
time taken 108.44254493713379 100
time taken 166.73660278320312 250
time taken 187.24393844604492 300

We can clearly see that the time taken is less in the PR as compared to master.

@jreback
Tried to ans all the things that you asked for.

jreback · 2021-05-02T23:52:19Z

can you add a whatsnew note in doc/source/whatsnew/v1.3.0.txt in the perf section. pls ping when greeen.

KenilMehta · 2021-05-03T06:58:34Z

@jreback
Added whatsnew in doc/source/whatsnew/v1.3.0.txt file.
All builds passed, it is green.

jreback · 2021-05-03T15:12:37Z

thanks @KenilMehta

…#41236)

Kenil Mehta added 3 commits April 29, 2021 21:34

PERF: Optimising Series.nunique() for NaN values pandas-dev#40865

f58048d

using isna inplace of np.isnan

b09c336

using numpy isnan function

47c7911

Kenil Mehta added 2 commits April 30, 2021 20:32

using isna isplace of np.isnan

b438052

small change

b2f9b0c

jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Apr 30, 2021

jreback added this to the 1.3 milestone Apr 30, 2021

adding asv for Series Nunique() with Nan values

7f20a6c

precommit check changes

2985b9d

KenilMehta force-pushed the Issue#40865 branch from aaad974 to 2985b9d Compare May 1, 2021 19:03

adding entry in whatsnew doc

6628cae

jreback merged commit c61e66e into pandas-dev:master May 3, 2021

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

Optimising Series.nunique for Nan values pandas-dev#40865 (pandas-dev…

175209d

…#41236)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

Optimising Series.nunique for Nan values pandas-dev#40865 (pandas-dev…

0e44192

…#41236)

TendouArisu mentioned this pull request Feb 16, 2025

Potential performance issue: Unreliable performance of pd.Series.nunique in pandas 1.2.4 Priya22/EmotionDynamics#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimising Series.nunique for Nan values #40865 #41236

Optimising Series.nunique for Nan values #40865 #41236

Uh oh!

KenilMehta commented Apr 30, 2021 •

edited

Loading

Uh oh!

pep8speaks commented Apr 30, 2021 •

edited

Loading

Uh oh!

jreback commented Apr 30, 2021

Uh oh!

KenilMehta commented Apr 30, 2021

Uh oh!

jreback commented Apr 30, 2021

Uh oh!

KenilMehta commented May 1, 2021

Uh oh!

jreback commented May 2, 2021

Uh oh!

KenilMehta commented May 3, 2021

Uh oh!

jreback commented May 3, 2021

Uh oh!

Uh oh!

Uh oh!

Optimising Series.nunique for Nan values #40865 #41236

Optimising Series.nunique for Nan values #40865 #41236

Uh oh!

Conversation

KenilMehta commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-05-03 05:05:01 UTC

Uh oh!

jreback commented Apr 30, 2021

Uh oh!

KenilMehta commented Apr 30, 2021

Uh oh!

jreback commented Apr 30, 2021

Uh oh!

KenilMehta commented May 1, 2021

Uh oh!

jreback commented May 2, 2021

Uh oh!

KenilMehta commented May 3, 2021

Uh oh!

jreback commented May 3, 2021

Uh oh!

Uh oh!

KenilMehta commented Apr 30, 2021 •

edited

Loading

pep8speaks commented Apr 30, 2021 •

edited

Loading