Skip to content

Optimising Series.nunique for Nan values #40865 #41236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 3, 2021

Conversation

KenilMehta
Copy link
Contributor

@KenilMehta KenilMehta commented Apr 30, 2021

@pep8speaks
Copy link

pep8speaks commented Apr 30, 2021

Hello @KenilMehta! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-03 05:05:01 UTC

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Apr 30, 2021
@jreback jreback added this to the 1.3 milestone Apr 30, 2021
@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

thanks @KenilMehta

can you add a whatsnew entry (improved perf of Series.nunique)

and do we have asv's for this? (if not we would like to add them), and run to show the improvements. (could also just show via a timeout on master vs your PR)

@KenilMehta
Copy link
Contributor Author

@jreback
I am new to this. Can you please tell me what do you mean by a asv?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

@KenilMehta
Copy link
Contributor Author

  1. What is new in the PR?
    As @rhshadrach pointed out while raising this issue,
    Currently we first remove nans, then use len() on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations.
    I have done exactly the same thing in the PR by first finding the unique values of series and then removing nans from the uniques values.

  2. Is there an asv for this?
    I could not find any asv for this. I have added an asv for this in the PR.

  3. Show the improvements.
    I have written a small script which I ran against my PR and the master.

import time
import pandas as pd
import numpy as np

def calculateTime(part_nan):
    n = 100_000
    ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

    startTime = time.time()
    n = ser.nunique()
    endTime = time.time()
    print(f"time taken {(endTime - startTime)*1000} {part_nan}")


calculateTime(0)
calculateTime(10)
calculateTime(100)
calculateTime(250)
calculateTime(300)

Following are the results from master:

time taken 134.84644889831543 0
time taken 128.6320686340332 10
time taken 159.5449447631836 100
time taken 196.3047981262207 250
time taken 206.98261260986328 300

Following are the results from PR:

time taken 72.15404510498047 0
time taken 75.88934898376465 10
time taken 108.44254493713379 100
time taken 166.73660278320312 250
time taken 187.24393844604492 300

We can clearly see that the time taken is less in the PR as compared to master.

@jreback
Tried to ans all the things that you asked for.

@jreback
Copy link
Contributor

jreback commented May 2, 2021

can you add a whatsnew note in doc/source/whatsnew/v1.3.0.txt in the perf section. pls ping when greeen.

@KenilMehta
Copy link
Contributor Author

@jreback
Added whatsnew in doc/source/whatsnew/v1.3.0.txt file.
All builds passed, it is green.

@jreback jreback merged commit c61e66e into pandas-dev:master May 3, 2021
@jreback
Copy link
Contributor

jreback commented May 3, 2021

thanks @KenilMehta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Series.nunique can compute unique, then remove na
3 participants