-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] ENH: Vectorized SMOTE #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Vectorized SMOTE #596
Conversation
Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-17 19:57:54 UTC |
Codecov Report
@@ Coverage Diff @@
## master #596 +/- ##
==========================================
- Coverage 97.97% 97.97% -0.01%
==========================================
Files 83 83
Lines 4878 4877 -1
==========================================
- Hits 4779 4778 -1
Misses 99 99
Continue to review full report at Codecov.
|
That would be a nice addition. A doc test is falling though. |
The failing doctest is just due to a slight difference of numpy random function usage. Both implementations use random_state, but by vectorizing the code I had to change how to randomly select tie-breaking. The algorithm code works as intended, so either the pull-request needs to be declined due to backwards incompatibility, or the over_samping.rst should change 'A' to 'B' so the test passes. I am not sure how you would like to go forward with this.
Additionally, I have performed more benchmark testing focusing on the Zenodo datasets using timeit with 100 trials. The results are here: |
Probably you mean that someone will not get the same results using the same parameters in the same dataset, right? Sometimes this is expected to happen. Since we have a significant performance gain in terms of speed I would not decline that request for the reason you mention. @glemaitre what do you think? |
…e to random state change
@chkoar Yes, you are correct in that I meant the same parameters could result in different results. The API will remain the same. If this gets approved I would gladly vectorize dense/sparse code for other algorithms knowing that random_state will change under new implementations is not a deal breaker (e.g. ADASYN would have different results since the "for loop" calls randint each loop and will become 1 call only under vectorization). |
@MattEding sorry for the delay. This seems to be significantly better for the sparse case. |
I think this is completely fine. I added an entry in what's new and mentioned that |
@MattEding Thanks a lot for your contribution. This is a major speed-up for the sparse case. The code was pretty bad there :). |
Do not hesitate to optimize some other part samplers :) |
Glad to have helped! When I have the time, I will look into other sampler implementations. |
What does this implement/fix? Explain your changes.
Enhanced performance of dense and sparse matrix oversampling with
BaseSMOTE
,SMOTE
, andSMOTENC
by replacing "for loops" in favor of vectorized operations.For speed performance comparison see my benchmark gist:
https://gist.github.com/MattEding/97c3f36f508ed26e9b2e7dd22db17887
Any other comments?
With the original implementation of
SMOTENC
,argmax
was not used for deterministic tie-breaking. To achieve random tie-breaking in vectorized version, I had to forgo having identical tie-breaks from the original implementation since it uses differentnumpy
random functions.Interestingly
SMOTENC
does not have a unit test that validates exact values for a dataset, so while I was in the process of refactoring I was shocked thatSMOTENC
passed tests even though I knew it should not have.All main tests (other than
show_version
due toget_blas
removed from Scikit-Learn) passed. The tests I explicitly ignored werekeras
testing and thematplotlib
generation in thedoc
folder.