Skip to content

[MRG] ENH: Vectorized SMOTE #596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 17, 2019

Conversation

MattEding
Copy link
Contributor

What does this implement/fix? Explain your changes.

Enhanced performance of dense and sparse matrix oversampling with BaseSMOTE, SMOTE, and SMOTENC by replacing "for loops" in favor of vectorized operations.

For speed performance comparison see my benchmark gist:
https://gist.github.com/MattEding/97c3f36f508ed26e9b2e7dd22db17887

Any other comments?

With the original implementation of SMOTENC, argmax was not used for deterministic tie-breaking. To achieve random tie-breaking in vectorized version, I had to forgo having identical tie-breaks from the original implementation since it uses different numpy random functions.

Interestingly SMOTENC does not have a unit test that validates exact values for a dataset, so while I was in the process of refactoring I was shocked that SMOTENC passed tests even though I knew it should not have.

All main tests (other than show_version due to get_blas removed from Scikit-Learn) passed. The tests I explicitly ignored were keras testing and the matplotlib generation in the doc folder.

@pep8speaks
Copy link

pep8speaks commented Sep 5, 2019

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-17 19:57:54 UTC

@codecov
Copy link

codecov bot commented Sep 5, 2019

Codecov Report

Merging #596 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
- Coverage   97.97%   97.97%   -0.01%     
==========================================
  Files          83       83              
  Lines        4878     4877       -1     
==========================================
- Hits         4779     4778       -1     
  Misses         99       99
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.21% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b31677...98a0b2f. Read the comment docs.

@chkoar
Copy link
Member

chkoar commented Sep 5, 2019

That would be a nice addition. A doc test is falling though.

@MattEding
Copy link
Contributor Author

The failing doctest is just due to a slight difference of numpy random function usage. Both implementations use random_state, but by vectorizing the code I had to change how to randomly select tie-breaking. The algorithm code works as intended, so either the pull-request needs to be declined due to backwards incompatibility, or the over_samping.rst should change 'A' to 'B' so the test passes. I am not sure how you would like to go forward with this.

# taken from imbalanced-learn/doc/over_sampling.rst

from collections import Counter
import numpy as np

# create a synthetic data set with continuous and categorical features
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)
print(sorted(Counter(y).items()))
# [(0, 20), (1, 30)]


from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
# [(0, 30), (1, 30)]
print(X_resampled[-5:])
#   [['A' 0.5246469549655818 2]
#    ['B' -0.3657680728116921 2]
#    ['A' 0.9344237230779993 2]     <-- doctest difference ['B' 0.9344237230779993 2]
#    ['B' 0.3710891618824609 2]
#    ['B' 0.3327240726719727 2]]

# # # # # # # # #
# debugging breakpoint() inserted at _smote.py line 1088 in my pull request code
# 
# is_max[-5:]
#          'A'    'B'     'C'       Possible choices
# array([[ True, False,  True],     'A' or 'C' <-- randomly chooses same letter
#        [False,  True, False],     'B'
#        [ True,  True, False],     'A' or 'B' <-- working as expected, just randomly chose 'B' instead of 'A'
#        [False,  True, False],     'B'
#        [False,  True, False]])    'B'
# 
# is_max[-5:]
#           0      1       2       Possible choices
# array([[False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True]])    2

Additionally, I have performed more benchmark testing focusing on the Zenodo datasets using timeit with 100 trials. The results are here:
https://github.com/MattEding/SMOTE-Benchmark/blob/master/SMOTE%20Vectorized%20vs%20Original%20Zenodo.ipynb
https://github.com/MattEding/SMOTE-Benchmark/blob/master/vectorize_test.py

@chkoar
Copy link
Member

chkoar commented Sep 8, 2019

so either the pull-request needs to be declined due to backwards incompatibility

Probably you mean that someone will not get the same results using the same parameters in the same dataset, right? Sometimes this is expected to happen. Since we have a significant performance gain in terms of speed I would not decline that request for the reason you mention. @glemaitre what do you think?

@MattEding
Copy link
Contributor Author

@chkoar Yes, you are correct in that I meant the same parameters could result in different results. The API will remain the same.

If this gets approved I would gladly vectorize dense/sparse code for other algorithms knowing that random_state will change under new implementations is not a deal breaker (e.g. ADASYN would have different results since the "for loop" calls randint each loop and will become 1 call only under vectorization).

@glemaitre
Copy link
Member

@MattEding sorry for the delay. This seems to be significantly better for the sparse case.
Would you be able to resolve the conflict and use this generation in the different SMOTE version?

@glemaitre glemaitre self-assigned this Nov 17, 2019
@glemaitre glemaitre added this to the 0.6 milestone Nov 17, 2019
@glemaitre
Copy link
Member

@glemaitre what do you think?

I think this is completely fine. I added an entry in what's new and mentioned that SMOTENC could change. LGTM. Going to merge

@glemaitre glemaitre merged commit b606cb9 into scikit-learn-contrib:master Nov 17, 2019
@glemaitre
Copy link
Member

@MattEding Thanks a lot for your contribution. This is a major speed-up for the sparse case. The code was pretty bad there :).

@glemaitre
Copy link
Member

Do not hesitate to optimize some other part samplers :)

@MattEding
Copy link
Contributor Author

Glad to have helped! When I have the time, I will look into other sampler implementations.

@MattEding MattEding deleted the smote-vectorized branch November 17, 2019 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants