[MRG] ENH: Vectorized SMOTE #596

MattEding · 2019-09-05T00:22:50Z

What does this implement/fix? Explain your changes.

Enhanced performance of dense and sparse matrix oversampling with BaseSMOTE, SMOTE, and SMOTENC by replacing "for loops" in favor of vectorized operations.

For speed performance comparison see my benchmark gist:
https://gist.github.com/MattEding/97c3f36f508ed26e9b2e7dd22db17887

Any other comments?

With the original implementation of SMOTENC, argmax was not used for deterministic tie-breaking. To achieve random tie-breaking in vectorized version, I had to forgo having identical tie-breaks from the original implementation since it uses different numpy random functions.

Interestingly SMOTENC does not have a unit test that validates exact values for a dataset, so while I was in the process of refactoring I was shocked that SMOTENC passed tests even though I knew it should not have.

All main tests (other than show_version due to get_blas removed from Scikit-Learn) passed. The tests I explicitly ignored were keras testing and the matplotlib generation in the doc folder.

pep8speaks · 2019-09-05T00:22:52Z

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-17 19:57:54 UTC

…lanced-learn into smote-vectorized

codecov · 2019-09-05T15:13:06Z

Codecov Report

Merging #596 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
- Coverage   97.97%   97.97%   -0.01%     
==========================================
  Files          83       83              
  Lines        4878     4877       -1     
==========================================
- Hits         4779     4778       -1     
  Misses         99       99

Impacted Files	Coverage Δ
imblearn/over_sampling/_smote.py	`97.21% <100%> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b31677...98a0b2f. Read the comment docs.

chkoar · 2019-09-05T15:29:00Z

That would be a nice addition. A doc test is falling though.

MattEding · 2019-09-08T17:18:33Z

The failing doctest is just due to a slight difference of numpy random function usage. Both implementations use random_state, but by vectorizing the code I had to change how to randomly select tie-breaking. The algorithm code works as intended, so either the pull-request needs to be declined due to backwards incompatibility, or the over_samping.rst should change 'A' to 'B' so the test passes. I am not sure how you would like to go forward with this.

# taken from imbalanced-learn/doc/over_sampling.rst

from collections import Counter
import numpy as np

# create a synthetic data set with continuous and categorical features
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)
print(sorted(Counter(y).items()))
# [(0, 20), (1, 30)]


from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
# [(0, 30), (1, 30)]
print(X_resampled[-5:])
#   [['A' 0.5246469549655818 2]
#    ['B' -0.3657680728116921 2]
#    ['A' 0.9344237230779993 2]     <-- doctest difference ['B' 0.9344237230779993 2]
#    ['B' 0.3710891618824609 2]
#    ['B' 0.3327240726719727 2]]

# # # # # # # # #
# debugging breakpoint() inserted at _smote.py line 1088 in my pull request code
# 
# is_max[-5:]
#          'A'    'B'     'C'       Possible choices
# array([[ True, False,  True],     'A' or 'C' <-- randomly chooses same letter
#        [False,  True, False],     'B'
#        [ True,  True, False],     'A' or 'B' <-- working as expected, just randomly chose 'B' instead of 'A'
#        [False,  True, False],     'B'
#        [False,  True, False]])    'B'
# 
# is_max[-5:]
#           0      1       2       Possible choices
# array([[False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True],     2
#        [False, False,  True]])    2

Additionally, I have performed more benchmark testing focusing on the Zenodo datasets using timeit with 100 trials. The results are here:
https://github.com/MattEding/SMOTE-Benchmark/blob/master/SMOTE%20Vectorized%20vs%20Original%20Zenodo.ipynb
https://github.com/MattEding/SMOTE-Benchmark/blob/master/vectorize_test.py

chkoar · 2019-09-08T17:29:28Z

so either the pull-request needs to be declined due to backwards incompatibility

Probably you mean that someone will not get the same results using the same parameters in the same dataset, right? Sometimes this is expected to happen. Since we have a significant performance gain in terms of speed I would not decline that request for the reason you mention. @glemaitre what do you think?

…e to random state change

MattEding · 2019-09-08T21:54:53Z

@chkoar Yes, you are correct in that I meant the same parameters could result in different results. The API will remain the same.

If this gets approved I would gladly vectorize dense/sparse code for other algorithms knowing that random_state will change under new implementations is not a deal breaker (e.g. ADASYN would have different results since the "for loop" calls randint each loop and will become 1 call only under vectorization).

glemaitre · 2019-11-13T12:25:38Z

@MattEding sorry for the delay. This seems to be significantly better for the sparse case.
Would you be able to resolve the conflict and use this generation in the different SMOTE version?

glemaitre · 2019-11-17T19:58:43Z

@glemaitre what do you think?

I think this is completely fine. I added an entry in what's new and mentioned that SMOTENC could change. LGTM. Going to merge

glemaitre · 2019-11-17T20:00:01Z

@MattEding Thanks a lot for your contribution. This is a major speed-up for the sparse case. The code was pretty bad there :).

glemaitre · 2019-11-17T20:00:30Z

Do not hesitate to optimize some other part samplers :)

MattEding · 2019-11-17T20:24:08Z

Glad to have helped! When I have the time, I will look into other sampler implementations.

vectorized BaseSMOTE, SMOTE, and SMOTENC

5618e9f

MattEding added 2 commits September 5, 2019 08:06

Merge branch 'master' of https://github.com/scikit-learn-contrib/imba…

257455d

…lanced-learn into smote-vectorized

fix PEP8; fix _generate_samples params for SMOTENC

fa8fdf0

BaseSMOTE len(sample_indices) -> n_samples; doctest SMOTENC A to B du…

26a8c4d

…e to random state change

glemaitre force-pushed the master branch from 65132db to 68123d0 Compare November 8, 2019 22:54

Merge branch 'master' into smote-vectorized

e948f25

glemaitre self-assigned this Nov 17, 2019

glemaitre added this to the 0.6 milestone Nov 17, 2019

MattEding and others added 4 commits November 17, 2019 09:56

Merge branch 'master' into smote-vectorized

50b9120

nitpicks style min diff

dbce8f0

add whats new

96c52ae

DOC add info regarding change in SMOTENC

98a0b2f

glemaitre merged commit b606cb9 into scikit-learn-contrib:master Nov 17, 2019

MattEding deleted the smote-vectorized branch November 17, 2019 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] ENH: Vectorized SMOTE #596

[MRG] ENH: Vectorized SMOTE #596

Uh oh!

MattEding commented Sep 5, 2019

Uh oh!

pep8speaks commented Sep 5, 2019 •

edited

Loading

Uh oh!

codecov bot commented Sep 5, 2019 •

edited

Loading

Uh oh!

chkoar commented Sep 5, 2019

Uh oh!

MattEding commented Sep 8, 2019

Uh oh!

chkoar commented Sep 8, 2019

Uh oh!

MattEding commented Sep 8, 2019

Uh oh!

glemaitre commented Nov 13, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

MattEding commented Nov 17, 2019

Uh oh!

Uh oh!

[MRG] ENH: Vectorized SMOTE #596

[MRG] ENH: Vectorized SMOTE #596

Uh oh!

Conversation

MattEding commented Sep 5, 2019

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

pep8speaks commented Sep 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-11-17 19:57:54 UTC

Uh oh!

codecov bot commented Sep 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chkoar commented Sep 5, 2019

Uh oh!

MattEding commented Sep 8, 2019

Uh oh!

chkoar commented Sep 8, 2019

Uh oh!

MattEding commented Sep 8, 2019

Uh oh!

glemaitre commented Nov 13, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

glemaitre commented Nov 17, 2019

Uh oh!

MattEding commented Nov 17, 2019

Uh oh!

Uh oh!

pep8speaks commented Sep 5, 2019 •

edited

Loading

codecov bot commented Sep 5, 2019 •

edited

Loading