[MRG] ENH: Vectorized ADASYN #649

MattEding · 2019-11-18T18:47:20Z

What does this implement/fix? Explain your changes.

Simplify code logic for ADASYN by vectorizing components.
Significantly improves sparse matrix speeds while maintaining dense array performance.
Fix ADASYN module docstring

TODO:

Benchmark performance
Adjust unit tests due to change in random state usage

…ests due to random state changes

pep8speaks · 2019-11-18T18:47:25Z

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 07:02:01 UTC

MattEding · 2019-11-18T23:50:28Z

imblearn/over_sampling/_smote.py

@@ -731,13 +731,12 @@ def _fit_resample(self, X, y):
            X_resampled.append(X_new)
            y_resampled.append(y_new)

-        if sparse.issparse(X_new):
+        if sparse.issparse(X):


changed in case of some edge case where the number of samples produced is 0 and thus X_new would cause an UnboundLocalError

MattEding · 2019-11-19T00:49:31Z

imblearn/over_sampling/_smote.py

@@ -98,7 +98,7 @@ def _make_samples(
        """
        random_state = check_random_state(self.random_state)
        samples_indices = random_state.randint(
-            low=0, high=len(nn_num.flatten()), size=n_samples
+            low=0, high=nn_num.size, size=n_samples


avoid making copy of array

MattEding · 2019-11-19T00:51:52Z

imblearn/over_sampling/_adasyn.py

@@ -131,7 +130,9 @@ def _fit_resample(self, X, y):
                )
            ratio_nn /= np.sum(ratio_nn)
            n_samples_generate = np.rint(ratio_nn * n_samples).astype(int)
-            if not np.sum(n_samples_generate):
+            # rounding may cause new amount for n_samples
+            n_samples = np.sum(n_samples_generate)


Original non-vectorized implementation used this approach to make n-samples rather than truncating np.sum(n_samples_generate) down to n_samples

chkoar · 2019-11-19T14:02:32Z

Hey @MattEding. It is great that you work on this. Do you have any timings regarding your changes?

MattEding · 2019-11-19T19:41:24Z

Benchmarks with Zenodo datasets and Random Sparse Matrices

ADASYN Vectorize vs Loop Benchmark

Performed on a MacBook Pro using 4-cores (wanted max n_jobs to focus on refactored code rather than NearestNeighbors used internally with both implementations).
Used the minimum of 3-trials using timeit (minimum is its default behavior).
As with the previous SMOTE vectorization, dense matrices time results are essentially the same.
There is a clear advantage in execution time when sparse matrices are being used.

…lanced-learn into adasyn-vectorized

MattEding · 2019-11-22T07:07:09Z

Comparison of ADASYN Implemenations

For both the For-Loop and Vectorized implemenations,
n_samples_generate == [1 0 1 0 1 0 1 0]
so they will both generate the same amount of synthetic samples for each of
the 8 minority class samples. I will emphasize lines using *X to
highlight the rows that correspond to n_samples_generate.

Nearest Neighbors

For each minority sample both implementations refer to the same neighbors. The
only difference is that the vectorized implementation doesn't store information
of it being a neighbor to itself.

For-Loop

nn_index
[[0 2 6 4 1 3]  *A
 [1 3 7 4 5 2]
 [2 0 1 4 6 5]  *B
 [3 1 7 5 2 4]
 [4 1 2 0 7 3]  *C
 [5 3 1 7 2 4]
 [6 0 2 4 1 5]  *D
 [7 3 1 5 4 2]]

Vectorized

nns
[[2 6 4 1 3]  *A
 [3 7 4 5 2]
 [0 1 4 6 5]  *B
 [1 7 5 2 4]
 [1 2 0 7 3]  *C
 [3 1 7 2 4]
 [0 2 4 1 5]  *D
 [3 1 5 4 2]]

Minority samples

For-Loop

For each pass of the loop where the continue statement is not triggered,
these are the minority samples that will be used to generate synthetic samples
from. We can confirm that they are identical.

x_i
[ 0.11622591 -0.0317206 ]  *A

[ 0.53366841 -0.30312976]  *B

[ 0.88407872  0.35454207]  *C

[-0.41635887 -0.38299653]  *D

Vectorized

X_class
[[ 0.11622591 -0.0317206 ]  *A
 [ 1.25192108 -0.22367336]
 [ 0.53366841 -0.30312976]  *B
 [ 1.52091956 -0.49283504]
 [ 0.88407872  0.35454207]  *C
 [ 1.31301027 -0.92648734]
 [-0.41635887 -0.38299653]  *D
 [ 1.70580611 -0.11219234]]

Selected Neighbors

The randomly selected neighbors for each of the samples conveniently align for
this test-case. This is a lucky happenstance as it makes comparison between the
two implementations easier. Note that the For-Loop implementation results are
one unit larger by necessity due to the fact that the nearest neighbors are
stored with the sample point itself.

For-Loop

nn_zs
[5]  *A

[1]  *B

[4]  *C

[4]  *D

Vectorized

cols
[4 0 3 3]
*A B C D

Differences of Sample and Neighbor

For-Loop

This is not directly observable in pdb, but when entering and
leaving interact mode between loop iterations, we can capture the results.

The code this is derived from is:

x_class_gen.append([
    x_i + step * (X_class[x_i_nn[nn_z], :] - x_i)
    for step, nn_z in zip(steps, nn_zs)
])

Via the debugger:

[(X_class[x_i_nn[nn_z], :] - x_i) for step, nn_z in zip(steps, nn_zs)]

[array([ 1.40469365, -0.46111444])]
[array([-0.4174425 ,  0.27140916])]
[array([ 0.82172739, -0.46673441])]
[array([ 1.66827995,  0.15932317])]

Vectorized

diffs
[[ 1.40469365 -0.46111444]
 [-0.4174425   0.27140916]
 [ 0.82172739 -0.46673441]
 [ 1.66827995  0.15932317]]

Uniformly Random Steps

Both implementations use the same random function, random_state.uniform(),
just with different shapes.
This is where the values diverge, but we can now be confident that the two
implementations of the ADASYN algorithm coincide other than due to random
number generation.

For-Loop

steps
[ 0.59284462]

[ 0.60276338]

[ 0.84725174]

[ 0.64589411]

Vectorized

steps
[[ 0.54488318]
 [ 0.4236548 ]
 [ 0.64589411]
 [ 0.43758721]]

Newly Synthesized Samples

The new samples are generated afterward from the above information.

For-Loop

X_new
[[ 0.94899098 -0.30508981]
 [ 0.28204936 -0.13953426]
 [ 1.58028868 -0.04089947]
 [ 0.66117333 -0.28009063]]

y_new
[0 0 0 0]

Vectorized

X_new
[[ 0.88161986 -0.2829741 ]
 [ 0.35681689 -0.18814597]
 [ 1.4148276   0.05308106]
 [ 0.3136591  -0.31327875]]

y_new
[0 0 0 0]

Unit Test

From the above results, I feel that the changes to the unit tests are accurate and justifiable.

codecov · 2019-11-22T07:41:11Z

Codecov Report

Merging #649 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #649      +/-   ##
==========================================
- Coverage   98.46%   98.46%   -0.01%     
==========================================
  Files          82       82              
  Lines        4886     4876      -10     
==========================================
- Hits         4811     4801      -10     
  Misses         75       75

Impacted Files	Coverage Δ
imblearn/over_sampling/tests/test_adasyn.py	`100% <ø> (ø)`	⬆️
imblearn/over_sampling/_smote.py	`96.69% <100%> (ø)`	⬆️
imblearn/over_sampling/_adasyn.py	`98.36% <100%> (-0.24%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3839df1...ad574ce. Read the comment docs.

glemaitre · 2019-11-26T10:28:31Z

I'll have a look at it before to the release.

MattEding added 2 commits November 18, 2019 10:35

vectorized adasyn; fixed adasyn module docstring; todo: update unit t…

46f2cb5

…ests due to random state changes

fix indentation error

e0b5c94

fixed row selection indices; fixed n_samples to work with non-ints

07175ef

MattEding commented Nov 18, 2019

View reviewed changes

fixed row & col shape occassional mismatch due to rounding in algorithm

85c173d

MattEding commented Nov 19, 2019

View reviewed changes

MattEding added 2 commits November 21, 2019 22:56

Merge branch 'master' of https://github.com/scikit-learn-contrib/imba…

902438c

…lanced-learn into adasyn-vectorized

update unit tests to reflect random state changes

ad574ce

MattEding changed the title ~~[WIP] ENH: Vectorized ADASYN~~ [MRG] ENH: Vectorized ADASYN Nov 22, 2019

glemaitre added this to the 0.6 milestone Nov 26, 2019

glemaitre merged commit a0ac84d into scikit-learn-contrib:master Dec 5, 2019

MattEding deleted the adasyn-vectorized branch December 5, 2019 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] ENH: Vectorized ADASYN #649

[MRG] ENH: Vectorized ADASYN #649

Uh oh!

MattEding commented Nov 18, 2019 •

edited

Loading

Uh oh!

pep8speaks commented Nov 18, 2019 •

edited

Loading

Uh oh!

MattEding Nov 18, 2019

Uh oh!

MattEding Nov 19, 2019

Uh oh!

MattEding Nov 19, 2019

Uh oh!

chkoar commented Nov 19, 2019

Uh oh!

MattEding commented Nov 19, 2019

Uh oh!

MattEding commented Nov 22, 2019

Uh oh!

codecov bot commented Nov 22, 2019 •

edited

Loading

Uh oh!

glemaitre commented Nov 26, 2019

Uh oh!

Uh oh!

[MRG] ENH: Vectorized ADASYN #649

[MRG] ENH: Vectorized ADASYN #649

Uh oh!

Conversation

MattEding commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-11-22 07:02:01 UTC

Uh oh!

MattEding Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

MattEding Nov 19, 2019

Choose a reason for hiding this comment

Uh oh!

MattEding Nov 19, 2019

Choose a reason for hiding this comment

Uh oh!

chkoar commented Nov 19, 2019

Uh oh!

MattEding commented Nov 19, 2019

Benchmarks with Zenodo datasets and Random Sparse Matrices

Uh oh!

MattEding commented Nov 22, 2019

Comparison of ADASYN Implemenations

Nearest Neighbors

For-Loop

Vectorized

Minority samples

For-Loop

Vectorized

Selected Neighbors

For-Loop

Vectorized

Differences of Sample and Neighbor

For-Loop

Vectorized

Uniformly Random Steps

For-Loop

Vectorized

Newly Synthesized Samples

For-Loop

Vectorized

Unit Test

Uh oh!

codecov bot commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

glemaitre commented Nov 26, 2019

Uh oh!

Uh oh!

MattEding commented Nov 18, 2019 •

edited

Loading

pep8speaks commented Nov 18, 2019 •

edited

Loading

codecov bot commented Nov 22, 2019 •

edited

Loading