Skip to content

[MRG] ENH: Vectorized ADASYN #649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 5, 2019

Conversation

MattEding
Copy link
Contributor

@MattEding MattEding commented Nov 18, 2019

What does this implement/fix? Explain your changes.

  • Simplify code logic for ADASYN by vectorizing components.
  • Significantly improves sparse matrix speeds while maintaining dense array performance.
  • Fix ADASYN module docstring

TODO:

  • Benchmark performance
  • Adjust unit tests due to change in random state usage

@pep8speaks
Copy link

pep8speaks commented Nov 18, 2019

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 07:02:01 UTC

@@ -731,13 +731,12 @@ def _fit_resample(self, X, y):
X_resampled.append(X_new)
y_resampled.append(y_new)

if sparse.issparse(X_new):
if sparse.issparse(X):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in case of some edge case where the number of samples produced is 0 and thus X_new would cause an UnboundLocalError

@@ -98,7 +98,7 @@ def _make_samples(
"""
random_state = check_random_state(self.random_state)
samples_indices = random_state.randint(
low=0, high=len(nn_num.flatten()), size=n_samples
low=0, high=nn_num.size, size=n_samples
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid making copy of array

@@ -131,7 +130,9 @@ def _fit_resample(self, X, y):
)
ratio_nn /= np.sum(ratio_nn)
n_samples_generate = np.rint(ratio_nn * n_samples).astype(int)
if not np.sum(n_samples_generate):
# rounding may cause new amount for n_samples
n_samples = np.sum(n_samples_generate)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original non-vectorized implementation used this approach to make n-samples rather than truncating np.sum(n_samples_generate) down to n_samples

@chkoar
Copy link
Member

chkoar commented Nov 19, 2019

Hey @MattEding. It is great that you work on this. Do you have any timings regarding your changes?

@MattEding
Copy link
Contributor Author

Benchmarks with Zenodo datasets and Random Sparse Matrices

ADASYN Vectorize vs Loop Benchmark

  • Performed on a MacBook Pro using 4-cores (wanted max n_jobs to focus on refactored code rather than NearestNeighbors used internally with both implementations).
  • Used the minimum of 3-trials using timeit (minimum is its default behavior).
  • As with the previous SMOTE vectorization, dense matrices time results are essentially the same.
  • There is a clear advantage in execution time when sparse matrices are being used.

@MattEding
Copy link
Contributor Author

Comparison of ADASYN Implemenations

For both the For-Loop and Vectorized implemenations,
n_samples_generate == [1 0 1 0 1 0 1 0]
so they will both generate the same amount of synthetic samples for each of
the 8 minority class samples. I will emphasize lines using *X to
highlight the rows that correspond to n_samples_generate.

Nearest Neighbors

For each minority sample both implementations refer to the same neighbors. The
only difference is that the vectorized implementation doesn't store information
of it being a neighbor to itself.

For-Loop

nn_index
[[0 2 6 4 1 3]  *A
 [1 3 7 4 5 2]
 [2 0 1 4 6 5]  *B
 [3 1 7 5 2 4]
 [4 1 2 0 7 3]  *C
 [5 3 1 7 2 4]
 [6 0 2 4 1 5]  *D
 [7 3 1 5 4 2]]

Vectorized

nns
[[2 6 4 1 3]  *A
 [3 7 4 5 2]
 [0 1 4 6 5]  *B
 [1 7 5 2 4]
 [1 2 0 7 3]  *C
 [3 1 7 2 4]
 [0 2 4 1 5]  *D
 [3 1 5 4 2]]

Minority samples

For-Loop

For each pass of the loop where the continue statement is not triggered,
these are the minority samples that will be used to generate synthetic samples
from. We can confirm that they are identical.

x_i
[ 0.11622591 -0.0317206 ]  *A

[ 0.53366841 -0.30312976]  *B

[ 0.88407872  0.35454207]  *C

[-0.41635887 -0.38299653]  *D

Vectorized

X_class
[[ 0.11622591 -0.0317206 ]  *A
 [ 1.25192108 -0.22367336]
 [ 0.53366841 -0.30312976]  *B
 [ 1.52091956 -0.49283504]
 [ 0.88407872  0.35454207]  *C
 [ 1.31301027 -0.92648734]
 [-0.41635887 -0.38299653]  *D
 [ 1.70580611 -0.11219234]]

Selected Neighbors

The randomly selected neighbors for each of the samples conveniently align for
this test-case. This is a lucky happenstance as it makes comparison between the
two implementations easier. Note that the For-Loop implementation results are
one unit larger by necessity due to the fact that the nearest neighbors are
stored with the sample point itself.

For-Loop

nn_zs
[5]  *A

[1]  *B

[4]  *C

[4]  *D

Vectorized

cols
[4 0 3 3]
*A B C D

Differences of Sample and Neighbor

For-Loop

This is not directly observable in pdb, but when entering and
leaving interact mode between loop iterations, we can capture the results.

The code this is derived from is:

x_class_gen.append([
    x_i + step * (X_class[x_i_nn[nn_z], :] - x_i)
    for step, nn_z in zip(steps, nn_zs)
])

Via the debugger:

[(X_class[x_i_nn[nn_z], :] - x_i) for step, nn_z in zip(steps, nn_zs)]

[array([ 1.40469365, -0.46111444])]
[array([-0.4174425 ,  0.27140916])]
[array([ 0.82172739, -0.46673441])]
[array([ 1.66827995,  0.15932317])]

Vectorized

diffs
[[ 1.40469365 -0.46111444]
 [-0.4174425   0.27140916]
 [ 0.82172739 -0.46673441]
 [ 1.66827995  0.15932317]]

Uniformly Random Steps

Both implementations use the same random function, random_state.uniform(),
just with different shapes.
This is where the values diverge, but we can now be confident that the two
implementations of the ADASYN algorithm coincide other than due to random
number generation.

For-Loop

steps
[ 0.59284462]

[ 0.60276338]

[ 0.84725174]

[ 0.64589411]

Vectorized

steps
[[ 0.54488318]
 [ 0.4236548 ]
 [ 0.64589411]
 [ 0.43758721]]

Newly Synthesized Samples

The new samples are generated afterward from the above information.

For-Loop

X_new
[[ 0.94899098 -0.30508981]
 [ 0.28204936 -0.13953426]
 [ 1.58028868 -0.04089947]
 [ 0.66117333 -0.28009063]]

y_new
[0 0 0 0]

Vectorized

X_new
[[ 0.88161986 -0.2829741 ]
 [ 0.35681689 -0.18814597]
 [ 1.4148276   0.05308106]
 [ 0.3136591  -0.31327875]]

y_new
[0 0 0 0]

Unit Test

From the above results, I feel that the changes to the unit tests are accurate and justifiable.

@MattEding MattEding changed the title [WIP] ENH: Vectorized ADASYN [MRG] ENH: Vectorized ADASYN Nov 22, 2019
@codecov
Copy link

codecov bot commented Nov 22, 2019

Codecov Report

Merging #649 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #649      +/-   ##
==========================================
- Coverage   98.46%   98.46%   -0.01%     
==========================================
  Files          82       82              
  Lines        4886     4876      -10     
==========================================
- Hits         4811     4801      -10     
  Misses         75       75
Impacted Files Coverage Δ
imblearn/over_sampling/tests/test_adasyn.py 100% <ø> (ø) ⬆️
imblearn/over_sampling/_smote.py 96.69% <100%> (ø) ⬆️
imblearn/over_sampling/_adasyn.py 98.36% <100%> (-0.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3839df1...ad574ce. Read the comment docs.

@glemaitre glemaitre added this to the 0.6 milestone Nov 26, 2019
@glemaitre
Copy link
Member

I'll have a look at it before to the release.

@glemaitre glemaitre merged commit a0ac84d into scikit-learn-contrib:master Dec 5, 2019
@MattEding MattEding deleted the adasyn-vectorized branch December 5, 2019 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants