-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] ENH: Vectorized ADASYN #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Vectorized ADASYN #649
Conversation
…ests due to random state changes
Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-22 07:02:01 UTC |
@@ -731,13 +731,12 @@ def _fit_resample(self, X, y): | |||
X_resampled.append(X_new) | |||
y_resampled.append(y_new) | |||
|
|||
if sparse.issparse(X_new): | |||
if sparse.issparse(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in case of some edge case where the number of samples produced is 0 and thus X_new would cause an UnboundLocalError
@@ -98,7 +98,7 @@ def _make_samples( | |||
""" | |||
random_state = check_random_state(self.random_state) | |||
samples_indices = random_state.randint( | |||
low=0, high=len(nn_num.flatten()), size=n_samples | |||
low=0, high=nn_num.size, size=n_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid making copy of array
@@ -131,7 +130,9 @@ def _fit_resample(self, X, y): | |||
) | |||
ratio_nn /= np.sum(ratio_nn) | |||
n_samples_generate = np.rint(ratio_nn * n_samples).astype(int) | |||
if not np.sum(n_samples_generate): | |||
# rounding may cause new amount for n_samples | |||
n_samples = np.sum(n_samples_generate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original non-vectorized implementation used this approach to make n-samples rather than truncating np.sum(n_samples_generate)
down to n_samples
Hey @MattEding. It is great that you work on this. Do you have any timings regarding your changes? |
Benchmarks with Zenodo datasets and Random Sparse MatricesADASYN Vectorize vs Loop Benchmark
|
Comparison of ADASYN ImplemenationsFor both the For-Loop and Vectorized implemenations, Nearest NeighborsFor each minority sample both implementations refer to the same neighbors. The For-Loop
Vectorized
Minority samplesFor-LoopFor each pass of the loop where the
Vectorized
Selected NeighborsThe randomly selected neighbors for each of the samples conveniently align for For-Loop
Vectorized
Differences of Sample and NeighborFor-LoopThis is not directly observable in The code this is derived from is:
Via the debugger:
Vectorized
Uniformly Random StepsBoth implementations use the same random function, For-Loop
Vectorized
Newly Synthesized SamplesThe new samples are generated afterward from the above information. For-Loop
Vectorized
Unit TestFrom the above results, I feel that the changes to the unit tests are accurate and justifiable. |
Codecov Report
@@ Coverage Diff @@
## master #649 +/- ##
==========================================
- Coverage 98.46% 98.46% -0.01%
==========================================
Files 82 82
Lines 4886 4876 -10
==========================================
- Hits 4811 4801 -10
Misses 75 75
Continue to review full report at Codecov.
|
I'll have a look at it before to the release. |
What does this implement/fix? Explain your changes.
TODO: