[MRG] EHN handling sparse matrices whenever possible #316

glemaitre · 2017-08-12T11:05:54Z

Reference Issue

closes #158

What does this implement/fix? Explain your changes.

Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).

Supported samplers:

All prototype-selection can be supported.
Ensemble classes.
RandomOverSampler should be supported.

ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.

TODO:

Any other comments?

codecov · 2017-08-12T11:21:10Z

Codecov Report

Merging #316 into master will decrease coverage by 0.22%.
The diff coverage is 97.74%.

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
- Coverage   98.19%   97.96%   -0.23%     
==========================================
  Files          66       66              
  Lines        3978     3924      -54     
==========================================
- Hits         3906     3844      -62     
- Misses         72       80       +8

Impacted Files	Coverage Δ
imblearn/ensemble/easy_ensemble.py	`100% <ø> (ø)`	⬆️
imblearn/combine/smote_enn.py	`85.29% <100%> (ø)`	⬆️
...arn/under_sampling/prototype_selection/nearmiss.py	`98.66% <100%> (+1.26%)`	⬆️
imblearn/over_sampling/random_over_sampler.py	`100% <100%> (ø)`	⬆️
...sampling/prototype_generation/cluster_centroids.py	`100% <100%> (ø)`	⬆️
imblearn/over_sampling/tests/test_adasyn.py	`100% <100%> (ø)`	⬆️
...prototype_selection/neighbourhood_cleaning_rule.py	`100% <100%> (ø)`	⬆️
.../under_sampling/prototype_selection/tomek_links.py	`100% <100%> (ø)`	⬆️
imblearn/combine/smote_tomek.py	`93.22% <100%> (ø)`	⬆️
...mpling/prototype_selection/random_under_sampler.py	`100% <100%> (ø)`	⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488a0e8...f8ebd0e. Read the comment docs.

pep8speaks · 2017-08-13T18:15:14Z

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 24, 2017 at 19:40 Hours UTC

glemaitre · 2017-08-14T16:20:38Z

@massich @chkoar Here comes the sparse handling.

It is a bit tricky for the ClusterCentroids since we create a sparse matrix for the centroids which are not sparse. But this is stacked with the original sparse matrix. I will add an entry in the docstring to state that.

For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think.

For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling.

glemaitre · 2017-08-14T16:28:37Z

Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good.

I can imagine a parameter voting with 'hard'/'soft'/'auto' is a possibility. 'hard' would be default with sparse input while 'soft' otherwise.

@chkoar WDYT? I know that you were thinking about that since a while.

massich · 2017-08-22T13:56:53Z

imblearn/utils/estimator_checks.py

 from sklearn.utils.estimator_checks import _yield_all_checks \
    as sklearn_yield_all_checks, check_estimator \
    as sklearn_check_estimator, check_parameters_default_constructible
 from sklearn.exceptions import NotFittedError
 from sklearn.utils.testing import (assert_warns, assert_raises_regex,
                                   assert_true, set_random_state,
-                                   assert_equal)
+                                   assert_equal, assert_allclose,


I don't understand why but this PR would insert back the assert_equal and the diff looks wierd 'cos they are no longer there in master (see this)

Actually needs rebasing master.

glemaitre · 2017-08-22T14:26:09Z

@massich rebased. Note that you will have an assert_raises_regex in a new test_adasyn

edit: ping #321

massich · 2017-08-22T15:18:27Z

LGTM, if all the CIs are happy

glemaitre · 2017-08-24T19:13:40Z

doc/introduction.rst

+API's of imbalanced-learn samplers
+----------------------------------
+
+The sampler available follows the scikit-learn API using the estimator base


The available samplers follow

using the base estimator and adding a sampling functionality throw the sample method

glemaitre · 2017-08-24T19:15:34Z

doc/over_sampling.rst


 The augmented data set should be used instead of the original data set to train
 a classifier::

  >>> from sklearn.svm import LinearSVC
  >>> clf = LinearSVC()
-  >>> clf.fit(X_resampled, y_resampled) # doctest: +ELLIPSIS


Add this back

glemaitre · 2017-08-24T19:16:32Z

doc/under_sampling.rst

+.. warning::
+
+   :class:`ClusterCentroids` supports sparse matrices. However, the new samples
+   are generated are not specifically sparse. Therefore, even if the resulting


the new samples generated

glemaitre · 2017-08-24T19:18:43Z

imblearn/ensemble/balance_cascade.py

-            index_classified = index_under_sample[pred_target == y_subset]
+            pred_target = pred[:index_under_sample.size]
+            index_classified = index_under_sample[
+                pred_target == y_subset[:index_under_sample.size]]


safe_indexing y_subset

EHN POC sparse handling for RandomUnderSampler

a68e8eb

glemaitre added 7 commits August 12, 2017 13:49

EHN support sparse ENN

0062d6d

iter

6197d80

EHN sparse indexing IHT

f669843

EHN sparse support nearmiss

4adc6db

Merge branch 'master' into is/158

9c93dab

EHN support sparse matrices for NCR

bba7835

EHN support sparse Tomek and OSS

9cd917b

glemaitre added 17 commits August 13, 2017 21:31

EHN support sparsity for CNN

c3ba307

EHN support sparse for SMOTE

d195868

EHN support sparse adasyn

bcf44ab

EHN support sparsity for sombine methods

c405aa9

EHN support sparsity BC

79637d7

DOC update docstring

c199af9

DOC fix example topic classification

425928f

FIX fix test and class clustercentroids

4ba8c4e

TST add common test

8298fdc

TST add ensemble

e4c6ebb

TST use allclose

1226a91

TST install conda with ubuntu container

68b16b5

TST increase tolerance

35c638b

TST increase tolerance

004f920

TST test all versions NearMiss and SMOTE

d3ceb5a

TST set the algorithm of KMeans

d9c4e55

DOC add entry in user guide

b469747

glemaitre changed the title ~~[WIP] EHN handling sparse matrices whenever possible~~ [MRG] EHN handling sparse matrices whenever possible Aug 14, 2017

glemaitre added 2 commits August 14, 2017 18:30

DOC add entry sparse for CC

c05d0ba

DOC whatsnew entry

1625879

This was referenced Aug 14, 2017

[MRG] EHN add voting paramter for ClusterCentroids #318

Merged

Release 0.3.0 #319

Closed

DOC fix api

709dec3

glemaitre force-pushed the master branch from 9304aa5 to f02c565 Compare August 17, 2017 17:59

massich reviewed Aug 22, 2017

View reviewed changes

glemaitre added 2 commits August 22, 2017 16:22

Merge branch 'master' into is/158

14b686f

TST adapt pytest

3e0cdc9

DOC update user guide

0595dab

glemaitre commented Aug 24, 2017

View reviewed changes

glemaitre added 3 commits August 24, 2017 21:27

Merge remote-tracking branch 'origin/master' into is/158

a540dc1

address comments

2d0e730

TST remove the last assert_regex

f8ebd0e

glemaitre merged commit cddf39b into scikit-learn-contrib:master Aug 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] EHN handling sparse matrices whenever possible #316

[MRG] EHN handling sparse matrices whenever possible #316

Uh oh!

glemaitre commented Aug 12, 2017 •

edited

Loading

Uh oh!

codecov bot commented Aug 12, 2017 •

edited

Loading

Uh oh!

pep8speaks commented Aug 13, 2017 •

edited

Loading

Uh oh!

glemaitre commented Aug 14, 2017

Uh oh!

glemaitre commented Aug 14, 2017

Uh oh!

massich Aug 22, 2017

Uh oh!

massich Aug 22, 2017

Uh oh!

glemaitre commented Aug 22, 2017 •

edited by massich

Loading

Uh oh!

massich commented Aug 22, 2017 •

edited

Loading

Uh oh!

glemaitre Aug 24, 2017

Uh oh!

glemaitre Aug 24, 2017

Uh oh!

glemaitre Aug 24, 2017

Uh oh!

glemaitre Aug 24, 2017

Uh oh!

glemaitre Aug 24, 2017

Uh oh!

Uh oh!

[MRG] EHN handling sparse matrices whenever possible #316

[MRG] EHN handling sparse matrices whenever possible #316

Uh oh!

Conversation

glemaitre commented Aug 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

codecov bot commented Aug 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commented Aug 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on August 24, 2017 at 19:40 Hours UTC

Uh oh!

glemaitre commented Aug 14, 2017

Uh oh!

glemaitre commented Aug 14, 2017

Uh oh!

massich Aug 22, 2017

Choose a reason for hiding this comment

Uh oh!

massich Aug 22, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 22, 2017 • edited by massich Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

massich commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Aug 12, 2017 •

edited

Loading

codecov bot commented Aug 12, 2017 •

edited

Loading

pep8speaks commented Aug 13, 2017 •

edited

Loading

glemaitre commented Aug 22, 2017 •

edited by massich

Loading

massich commented Aug 22, 2017 •

edited

Loading