-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] EHN handling sparse matrices whenever possible #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #316 +/- ##
==========================================
- Coverage 98.19% 97.96% -0.23%
==========================================
Files 66 66
Lines 3978 3924 -54
==========================================
- Hits 3906 3844 -62
- Misses 72 80 +8
Continue to review full report at Codecov.
|
Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 24, 2017 at 19:40 Hours UTC |
@massich @chkoar Here comes the sparse handling. It is a bit tricky for the For SMOTE, we can compute effectively the sparse matrix but then I am not sure it makes sense to have a lot of zero there. It would make sense inside a categorical-SMOTE I think. For the review, I did not modified the test and added a common test in which we check that dense sampling provide the same thing than a sparse sampling. |
Regarding the ClusterCentroids, I think that a PR making a nearest-neighbors voting instead of generating the centroids will be good. I can imagine a parameter @chkoar WDYT? I know that you were thinking about that since a while. |
imblearn/utils/estimator_checks.py
Outdated
from sklearn.utils.estimator_checks import _yield_all_checks \ | ||
as sklearn_yield_all_checks, check_estimator \ | ||
as sklearn_check_estimator, check_parameters_default_constructible | ||
from sklearn.exceptions import NotFittedError | ||
from sklearn.utils.testing import (assert_warns, assert_raises_regex, | ||
assert_true, set_random_state, | ||
assert_equal) | ||
assert_equal, assert_allclose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why but this PR would insert back the assert_equal
and the diff looks wierd 'cos they are no longer there in master (see this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually needs rebasing master.
LGTM, if all the CIs are happy |
doc/introduction.rst
Outdated
API's of imbalanced-learn samplers | ||
---------------------------------- | ||
|
||
The sampler available follows the scikit-learn API using the estimator base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The available samplers follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the base estimator and adding a sampling functionality throw the sample
method
|
||
The augmented data set should be used instead of the original data set to train | ||
a classifier:: | ||
|
||
>>> from sklearn.svm import LinearSVC | ||
>>> clf = LinearSVC() | ||
>>> clf.fit(X_resampled, y_resampled) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this back
doc/under_sampling.rst
Outdated
.. warning:: | ||
|
||
:class:`ClusterCentroids` supports sparse matrices. However, the new samples | ||
are generated are not specifically sparse. Therefore, even if the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new samples generated
imblearn/ensemble/balance_cascade.py
Outdated
index_classified = index_under_sample[pred_target == y_subset] | ||
pred_target = pred[:index_under_sample.size] | ||
index_classified = index_under_sample[ | ||
pred_target == y_subset[:index_under_sample.size]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe_indexing y_subset
Reference Issue
closes #158
What does this implement/fix? Explain your changes.
Enhancement to handle sparse matrices (+pandas dataframe, list, etc.).
Supported samplers:
ADASYN and SMOTE and ClusterCentroids cannot be supported straightforwardly. I need more thought about it.
TODO:
safe_indexing
in the different methods.Any other comments?