-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] Refactor ratio to pick up any class #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @glemaitre! Thanks for updating the PR.
Comment last updated on May 30, 2017 at 22:15 Hours UTC |
imblearn/utils/validation.py
Outdated
raise ValueError("When 'ratio' is a string, it needs to be one of" | ||
" {}. Got '{}' instead.".format(RATIO_KIND, | ||
ratio)) | ||
if ratio == 'all' or ratio == 'auto': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we could avoid that if-elses by using a dict
That's big PR @glemaitre! |
@chkoar yes indeed. I am fixing back the tests, right now. I get why we had trouble to put it out. It required lot of works |
And it should be good to go fast on this one since any further development will benefit from those changes |
Hints: for a faster review process, you should actually focus on the base classes and mixin. All methods were adapted to use those and there is not real changes apart to make some of them multiclass. |
imblearn/under_sampling/base.py
Outdated
X, y = check_X_y(X, y) | ||
y = check_target_type(y) | ||
self.X_hash_, self.y_hash_ = hash_X_y(X, y) | ||
self.ratio_ = check_ratio(self.ratio, y, 'cleaning-sampling') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-sampling
might be more appropriate, no?
That review it will take some time!
p.s. I proposed class variables in order to avoid to override the |
I agree with those. I would also deprecate the combine class.
|
* change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class
* change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class * iterate * fix PEP8
@glemaitre I will scan it this weekend! |
imblearn/base.py
Outdated
if not type_of_target(y) == 'binary': | ||
warnings.simplefilter('always', UserWarning) | ||
warnings.warn('The target type should be binary.') | ||
class BaseSampler(SamplerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we merge BaseSampler
and SamplerMixin
into BaseSampler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor. I would like the sampler to have a SamplerMixin
as the ClassifierMixin
or TransformerMixin
imblearn/base.py
Outdated
------- | ||
self : object, | ||
Return self. | ||
def _validate_size_ngh_deprecation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we could make all these _vaildate_methods(estimator)
as module level functions to keep the class clean. So, it will be clean for any newcomer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I agree with that
imblearn/utils/validation.py
Outdated
y_hash: str | ||
Hash identifier of the ``y`` matrix. | ||
""" | ||
return joblib.hash(X), joblib.hash(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fast for big arrays? Does it worth instead of the naive check of the array shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to be profiled. But I think that this is too slow. We could randomly pick some samples which should be better than the shape. But I am still unsure about that part.
@massich @chkoar Does anybody how to optimize this function: without cythonizing at first. |
LGTM |
What was the progress on that? |
In general this PR should have been splitted. Since, it hasn't and the tests are passing I propose to merge the changes. @glemaitre you put a lot of effort on this. We could improve or fix any bug from this PR in the feature. Thanks! |
* EHN enable multiclass ratio handling * FIX simplify call to dictionary * FIX RUS done * FIX Refactor ADASYN * FIX partial * FIX refactor SMOTE * FIX refactor SMOTE * DOC add proper docstring * PEP8 * FIX ClusterCentroids * FIX refactor IHT * FIX Nearmiss refactoring * FIX tomek links refactor * FIX refactor OSS * FIX NCR refactoring * FIX refactor combined methods with Pipeline * FIX combine method targetting all classes when cleaning * FIX balance cascade refactoring * EHN add the possibility to add a dict for ratio * TST add test for check_ratio * TST add test for float * FIX/TST adapt common test * TST fix IHT tests * TST fix NCR * FIX combine test * TST fix balance * FIX doctest * FIX doctest * FIX solve the pickle issue * FIX remove comments * TST add test for NCR * TST add knn balance cascade * EHN add callable option for the ratio * DOC make doc cleaner * FIX/DOC remove useless comments and clean doc * DEP deprecation of ratio as float * EHN add base class for cleaning methods * TST add common test for multi class * MAINT downgrade sphinx for the moment * TST/EHN add test for the ratio and specific ratio for cleaning sampling * EHN remove redundant code * FIX warning * Remove useless base class * MAINT add christos back to some file * EHN rename test and add a comment * DOC add hash_X_y in the API * [MRG] Incorporate chkoar remarks (#6) * change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class * [MRG] Remove the init in base class (#7) * change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class * iterate * fix PEP8 * EHN doc * FIX add extension for sphinx * EHN make deprecatin great again * EHN Improve SMOTE and ADASYN
* EHN enable multiclass ratio handling * FIX simplify call to dictionary * FIX RUS done * FIX Refactor ADASYN * FIX partial * FIX refactor SMOTE * FIX refactor SMOTE * DOC add proper docstring * PEP8 * FIX ClusterCentroids * FIX refactor IHT * FIX Nearmiss refactoring * FIX tomek links refactor * FIX refactor OSS * FIX NCR refactoring * FIX refactor combined methods with Pipeline * FIX combine method targetting all classes when cleaning * FIX balance cascade refactoring * EHN add the possibility to add a dict for ratio * TST add test for check_ratio * TST add test for float * FIX/TST adapt common test * TST fix IHT tests * TST fix NCR * FIX combine test * TST fix balance * FIX doctest * FIX doctest * FIX solve the pickle issue * FIX remove comments * TST add test for NCR * TST add knn balance cascade * EHN add callable option for the ratio * DOC make doc cleaner * FIX/DOC remove useless comments and clean doc * DEP deprecation of ratio as float * EHN add base class for cleaning methods * TST add common test for multi class * MAINT downgrade sphinx for the moment * TST/EHN add test for the ratio and specific ratio for cleaning sampling * EHN remove redundant code * FIX warning * Remove useless base class * MAINT add christos back to some file * EHN rename test and add a comment * DOC add hash_X_y in the API * [MRG] Incorporate chkoar remarks (#6) * change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class * [MRG] Remove the init in base class (#7) * change cleaning-sampler to clean-sampler * Refactor the over_sampling * [WIP] adapt ensamble class * iterate * fix PEP8 * EHN doc * FIX add extension for sphinx * EHN make deprecatin great again * EHN Improve SMOTE and ADASYN
Reference Issue
Fixes #121
What does this implement/fix? Explain your changes.
Refactor the
ratio
parameterTODO:
'auto'
corresponds to'all'
for over-sampling and'not minority'
for under-sampling.OneSidedSelection
NeighborhoodCleaningRule
Counter
occurrences when this is useless.X_shape_
from the documentationAny other comments?