Skip to content

[MRG] Refactor ratio to pick up any class #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Jun 10, 2017

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented May 8, 2017

Reference Issue

Fixes #121

What does this implement/fix? Explain your changes.

Refactor the ratio parameter

TODO:

  • Need to be back compatible
  • Base class for: under-sampling and cleaning
  • Correct every test for array equality
  • Change the common test regarding the ratio
  • Add multi-class test for SMOTE, ADASYN, BalanceCascade, and others which were not binary
  • Correct the ratio docstring. 'auto' corresponds to 'all' for over-sampling and 'not minority' for under-sampling.
  • Add ratio docstring for cleaning methods.
  • Check that all ratio make sense for the methods:
    • 'all' makes no sense for OneSidedSelection
    • 'all' makes no sense for NeighborhoodCleaningRule
  • Make a proper logger
  • Add authorship
  • Add whats new entry
  • Remove a much comments as possible when it is not necessary
  • Remove the Counter occurrences when this is useless.
  • Remove X_shape_ from the documentation
  • Make a fast profiling to know if the hashing is an issue
  • Use deprecated directive from sphinx.
  • Check double quote in docstring

Any other comments?

@pep8speaks
Copy link

pep8speaks commented May 8, 2017

Hello @glemaitre! Thanks for updating the PR.

  • In the file doc/conf.py, following are the PEP8 issues :

Line 30:1: E722 do not use bare except'
Line 41:1: E402 module level import not at top of file
Line 309:80: E501 line too long (86 > 79 characters)
Line 335:80: E501 line too long (84 > 79 characters)

Line 142:5: E722 do not use bare except'

Comment last updated on May 30, 2017 at 22:15 Hours UTC

raise ValueError("When 'ratio' is a string, it needs to be one of"
" {}. Got '{}' instead.".format(RATIO_KIND,
ratio))
if ratio == 'all' or ratio == 'auto':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we could avoid that if-elses by using a dict

@chkoar
Copy link
Member

chkoar commented May 14, 2017

That's big PR @glemaitre!

@glemaitre
Copy link
Member Author

@chkoar yes indeed. I am fixing back the tests, right now.
Mainly, I still have to check the case of a function given for the ratio but we should be almost done code wise. I still have to go through all the docstring.

I get why we had trouble to put it out. It required lot of works

@glemaitre
Copy link
Member Author

glemaitre commented May 19, 2017

@chkoar @massich I think the PR is ready.

I removed the multiclass and binary mixins since we only have multiclass from now.
I think that any integration should be only be multiclass.

I still have the issue with the dictionary with the cleaning method thought.

@glemaitre
Copy link
Member Author

And it should be good to go fast on this one since any further development will benefit from those changes

@glemaitre
Copy link
Member Author

Hints: for a faster review process, you should actually focus on the base classes and mixin. All methods were adapted to use those and there is not real changes apart to make some of them multiclass.

@glemaitre glemaitre changed the title [MRG] Refactor ratio to pick up any class [WIP] Refactor ratio to pick up any class May 21, 2017
X, y = check_X_y(X, y)
y = check_target_type(y)
self.X_hash_, self.y_hash_ = hash_X_y(X, y)
self.ratio_ = check_ratio(self.ratio, y, 'cleaning-sampling')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean-sampling might be more appropriate, no?

@chkoar
Copy link
Member

chkoar commented May 22, 2017

That review it will take some time!

BaseUnderSampler, BaseCleaningSampler and BaseOverSampler have the same __init__ and almost same fit. We could use a parent class that will derive from the SamplerMixin lets say BaseSampler . In the fit in the check_ratio function as a third argument we will use a class variable that it will "become" an instance variable. So, in the base class we will have self.ratio_ = check_ratio(self.ratio, y, self.clean_kind) . Thus, in any child classes we only have to define the clean_kind class variable variable.

p.s. I proposed class variables in order to avoid to override the __init__. You can access class level variables from self.

@glemaitre
Copy link
Member Author

glemaitre commented May 22, 2017 via email

massich added 2 commits May 22, 2017 14:45
* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class
* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8
@glemaitre
Copy link
Member Author

@massich @chkoar anything else?

@chkoar
Copy link
Member

chkoar commented May 25, 2017

@glemaitre I will scan it this weekend!

imblearn/base.py Outdated
if not type_of_target(y) == 'binary':
warnings.simplefilter('always', UserWarning)
warnings.warn('The target type should be binary.')
class BaseSampler(SamplerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we merge BaseSampler and SamplerMixin into BaseSampler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor. I would like the sampler to have a SamplerMixin as the ClassifierMixin or TransformerMixin

imblearn/base.py Outdated
-------
self : object,
Return self.
def _validate_size_ngh_deprecation(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could make all these _vaildate_methods(estimator) as module level functions to keep the class clean. So, it will be clean for any newcomer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I agree with that

y_hash: str
Hash identifier of the ``y`` matrix.
"""
return joblib.hash(X), joblib.hash(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fast for big arrays? Does it worth instead of the naive check of the array shape?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be profiled. But I think that this is too slow. We could randomly pick some samples which should be better than the shape. But I am still unsure about that part.

@glemaitre glemaitre changed the title [WIP] Refactor ratio to pick up any class [MRG] Refactor ratio to pick up any class May 30, 2017
@glemaitre
Copy link
Member Author

@massich @chkoar Does anybody how to optimize this function:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/290/files#diff-d928243d0cfeca2e8bd7d6084e5018c6R244

without cythonizing at first.

@massich
Copy link
Contributor

massich commented Jun 8, 2017

LGTM

@chkoar
Copy link
Member

chkoar commented Jun 8, 2017

@massich @chkoar Does anybody how to optimize this function:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/290/files#diff-d928243d0cfeca2e8bd7d6084e5018c6R244
without cythonizing at first.

What was the progress on that?

@chkoar
Copy link
Member

chkoar commented Jun 8, 2017

In general this PR should have been splitted. Since, it hasn't and the tests are passing I propose to merge the changes. @glemaitre you put a lot of effort on this. We could improve or fix any bug from this PR in the feature. Thanks!

@glemaitre glemaitre merged commit c2c6565 into scikit-learn-contrib:master Jun 10, 2017
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
* EHN enable multiclass ratio handling

* FIX simplify call to dictionary

* FIX RUS done

* FIX Refactor ADASYN

* FIX partial

* FIX refactor SMOTE

* FIX refactor SMOTE

* DOC add proper docstring

* PEP8

* FIX ClusterCentroids

* FIX refactor IHT

* FIX Nearmiss refactoring

* FIX tomek links refactor

* FIX refactor OSS

* FIX NCR refactoring

* FIX refactor combined methods with Pipeline

* FIX combine method targetting all classes when cleaning

* FIX balance cascade refactoring

* EHN add the possibility to add a dict for ratio

* TST add test for check_ratio

* TST add test for float

* FIX/TST adapt common test

* TST fix IHT tests

* TST fix NCR

* FIX combine test

* TST fix balance

* FIX doctest

* FIX doctest

* FIX solve the pickle issue

* FIX remove comments

* TST add test for NCR

* TST add knn balance cascade

* EHN add callable option for the ratio

* DOC make doc cleaner

* FIX/DOC remove useless comments and clean doc

* DEP deprecation of ratio as float

* EHN add base class for cleaning methods

* TST add common test for multi class

* MAINT downgrade sphinx for the moment

* TST/EHN add test for the ratio and specific ratio for cleaning sampling

* EHN remove redundant code

* FIX warning

* Remove useless base class

* MAINT add christos back to some file

* EHN rename test and add a comment

* DOC add hash_X_y in the API

* [MRG] Incorporate chkoar remarks (#6)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* [MRG] Remove the init in base class (#7)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8

* EHN doc

* FIX add extension for sphinx

* EHN make deprecatin great again

* EHN Improve SMOTE and ADASYN
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
* EHN enable multiclass ratio handling

* FIX simplify call to dictionary

* FIX RUS done

* FIX Refactor ADASYN

* FIX partial

* FIX refactor SMOTE

* FIX refactor SMOTE

* DOC add proper docstring

* PEP8

* FIX ClusterCentroids

* FIX refactor IHT

* FIX Nearmiss refactoring

* FIX tomek links refactor

* FIX refactor OSS

* FIX NCR refactoring

* FIX refactor combined methods with Pipeline

* FIX combine method targetting all classes when cleaning

* FIX balance cascade refactoring

* EHN add the possibility to add a dict for ratio

* TST add test for check_ratio

* TST add test for float

* FIX/TST adapt common test

* TST fix IHT tests

* TST fix NCR

* FIX combine test

* TST fix balance

* FIX doctest

* FIX doctest

* FIX solve the pickle issue

* FIX remove comments

* TST add test for NCR

* TST add knn balance cascade

* EHN add callable option for the ratio

* DOC make doc cleaner

* FIX/DOC remove useless comments and clean doc

* DEP deprecation of ratio as float

* EHN add base class for cleaning methods

* TST add common test for multi class

* MAINT downgrade sphinx for the moment

* TST/EHN add test for the ratio and specific ratio for cleaning sampling

* EHN remove redundant code

* FIX warning

* Remove useless base class

* MAINT add christos back to some file

* EHN rename test and add a comment

* DOC add hash_X_y in the API

* [MRG] Incorporate chkoar remarks (#6)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* [MRG] Remove the init in base class (#7)

* change cleaning-sampler to clean-sampler

* Refactor the over_sampling

* [WIP] adapt ensamble class

* iterate

* fix PEP8

* EHN doc

* FIX add extension for sphinx

* EHN make deprecatin great again

* EHN Improve SMOTE and ADASYN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ratio should allow to specify which class to target when resampling
4 participants