Skip to content

[MRG] EHN add collections of imbalanced datasets #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 6, 2017

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Mar 18, 2017

Reference Issue

Fixes #247

What does this implement/fix? Explain your changes.

Make a fetcher for an imbalanced datasets collections available on Zenodo maintained by ourself.

Any other comments?

TODO:

  • Write the fetcher;
  • Unit tests;
  • Write documentation

@pep8speaks
Copy link

pep8speaks commented Mar 18, 2017

Hello @glemaitre! Thanks for updating the PR.

Line 108:18: E128 continuation line under-indented for visual indent
Line 109:18: E128 continuation line under-indented for visual indent
Line 110:18: E128 continuation line under-indented for visual indent
Line 111:18: E128 continuation line under-indented for visual indent

Comment last updated on April 03, 2017 at 13:01 Hours UTC

@codecov
Copy link

codecov bot commented Mar 18, 2017

Codecov Report

Merging #249 into master will decrease coverage by 0.09%.
The diff coverage is 95.19%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #249     +/-   ##
=========================================
- Coverage   98.27%   98.18%   -0.1%     
=========================================
  Files          58       60      +2     
  Lines        3427     3530    +103     
=========================================
+ Hits         3368     3466     +98     
- Misses         59       64      +5
Impacted Files Coverage Δ
imblearn/datasets/__init__.py 100% <100%> (ø) ⬆️
imblearn/datasets/tests/test_zenodo.py 89.18% <89.18%> (ø)
imblearn/datasets/zenodo.py 98.46% <98.46%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ca3037...28e7830. Read the comment docs.

@glemaitre glemaitre changed the title [WIP] EHN add collections of imbalanced datasets [MRG] EHN add collections of imbalanced datasets Mar 19, 2017
@glemaitre glemaitre changed the title [MRG] EHN add collections of imbalanced datasets [WIP] EHN add collections of imbalanced datasets Mar 19, 2017
@glemaitre
Copy link
Member Author

Still have an issue with the docstring regarding the table. It is not so nice. I have to check what can be done.

@glemaitre glemaitre changed the title [WIP] EHN add collections of imbalanced datasets [MRG] EHN add collections of imbalanced datasets Mar 25, 2017
@glemaitre
Copy link
Member Author

@chkoar This one can be merged also. The PEP8 has been corrected and the table looks nice.

doc/api.rst Outdated
@@ -158,7 +158,7 @@ Functions
:toctree: generated/

datasets.make_imbalance

datasets.fetch_zenodo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just fetch_datasets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted something similar to fetch_mldata

@@ -14,6 +14,8 @@ New features

- Turn off steps in :class:`pipeline.Pipeline` using the `None`
object. By `Christos Aridas`_.
- Add a fetching method `datasets.fetch_zenodo` in order to get some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say function instead of method


__all__ = ['make_imbalance']
__all__ = ['make_imbalance',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort them if you like

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm not sure since that they are not in the same file.

download_if_missing=True,
random_state=None,
shuffle=False):
"""Load the Higgs dataset, downloading it if necessary.
Copy link
Member

@chkoar chkoar Mar 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L112 could be removed


Returns
-------
datasets : OrderedDict of Bunch object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary of Bunch objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to specified that this is an ordered dictionary. I change the code in accordance

@@ -0,0 +1,279 @@
"""Collection of imbalanced datasets.

This collection of datasets have been proposed in [1]_. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is: "This collection of datasets has been proposed..." or "These datasets have been proposed..."

@glemaitre
Copy link
Member Author

@chkoar done

----------
data_home : string, optional (default=None)
Specify another download and cache folder for the datasets. By default
all scikit learn data is stored in '~/scikit_learn_data' subfolders.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn :)

@chkoar
Copy link
Member

chkoar commented Apr 3, 2017

My main concern here is the name of the function. I would call it like fetch_datasets , fetch_data or something like that, since we choose to host them over zenodo. Apart from that it could be easily merged! :P

@glemaitre
Copy link
Member Author

Ok so I went for fetch_datasets

@chkoar
Copy link
Member

chkoar commented Apr 3, 2017

Will we change the filenames?

@glemaitre
Copy link
Member Author

Will we change the filenames?

Not for the moment. It has no influence on the API or the import. So if we come with a better name we can change it afterwards.

@chkoar chkoar merged commit 3c54ea6 into scikit-learn-contrib:master Apr 6, 2017
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EHN allow automatic fetching from some imbalance dataset
3 participants