Skip to content

Commit cc66b75

Browse files
committed
Merge remote-tracking branch 'origin/master' into keras_batch_generator
2 parents 032c791 + eafae67 commit cc66b75

26 files changed

+882
-437
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ htmlcov/
4242
nosetests.xml
4343
coverage.xml
4444
*,cover
45+
.pytest_cache/
4546

4647
# Translations
4748
*.mo
@@ -66,4 +67,7 @@ target/
6667
*.sln
6768
*.pyproj
6869
*.suo
69-
*.vs
70+
*.vs
71+
72+
# PyCharm
73+
.idea/

README.rst

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -166,32 +166,32 @@ The different algorithms are presented in the sphinx-gallery_.
166166
References:
167167
-----------
168168

169-
.. [1] : I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010.
169+
.. [1] : I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976. [`bib <references.bib#L148>`_]
170170
171-
.. [2] : I. Mani, I. Zhang. “kNN approach to unbalanced data distributions: a case study involving information extraction,” In Proceedings of workshop on learning from imbalanced datasets, 2003.
171+
.. [2] : I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003. [`pdf <https://www.site.uottawa.ca/~nat/Workshop2003/jzhang.pdf>`_] [`bib <references.bib#L113>`_]
172172
173-
.. [3] : P. Hart, “The condensed nearest neighbor rule,” In Information Theory, IEEE Transactions on, vol. 14(3), pp. 515-516, 1968.
173+
.. [3] : P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968. [`pdf <http://sci2s.ugr.es/keel/pdf/algorithm/articulo/hart1968.pdf>`_] [`bib <references.bib#L51>`_]
174174
175-
.. [4] : M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection,” In ICML, vol. 97, pp. 179-186, 1997.
175+
.. [4] : M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997. [`pdf <http://sci2s.ugr.es/keel/pdf/algorithm/congreso/kubat97addressing.pdf>`_] [`bib <references.bib#L76>`_]
176176
177-
.. [5] : J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Springer Berlin Heidelberg, 2001.
177+
.. [5] : J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001. [`pdf <https://pdfs.semanticscholar.org/0e75/4db8253e84cde4ade4b6f5ba768a6150569a.pdf>`_] [`bib <references.bib#L89>`_]
178178
179-
.. [6] : D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” In IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2 (3), pp. 408-421, 1972.
179+
.. [6] : D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972. [`pdf <http://sci2s.ugr.es/keel/pdf/algorithm/articulo/1972-Wilson-IEEETSMC.pdf>`_] [`bib <references.bib#L168>`_]
180180
181-
.. [7] : D. Smith, Michael R., Tony Martinez, and Christophe Giraud-Carrier. “An instance level analysis of data complexity.” Machine learning 95.2 (2014): 225-256.
181+
.. [7] : M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014. [`pdf <https://pdfs.semanticscholar.org/5796/8c07abe6a734977db47b08cf4c567733aede.pdf>`_] [`bib <references.bib#L136>`_]
182182
183-
.. [8] : N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.
183+
.. [8] : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002. [`pdf <http://www.jair.org/media/953/live-953-2037-jair.pdf>`_] [`bib <references.bib#L28>`_]
184184
185-
.. [9] : H. Han, W. Wen-Yuan, M. Bing-Huan, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Advances in intelligent computing, 878-887, 2005.
185+
.. [9] : H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005. [`pdf <http://sci2s.ugr.es/keel/pdf/specific/congreso/han_borderline_smote.pdf>`_] [`bib <references.bib#L38>`_]
186186
187-
.. [10] : H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), pp.4-21, 2001.
187+
.. [10] : H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009. [`pdf <http://ousar.lib.okayama-u.ac.jp/files/public/1/19617/20160528004522391723/IWCIA2009_A1005.pdf>`_] [`bib <references.bib#L126>`_]
188188
189-
.. [11] : G. Batista, R. C. Prati, M. C. Monard. “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter 6 (1), 20-29, 2004.
189+
.. [11] : G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004. [`pdf <http://sci2s.ugr.es/keel/dataset/includes/catImbFiles/2004-Batista-SIGKDD.pdf>`_] [`bib <references.bib#L15>`_]
190190
191-
.. [12] : G. Batista, B. Bazzan, M. Monard, [“Balancing Training Data for Automated Annotation of Keywords: a Case Study,” In WOB, 10-18, 2003.
191+
.. [12] : G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003. [`pdf <http://www.inf.ufrgs.br/maslab/pergamus/pubs/balancing-training-data-for.pdf>`_] [`bib <references.bib#L2>`_]
192192
193-
.. [13] : X. Y. Liu, J. Wu and Z. H. Zhou, “Exploratory Undersampling for Class-Imbalance Learning,” in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550, April 2009.
193+
.. [13] : X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009. [`pdf <https://pdfs.semanticscholar.org/beac/3afc6a2cbdefe8dae03de25a139193ef6021.pdf>`_] [`bib <references.bib#L102>`_]
194194
195-
.. [14] : I. Tomek, “An Experiment with the Edited Nearest-Neighbor Rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, June 1976.
195+
.. [14] : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976. [`bib <references.bib#L158>`_]
196196
197-
.. [15] : He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322-1328, 2008.
197+
.. [15] : H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008. [`pdf <https://pdfs.semanticscholar.org/4823/4756b7cf798bfeb47328f7c5d597fd4838c2.pdf>`_] [`bib <references.bib#L62>`_]

doc/over_sampling.rst

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -127,11 +127,11 @@ nearest neighbors class. Those variants are presented in the figure below.
127127
:align: center
128128

129129

130-
The parameter ``kind`` is controlling this feature and the following types are
131-
available: (i) ``'borderline1'``, (ii) ``'borderline2'``, and (iii) ``'svm'``::
130+
The :class:`BorderlineSMOTE` and :class:`SVMSMOTE` offer some variant of the SMOTE
131+
algorithm::
132132

133-
>>> from imblearn.over_sampling import SMOTE, ADASYN
134-
>>> X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(X, y)
133+
>>> from imblearn.over_sampling import BorderlineSMOTE
134+
>>> X_resampled, y_resampled = BorderlineSMOTE().fit_sample(X, y)
135135
>>> print(sorted(Counter(y_resampled).items()))
136136
[(0, 4674), (1, 4674), (2, 4674)]
137137

@@ -168,12 +168,11 @@ interpolation will create a sample on the line between :math:`x_{i}` and
168168
Each SMOTE variant and ADASYN differ from each other by selecting the samples
169169
:math:`x_i` ahead of generating the new samples.
170170

171-
The **regular** SMOTE algorithm --- cf. to ``kind='regular'`` when
172-
instantiating a :class:`SMOTE` object --- does not impose any rule and will
173-
randomly pick-up all possible :math:`x_i` available.
171+
The **regular** SMOTE algorithm --- cf. to the :class:`SMOTE` object --- does not
172+
impose any rule and will randomly pick-up all possible :math:`x_i` available.
174173

175-
The **borderline** SMOTE --- cf. to ``kind='borderline1'`` and
176-
``kind='borderline2'`` when instantiating a :class:`SMOTE` object --- will
174+
The **borderline** SMOTE --- cf. to the :class:`BorderlineSMOTE` with the
175+
parameters ``kind='borderline-1'`` and ``kind='borderline-2'`` --- will
177176
classify each sample :math:`x_i` to be (i) noise (i.e. all nearest-neighbors
178177
are from a different class than the one of :math:`x_i`), (ii) in danger
179178
(i.e. at least half of the nearest neighbors are from the same class than
@@ -184,10 +183,9 @@ samples *in danger* to generate new samples. In **Borderline-1** SMOTE,
184183
:math:`x_i`. On the contrary, **Borderline-2** SMOTE will consider
185184
:math:`x_{zi}` which can be from any class.
186185

187-
**SVM** SMOTE --- cf. to ``kind='svm'`` when instantiating a :class:`SMOTE`
188-
object --- uses an SVM classifier to find support vectors and generate samples
189-
considering them. Note that the ``C`` parameter of the SVM classifier allows to
190-
select more or less support vectors.
186+
**SVM** SMOTE --- cf. to :class:`SVMSMOTE` --- uses an SVM classifier to find
187+
support vectors and generate samples considering them. Note that the ``C``
188+
parameter of the SVM classifier allows to select more or less support vectors.
191189

192190
For both borderline and SVM SMOTE, a neighborhood is defined using the
193191
parameter ``m_neighbors`` to decide if a sample is in danger, safe, or noise.
@@ -196,7 +194,7 @@ ADASYN is working similarly to the regular SMOTE. However, the number of
196194
samples generated for each :math:`x_i` is proportional to the number of samples
197195
which are not from the same class than :math:`x_i` in a given
198196
neighborhood. Therefore, more samples will be generated in the area that the
199-
nearest neighbor rule is not respected. The parameter ``n_neighbors`` is
197+
nearest neighbor rule is not respected. The parameter ``m_neighbors`` is
200198
equivalent to ``k_neighbors`` in :class:`SMOTE`.
201199

202200
Multi-class management

doc/whats_new/v0.0.4.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@ Enhancement
3636
- Add support for one-vs-all encoded target to support keras. :issue:`409` by
3737
:user:`Guillaume Lemaitre <glemaitre>`.
3838

39+
- Adding specific class for borderline and SVM SMOTE using
40+
:class:`BorderlineSMOTE` and :class:`SVMSMOTE`.
41+
:issue:`440` by :user:`Guillaume Lemaitre <glemaitre>`.
42+
3943
Bug fixes
4044
.........
4145

@@ -69,3 +73,9 @@ Deprecation
6973
:class:`imblearn.under_sampling.NeighbourhoodCleaningRule`,
7074
:class:`imblearn.under_sampling.InstanceHardnessThreshold`,
7175
:class:`imblearn.under_sampling.CondensedNearestNeighbours`.
76+
77+
- Deprecate ``kind``, ``out_step``, ``svm_estimator``, ``m_neighbors`` in
78+
:class:`imblearn.over_sampling.SMOTE`. User should use
79+
:class:`imblearn.over_sampling.SVMSMOTE` and
80+
:class:`imblearn.over_sampling.BorderlineSMOTE`.
81+
:issue:`440` by :user:`Guillaume Lemaitre <glemaitre>`.

examples/over-sampling/plot_comparison_over_sampling.py

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
from sklearn.svm import LinearSVC
2121

2222
from imblearn.pipeline import make_pipeline
23-
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
23+
from imblearn.over_sampling import ADASYN
24+
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE
25+
from imblearn.over_sampling import RandomOverSampler
2426
from imblearn.base import SamplerMixin
2527
from imblearn.utils import hash_X_y
2628

@@ -220,21 +222,18 @@ def fit_sample(self, X, y):
220222
class_sep=0.8)
221223

222224
ax_arr = ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8))
223-
string_add = ['regular', 'borderline-1', 'borderline-2', 'SVM']
224-
for str_add, ax, sampler in zip(string_add,
225-
ax_arr,
226-
(SMOTE(random_state=0),
227-
SMOTE(random_state=0, kind='borderline1'),
228-
SMOTE(random_state=0, kind='borderline2'),
229-
SMOTE(random_state=0, kind='svm'))):
225+
for ax, sampler in zip(ax_arr,
226+
(SMOTE(random_state=0),
227+
BorderlineSMOTE(random_state=0, kind='borderline-1'),
228+
BorderlineSMOTE(random_state=0, kind='borderline-2'),
229+
SVMSMOTE(random_state=0))):
230230
clf = make_pipeline(sampler, LinearSVC())
231231
clf.fit(X, y)
232232
plot_decision_function(X, y, clf, ax[0])
233-
ax[0].set_title('Decision function for {} {}'.format(
234-
str_add, sampler.__class__.__name__))
233+
ax[0].set_title('Decision function for {}'.format(
234+
sampler.__class__.__name__))
235235
plot_resampling(X, y, sampler, ax[1])
236-
ax[1].set_title('Resampling using {} {}'.format(
237-
str_add, sampler.__class__.__name__))
236+
ax[1].set_title('Resampling using {}'.format(sampler.__class__.__name__))
238237
fig.tight_layout()
239238

240239
plt.show()

examples/over-sampling/plot_smote.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
from sklearn.decomposition import PCA
1818

1919
from imblearn.over_sampling import SMOTE
20+
from imblearn.over_sampling import BorderlineSMOTE
21+
from imblearn.over_sampling import SVMSMOTE
2022

2123
print(__doc__)
2224

@@ -49,8 +51,8 @@ def plot_resampling(ax, X, y, title):
4951
X_vis = pca.fit_transform(X)
5052

5153
# Apply regular SMOTE
52-
kind = ['regular', 'borderline1', 'borderline2', 'svm']
53-
sm = [SMOTE(kind=k) for k in kind]
54+
sm = [SMOTE(), BorderlineSMOTE(kind='borderline-1'),
55+
BorderlineSMOTE(kind='borderline-2'), SVMSMOTE()]
5456
X_resampled = []
5557
y_resampled = []
5658
X_res_vis = []
@@ -67,9 +69,10 @@ def plot_resampling(ax, X, y, title):
6769
ax_res = [ax3, ax4, ax5, ax6]
6870

6971
c0, c1 = plot_resampling(ax1, X_vis, y, 'Original set')
70-
for i in range(len(kind)):
72+
for i, name in enumerate(['SMOTE', 'SMOTE Borderline-1',
73+
'SMOTE Borderline-2', 'SMOTE SVM']):
7174
plot_resampling(ax_res[i], X_res_vis[i], y_resampled[i],
72-
'SMOTE {}'.format(kind[i]))
75+
'{}'.format(name))
7376

7477
ax2.legend((c0, c1), ('Class #0', 'Class #1'), loc='center',
7578
ncol=1, labelspacing=0.)

imblearn/combine/smote_enn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ class SMOTEENN(SamplerMixin):
5050
-----
5151
The method is presented in [1]_.
5252
53-
Supports mutli-class resampling. Refer to SMOTE and ENN regarding the
53+
Supports multi-class resampling. Refer to SMOTE and ENN regarding the
5454
scheme which used.
5555
5656
See :ref:`sphx_glr_auto_examples_combine_plot_smote_enn.py` and

imblearn/combine/smote_tomek.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ class SMOTETomek(SamplerMixin):
5757
-----
5858
The methos is presented in [1]_.
5959
60-
Supports mutli-class resampling. Refer to SMOTE and TomekLinks regarding
60+
Supports multi-class resampling. Refer to SMOTE and TomekLinks regarding
6161
the scheme which used.
6262
6363
See :ref:`sphx_glr_auto_examples_combine_plot_smote_tomek.py` and

imblearn/ensemble/balance_cascade.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ class BalanceCascade(BaseEnsembleSampler):
6363
-----
6464
The method is described in [1]_.
6565
66-
Supports mutli-class resampling. A one-vs.-rest scheme is used as
66+
Supports multi-class resampling. A one-vs.-rest scheme is used as
6767
originally proposed in [1]_.
6868
6969
See :ref:`sphx_glr_auto_examples_ensemble_plot_balance_cascade.py`.

imblearn/ensemble/easy_ensemble.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ class EasyEnsemble(BaseEnsembleSampler):
5353
-----
5454
The method is described in [1]_.
5555
56-
Supports mutli-class resampling by sampling each class independently.
56+
Supports multi-class resampling by sampling each class independently.
5757
5858
See :ref:`sphx_glr_auto_examples_ensemble_plot_easy_ensemble.py`.
5959

imblearn/over_sampling/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,8 @@
66
from .adasyn import ADASYN
77
from .random_over_sampler import RandomOverSampler
88
from .smote import SMOTE
9+
from .smote import BorderlineSMOTE
10+
from .smote import SVMSMOTE
911

10-
__all__ = ['ADASYN', 'RandomOverSampler', 'SMOTE']
12+
__all__ = ['ADASYN', 'RandomOverSampler',
13+
'SMOTE', 'BorderlineSMOTE', 'SVMSMOTE']

imblearn/over_sampling/adasyn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ class ADASYN(BaseOverSampler):
5050
-----
5151
The implementation is based on [1]_.
5252
53-
Supports mutli-class resampling. A one-vs.-rest scheme is used.
53+
Supports multi-class resampling. A one-vs.-rest scheme is used.
5454
5555
See
5656
:ref:`sphx_glr_auto_examples_applications_plot_over_sampling_benchmark_lfw.py`,

imblearn/over_sampling/random_over_sampler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ class RandomOverSampler(BaseOverSampler):
3939
4040
Notes
4141
-----
42-
Supports mutli-class resampling by sampling each class independently.
42+
Supports multi-class resampling by sampling each class independently.
4343
4444
See
4545
:ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`,

0 commit comments

Comments
 (0)