@@ -28,13 +28,13 @@ K-means method instead of the original samples::
28
28
... n_clusters_per_class=1,
29
29
... weights=[0.01, 0.05, 0.94],
30
30
... class_sep=0.8, random_state=0)
31
- >>> print(Counter(y))
32
- Counter({2: 4674, 1: 262, 0: 64})
31
+ >>> print(sorted( Counter(y).items() ))
32
+ [(0, 64), (1, 262), (2, 4674)]
33
33
>>> from imblearn.under_sampling import ClusterCentroids
34
34
>>> cc = ClusterCentroids(random_state=0)
35
35
>>> X_resampled, y_resampled = cc.fit_sample(X, y)
36
- >>> print(Counter(y_resampled))
37
- Counter({0: 64, 1: 64, 2: 64})
36
+ >>> print(sorted( Counter(y_resampled).items() ))
37
+ [(0, 64), (1, 64), (2, 64)]
38
38
39
39
The figure below illustrates such under-sampling.
40
40
@@ -49,6 +49,12 @@ your data are grouped into clusters. In addition, the number of centroids
49
49
should be set such that the under-sampled clusters are representative of the
50
50
original one.
51
51
52
+ .. warning ::
53
+
54
+ :class: `ClusterCentroids ` supports sparse matrices. However, the new samples
55
+ generated are not specifically sparse. Therefore, even if the resulting
56
+ matrix will be sparse, the algorithm will be inefficient in this regard.
57
+
52
58
See :ref: `sphx_glr_auto_examples_under-sampling_plot_cluster_centroids.py ` and
53
59
:ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
54
60
@@ -77,8 +83,8 @@ randomly selecting a subset of data for the targeted classes::
77
83
>>> from imblearn.under_sampling import RandomUnderSampler
78
84
>>> rus = RandomUnderSampler(random_state=0)
79
85
>>> X_resampled, y_resampled = rus.fit_sample(X, y)
80
- >>> print(Counter(y_resampled))
81
- Counter({0: 64, 1: 64, 2: 64})
86
+ >>> print(sorted( Counter(y_resampled).items() ))
87
+ [(0, 64), (1, 64), (2, 64)]
82
88
83
89
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_002.png
84
90
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
@@ -108,8 +114,8 @@ be selected with the parameter ``version``::
108
114
>>> from imblearn.under_sampling import NearMiss
109
115
>>> nm1 = NearMiss(random_state=0, version=1)
110
116
>>> X_resampled_nm1, y_resampled = nm1.fit_sample(X, y)
111
- >>> print(Counter(y_resampled))
112
- Counter({0: 64, 1: 64, 2: 64})
117
+ >>> print(sorted( Counter(y_resampled).items() ))
118
+ [(0, 64), (1, 64), (2, 64)]
113
119
114
120
As later stated in the next section, :class: `NearMiss ` heuristic rules are
115
121
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors ``
@@ -238,13 +244,13 @@ available: (i) the majority (i.e., ``kind_sel='mode'``) or (ii) all (i.e.,
238
244
``kind_sel='all' ``) the nearest-neighbors have to belong to the same class than
239
245
the sample inspected to keep it in the dataset::
240
246
241
- >>> Counter(y)
242
- Counter({2: 4674, 1: 262, 0: 64})
247
+ >>> sorted( Counter(y).items() )
248
+ [(0, 64), (1, 262), (2, 4674)]
243
249
>>> from imblearn.under_sampling import EditedNearestNeighbours
244
250
>>> enn = EditedNearestNeighbours(random_state=0)
245
251
>>> X_resampled, y_resampled = enn.fit_sample(X, y)
246
- >>> print(Counter(y_resampled))
247
- Counter({2: 4568, 1: 213, 0: 64})
252
+ >>> print(sorted( Counter(y_resampled).items() ))
253
+ [(0, 64), (1, 213), (2, 4568)]
248
254
249
255
The parameter ``n_neighbors `` allows to give a classifier subclassed from
250
256
``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
@@ -257,8 +263,8 @@ Generally, repeating the algorithm will delete more data::
257
263
>>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
258
264
>>> renn = RepeatedEditedNearestNeighbours(random_state=0)
259
265
>>> X_resampled, y_resampled = renn.fit_sample(X, y)
260
- >>> print(Counter(y_resampled))
261
- Counter({2: 4551, 1: 208, 0: 64})
266
+ >>> print(sorted( Counter(y_resampled).items() ))
267
+ [(0, 64), (1, 208), (2, 4551)]
262
268
263
269
:class: `AllKNN ` differs from the previous
264
270
:class: `RepeatedEditedNearestNeighbours ` since the number of neighbors of the
@@ -267,8 +273,8 @@ internal nearest neighbors algorithm is increased at each iteration::
267
273
>>> from imblearn.under_sampling import AllKNN
268
274
>>> allknn = AllKNN(random_state=0)
269
275
>>> X_resampled, y_resampled = allknn.fit_sample(X, y)
270
- >>> print(Counter(y_resampled))
271
- Counter({2: 4601, 1: 220, 0: 64})
276
+ >>> print(sorted( Counter(y_resampled).items() ))
277
+ [(0, 64), (1, 220), (2, 4601)]
272
278
273
279
In the example below, it can be seen that the three algorithms have similar
274
280
impact by cleaning noisy samples next to the boundaries of the classes.
@@ -305,8 +311,8 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
305
311
>>> from imblearn.under_sampling import CondensedNearestNeighbour
306
312
>>> cnn = CondensedNearestNeighbour(random_state=0)
307
313
>>> X_resampled, y_resampled = cnn.fit_sample(X, y)
308
- >>> print(Counter(y_resampled))
309
- Counter({2: 116, 0: 64, 1: 25})
314
+ >>> print(sorted( Counter(y_resampled).items() ))
315
+ [(0, 64), (1, 24), (2, 115)]
310
316
311
317
However as illustrated in the figure below, :class: `CondensedNearestNeighbour `
312
318
is sensitive to noise and will add noisy samples.
@@ -320,8 +326,8 @@ used as::
320
326
>>> from imblearn.under_sampling import OneSidedSelection
321
327
>>> oss = OneSidedSelection(random_state=0)
322
328
>>> X_resampled, y_resampled = oss.fit_sample(X, y)
323
- >>> print(Counter(y_resampled))
324
- Counter({2: 4403, 1: 174, 0: 64})
329
+ >>> print(sorted( Counter(y_resampled).items() ))
330
+ [(0, 64), (1, 174), (2, 4403)]
325
331
326
332
Our implementation offer to set the number of seeds to put in the set :math: `C`
327
333
originally by setting the parameter ``n_seeds_S ``.
@@ -334,8 +340,8 @@ neighbors classifier. The class can be used as::
334
340
>>> from imblearn.under_sampling import NeighbourhoodCleaningRule
335
341
>>> ncr = NeighbourhoodCleaningRule(random_state=0)
336
342
>>> X_resampled, y_resampled = ncr.fit_sample(X, y)
337
- >>> print(Counter(y_resampled))
338
- Counter({2: 4666, 1: 234, 0: 64})
343
+ >>> print(sorted( Counter(y_resampled).items() ))
344
+ [(0, 64), (1, 234), (2, 4666)]
339
345
340
346
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_005.png
341
347
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
@@ -362,8 +368,8 @@ removed. The class can be used as::
362
368
>>> iht = InstanceHardnessThreshold(random_state=0,
363
369
... estimator=LogisticRegression())
364
370
>>> X_resampled, y_resampled = iht.fit_sample(X, y)
365
- >>> print(Counter(y_resampled))
366
- Counter({0: 64, 1: 64, 2: 64})
371
+ >>> print(sorted( Counter(y_resampled).items() ))
372
+ [(0, 64), (1, 64), (2, 64)]
367
373
368
374
This class has 2 important parameters. ``estimator `` will accept any
369
375
scikit-learn classifier which has a method ``predict_proba ``. The classifier
0 commit comments