@@ -28,13 +28,13 @@ K-means method instead of the original samples::
2828 ... n_clusters_per_class=1,
2929 ... weights=[0.01, 0.05, 0.94],
3030 ... class_sep=0.8, random_state=0)
31- >>> print(Counter(y))
32- Counter({2: 4674, 1: 262, 0: 64})
31+ >>> print(sorted( Counter(y).items() ))
32+ [(0, 64), (1, 262), (2, 4674)]
3333 >>> from imblearn.under_sampling import ClusterCentroids
3434 >>> cc = ClusterCentroids(random_state=0)
3535 >>> X_resampled, y_resampled = cc.fit_sample(X, y)
36- >>> print(Counter(y_resampled))
37- Counter({0: 64, 1: 64, 2: 64})
36+ >>> print(sorted( Counter(y_resampled).items() ))
37+ [(0, 64), (1, 64), (2, 64)]
3838
3939The figure below illustrates such under-sampling.
4040
@@ -49,6 +49,12 @@ your data are grouped into clusters. In addition, the number of centroids
4949should be set such that the under-sampled clusters are representative of the
5050original one.
5151
52+ .. warning ::
53+
54+ :class: `ClusterCentroids ` supports sparse matrices. However, the new samples
55+ generated are not specifically sparse. Therefore, even if the resulting
56+ matrix will be sparse, the algorithm will be inefficient in this regard.
57+
5258See :ref: `sphx_glr_auto_examples_under-sampling_plot_cluster_centroids.py ` and
5359:ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
5460
@@ -77,8 +83,8 @@ randomly selecting a subset of data for the targeted classes::
7783 >>> from imblearn.under_sampling import RandomUnderSampler
7884 >>> rus = RandomUnderSampler(random_state=0)
7985 >>> X_resampled, y_resampled = rus.fit_sample(X, y)
80- >>> print(Counter(y_resampled))
81- Counter({0: 64, 1: 64, 2: 64})
86+ >>> print(sorted( Counter(y_resampled).items() ))
87+ [(0, 64), (1, 64), (2, 64)]
8288
8389.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_002.png
8490 :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
@@ -108,8 +114,8 @@ be selected with the parameter ``version``::
108114 >>> from imblearn.under_sampling import NearMiss
109115 >>> nm1 = NearMiss(random_state=0, version=1)
110116 >>> X_resampled_nm1, y_resampled = nm1.fit_sample(X, y)
111- >>> print(Counter(y_resampled))
112- Counter({0: 64, 1: 64, 2: 64})
117+ >>> print(sorted( Counter(y_resampled).items() ))
118+ [(0, 64), (1, 64), (2, 64)]
113119
114120As later stated in the next section, :class: `NearMiss ` heuristic rules are
115121based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors ``
@@ -238,13 +244,13 @@ available: (i) the majority (i.e., ``kind_sel='mode'``) or (ii) all (i.e.,
238244``kind_sel='all' ``) the nearest-neighbors have to belong to the same class than
239245the sample inspected to keep it in the dataset::
240246
241- >>> Counter(y)
242- Counter({2: 4674, 1: 262, 0: 64})
247+ >>> sorted( Counter(y).items() )
248+ [(0, 64), (1, 262), (2, 4674)]
243249 >>> from imblearn.under_sampling import EditedNearestNeighbours
244250 >>> enn = EditedNearestNeighbours(random_state=0)
245251 >>> X_resampled, y_resampled = enn.fit_sample(X, y)
246- >>> print(Counter(y_resampled))
247- Counter({2: 4568, 1: 213, 0: 64})
252+ >>> print(sorted( Counter(y_resampled).items() ))
253+ [(0, 64), (1, 213), (2, 4568)]
248254
249255The parameter ``n_neighbors `` allows to give a classifier subclassed from
250256``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
@@ -257,8 +263,8 @@ Generally, repeating the algorithm will delete more data::
257263 >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
258264 >>> renn = RepeatedEditedNearestNeighbours(random_state=0)
259265 >>> X_resampled, y_resampled = renn.fit_sample(X, y)
260- >>> print(Counter(y_resampled))
261- Counter({2: 4551, 1: 208, 0: 64})
266+ >>> print(sorted( Counter(y_resampled).items() ))
267+ [(0, 64), (1, 208), (2, 4551)]
262268
263269:class: `AllKNN ` differs from the previous
264270:class: `RepeatedEditedNearestNeighbours ` since the number of neighbors of the
@@ -267,8 +273,8 @@ internal nearest neighbors algorithm is increased at each iteration::
267273 >>> from imblearn.under_sampling import AllKNN
268274 >>> allknn = AllKNN(random_state=0)
269275 >>> X_resampled, y_resampled = allknn.fit_sample(X, y)
270- >>> print(Counter(y_resampled))
271- Counter({2: 4601, 1: 220, 0: 64})
276+ >>> print(sorted( Counter(y_resampled).items() ))
277+ [(0, 64), (1, 220), (2, 4601)]
272278
273279In the example below, it can be seen that the three algorithms have similar
274280impact by cleaning noisy samples next to the boundaries of the classes.
@@ -305,8 +311,8 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
305311 >>> from imblearn.under_sampling import CondensedNearestNeighbour
306312 >>> cnn = CondensedNearestNeighbour(random_state=0)
307313 >>> X_resampled, y_resampled = cnn.fit_sample(X, y)
308- >>> print(Counter(y_resampled))
309- Counter({2: 116, 0: 64, 1: 25})
314+ >>> print(sorted( Counter(y_resampled).items() ))
315+ [(0, 64), (1, 24), (2, 115)]
310316
311317However as illustrated in the figure below, :class: `CondensedNearestNeighbour `
312318is sensitive to noise and will add noisy samples.
@@ -320,8 +326,8 @@ used as::
320326 >>> from imblearn.under_sampling import OneSidedSelection
321327 >>> oss = OneSidedSelection(random_state=0)
322328 >>> X_resampled, y_resampled = oss.fit_sample(X, y)
323- >>> print(Counter(y_resampled))
324- Counter({2: 4403, 1: 174, 0: 64})
329+ >>> print(sorted( Counter(y_resampled).items() ))
330+ [(0, 64), (1, 174), (2, 4403)]
325331
326332Our implementation offer to set the number of seeds to put in the set :math: `C`
327333originally by setting the parameter ``n_seeds_S ``.
@@ -334,8 +340,8 @@ neighbors classifier. The class can be used as::
334340 >>> from imblearn.under_sampling import NeighbourhoodCleaningRule
335341 >>> ncr = NeighbourhoodCleaningRule(random_state=0)
336342 >>> X_resampled, y_resampled = ncr.fit_sample(X, y)
337- >>> print(Counter(y_resampled))
338- Counter({2: 4666, 1: 234, 0: 64})
343+ >>> print(sorted( Counter(y_resampled).items() ))
344+ [(0, 64), (1, 234), (2, 4666)]
339345
340346.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_005.png
341347 :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
@@ -362,8 +368,8 @@ removed. The class can be used as::
362368 >>> iht = InstanceHardnessThreshold(random_state=0,
363369 ... estimator=LogisticRegression())
364370 >>> X_resampled, y_resampled = iht.fit_sample(X, y)
365- >>> print(Counter(y_resampled))
366- Counter({0: 64, 1: 64, 2: 64})
371+ >>> print(sorted( Counter(y_resampled).items() ))
372+ [(0, 64), (1, 64), (2, 64)]
367373
368374This class has 2 important parameters. ``estimator `` will accept any
369375scikit-learn classifier which has a method ``predict_proba ``. The classifier
0 commit comments