@@ -6,6 +6,9 @@ Under-sampling
66
77.. currentmodule :: imblearn.under_sampling
88
9+ You can refer to
10+ :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
11+
912.. _cluster_centroids :
1013
1114Prototype generation
@@ -55,9 +58,6 @@ original one.
5558 generated are not specifically sparse. Therefore, even if the resulting
5659 matrix will be sparse, the algorithm will be inefficient in this regard.
5760
58- See :ref: `sphx_glr_auto_examples_under-sampling_plot_cluster_centroids.py ` and
59- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
60-
6161Prototype selection
6262===================
6363
@@ -116,13 +116,9 @@ In addition, :class:`RandomUnderSampler` allows to sample heterogeneous data
116116 >>> print(y_resampled)
117117 [0 1]
118118
119- See :ref: `sphx_glr_auto_examples_plot_sampling_strategy_usage.py `.,
120- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `,
121- and :ref: `sphx_glr_auto_examples_under-sampling_plot_random_under_sampler.py `.
122-
123- :class: `NearMiss ` adds some heuristic rules to select
124- samples. :class: `NearMiss ` implements 3 different types of heuristic which can
125- be selected with the parameter ``version ``::
119+ :class: `NearMiss ` adds some heuristic rules to select samples [MZ2003 ]_.
120+ :class: `NearMiss ` implements 3 different types of heuristic which can be
121+ selected with the parameter ``version ``::
126122
127123 >>> from imblearn.under_sampling import NearMiss
128124 >>> nm1 = NearMiss(version=1)
@@ -137,10 +133,12 @@ from scikit-learn. The former parameter is used to compute the average distance
137133to the neighbors while the latter is used for the pre-selection of the samples
138134of interest.
139135
140- See
141- :ref: `sphx_glr_auto_examples_applications_plot_multi_class_under_sampling.py `,
142- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `,
143- and :ref: `sphx_glr_auto_examples_under-sampling_plot_nearmiss.py `.
136+ .. topic :: References
137+
138+ .. [MZ2003 ] I. Mani, I. Zhang. "kNN approach to unbalanced data
139+ distributions: a case study involving information extraction," In
140+ Proceedings of workshop on learning from imbalanced datasets,
141+ 2003.
144142
145143 Mathematical formulation
146144^^^^^^^^^^^^^^^^^^^^^^^^
@@ -194,9 +192,6 @@ affected by noise due to the first step sample selection.
194192 :scale: 60
195193 :align: center
196194
197- See
198- :ref: `sphx_glr_auto_examples_under-sampling_plot_illustration_nearmiss.py `.
199-
200195Cleaning under-sampling techniques
201196----------------------------------
202197
@@ -209,9 +204,9 @@ which will clean the dataset.
209204Tomek's links
210205^^^^^^^^^^^^^
211206
212- :class: `TomekLinks ` detects the so-called Tomek's links. A Tomek's link
213- between two samples of different class :math: `x` and :math: `y` is defined such
214- that there is no example :math: `z` such that:
207+ :class: `TomekLinks ` detects the so-called Tomek's links [ T2010 ]_ . A Tomek's
208+ link between two samples of different class :math: `x` and :math: `y` is defined
209+ such that there is no example :math: `z` such that:
215210
216211.. math ::
217212
@@ -238,10 +233,10 @@ figure illustrates this behaviour.
238233 :scale: 60
239234 :align: center
240235
241- See
242- :ref: ` sphx_glr_auto_examples_under-sampling_plot_illustration_tomek_links.py `
243- and
244- :ref: ` sphx_glr_auto_examples_under-sampling_plot_tomek_links.py ` .
236+ .. topic :: References
237+
238+ .. [ T2010 ] I. Tomek, "Two modifications of CNN," In Systems, Man, and
239+ Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010 .
245240
246241 .. _edited_nearest_neighbors :
247242
@@ -250,7 +245,7 @@ Edited data set using nearest neighbours
250245
251246:class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
252247"edit" the dataset by removing samples which do not agree "enough" with their
253- neighboorhood. For each sample in the class to be under-sampled, the
248+ neighboorhood [ W1972 ]_ . For each sample in the class to be under-sampled, the
254249nearest-neighbours are computed and if the selection criterion is not
255250fulfilled, the sample is removed. Two selection criteria are currently
256251available: (i) the majority (i.e., ``kind_sel='mode' ``) or (ii) all (i.e.,
@@ -270,8 +265,8 @@ The parameter ``n_neighbors`` allows to give a classifier subclassed from
270265the decision to keep a given sample or not.
271266
272267:class: `RepeatedEditedNearestNeighbours ` extends
273- :class: `EditedNearestNeighbours ` by repeating the algorithm multiple times.
274- Generally, repeating the algorithm will delete more data::
268+ :class: `EditedNearestNeighbours ` by repeating the algorithm multiple times
269+ [ T1976 ]_. Generally, repeating the algorithm will delete more data::
275270
276271 >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
277272 >>> renn = RepeatedEditedNearestNeighbours()
@@ -281,7 +276,7 @@ Generally, repeating the algorithm will delete more data::
281276
282277:class: `AllKNN ` differs from the previous
283278:class: `RepeatedEditedNearestNeighbours ` since the number of neighbors of the
284- internal nearest neighbors algorithm is increased at each iteration::
279+ internal nearest neighbors algorithm is increased at each iteration [ T1976 ]_ ::
285280
286281 >>> from imblearn.under_sampling import AllKNN
287282 >>> allknn = AllKNN()
@@ -297,19 +292,24 @@ impact by cleaning noisy samples next to the boundaries of the classes.
297292 :scale: 60
298293 :align: center
299294
300- See
301- :ref: `sphx_glr_auto_examples_pipeline_plot_pipeline_classification.py `,
302- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `,
303- and :ref: `sphx_glr_auto_examples_under-sampling_plot_enn_renn_allknn.py `.
295+ .. topic :: References
296+
297+ .. [W1972 ] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using
298+ Edited Data," In IEEE Transactions on Systems, Man, and
299+ Cybernetrics, vol. 2 (3), pp. 408-421, 1972.
300+
301+ .. [T1976 ] I. Tomek, "An Experiment with the Edited Nearest-Neighbor
302+ Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol.
303+ 6(6), pp. 448-452, June 1976.
304304
305305 .. _condensed_nearest_neighbors :
306306
307307Condensed nearest neighbors and derived algorithms
308308^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309309
310310:class: `CondensedNearestNeighbour ` uses a 1 nearest neighbor rule to
311- iteratively decide if a sample should be removed or not. The algorithm is
312- running as followed:
311+ iteratively decide if a sample should be removed or not [ H1968 ]_ . The algorithm
312+ is running as followed:
313313
3143141. Get all minority samples in a set :math: `C`.
3153152. Add a sample from the targeted class (class to be under-sampled) in
@@ -331,10 +331,10 @@ However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
331331is sensitive to noise and will add noisy samples.
332332
333333In the contrary, :class: `OneSidedSelection ` will use :class: `TomekLinks ` to
334- remove noisy samples. In addition, the 1 nearest neighbor rule is applied to
335- all samples and the one which are misclassified will be added to the set
336- :math: `C`. No iteration on the set :math: `S` will take place. The class can be
337- used as::
334+ remove noisy samples [ KM1997 ]_. In addition, the 1 nearest neighbor rule is
335+ applied to all samples and the one which are misclassified will be added to the
336+ set :math: `C`. No iteration on the set :math: `S` will take place. The class can
337+ be used as::
338338
339339 >>> from imblearn.under_sampling import OneSidedSelection
340340 >>> oss = OneSidedSelection(random_state=0)
@@ -346,9 +346,9 @@ Our implementation offer to set the number of seeds to put in the set :math:`C`
346346originally by setting the parameter ``n_seeds_S ``.
347347
348348:class: `NeighbourhoodCleaningRule ` will focus on cleaning the data than
349- condensing them. Therefore, it will used the union of samples to be rejected
350- between the :class: `EditedNearestNeighbours ` and the output a 3 nearest
351- neighbors classifier. The class can be used as::
349+ condensing them [ J2001 ]_ . Therefore, it will used the union of samples to be
350+ rejected between the :class: `EditedNearestNeighbours ` and the output a 3
351+ nearest neighbors classifier. The class can be used as::
352352
353353 >>> from imblearn.under_sampling import NeighbourhoodCleaningRule
354354 >>> ncr = NeighbourhoodCleaningRule()
@@ -361,11 +361,18 @@ neighbors classifier. The class can be used as::
361361 :scale: 60
362362 :align: center
363363
364- See
365- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `,
366- :ref: `sphx_glr_auto_examples_under-sampling_plot_condensed_nearest_neighbour.py `,
367- :ref: `sphx_glr_auto_examples_under-sampling_plot_one_sided_selection.py `, and
368- :ref: `sphx_glr_auto_examples_under-sampling_plot_neighbourhood_cleaning_rule.py `.
364+ .. topic :: References
365+
366+ .. [H1968 ] P. Hart, "The condensed nearest neighbor rule,"
367+ In Information Theory, IEEE Transactions on, vol. 14(3), pp.
368+ 515-516, 1968.
369+
370+ .. [KM1997 ] M. Kubat, S. Matwin, "Addressing the curse of imbalanced training
371+ sets: one-sided selection," In ICML, vol. 97, pp. 179-186, 1997.
372+
373+ .. [J2001 ] J. Laurikkala, "Improving identification of difficult small
374+ classes by balancing class distribution," Springer Berlin
375+ Heidelberg, 2001.
369376
370377 .. _instance_hardness_threshold :
371378
@@ -374,7 +381,7 @@ Instance hardness threshold
374381
375382:class: `InstanceHardnessThreshold ` is a specific algorithm in which a
376383classifier is trained on the data and the samples with lower probabilities are
377- removed. The class can be used as::
384+ removed [ SMMG2014 ]_ . The class can be used as::
378385
379386 >>> from sklearn.linear_model import LogisticRegression
380387 >>> from imblearn.under_sampling import InstanceHardnessThreshold
@@ -403,6 +410,8 @@ The figure below gives another examples on some toy data.
403410 :scale: 60
404411 :align: center
405412
406- See
407- :ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `,
408- :ref: `sphx_glr_auto_examples_under-sampling_plot_instance_hardness_threshold.py `.
413+ .. topic :: References
414+
415+ .. [SMMG2014 ] D. Smith, Michael R., Tony Martinez, and Christophe
416+ Giraud-Carrier. "An instance level analysis of data
417+ complexity." Machine learning 95.2 (2014): 225-256.
0 commit comments