Skip to content

Commit 7f93dfc

Browse files
authored
DOC: Removing duplicate examples and cross-referencing (#471)
1 parent 56b624c commit 7f93dfc

38 files changed

+124
-1313
lines changed

doc/combine.rst

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ from over-sampling.
1515

1616
In this regard, Tomek's link and edited nearest-neighbours are the two cleaning
1717
methods that have been added to the pipeline after applying SMOTE over-sampling
18-
to obtain a cleaner space. The two ready-to use classes imbalanced-learn implements
19-
for combining over- and undersampling methods are: (i) :class:`SMOTETomek`
20-
and (ii) :class:`SMOTEENN`.
18+
to obtain a cleaner space. The two ready-to use classes imbalanced-learn
19+
implements for combining over- and undersampling methods are: (i)
20+
:class:`SMOTETomek` [BPM2004]_ and (ii) :class:`SMOTEENN` [BBM2003]_.
2121

22-
Those two classes can be used like any other sampler with parameters identical
22+
Those two classes can be used like any other sampler with parameters identical
2323
to their former samplers::
2424

2525
>>> from collections import Counter
@@ -50,7 +50,16 @@ noisy samples than :class:`SMOTETomek`.
5050
:scale: 60
5151
:align: center
5252

53-
See :ref:`sphx_glr_auto_examples_combine_plot_smote_enn.py`,
54-
:ref:`sphx_glr_auto_examples_combine_plot_smote_tomek.py`,
55-
and
56-
:ref:`sphx_glr_auto_examples_combine_plot_comparison_combine.py`.
53+
.. topic:: Examples
54+
55+
* :ref:`sphx_glr_auto_examples_combine_plot_comparison_combine.py`
56+
57+
.. topic:: References
58+
59+
.. [BPM2004] G. Batista, R. C. Prati, M. C. Monard. "A study of the behavior
60+
of several methods for balancing machine learning training
61+
data," ACM Sigkdd Explorations Newsletter 6 (1), 20-29, 2004.
62+
63+
.. [BBM2003] G. Batista, B. Bazzan, M. Monard, "Balancing Training Data for
64+
Automated Annotation of Keywords: a Case Study," In WOB, 10-18,
65+
2003.

doc/ensemble.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ takes the same parameters than the scikit-learn
5050
``sampling_strategy`` and ``replacement`` to control the behaviour of the
5151
random under-sampler::
5252

53-
5453
>>> from imblearn.ensemble import BalancedBaggingClassifier
5554
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
5655
... sampling_strategy='auto',
@@ -115,8 +114,9 @@ ensemble as::
115114
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
116115
0.62484778593026025
117116

118-
See
119-
:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
117+
.. topic:: Examples
118+
119+
* :ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`
120120

121121
.. topic:: References
122122

doc/miscellaneous.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,7 @@ will be passed to ``fit_generator``::
149149
... X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42)
150150
>>> callback_history = model.fit_generator(generator=training_generator,
151151
... epochs=10, verbose=0)
152+
153+
.. topic:: References
154+
155+
* :ref:`sphx_glr_auto_examples_applications_porto_seguro_keras_under_sampling.py`

doc/over_sampling.rst

Lines changed: 33 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ Over-sampling
99
A practical guide
1010
=================
1111

12+
You can refer to
13+
:ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`.
14+
1215
.. _random_over_sampler:
1316

1417
Naive random over-sampling
@@ -68,18 +71,15 @@ In addition, :class:`RandomOverSampler` allows to sample heterogeneous data
6871
>>> print(y_resampled)
6972
[0 0 1 1]
7073

71-
See :ref:`sphx_glr_auto_examples_over-sampling_plot_random_over_sampling.py`
72-
for usage example.
73-
7474
.. _smote_adasyn:
7575

7676
From random over-sampling to SMOTE and ADASYN
7777
---------------------------------------------
7878

7979
Apart from the random sampling with replacement, there are two popular methods
80-
to over-sample minority classes: (i) the Synthetic Minority Oversampling Technique
81-
(SMOTE) and (ii) the Adaptive Synthetic (ADASYN) sampling method. These algorithms
82-
can be used in the same manner::
80+
to over-sample minority classes: (i) the Synthetic Minority Oversampling
81+
Technique (SMOTE) [CBHK2002]_ and (ii) the Adaptive Synthetic (ADASYN)
82+
[HBGL2008]_ sampling method. These algorithms can be used in the same manner::
8383

8484
>>> from imblearn.over_sampling import SMOTE, ADASYN
8585
>>> X_resampled, y_resampled = SMOTE().fit_resample(X, y)
@@ -91,16 +91,25 @@ can be used in the same manner::
9191
[(0, 4673), (1, 4662), (2, 4674)]
9292
>>> clf_adasyn = LinearSVC().fit(X_resampled, y_resampled)
9393

94-
The figure below illustrates the major difference of the different over-sampling
95-
methods.
94+
The figure below illustrates the major difference of the different
95+
over-sampling methods.
9696

9797
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png
9898
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
9999
:scale: 60
100100
:align: center
101101

102-
See :ref:`sphx_glr_auto_examples_over-sampling_plot_smote.py` and
103-
:ref:`sphx_glr_auto_examples_over-sampling_plot_adasyn.py` for usage example.
102+
.. topic:: References
103+
104+
.. [HBGL2008] He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. "ADASYN:
105+
Adaptive synthetic sampling approach for imbalanced learning,"
106+
In IEEE International Joint Conference on Neural Networks (IEEE
107+
World Congress on Computational Intelligence), pp. 1322-1328,
108+
2008.
109+
110+
.. [CBHK2002] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer,
111+
"SMOTE: synthetic minority over-sampling technique," Journal of
112+
artificial intelligence research, 321-357, 2002.
104113
105114
Ill-posed examples
106115
------------------
@@ -143,25 +152,33 @@ nearest neighbors class. Those variants are presented in the figure below.
143152
:align: center
144153

145154

146-
The :class:`BorderlineSMOTE` and :class:`SVMSMOTE` offer some variant of the SMOTE
147-
algorithm::
155+
The :class:`BorderlineSMOTE` [HWB2005]_ and :class:`SVMSMOTE` [NCK2009]_ offer
156+
some variant of the SMOTE algorithm::
148157

149158
>>> from imblearn.over_sampling import BorderlineSMOTE
150159
>>> X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X, y)
151160
>>> print(sorted(Counter(y_resampled).items()))
152161
[(0, 4674), (1, 4674), (2, 4674)]
153162

154-
See :ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`
155-
to see a comparison between the different over-sampling methods.
163+
.. topic:: References
164+
165+
.. [HWB2005] H. Han, W. Wen-Yuan, M. Bing-Huan, "Borderline-SMOTE: a new
166+
over-sampling method in imbalanced data sets learning," Advances
167+
in intelligent computing, 878-887, 2005.
168+
169+
.. [NCK2009] H. M. Nguyen, E. W. Cooper, K. Kamei, "Borderline over-sampling
170+
for imbalanced data classification," International Journal of
171+
Knowledge Engineering and Soft Data Paradigms, 3(1), pp.4-21,
172+
2009.
156173
157174
Mathematical formulation
158175
========================
159176

160177
Sample generation
161178
-----------------
162179

163-
Both SMOTE and ADASYN use the same algorithm to generate new
164-
samples. Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be
180+
Both SMOTE and ADASYN use the same algorithm to generate new samples.
181+
Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be
165182
generated considering its k neareast-neighbors (corresponding to
166183
``k_neighbors``). For instance, the 3 nearest-neighbors are included in the
167184
blue circle as illustrated in the figure below. Then, one of these

doc/under_sampling.rst

Lines changed: 59 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ Under-sampling
66

77
.. currentmodule:: imblearn.under_sampling
88

9+
You can refer to
10+
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.
11+
912
.. _cluster_centroids:
1013

1114
Prototype generation
@@ -55,9 +58,6 @@ original one.
5558
generated are not specifically sparse. Therefore, even if the resulting
5659
matrix will be sparse, the algorithm will be inefficient in this regard.
5760

58-
See :ref:`sphx_glr_auto_examples_under-sampling_plot_cluster_centroids.py` and
59-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.
60-
6161
Prototype selection
6262
===================
6363

@@ -116,13 +116,9 @@ In addition, :class:`RandomUnderSampler` allows to sample heterogeneous data
116116
>>> print(y_resampled)
117117
[0 1]
118118

119-
See :ref:`sphx_glr_auto_examples_plot_sampling_strategy_usage.py`.,
120-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
121-
and :ref:`sphx_glr_auto_examples_under-sampling_plot_random_under_sampler.py`.
122-
123-
:class:`NearMiss` adds some heuristic rules to select
124-
samples. :class:`NearMiss` implements 3 different types of heuristic which can
125-
be selected with the parameter ``version``::
119+
:class:`NearMiss` adds some heuristic rules to select samples [MZ2003]_.
120+
:class:`NearMiss` implements 3 different types of heuristic which can be
121+
selected with the parameter ``version``::
126122

127123
>>> from imblearn.under_sampling import NearMiss
128124
>>> nm1 = NearMiss(version=1)
@@ -137,10 +133,12 @@ from scikit-learn. The former parameter is used to compute the average distance
137133
to the neighbors while the latter is used for the pre-selection of the samples
138134
of interest.
139135

140-
See
141-
:ref:`sphx_glr_auto_examples_applications_plot_multi_class_under_sampling.py`,
142-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
143-
and :ref:`sphx_glr_auto_examples_under-sampling_plot_nearmiss.py`.
136+
.. topic:: References
137+
138+
.. [MZ2003] I. Mani, I. Zhang. "kNN approach to unbalanced data
139+
distributions: a case study involving information extraction," In
140+
Proceedings of workshop on learning from imbalanced datasets,
141+
2003.
144142
145143
Mathematical formulation
146144
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -194,9 +192,6 @@ affected by noise due to the first step sample selection.
194192
:scale: 60
195193
:align: center
196194

197-
See
198-
:ref:`sphx_glr_auto_examples_under-sampling_plot_illustration_nearmiss.py`.
199-
200195
Cleaning under-sampling techniques
201196
----------------------------------
202197

@@ -209,9 +204,9 @@ which will clean the dataset.
209204
Tomek's links
210205
^^^^^^^^^^^^^
211206

212-
:class:`TomekLinks` detects the so-called Tomek's links. A Tomek's link
213-
between two samples of different class :math:`x` and :math:`y` is defined such
214-
that there is no example :math:`z` such that:
207+
:class:`TomekLinks` detects the so-called Tomek's links [T2010]_. A Tomek's
208+
link between two samples of different class :math:`x` and :math:`y` is defined
209+
such that there is no example :math:`z` such that:
215210

216211
.. math::
217212
@@ -238,10 +233,10 @@ figure illustrates this behaviour.
238233
:scale: 60
239234
:align: center
240235

241-
See
242-
:ref:`sphx_glr_auto_examples_under-sampling_plot_illustration_tomek_links.py`
243-
and
244-
:ref:`sphx_glr_auto_examples_under-sampling_plot_tomek_links.py`.
236+
.. topic:: References
237+
238+
.. [T2010] I. Tomek, "Two modifications of CNN," In Systems, Man, and
239+
Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010.
245240
246241
.. _edited_nearest_neighbors:
247242

@@ -250,7 +245,7 @@ Edited data set using nearest neighbours
250245

251246
:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
252247
"edit" the dataset by removing samples which do not agree "enough" with their
253-
neighboorhood. For each sample in the class to be under-sampled, the
248+
neighboorhood [W1972]_. For each sample in the class to be under-sampled, the
254249
nearest-neighbours are computed and if the selection criterion is not
255250
fulfilled, the sample is removed. Two selection criteria are currently
256251
available: (i) the majority (i.e., ``kind_sel='mode'``) or (ii) all (i.e.,
@@ -270,8 +265,8 @@ The parameter ``n_neighbors`` allows to give a classifier subclassed from
270265
the decision to keep a given sample or not.
271266

272267
:class:`RepeatedEditedNearestNeighbours` extends
273-
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times.
274-
Generally, repeating the algorithm will delete more data::
268+
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
269+
[T1976]_. Generally, repeating the algorithm will delete more data::
275270

276271
>>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
277272
>>> renn = RepeatedEditedNearestNeighbours()
@@ -281,7 +276,7 @@ Generally, repeating the algorithm will delete more data::
281276

282277
:class:`AllKNN` differs from the previous
283278
:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the
284-
internal nearest neighbors algorithm is increased at each iteration::
279+
internal nearest neighbors algorithm is increased at each iteration [T1976]_::
285280

286281
>>> from imblearn.under_sampling import AllKNN
287282
>>> allknn = AllKNN()
@@ -297,19 +292,24 @@ impact by cleaning noisy samples next to the boundaries of the classes.
297292
:scale: 60
298293
:align: center
299294

300-
See
301-
:ref:`sphx_glr_auto_examples_pipeline_plot_pipeline_classification.py`,
302-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
303-
and :ref:`sphx_glr_auto_examples_under-sampling_plot_enn_renn_allknn.py`.
295+
.. topic:: References
296+
297+
.. [W1972] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using
298+
Edited Data," In IEEE Transactions on Systems, Man, and
299+
Cybernetrics, vol. 2 (3), pp. 408-421, 1972.
300+
301+
.. [T1976] I. Tomek, "An Experiment with the Edited Nearest-Neighbor
302+
Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol.
303+
6(6), pp. 448-452, June 1976.
304304
305305
.. _condensed_nearest_neighbors:
306306

307307
Condensed nearest neighbors and derived algorithms
308308
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309309

310310
:class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to
311-
iteratively decide if a sample should be removed or not. The algorithm is
312-
running as followed:
311+
iteratively decide if a sample should be removed or not [H1968]_. The algorithm
312+
is running as followed:
313313

314314
1. Get all minority samples in a set :math:`C`.
315315
2. Add a sample from the targeted class (class to be under-sampled) in
@@ -331,10 +331,10 @@ However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
331331
is sensitive to noise and will add noisy samples.
332332

333333
In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
334-
remove noisy samples. In addition, the 1 nearest neighbor rule is applied to
335-
all samples and the one which are misclassified will be added to the set
336-
:math:`C`. No iteration on the set :math:`S` will take place. The class can be
337-
used as::
334+
remove noisy samples [KM1997]_. In addition, the 1 nearest neighbor rule is
335+
applied to all samples and the one which are misclassified will be added to the
336+
set :math:`C`. No iteration on the set :math:`S` will take place. The class can
337+
be used as::
338338

339339
>>> from imblearn.under_sampling import OneSidedSelection
340340
>>> oss = OneSidedSelection(random_state=0)
@@ -346,9 +346,9 @@ Our implementation offer to set the number of seeds to put in the set :math:`C`
346346
originally by setting the parameter ``n_seeds_S``.
347347

348348
:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
349-
condensing them. Therefore, it will used the union of samples to be rejected
350-
between the :class:`EditedNearestNeighbours` and the output a 3 nearest
351-
neighbors classifier. The class can be used as::
349+
condensing them [J2001]_. Therefore, it will used the union of samples to be
350+
rejected between the :class:`EditedNearestNeighbours` and the output a 3
351+
nearest neighbors classifier. The class can be used as::
352352

353353
>>> from imblearn.under_sampling import NeighbourhoodCleaningRule
354354
>>> ncr = NeighbourhoodCleaningRule()
@@ -361,11 +361,18 @@ neighbors classifier. The class can be used as::
361361
:scale: 60
362362
:align: center
363363

364-
See
365-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
366-
:ref:`sphx_glr_auto_examples_under-sampling_plot_condensed_nearest_neighbour.py`,
367-
:ref:`sphx_glr_auto_examples_under-sampling_plot_one_sided_selection.py`, and
368-
:ref:`sphx_glr_auto_examples_under-sampling_plot_neighbourhood_cleaning_rule.py`.
364+
.. topic:: References
365+
366+
.. [H1968] P. Hart, "The condensed nearest neighbor rule,"
367+
In Information Theory, IEEE Transactions on, vol. 14(3), pp.
368+
515-516, 1968.
369+
370+
.. [KM1997] M. Kubat, S. Matwin, "Addressing the curse of imbalanced training
371+
sets: one-sided selection," In ICML, vol. 97, pp. 179-186, 1997.
372+
373+
.. [J2001] J. Laurikkala, "Improving identification of difficult small
374+
classes by balancing class distribution," Springer Berlin
375+
Heidelberg, 2001.
369376
370377
.. _instance_hardness_threshold:
371378

@@ -374,7 +381,7 @@ Instance hardness threshold
374381

375382
:class:`InstanceHardnessThreshold` is a specific algorithm in which a
376383
classifier is trained on the data and the samples with lower probabilities are
377-
removed. The class can be used as::
384+
removed [SMMG2014]_. The class can be used as::
378385

379386
>>> from sklearn.linear_model import LogisticRegression
380387
>>> from imblearn.under_sampling import InstanceHardnessThreshold
@@ -403,6 +410,8 @@ The figure below gives another examples on some toy data.
403410
:scale: 60
404411
:align: center
405412

406-
See
407-
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
408-
:ref:`sphx_glr_auto_examples_under-sampling_plot_instance_hardness_threshold.py`.
413+
.. topic:: References
414+
415+
.. [SMMG2014] D. Smith, Michael R., Tony Martinez, and Christophe
416+
Giraud-Carrier. "An instance level analysis of data
417+
complexity." Machine learning 95.2 (2014): 225-256.

0 commit comments

Comments
 (0)