@@ -6,80 +6,30 @@ Ensemble of samplers
66
77.. currentmodule :: imblearn.ensemble
88
9- .. _ensemble_samplers :
10-
11- Samplers
12- --------
13-
14- .. warning ::
15- Note that those:class: `EasyEnsemble ` is deprecated and you should use
16- :class: `EasyEnsembleClassifier ` instead. :class: `EasyEnsembleClassifier ` is
17- presented in the next section.
18-
19- An imbalanced data set can be balanced by creating several balanced
20- subsets. The module :mod: `imblearn.ensemble ` allows to create such sets.
21-
22- :class: `EasyEnsemble ` creates an ensemble of data set by randomly
23- under-sampling the original set::
24-
25- >>> from collections import Counter
26- >>> from sklearn.datasets import make_classification
27- >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
28- ... n_redundant=0, n_repeated=0, n_classes=3,
29- ... n_clusters_per_class=1,
30- ... weights=[0.01, 0.05, 0.94],
31- ... class_sep=0.8, random_state=0)
32- >>> print(sorted(Counter(y).items()))
33- [(0, 64), (1, 262), (2, 4674)]
34- >>> from imblearn.ensemble import EasyEnsemble
35- >>> ee = EasyEnsemble(random_state=0, n_subsets=10) # doctest: +SKIP
36- >>> X_resampled, y_resampled = ee.fit_resample(X, y) # doctest: +SKIP
37- >>> print(X_resampled.shape) # doctest: +SKIP
38- (10, 192, 2)
39- >>> print(sorted(Counter(y_resampled[0]).items())) # doctest: +SKIP
40- [(0, 64), (1, 64), (2, 64)]
41-
42- :class: `EasyEnsemble ` has two important parameters: (i) ``n_subsets `` will be
43- used to return number of subset and (ii) ``replacement `` to randomly sample
44- with or without replacement.
45-
46- :class: `BalanceCascade ` differs from the previous method by using a classifier
47- (using the parameter ``estimator ``) to ensure that misclassified samples can
48- again be selected for the next subset. In fact, the classifier play the role of
49- a "smart" replacement method. The maximum number of subset can be set using the
50- parameter ``n_max_subset `` and an additional bootstraping can be activated with
51- ``bootstrap `` set to ``True ``::
52-
53- >>> from imblearn.ensemble import BalanceCascade
54- >>> from sklearn.linear_model import LogisticRegression
55- >>> bc = BalanceCascade(random_state=0,
56- ... estimator=LogisticRegression(solver='lbfgs',
57- ... multi_class='auto',
58- ... random_state=0),
59- ... n_max_subset=4)
60- >>> X_resampled, y_resampled = bc.fit_resample(X, y)
61- >>> print(X_resampled.shape)
62- (4, 192, 2)
63- >>> print(sorted(Counter(y_resampled[0]).items()))
64- [(0, 64), (1, 64), (2, 64)]
9+ .. _ensemble_meta_estimators :
6510
66- See
67- :ref: `sphx_glr_auto_examples_ensemble_plot_easy_ensemble.py ` and
68- :ref: `sphx_glr_auto_examples_ensemble_plot_balance_cascade.py `.
11+ Classifier including inner balancing samplers
12+ =============================================
6913
70- .. _ ensemble_meta_estimators :
14+ .. _ bagging :
7115
72- Chaining ensemble of samplers and estimators
73- --------------------------------------------
16+ Bagging classifier
17+ ------------------
7418
7519In ensemble classifiers, bagging methods build several estimators on different
7620randomly selected subset of data. In scikit-learn, this classifier is named
7721``BaggingClassifier ``. However, this classifier does not allow to balance each
7822subset of data. Therefore, when training on imbalanced data set, this
7923classifier will favor the majority classes::
8024
25+ >>> from sklearn.datasets import make_classification
26+ >>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
27+ ... n_redundant=0, n_repeated=0, n_classes=3,
28+ ... n_clusters_per_class=1,
29+ ... weights=[0.01, 0.05, 0.94], class_sep=0.8,
30+ ... random_state=0)
8131 >>> from sklearn.model_selection import train_test_split
82- >>> from sklearn.metrics import confusion_matrix
32+ >>> from sklearn.metrics import balanced_accuracy_score
8333 >>> from sklearn.ensemble import BaggingClassifier
8434 >>> from sklearn.tree import DecisionTreeClassifier
8535 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
@@ -88,10 +38,8 @@ classifier will favor the majority classes::
8838 >>> bc.fit(X_train, y_train) #doctest: +ELLIPSIS
8939 BaggingClassifier(...)
9040 >>> y_pred = bc.predict(X_test)
91- >>> confusion_matrix(y_test, y_pred)
92- array([[ 9, 1, 2],
93- [ 0, 54, 5],
94- [ 1, 6, 1172]])
41+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
42+ 0.77...
9543
9644:class: `BalancedBaggingClassifier ` allows to resample each subset of data
9745before to train each estimator of the ensemble. In short, it combines the
@@ -111,45 +59,77 @@ random under-sampler::
11159 >>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
11260 BalancedBaggingClassifier(...)
11361 >>> y_pred = bbc.predict(X_test)
114- >>> confusion_matrix(y_test, y_pred)
115- array([[ 9, 1, 2],
116- [ 0, 55, 4],
117- [ 42, 46, 1091]])
62+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
63+ 0.80...
64+
65+ .. _forest :
66+
67+ Forest of randomized trees
68+ --------------------------
11869
11970:class: `BalancedRandomForestClassifier ` is another ensemble method in which
120- each tree of the forest will be provided a balanced boostrap sample. This class
71+ each tre1vided a balanced boostrap sample [1CLB2004]_ . This class
12172provides all functionality of the
12273:class: `sklearn.ensemble.RandomForestClassifier ` and notably the
12374`feature_importances_ ` attributes::
12475
125-
12676 >>> from imblearn.ensemble import BalancedRandomForestClassifier
127- >>> brf = BalancedRandomForestClassifier(n_estimators=10 , random_state=0)
77+ >>> brf = BalancedRandomForestClassifier(n_estimators=100 , random_state=0)
12878 >>> brf.fit(X_train, y_train) # doctest: +ELLIPSIS
12979 BalancedRandomForestClassifier(...)
13080 >>> y_pred = brf.predict(X_test)
131- >>> confusion_matrix(y_test, y_pred)
132- array([[ 9, 1, 2],
133- [ 3, 54, 2],
134- [ 113, 47, 1019]])
135- >>> brf.feature_importances_
136- array([ 0.63501243, 0.36498757])
137-
138- A specific method which uses ``AdaBoost `` as learners in the bagging
139- classifier is called EasyEnsemble. The :class: `EasyEnsembleClassifier ` allows
140- to bag AdaBoost learners which are trained on balanced bootstrap samples.
141- Similarly to the :class: `BalancedBaggingClassifier ` API, one can construct
142- the ensemble as::
81+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
82+ 0.81...
83+ >>> brf.feature_importances_ # doctest: +ELLIPSIS
84+ array([ 0.55..., 0.44...])
85+
86+ .. _boosting :
87+
88+ Boosting
89+ --------
90+
91+ Several methods taking advantage of boosting have been designed.
92+
93+ :class: `RUSBoostClassifier ` randomly under-sample the dataset before to perform
94+ a boosting iteration [SKHN2010 ]_::
95+
96+ >>> from imblearn.ensemble import RUSBoostClassifier
97+ >>> rusboost = RUSBoostClassifier(random_state=0)
98+ >>> rusboost.fit(X_train, y_train) # doctest: +ELLIPSIS
99+ RUSBoostClassifier(...)
100+ >>> y_pred = rusboost.predict(X_test)
101+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
102+ 0.74770070758043261
103+
104+ A specific method which uses ``AdaBoost `` as learners in the bagging classifier
105+ is called EasyEnsemble. The :class: `EasyEnsembleClassifier ` allows to bag
106+ AdaBoost learners which are trained on balanced bootstrap samples [LWZ2009 ]_.
107+ Similarly to the :class: `BalancedBaggingClassifier ` API, one can construct the
108+ ensemble as::
143109
144110 >>> from imblearn.ensemble import EasyEnsembleClassifier
145111 >>> eec = EasyEnsembleClassifier(random_state=0)
146112 >>> eec.fit(X_train, y_train) # doctest: +ELLIPSIS
147113 EasyEnsembleClassifier(...)
148114 >>> y_pred = eec.predict(X_test)
149- >>> confusion_matrix(y_test, y_pred)
150- array([[ 9, 1, 2],
151- [ 5, 52, 2],
152- [252, 45, 882]])
115+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
116+ 0.62484778593026025
153117
154118See
155- :ref: `sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py `.
119+ :ref: `sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py `.
120+
121+ .. topic :: References
122+
123+ .. [1CLB2004 ] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
124+ learn imbalanced data." University of California, Berkeley 110
125+ (2004): 1-12.
126+
127+ .. [LWZ2009 ] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
128+ Class-Imbalance Learning," in IEEE Transactions on Systems, Man,
129+ and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp.
130+ 539-550, April 2009.
131+
132+ .. [SKHN2010 ] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., &
133+ Napolitano, A. "RUSBoost: A hybrid approach to alleviating
134+ class imbalance." IEEE Transactions on Systems, Man, and
135+ Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
0 commit comments