Skip to content

Commit dcb1546

Browse files
authored
Merge pull request #8 from FalseNegativeLab/feature/multiclass_new
Mainly documentation changes
2 parents cff2911 + 7125700 commit dcb1546

40 files changed

Lines changed: 330 additions & 95 deletions

.pylintrc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,5 @@ ignored-modules = numpy
66

77
# Minimum lines number of a similarity.
88
min-similarity-lines=30
9+
10+
disable = too-many-arguments

README.rst

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ If you use the package, please consider citing the following paper:
7676
.. code-block:: BibTex
7777
7878
@misc{fazekas2023testing,
79-
title={Testing the Consistency of Performance Scores Reported for Binary Classification Problems},
79+
title={Testing the Consistency of Performance Scores Reported for Binary Classification Problems},
8080
author={Attila Fazekas and György Kovács},
8181
year={2023},
8282
eprint={2310.12527},
@@ -159,6 +159,8 @@ A simple binary classification testset consisting of ``p`` positive samples (usu
159159
160160
testset = {"p": 10, "n": 20}
161161
162+
We note that alternative notations, like using ``n_positive``, ``n_minority`` or ``n_1`` instead of ``p`` and similarly, ``n_negative``, ``n_majority`` and ``n_0`` instead of ``n`` are supported.
163+
162164
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):
163165

164166
.. code-block:: Python
@@ -261,7 +263,18 @@ Depending on the experimental setup, the consistency tests developed for binary
261263
* prevalence threshold (``pt``),
262264
* diagnostic odds ratio (``dor``),
263265
* Jaccard index (``ji``),
264-
* Cohen's kappa (``kappa``)
266+
* Cohen's kappa (``kappa``).
267+
268+
We note that synonyms and full names are also supported, for example:
269+
270+
* alternatives to ``sens`` are ``sensitivity``, ``true_positive_rate``, ``tpr`` and ``recall``,
271+
* alternatives to ``spec`` are ``specificity``, ``true_negative_rate``, ``tnr`` and ``selectivity``,
272+
* alternatives to ``ppv`` are ``positive_predictive_value`` and ``precision``.
273+
274+
Similarly, complements are supported as:
275+
276+
* one can specify ``false_positive_rate`` or ``fpr`` as a complement of ``spec``,
277+
* and similarly, ``false_negative_rate`` or ``fnr`` as a complement of ``sens``.
265278

266279
The tests are designed to detect inconsistencies. If the resulting ``inconsistency`` flag is ``False``, the scores can still be calculated in non-standard ways. However, **if the resulting ``inconsistency`` flag is ``True``, it conclusively indicates that inconsistencies are detected, and the reported scores could not be the outcome of the presumed experiment**.
267280

docs/01a_requirements.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ A simple binary classification testset consisting of ``p`` positive samples (usu
2626
2727
testset = {"p": 10, "n": 20}
2828
29+
We note that alternative notations, like using ``n_positive``, ``n_minority`` or ``n_1`` instead of ``p`` and similarly, ``n_negative``, ``n_majority`` and ``n_0`` instead of ``n`` are supported.
30+
2931
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):
3032

3133
.. code-block:: Python

docs/01c_consistency_checking.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,18 @@ Depending on the experimental setup, the consistency tests developed for binary
2424
* prevalence threshold (``pt``),
2525
* diagnostic odds ratio (``dor``),
2626
* Jaccard index (``ji``),
27-
* Cohen's kappa (``kappa``)
27+
* Cohen's kappa (``kappa``).
28+
29+
We note that synonyms and full names are also supported, for example:
30+
31+
* alternatives to ``sens`` are ``sensitivity``, ``true_positive_rate``, ``tpr`` and ``recall``,
32+
* alternatives to ``spec`` are ``specificity``, ``true_negative_rate``, ``tnr`` and ``selectivity``,
33+
* alternatives to ``ppv`` are ``positive_predictive_value`` and ``precision``.
34+
35+
Similarly, complements are supported as:
36+
37+
* one can specify ``false_positive_rate`` or ``fpr`` as a complement of ``spec``,
38+
* and similarly, ``false_negative_rate`` or ``fnr`` as a complement of ``sens``.
2839

2940
The tests are designed to detect inconsistencies. If the resulting ``inconsistency`` flag is ``False``, the scores can still be calculated in non-standard ways. However, **if the resulting ``inconsistency`` flag is ``True``, it conclusively indicates that inconsistencies are detected, and the reported scores could not be the outcome of the presumed experiment**.
3041

mlscorecheck/aggregated/_fold_enumeration.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -370,7 +370,7 @@ def experiment_kfolds_generator(experiment: dict, available_scores: list):
370370
"aggregation": experiment["aggregation"],
371371
}
372372

373-
def multiclass_fold_partitioning_generator_22(n0: int, n1: int, c0: int) -> dict:
373+
def multiclass_fold_partitioning_generator_22(n0: int, n1: int, c0: int):
374374
"""
375375
Generates the configurations for two folds of cardinalities n0 and n1 and two
376376
classes of cardinalities c0 and n0 + n1 - c0
@@ -392,7 +392,7 @@ def multiclass_fold_partitioning_generator_22(n0: int, n1: int, c0: int) -> dict
392392
1: (c0 - c_00, n1 - c0 + c_00)
393393
}
394394

395-
def multiclass_fold_partitioning_generator_2n(n0: int, n1: int, cs: list) -> dict:
395+
def multiclass_fold_partitioning_generator_2n(n0: int, n1: int, cs: list):
396396
"""
397397
Generates the configurations for two folds of cardinalities n0 and n1 and a list
398398
of classes with sizes in cs
@@ -409,13 +409,17 @@ def multiclass_fold_partitioning_generator_2n(n0: int, n1: int, cs: list) -> dic
409409
if len(cs) == 2:
410410
yield part
411411
else:
412-
for part_deep in multiclass_fold_partitioning_generator_2n(part[0][1], part[1][1], cs[1:]):
412+
for part_deep in multiclass_fold_partitioning_generator_2n(
413+
part[0][1],
414+
part[1][1],
415+
cs[1:]
416+
):
413417
yield {
414418
0: (part[0][0], *(part_deep[0])),
415419
1: (part[1][0], *(part_deep[1]))
416420
}
417421

418-
def multiclass_fold_partitioning_generator_kn(ns: list, cs: list) -> dict:
422+
def multiclass_fold_partitioning_generator_kn(ns: list, cs: list):
419423
"""
420424
Generates the configurations for a list of folds of sizes ns and a list
421425
of classes with sizes in cs

mlscorecheck/check/binary/_check_1_dataset_kfold_som.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"""
66

77
from ...core import NUMERICAL_TOLERANCE
8-
from ...individual import check_scores_tptn_pairs
8+
from ...individual import check_scores_tptn_pairs, translate_metadata
99
from ...aggregated import Experiment
1010

1111
__all__ = ["check_1_dataset_kfold_som"]
@@ -32,7 +32,10 @@ def check_1_dataset_kfold_som(
3232
'f1', 'fm', 'f1n', 'fbp', 'fbn', 'upm', 'gm', 'mk', 'lrp', 'lrn',
3333
'mcc', 'bm', 'pt', 'dor', 'ji', 'kappa'). When using f-beta
3434
positive or f-beta negative, also set 'beta_positive' and
35-
'beta_negative'.
35+
'beta_negative'. Full names in camel case, like
36+
'positive_predictive_value', synonyms, like 'true_positive_rate'
37+
or 'tpr' instead of 'sens' and complements, like
38+
'false_positive_rate' for (1 - 'spec') can also be used.
3639
eps (float|dict(str,float)): The numerical uncertainty(ies) of the scores.
3740
numerical_tolerance (float, optional): In practice, beyond the numerical uncertainty of
3841
the scores, some further tolerance is applied. This
@@ -90,6 +93,8 @@ def check_1_dataset_kfold_som(
9093
# True
9194
9295
"""
96+
folding = translate_metadata(folding)
97+
9398
if folding.get("folds") is None and folding.get("strategy") is None:
9499
# any folding strategy results the same
95100
folding = {**folding} | {"strategy": "stratified_sklearn"}

mlscorecheck/check/binary/_check_1_dataset_known_folds_mos.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
from ...core import NUMERICAL_TOLERANCE
88
from ...aggregated import check_aggregated_scores, Experiment, Evaluation
9+
from ...individual import translate_metadata
910

1011
__all__ = ["check_1_dataset_known_folds_mos"]
1112

@@ -31,7 +32,10 @@ def check_1_dataset_known_folds_mos(
3132
3233
The test can only check the consistency of the 'acc', 'sens', 'spec' and 'bacc'
3334
scores. For a stronger test, one can add ``fold_score_bounds`` when, for example, the minimum
34-
and the maximum scores over the folds are also provided.
35+
and the maximum scores over the folds are also provided. Full names in camel case, like
36+
'positive_predictive_value', synonyms, like 'true_positive_rate'
37+
or 'tpr' instead of 'sens' and complements, like
38+
'false_positive_rate' for (1 - 'spec') can also be used.
3539
3640
Args:
3741
dataset (dict): The dataset specification.
@@ -105,6 +109,9 @@ def check_1_dataset_known_folds_mos(
105109
# True
106110
"""
107111

112+
dataset = translate_metadata(dataset)
113+
folding = translate_metadata(folding)
114+
108115
evaluation = Evaluation(
109116
dataset=dataset,
110117
folding=folding,

mlscorecheck/check/binary/_check_1_dataset_unknown_folds_mos.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
from ...core import NUMERICAL_TOLERANCE
77
from ...aggregated import Dataset, repeated_kfolds_generator, kfolds_generator
8+
from ...individual import translate_metadata
89
from ._check_1_dataset_known_folds_mos import check_1_dataset_known_folds_mos
910

1011
__all__ = ["check_1_dataset_unknown_folds_mos", "estimate_n_evaluations"]
@@ -63,7 +64,10 @@ def check_1_dataset_unknown_folds_mos(
6364
6465
The test can only check the consistency of the 'acc', 'sens', 'spec' and 'bacc'
6566
scores. For a stronger test, one can add fold_score_bounds when, for example, the minimum and
66-
the maximum scores over the folds are also provided.
67+
the maximum scores over the folds are also provided. Full names in camel case, like
68+
'positive_predictive_value', synonyms, like 'true_positive_rate'
69+
or 'tpr' instead of 'sens' and complements, like
70+
'false_positive_rate' for (1 - 'spec') can also be used.
6771
6872
Note that depending on the size of the dataset (especially the number of minority instances)
6973
and the folding configuration, this test might lead to an untractable number of problems to
@@ -126,6 +130,9 @@ def check_1_dataset_unknown_folds_mos(
126130
>>> result['inconsistency']
127131
# True
128132
"""
133+
dataset = translate_metadata(dataset)
134+
folding = translate_metadata(folding)
135+
129136
evaluation = {
130137
"dataset": dataset,
131138
"folding": folding,

mlscorecheck/check/binary/_check_1_testset_no_kfold.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import warnings
77

88
from ...core import logger, NUMERICAL_TOLERANCE
9-
from ...individual import check_scores_tptn_pairs
9+
from ...individual import check_scores_tptn_pairs, translate_metadata
1010
from ...experiments import dataset_statistics
1111

1212
__all__ = ["check_1_testset_no_kfold"]
@@ -32,7 +32,11 @@ def check_1_testset_no_kfold(
3232
'fbp', 'fbn', 'upm', 'gm', 'mk', 'lrp', 'lrn', 'mcc',
3333
'bm', 'pt', 'dor', 'ji', 'kappa'), when using
3434
f-beta positive or f-beta negative, also set
35-
'beta_positive' and 'beta_negative'.
35+
'beta_positive' and 'beta_negative'. Full names in camel case,
36+
like 'positive_predictive_value', synonyms, like
37+
'true_positive_rate' or 'tpr' instead of 'sens' and
38+
complements, like 'false_positive_rate' for (1 - 'spec') can
39+
also be used.
3640
eps (float|dict(str,float)): the numerical uncertainty (potentially for each score)
3741
numerical_tolerance (float): in practice, beyond the numerical uncertainty of
3842
the scores, some further tolerance is applied. This is
@@ -90,6 +94,8 @@ def check_1_testset_no_kfold(
9094
"no aggregation of any kind."
9195
)
9296

97+
testset = translate_metadata(testset)
98+
9399
if ("p" not in testset or "n" not in testset) and ("name" not in testset):
94100
raise ValueError('either "p" and "n" or "name" should be specified')
95101

mlscorecheck/check/binary/_check_n_datasets_mos_kfold_som.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import copy
88

99
from ...aggregated import check_aggregated_scores, Experiment
10+
from ...individual import translate_metadata
1011
from ...core import NUMERICAL_TOLERANCE
1112

1213
__all__ = ["check_n_datasets_mos_kfold_som"]
@@ -33,7 +34,10 @@ def check_n_datasets_mos_kfold_som(
3334
3435
The test can only check the consistency of the 'acc', 'sens', 'spec' and 'bacc'
3536
scores. For a stronger test, one can add ``dataset_score_bounds`` when, for example, the minimum
36-
and the maximum scores over the datasets are also provided.
37+
and the maximum scores over the datasets are also provided. Full names in camel case, like
38+
'positive_predictive_value', synonyms, like 'true_positive_rate'
39+
or 'tpr' instead of 'sens' and complements, like
40+
'false_positive_rate' for (1 - 'spec') can also be used.
3741
3842
Args:
3943
evaluations (list(dict)): the list of evaluation specifications
@@ -105,6 +109,8 @@ def check_n_datasets_mos_kfold_som(
105109
# True
106110
"""
107111

112+
evaluations = translate_metadata(evaluations)
113+
108114
if any(evaluation.get("aggregation", "som") != "som" for evaluation in evaluations):
109115
raise ValueError(
110116
'the aggregation specified in each dataset must be "rom" or nothing.'

0 commit comments

Comments
 (0)