In Gpredomics, an Individual represents a unique predictive model, defined by a specific subset of features, their coefficients, and associated parameters. Individuals are the fundamental units manipulated by algorithms such as genetic algorithms, beam search, and ant colony optimization (ACO).
Currently, each model is a binary classification model, dependent on a threshold. Gpredomics is specifically designed for interpretable and easy-to-use models.
Let's take the example of a fictive target disease called Lightheadedness, causing patients' heads to transform into giant light bulbs. The disease is mainly associated with three features called L, U and X. A Gpredomics equation that characterises this relationship could therefore be:
L + U + X ≥ threshold
This equation is quite simple: if the sum of L, U and X is equal to or greater than the given threshold, the sample is classified as having Lightheadedness, otherwise it is classified as healthy.
Legacy Predomics more precisely allows three types of languages, i.e. model formulae, called BTR. Each one is designed to describe a particular ecological relation:
Binary | Cooperation / Cumulative effect | Coefficient ∈ {0, 1}
L + U + X ≥ threshold
Ternary: Competition / Predation | Coefficient ∈ {-1, 0, 1}
L + U + X - (D + A + R + K) ≥ threshold
Ratio: Imbalance | Coefficient ∈ {-1, 0, 1}
(L + U + X) / (D + A + R + K + ε) ≥ threshold
Gpredomics introduces a new language called Pow2, which is a variant of Ternary where each feature can be raised to a power of two:
Pow2 | Coefficient ∈ {-64, -32, -16, -8, -4, -2, -1, 0, 1, 2, 4, 8, 16, 32, 64}:
L + U + 2 X - (64 D + 4 A + 2 R + K) ≥ threshold
See Predomics' original paper for more details on BTR languages.
Each model is additionally characterized by a data type:
| Data Type | Description | Displayed equation (bin example) | Implemented non-ratio (bin/ter/pow2) equation | Implemented ratio equation |
|---|---|---|---|---|
raw |
Raw data | |||
prev |
Binarized data | |||
log |
Natural log of data | $\sum_{i \in pos} | ||
zscore |
Standardized (z-score) |
|
Here,
datatype_epsilon in the YAML configuration.
For the zscore data type, each feature is standardized using training-data statistics (mean zscore, z, standardized.
An Individual contains the model definition, a universal fitness score, and nested classification metrics.
| Field | Type | Description |
|---|---|---|
features |
BTreeMap<usize, i8> |
Feature indices mapped to their coefficients. Uses BTreeMap (not HashMap) for deterministic iteration order. |
k |
usize |
Number of variables used in the model. |
language |
u8 |
Language of the model (bin, ter, ratio, pow2). |
data_type |
u8 |
Data type of the model (raw, prev, log, zscore). |
epsilon |
f64 |
Epsilon value used during score calculation. |
fit |
f64 |
Universal penalized objective (see Fitness below). |
cls |
ClassificationMetrics |
Nested classification metrics (see below). |
epoch |
usize |
Iteration that produced this model. |
parents |
Option<Vec<u64>> |
Parent hashes in the generational context. |
hash |
u64 |
Identifier hash of the model. |
betas |
Option<Betas> |
Beta coefficients (MCMC individuals only). |
Note on
features: Prior to v0.8.3,featureswas aHashMap<usize, i8>. It is now aBTreeMap<usize, i8>so that feature iteration order is deterministic across runs and platforms, which matters for reproducibility of serialised models and score computation.
The ClassificationMetrics struct groups all metrics that are specific to binary classification. It is stored in Individual.cls.
| Field | Type | Description |
|---|---|---|
auc |
f64 |
Area Under the ROC Curve on the training set. |
threshold |
f64 |
Decision threshold used for binary classification. |
threshold_ci |
Option<ThresholdCI> |
Confidence interval for the threshold and rejection rate. |
sensitivity |
f64 |
True Positive Rate on the training set. |
specificity |
f64 |
True Negative Rate on the training set. |
accuracy |
f64 |
Accuracy on the training set. |
additional |
AdditionalMetrics |
Additional derived metrics (see below). |
The AdditionalMetrics struct holds extra classification metrics that are computed on demand. It is stored in Individual.cls.additional.
| Field | Type | Description |
|---|---|---|
mcc |
Option<f64> |
Matthews Correlation Coefficient. |
f1_score |
Option<f64> |
Harmonic mean of Precision and Sensitivity. |
npv |
Option<f64> |
Negative Predictive Value — TN / (TN + FN). |
ppv |
Option<f64> |
Positive Predictive Value — TP / (TP + FP). |
g_mean |
Option<f64> |
Geometric mean of Sensitivity and Specificity. |
// Universal objective (always on Individual)
individual.fit
// Core classification metrics
individual.cls.auc
individual.cls.threshold
individual.cls.sensitivity
individual.cls.specificity
individual.cls.accuracy
individual.cls.threshold_ci
// Additional metrics (Option values)
individual.cls.additional.mcc
individual.cls.additional.f1_score
individual.cls.additional.npv
individual.cls.additional.ppv
individual.cls.additional.g_meanTo compute the best threshold according to the individual formulae and the data, Gpredomics calculates the area under the curve of the model. Then, it computes the best threshold according to the fit targeted metrics :
-
for AUC, the best threshold is the threshold maximizing the Youden Maxima :
Specificity + Sensitivity - 1 -
for asymetric target (specificity vs sensitivity), the best threshold is the threshold maximizing :
(target + (antagonist * *fr_penalty*)) / (1+*fr_penalty*) -
for others, the best threshold is the threshold maximizing the targeted metrics.
As medicine requires divisive model, Gpredomics allows since 0.7.3 to surround the threshold with a empirical confidence intervale. For more details, consult the associated documentation: rejection.md.
To assess how well a model fits the data, an individual is also characterised by a fitness metric, linked to a performance metric minus penalties.
Gpredomics allows several fit metrics:
| Metric | Description | Formula |
|---|---|---|
auc |
Area Under the Curve | Calculated from ROC curve |
sensitivity |
True Positive Rate | TP / (TP + FN) |
specificity |
True Negative Rate | TN / (TN + FP) |
g_mean |
Geometric mean, balance between Sens. & Spec. | |
mcc |
Matthews Correlation Coefficient | |
f1_score |
Harmonic mean of Prec. & Sens. |
Where:
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- Precision = TP / (TP + FP)
This diversity of metrics allows for diverse exploration of the research space. Certain metrics, notably MCC and G-mean, are preferable in the case of unbalanced datasets.
Several fit penalties can be applied on the fit to take specific constraints into account during the model selection at each iteration.
| param | Description | Equation | When to use it |
|---|---|---|---|
k_penalty |
Penalty proportional to the number of variables used in the model. | fit -= k * k_penalty |
To favor simpler models and avoid overfitting |
bias_penalty |
Penalizes models with specificity/sensitivity < 0.5. | fit -= (1.0 - bad_metrics) * bias_penalty |
To avoid unbalanced models (very low specificity/sensitivity) |
threshold_ci_penalty |
Penalizes according to rejection rate if a threshold confidence interval exists. | fit -= rejection_rate * threshold_ci_penalty |
When you want to limit the rejection rate or control threshold uncertainty |
overfit_penalty |
Penalizes overfitting during cross-validation. | fit -= mean(fit on k-1 folds - abs(delta with last fold)) * overfit_penalty |
To avoid overfitting detected during cross-validation |
feature_penalty |
Feature-specific penalty, defined in annotations. | fit -= Σ(feature_penalty) |
To penalize the use of specific variables according to their annotation |
Beyond global statistics, Gpredomics can evaluate the importance of variables within a specific individual model. This is achieved using a Mean Decrease Accuracy (MDA) approach via permutations.
For each feature present in the individual, the algorithm randomly shuffles the feature's values across samples
Please note that this feature allows you to evaluate the importance of a single feature, but does not allow you to evaluate its importance in conjunction with other features. In a metagenomic context, it can therefore be used to isolate features whose mere presence or absence is decisive, but it cannot capture group phenomena.
Last updated: v0.8.3