Merge pull request #31 from ChEB-AI/fix/predict_pipeline

sfluegel05 · web-flow · commit 63216b468d15 · 2026-02-24T09:29:20.000+01:00
Prediction functional for Graphs
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ The dataset has a customizable list of properties for atoms, bonds and molecules
 The list can be found in the `configs/data/chebi50_graph_properties.yml` file.
 
 ```bash
-python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/csv_logger.yml --model=../python-chebai-graph/configs/model/gnn_res_gated.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce.yml
+python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/csv_logger.yml --model=../python-chebai-graph/configs/model/gnn_res_gated.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce_weighted.yml
 ```
 
 ## Augmented Graphs
@@ -94,7 +94,7 @@ Among all the connection schemes we evaluated, this configuration delivered the
 Below is the command for the model and data configuration that achieved the best classification performance using augmented graphs.
 
 ```bash
-python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/gat_aug_amgpool.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --model.config.v2=True --data=../python-chebai-graph/configs/data/chebi50_aug_prop_as_per_node.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce.yml --trainer.logger.init_args.name=gatv2_amg_s0
+python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/gat_aug_amgpool.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_aug_prop_as_per_node.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce_weighted.yml --trainer.logger.init_args.name=gatv2_amg_s0
 ```
 
 ### Model Hyperparameters
@@ -104,7 +104,7 @@ python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.lo
 To use a GAT-based model, choose **one** of the following configs:
 
 - **Standard Pooling**: `--model=../python-chebai-graph/configs/model/gat.yml`
-   > Standard pooling sums the learned representations from all the nodes to produce a single representation which is used for classification.  
+   > Standard pooling sums the learned representations from all the nodes to produce a single representation which is used for classification.
 
 - **Atom-Augmented Node Pooling**: `--model=../python-chebai-graph/configs/model/gat_aug_aagpool.yml`
    > With this pooling stratergy, the learned representations are first separated into **two distinct sets**: those from atom nodes and those from all artificial nodes (both functional groups and the graph node). The representations within each set are aggregated separately (using summation) to yield two distinct single vectors. These two resulting vectors are then concatenated before being passed to the classification layer.
@@ -117,9 +117,13 @@ To use a GAT-based model, choose **one** of the following configs:
 - **Number of message-passing layers**: `--model.config.num_layers=5`        (default: 4)
 - **Attention heads**: `--model.config.heads=4`             (default: 8)
   > **Note**: The number of heads should be divisible by the output channels (or hidden channels if output channels are not specified).
-- **Use GATv2**: `--model.config.v2=True`             (default: False)
-  > **Note**: GATv2 addresses the limitation of static attention in GAT by introducing a dynamic attention mechanism. For further details, please refer to the [original GATv2 paper](https://arxiv.org/abs/2105.14491).
-  
+
+- **To Use different GAT versions**:
+    - **Use GAT**: `--model.config.v2=False`
+
+    - **Use GATv2**: `--model.config.v2=True`             (__default__)
+      > **Note**: GATv2 addresses the limitation of static attention in GAT by introducing a dynamic attention mechanism. For further details, please refer to the [original GATv2 paper](https://arxiv.org/abs/2105.14491).
+
 #### **ResGated Architecture**
 
 To use a ResGated GNN model, choose **one** of the following configs:
@@ -142,7 +146,7 @@ These can be used for both GAT and ResGated architectures:
 In this type of node initialization, the node features (and/or edge features) of the given molecular graph are initialized only once during dataset creation with the given initialization scheme.
 
 ```bash
-python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/resgated.yml --model.config.in_channels=203 --model.config.edge_dim=11 --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.pad_node_features=45 --data.pad_edge_features=4 --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --data.init_args.persistent_workers=False --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce.yml --trainer.logger.init_args.name=gni_res_props+zeros_s0
+python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/resgated.yml --model.config.in_channels=203 --model.config.edge_dim=11 --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.pad_node_features=45 --data.pad_edge_features=4 --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --data.init_args.persistent_workers=False --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce_weighted.yml --trainer.logger.init_args.name=gni_res_props+zeros_s0
 ```
 
 In the above command, for each node we use the 158 node features (corresponding the node properties defined in `chebi50_graph_properties.yml`) which are retrieved from RDKit and additional 45 additional features (specified by `--data.pad_node_features=45`) drawn from a normal distribution (default).
@@ -184,5 +188,5 @@ If all features should be initialized from the given distribution, remove the co
 Please find below the command for a typical dynamic node initialization:
 
 ```bash
-python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/resgated_dynamic_gni.yml --model.config.in_channels=203 --model.config.edge_dim=11 --model.config.complete_randomness=False --model.config.pad_node_features=45 --model.config.pad_edge_features=4 --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --data.init_args.persistent_workers=False --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce.yml --trainer.logger.init_args.name=gni_dres_props+rand_s0
+python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/wandb_logger.yml --model=../python-chebai-graph/configs/model/resgated_dynamic_gni.yml --model.config.in_channels=203 --model.config.edge_dim=11 --model.config.complete_randomness=False --model.config.pad_node_features=45 --model.config.pad_edge_features=4 --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --data.init_args.persistent_workers=False --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce_weighted.yml --trainer.logger.init_args.name=gni_dres_props+rand_s0
 ```
diff --git a/chebai_graph/preprocessing/datasets/chebi.py b/chebai_graph/preprocessing/datasets/chebi.py
@@ -77,7 +77,7 @@ def __init__(
             properties = self._sort_properties(properties)
         else:
             properties = []
-        self.properties = properties
+        self.properties: list[MolecularProperty] = properties
         assert isinstance(self.properties, list) and all(
             isinstance(p, MolecularProperty) for p in self.properties
         )
@@ -184,6 +184,54 @@ def _after_setup(self, **kwargs) -> None:
         self._setup_properties()
         super()._after_setup(**kwargs)
 
+    def _preprocess_smiles_for_pred(
+        self, idx, smiles: str, model_hparams: Optional[dict] = None
+    ) -> dict:
+        """Preprocess prediction data."""
+        # Add dummy labels because the collate function requires them.
+        # Note: If labels are set to `None`, the collator will insert a `non_null_labels` entry into `loss_kwargs`,
+        # which later causes `_get_prediction_and_labels` method in the prediction pipeline to treat the data as empty.
+        result = self.reader.to_data(
+            {"id": f"smiles_{idx}", "features": smiles, "labels": [1, 2]}
+        )
+        if result is None or result["features"] is None:
+            return None
+        for property in self.properties:
+            property.encoder.eval = True
+            property_value = self.reader.read_property(smiles, property)
+            if property_value is None or len(property_value) == 0:
+                encoded_value = None
+            else:
+                encoded_value = torch.stack(
+                    [property.encoder.encode(v) for v in property_value]
+                )
+                if len(encoded_value.shape) == 3:
+                    encoded_value = encoded_value.squeeze(0)
+            result[property.name] = encoded_value
+
+        result["features"] = self._prediction_merge_props_into_base_wrapper(
+            result, model_hparams
+        )
+
+        # apply transformation, e.g. masking for pretraining task
+        if self.transform is not None:
+            result["features"] = self.transform(result["features"])
+
+        return result
+
+    def _prediction_merge_props_into_base_wrapper(
+        self, row: pd.Series | dict, model_hparams: Optional[dict] = None
+    ) -> GeomData:
+        """
+        Wrapper to merge properties into base features for prediction.
+
+        Args:
+            row: A dictionary or pd.Series containing 'features' and encoded properties.
+        Returns:
+            A GeomData object with merged features.
+        """
+        return self._merge_props_into_base(row)
+
 
 class GraphPropertiesMixIn(DataPropertiesSetter, ABC):
     def __init__(
@@ -220,7 +268,7 @@ def __init__(
                 f"Data module uses these properties (ordered): {', '.join([str(p) for p in self.properties])}"
             )
 
-    def _merge_props_into_base(self, row: pd.Series) -> GeomData:
+    def _merge_props_into_base(self, row: pd.Series | dict) -> GeomData:
         """
         Merge encoded molecular properties into the GeomData object.
 
@@ -488,6 +536,8 @@ def _merge_props_into_base(
             A GeomData object with merged features.
         """
         geom_data = row["features"]
+        if geom_data is None:
+            return None
         assert isinstance(geom_data, GeomData)
 
         is_atom_node = geom_data.is_atom_node
@@ -571,6 +621,29 @@ def _merge_props_into_base(
             is_graph_node=is_graph_node,
         )
 
+    def _prediction_merge_props_into_base_wrapper(
+        self, row: pd.Series | dict, model_hparams: Optional[dict] = None
+    ) -> GeomData:
+        """
+        Wrapper to merge properties into base features for prediction.
+
+        Args:
+            row: A dictionary or pd.Series containing 'features' and encoded properties.
+        Returns:
+            A GeomData object with merged features.
+        """
+        if (
+            model_hparams is None
+            or "in_channels" not in model_hparams["config"]
+            or model_hparams["config"]["in_channels"] is None
+        ):
+            raise ValueError(
+                f"model_hparams must be provided for data class: {self.__class__.__name__}"
+                f" which should contain 'in_channels' key with valid value in 'config' dictionary."
+            )
+        max_len_node_properties = int(model_hparams["config"]["in_channels"])
+        return self._merge_props_into_base(row, max_len_node_properties)
+
 
 class ChEBI50_StaticGNI(DataPropertiesSetter, ChEBIOver50):
     READER = RandomFeatureInitializationReader
diff --git a/configs/model/gat.yml b/configs/model/gat.yml
@@ -9,7 +9,6 @@ init_args:
     num_layers: 4
     edge_dim: 7 # number of bond properties
     heads: 8  # the number of heads should be divisible by output channels (hidden channels if output channel not given)
-    v2: False  # set True to use `torch_geometric.nn.conv.GATv2Conv` convolution layers, default is GATConv
-    dropout: 0
+    v2: True  # This uses `torch_geometric.nn.conv.GATv2Conv` convolution layers, set False to use `GATConv`
   n_molecule_properties: 0
   n_linear_layers: 1
diff --git a/configs/model/gat_aug_aapool.yml b/configs/model/gat_aug_aapool.yml
@@ -9,7 +9,6 @@ init_args:
     num_layers: 4
     edge_dim: 11 # number of bond properties
     heads: 8  # the number of heads should be divisible by output channels (hidden channels if output channel not given)
-    v2: False  # set True to use `torch_geometric.nn.conv.GATv2Conv` convolution layers, default is GATConv
-    dropout: 0
+    v2: True  # This uses `torch_geometric.nn.conv.GATv2Conv` convolution layers, set False to use `GATConv`
   n_molecule_properties: 0
   n_linear_layers: 1
diff --git a/configs/model/gat_aug_amgpool.yml b/configs/model/gat_aug_amgpool.yml
@@ -9,7 +9,7 @@ init_args:
     num_layers: 4
     edge_dim: 11 # number of bond properties
     heads: 8  # the number of heads should be divisible by output channels (hidden channels if output channel not given)
-    v2: True  # set True to use `torch_geometric.nn.conv.GATv2Conv` convolution layers, default is GATConv
+    v2: True  # This uses `torch_geometric.nn.conv.GATv2Conv` convolution layers, set False to use `GATConv`
     dropout: 0
   n_molecule_properties: 0
   n_linear_layers: 1