Add DNN estimator by lionelkusch · Pull Request #57 · mind-inria/hidimstat

lionelkusch · 2024-12-13T13:43:39Z

The estimator Dnn_learner and RandomForestModified were not tested after that the BBI file was removed. I tried to implement some tests for improving the coverage of the tests.

I moved these files into a sub-module of hidimstat because they are not part of the core of methods proposed by the library.
Moreover, Dnn_learner include a dependency on torch and torchmetric which is not essential for the other methods of the library. In consequence, moving them to a sub-module, removes this requirement for using the other methods of the libraries.

lionelkusch · 2024-12-13T13:46:22Z

        loss = np.array(res_ens[4])

        if self.n_ensemble == 1:
+            raise Warning("The model can't be fit with n_ensemble = 1")


@jpaillard Can you check if it's correct that without multiple ensembles, there is not fitting?

I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

See the Issue #75. This will be modified in the future.

lionelkusch · 2024-12-13T13:49:07Z

            y = y.reshape(-1, 1)
        if self.problem_type == "regression":
            list_y.append(y)
+        # Encoding the target with the ordinal case


@jpaillard @bthirion
Can you tell me, if you know what is the "ordinal methods".
if yes, do you thing it's interesting to keep it? (the function was half implemented.)
If it's interesting to keep it, can you check if my modification was corrected?

It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

This will be address in the future. #76

lionelkusch · 2024-12-13T13:51:25Z

+                if samples_in_leaf.size > 0:
+                    leaf_samples.append(
+                        y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
+                    )


@jpaillard @bthirion
I modified the function for considering when the samples leaf was empty.
However, I don't know what was doing the function.
Can you validate if it's the correct way to do it?

This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

The usage of samples_in_leaf is explain in #42.

This will be corrected with the issues #77 and #78.

lionelkusch · 2024-12-13T13:52:55Z

+    assert not np.all(predict_prob[:, 0] == 0)
+    assert not np.all(predict_prob[:, 1] == 0)
+    # Check if the predicted probabilities are not all ones for each class
+    assert not np.all(predict_prob[:, 0] == 1)


@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

There are probably enough ;-)
We should just make sure they're not redundant.

Advise for new tests address in issue #79.

lionelkusch · 2024-12-13T13:53:35Z

+    learner = DnnLearnerSingle(do_hypertuning=True, problem_type="ordinal", n_jobs=10, verbose=0)
+    learner.fit(X, y)
+    predict_prob = learner.predict_proba(X)[:,0]
+    # Check if the predicted class labels match the true labels for at least one instance


@jpaillard @bthirion
Can you help me to define some tests for this method?

This will be addressed in #79.

lionelkusch · 2024-12-13T13:54:03Z

+    # Check if the feature importances are not all close to zero
+    assert not np.allclose(learner.feature_importances_, 0)
+    # Check if the feature importances are not all close to one
+    assert not np.allclose(learner.feature_importances_, 1)


@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

This will be addressed in #79.

codecov · 2024-12-13T14:10:53Z

Codecov Report

Attention: Patch coverage is 90.81081% with 34 lines in your changes missing coverage. Please review.

Project coverage is 94.32%. Comparing base (e95b4f6) to head (bbab162).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
hidimstat/estimator/_utils/Dnn.py	83.82%	33 Missing ⚠️
hidimstat/estimator/Dnn_learner_single.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #57       +/-   ##
===========================================
+ Coverage   77.09%   94.32%   +17.23%     
===========================================
  Files          46       52        +6     
  Lines        2471     2608      +137     
===========================================
+ Hits         1905     2460      +555     
+ Misses        566      148      -418

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bthirion

Thx for opening this ! I have a bunch of comments.

bthirion · 2024-12-13T14:53:51Z

            y = y.reshape(-1, 1)
        if self.problem_type == "regression":
            list_y.append(y)
+        # Encoding the target with the ordinal case


It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

bthirion · 2024-12-13T14:55:13Z

+                if samples_in_leaf.size > 0:
+                    leaf_samples.append(
+                        y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
+                    )


This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

bthirion · 2024-12-13T14:55:37Z

-                leaf_samples.append(
-                    y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
-                )
+                if samples_in_leaf.size > 0:


You can remove the condition here too.

This will be corrected with the issues #77 and #78.

bthirion · 2024-12-13T14:56:19Z

Is this needed ?

bthirion · 2024-12-13T14:56:40Z

+        Data matrix
+    y : np.array
+        Target vector
+    grps : np.array


Suggested change

grps : np.array

groups : np.array

bthirion · 2024-12-13T15:45:21Z

+        self.loss = 0
+
+    def forward(self, x):
+        if self.group_stacking:


This will be addressed by the issue #81.

bthirion · 2024-12-13T15:45:29Z

+            x = torch.cat(list_stacking, dim=1)
+        return self.layers(x)
+
+    def training_step(self, batch, device, problem_type):


docstring ?

This will be addressed by the issue #81.

bthirion · 2024-12-13T15:45:36Z

+            loss = F.binary_cross_entropy_with_logits(y_pred, y)
+        return loss
+
+    def validation_step(self, batch, device, problem_type):


docstring ?

This will be addressed by the issue #81.

bthirion · 2024-12-13T15:45:50Z

+                "batch_size": len(X),
+            }
+
+    def validation_epoch_end(self, outputs, problem_type):


docstring ?

bthirion · 2024-12-13T15:46:03Z

+            print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"]))
+
+
+def _evaluate(model, loader, device, problem_type):


docstring (1 line)

This will be addressed by the issue #81.

jpaillard · 2024-12-14T11:21:24Z

        loss = np.array(res_ens[4])

        if self.n_ensemble == 1:
+            raise Warning("The model can't be fit with n_ensemble = 1")


I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

jpaillard · 2024-12-14T11:31:05Z

As discussed with you, as a user, I don't like integrating the ensembling (n_ensemble) and hyper-parameter tuning (do_hypertuning, dict_hypertuning) in a single class, which becomes huge.

Also, I think other libraries (sklearn for ensembling, optuna for hyperparameters) offer more & better options for these advanced training strategies.

I suggest separating these aspects from the DNN_learner class and leaving it up to the user to optimize the training separately from humidistat.

lionelkusch · 2024-12-16T09:42:47Z

The primary aim of this pull request was to increase test coverage and separate the estimation functions from the other methods to reduce dependency on torch.

Most of your comments concern the code that was there before. I had no intention of dealing with it at the moment because, for me, it wasn't the priority. However, if you think it's very important, I can do it.

lionelkusch · 2024-12-19T17:24:13Z

This new commits are to address the most simple modifications. The other requests will be addressed in future PR and to trace them, 7 issues were open.

This is a particular case because the estimators require important refactoring and their are not the focus of the library.
Additionally, this PR is important to reduce the dependence to torch.

lionelkusch · 2025-03-07T10:07:44Z

After discussion, the DNN is removed from the source in the PR #166. This estimator can be added to the latter if it's very necessary.

bthirion

There are a few glitches. LGTM overall.

bthirion · 2025-03-10T21:56:29Z

@@ -0,0 +1,655 @@
+import numpy as np


module should be renamed to dnn.py

bthirion · 2025-03-10T21:56:42Z

+import copy
+import torch
+import torch.nn as nn
+import torch.nn.functional as F


Suggested change

import torch.nn.functional as F

import torch.nn.functional as functional

bthirion · 2025-03-10T21:59:24Z

+                        in_features=len(grp),
+                        out_features=input_dim[grp_ind + 1] - input_dim[grp_ind],
+                    )
+                    # nn.Sequential(


commented out code should not be included

bthirion · 2025-03-10T22:01:41Z

+
+    # Specify whether to use GPU or CPU
+    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    device = torch.device("cpu")


it is weird to set this up here...

bthirion · 2025-03-10T22:01:59Z

+            optimizer.step()
+            for name, param in model.named_parameters():
+                if "bias" not in name:
+                    # if name.split(".")[0] == "layers_stacking":


remove commented lines

lionelkusch · 2025-03-11T09:59:06Z

This is not ready to be reviewed. I just added here to keep track of it.
Before continuing, we need to look which should be replaced or not by scikit-learn.
See the issue #18, for more details.

lionelkusch commented Dec 13, 2024

View reviewed changes

lionelkusch requested review from bthirion and jpaillard December 13, 2024 13:54

bthirion reviewed Dec 13, 2024

View reviewed changes

jpaillard reviewed Dec 14, 2024

View reviewed changes

This was referenced Dec 17, 2024

Improve the coverage of the tests and fix some minor bugs #64

Merged

Modified ensembling in Dnn_learner_single. #75

Open

Check that the method ordinal is correct. #76

Open

Test for estimators #79

Open

lionelkusch mentioned this pull request Feb 21, 2025

Time for the tests #161

Closed

lionelkusch added 12 commits March 7, 2025 10:38

Create a folder with the estimators

717330b

Transfer of function from global utils to local utils

99e1092

Modifier Random forest for using their last function

1e546e0

Add test for DNN learner

c72c6fa

Add test for Dnn_learner_single and fix some bugs

5bbbc03

Ignore coverage files

8a806f4

Missing a file for test

8fc0e82

Modified coverage configuration for new tests

d91f78b

Fix bug in the import of test

d032aa2

Fix some typos

b1638a3

Formating files

e0075fc

Change name of the files

6217732

lionelkusch added 5 commits March 7, 2025 10:41

formart utils

3e2fbe0

Remove not necessary comments

b2c706d

Move to src folder

20e09ac

remove old folder

7afc7e5

Remove unecesary function from utils

fd65543

lionelkusch force-pushed the estimator branch from bbab162 to fd65543 Compare March 7, 2025 09:50

Remove RF

bec739b

lionelkusch mentioned this pull request Mar 7, 2025

Remove estimators #166

Merged

lionelkusch changed the title ~~Estimator~~ Add DNN estimator Mar 7, 2025

bthirion reviewed Mar 10, 2025

View reviewed changes

lionelkusch linked an issue Mar 11, 2025 that may be closed by this pull request

Separate ML models from the main hidimstat library #18

Open

Remove example link to RandomForest

7d30772

lionelkusch mentioned this pull request Mar 14, 2025

Adding more detailed requirements.txt to the installation process #21

Open

lionelkusch marked this pull request as draft March 18, 2025 16:56

lionelkusch added the estimators question link to estimators label Apr 11, 2025

lionelkusch removed a link to an issue May 19, 2025

Separate ML models from the main hidimstat library #18

Open

		print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"]))


		def _evaluate(model, loader, device, problem_type):

	import torch.nn.functional as F
	import torch.nn.functional as functional

Conversation

lionelkusch commented Dec 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionelkusch Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bthirion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionelkusch Dec 19, 2024 •

edited

Loading

codecov Bot commented Dec 13, 2024 •

edited

Loading