Conversation
| loss = np.array(res_ens[4]) | ||
|
|
||
| if self.n_ensemble == 1: | ||
| raise Warning("The model can't be fit with n_ensemble = 1") |
There was a problem hiding this comment.
@jpaillard Can you check if it's correct that without multiple ensembles, there is not fitting?
There was a problem hiding this comment.
I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)
The following lines select the n_ensemble best models (which is useless when n_ensemble==1).
I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.
There was a problem hiding this comment.
See the Issue #75. This will be modified in the future.
| y = y.reshape(-1, 1) | ||
| if self.problem_type == "regression": | ||
| list_y.append(y) | ||
| # Encoding the target with the ordinal case |
There was a problem hiding this comment.
@jpaillard @bthirion
Can you tell me, if you know what is the "ordinal methods".
if yes, do you thing it's interesting to keep it? (the function was half implemented.)
If it's interesting to keep it, can you check if my modification was corrected?
There was a problem hiding this comment.
It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.
There was a problem hiding this comment.
This will be address in the future. #76
| if samples_in_leaf.size > 0: | ||
| leaf_samples.append( | ||
| y_minus_i[rng.choice(samples_in_leaf, size=random_samples)] | ||
| ) |
There was a problem hiding this comment.
@jpaillard @bthirion
I modified the function for considering when the samples leaf was empty.
However, I don't know what was doing the function.
Can you validate if it's the correct way to do it?
There was a problem hiding this comment.
This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.
There was a problem hiding this comment.
The usage of samples_in_leaf is explain in #42.
| assert not np.all(predict_prob[:, 0] == 0) | ||
| assert not np.all(predict_prob[:, 1] == 0) | ||
| # Check if the predicted probabilities are not all ones for each class | ||
| assert not np.all(predict_prob[:, 0] == 1) |
There was a problem hiding this comment.
@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?
There was a problem hiding this comment.
There are probably enough ;-)
We should just make sure they're not redundant.
There was a problem hiding this comment.
Advise for new tests address in issue #79.
| learner = DnnLearnerSingle(do_hypertuning=True, problem_type="ordinal", n_jobs=10, verbose=0) | ||
| learner.fit(X, y) | ||
| predict_prob = learner.predict_proba(X)[:,0] | ||
| # Check if the predicted class labels match the true labels for at least one instance |
There was a problem hiding this comment.
@jpaillard @bthirion
Can you help me to define some tests for this method?
| # Check if the feature importances are not all close to zero | ||
| assert not np.allclose(learner.feature_importances_, 0) | ||
| # Check if the feature importances are not all close to one | ||
| assert not np.allclose(learner.feature_importances_, 1) |
There was a problem hiding this comment.
@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
===========================================
+ Coverage 77.09% 94.32% +17.23%
===========================================
Files 46 52 +6
Lines 2471 2608 +137
===========================================
+ Hits 1905 2460 +555
+ Misses 566 148 -418 ☔ View full report in Codecov by Sentry. |
bthirion
left a comment
There was a problem hiding this comment.
Thx for opening this ! I have a bunch of comments.
| y = y.reshape(-1, 1) | ||
| if self.problem_type == "regression": | ||
| list_y.append(y) | ||
| # Encoding the target with the ordinal case |
There was a problem hiding this comment.
It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.
| if samples_in_leaf.size > 0: | ||
| leaf_samples.append( | ||
| y_minus_i[rng.choice(samples_in_leaf, size=random_samples)] | ||
| ) |
There was a problem hiding this comment.
This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.
| leaf_samples.append( | ||
| y_minus_i[rng.choice(samples_in_leaf, size=random_samples)] | ||
| ) | ||
| if samples_in_leaf.size > 0: |
There was a problem hiding this comment.
You can remove the condition here too.
| Data matrix | ||
| y : np.array | ||
| Target vector | ||
| grps : np.array |
There was a problem hiding this comment.
| grps : np.array | |
| groups : np.array |
| self.loss = 0 | ||
|
|
||
| def forward(self, x): | ||
| if self.group_stacking: |
There was a problem hiding this comment.
This will be addressed by the issue #81.
| x = torch.cat(list_stacking, dim=1) | ||
| return self.layers(x) | ||
|
|
||
| def training_step(self, batch, device, problem_type): |
There was a problem hiding this comment.
This will be addressed by the issue #81.
| loss = F.binary_cross_entropy_with_logits(y_pred, y) | ||
| return loss | ||
|
|
||
| def validation_step(self, batch, device, problem_type): |
There was a problem hiding this comment.
This will be addressed by the issue #81.
| "batch_size": len(X), | ||
| } | ||
|
|
||
| def validation_epoch_end(self, outputs, problem_type): |
| print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"])) | ||
|
|
||
|
|
||
| def _evaluate(model, loader, device, problem_type): |
There was a problem hiding this comment.
This will be addressed by the issue #81.
| loss = np.array(res_ens[4]) | ||
|
|
||
| if self.n_ensemble == 1: | ||
| raise Warning("The model can't be fit with n_ensemble = 1") |
There was a problem hiding this comment.
I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)
The following lines select the n_ensemble best models (which is useless when n_ensemble==1).
I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.
There was a problem hiding this comment.
As discussed with you, as a user, I don't like integrating the ensembling (n_ensemble) and hyper-parameter tuning (do_hypertuning, dict_hypertuning) in a single class, which becomes huge.
Also, I think other libraries (sklearn for ensembling, optuna for hyperparameters) offer more & better options for these advanced training strategies.
I suggest separating these aspects from the DNN_learner class and leaving it up to the user to optimize the training separately from humidistat.
|
The primary aim of this pull request was to increase test coverage and separate the estimation functions from the other methods to reduce dependency on torch. Most of your comments concern the code that was there before. I had no intention of dealing with it at the moment because, for me, it wasn't the priority. However, if you think it's very important, I can do it. |
|
This new commits are to address the most simple modifications. The other requests will be addressed in future PR and to trace them, 7 issues were open. This is a particular case because the estimators require important refactoring and their are not the focus of the library. |
|
After discussion, the DNN is removed from the source in the PR #166. This estimator can be added to the latter if it's very necessary. |
bthirion
left a comment
There was a problem hiding this comment.
There are a few glitches. LGTM overall.
| @@ -0,0 +1,655 @@ | |||
| import numpy as np | |||
There was a problem hiding this comment.
module should be renamed to dnn.py
| import copy | ||
| import torch | ||
| import torch.nn as nn | ||
| import torch.nn.functional as F |
There was a problem hiding this comment.
| import torch.nn.functional as F | |
| import torch.nn.functional as functional |
| in_features=len(grp), | ||
| out_features=input_dim[grp_ind + 1] - input_dim[grp_ind], | ||
| ) | ||
| # nn.Sequential( |
There was a problem hiding this comment.
commented out code should not be included
|
|
||
| # Specify whether to use GPU or CPU | ||
| # device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | ||
| device = torch.device("cpu") |
There was a problem hiding this comment.
it is weird to set this up here...
| optimizer.step() | ||
| for name, param in model.named_parameters(): | ||
| if "bias" not in name: | ||
| # if name.split(".")[0] == "layers_stacking": |
|
This is not ready to be reviewed. I just added here to keep track of it. |
The estimator Dnn_learner and RandomForestModified were not tested after that the BBI file was removed. I tried to implement some tests for improving the coverage of the tests.
I moved these files into a sub-module of hidimstat because they are not part of the core of methods proposed by the library.
Moreover, Dnn_learner include a dependency on torch and torchmetric which is not essential for the other methods of the library. In consequence, moving them to a sub-module, removes this requirement for using the other methods of the libraries.