Conversation
rcurtin
left a comment
There was a problem hiding this comment.
Here is what I wrote up on the long plane ride ...
Hello Kirill,
This flight is really long... there are still some hours left in it too...
I've done
$ git checkout k_fold_cv
$ git diff simple_cv
in order to produce a diff, which I've snipped down and annotated as needed
below. This is hopefully semi-equivalent to Github PR comments, although the
interface is less nice. :)
Overall everything looks good to me---because of the nice way you built the
CVBase code, it means that there is little that's actually necessary to add for
KFoldCV, so there is not too much for me to comment on. The tests look good. I
have a handful of comments about little things, which you can find below.
(I took the comments from below and put them here as a code review.)
| * To construct a KFoldCV object you need to pass the k parameter and arguments | ||
| * that specify data. For the latter see the CVBase constructors as a reference | ||
| * - the CVBase constructors take exactly the same arguments as ones that are | ||
| * supposed to be passed after the k parameter in the KFoldCV constructor. |
There was a problem hiding this comment.
I think this way of referring someone to the other arguments of the constructor
could be a little bit confusing, and because we are not exposing the CVBase
constructors directly to the user, we can't try to force Doxygen to print all of
the CVBase constructors directly along with the KFoldCV constructors.
The only solution I could see would be to replicate the four constructors of
CVBase here. This would cause extra code duplication, but it would be for the
sake of clarity of documentation. Would you be willing to do that? I think
that that would also allow inlining of the Init() functions into the
constructors directly, so maybe the amount of extra code and functions would not
be that many...
There was a problem hiding this comment.
and because we are not exposing the CVBase
constructors directly to the user
What do you mean? The CVBase constructors are public.
Anyway, you have mentioned it already the second time (the first time was when you did it for the SimpleCV constructors). So, I guess it's pretty much important in general and to be consisted with other mlpack in particular - more important than I thought initially. In such a case I'm ready to do what you ask, even though I like the idea of avoiding code duplication. If we want to do it here, I guess it makes sense to do it in SimpleCV too.
There was a problem hiding this comment.
Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase. So in the end, when a user looks at the KFoldCV documentation, they only see this overload, and have to dig further to find the CVBase constructors that are being referred to. What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?
There was a problem hiding this comment.
Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase.
Even if the inheritance was public, the constructors of CVBase would not be available in KFoldCV. This simple code snippet illustrates the point.
class A
{
public:
A(int a) {}
};
class B : public A
{
};
int main()
{
B b(1); // This line will fail to be compiled.
}
What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?
I don't see any way how the documentation can become significantly clearer than now. So, I guess it's easier just to add explicit constructors here and to SimpleCV for uniformity.
There was a problem hiding this comment.
Yeah---sorry that that is extra work but I think it is the best option here.
| * | ||
| * @param k Number of folds (should be at least 2). | ||
| * @param args Basic constructor arguments for MLAlgortithm (see the CVBase | ||
| * constructors for reference). |
There was a problem hiding this comment.
Minor misspelling---this should be MLAlgorithm not MLAlgortithm. :)
There was a problem hiding this comment.
I will remove this constructor eventually, so it can be skipped.
| */ | ||
| MatType xs; | ||
| PredictionsType ys; | ||
| WeightsType weights; |
There was a problem hiding this comment.
Can you put one comment per line here? Otherwise Doxygen will pick up that
entire comment for just the variable xs, and the documentation will be empty for
both ys and weights.
There was a problem hiding this comment.
Ok. I thought that Doxygen doesn't generate documentation for private fields and methods.
Is it fine to use this approach for private methods like here? Or it is better to add a separate comment for each method too?
There was a problem hiding this comment.
Hmm, I looked and EXTRACT_PRIVATE is set to true in the Doxyfile so it is generating documentation for private members. I think that should be changed; I will try and do a PR for that in the future (but maybe it will be after I get back from my trip), so private members will not be shown in the generated Doxygen documentation. It might still be better to put a separate comment for each variable, since they are slightly different, and for each overload in CVBase like you linked to, since each overload is used in a slightly different situation.
There was a problem hiding this comment.
Ok, I will add separate comments.
| size_t trainingSubsetSize; | ||
|
|
||
| //! A pointer to a model from the last run of k-fold cross-validation. | ||
| std::unique_ptr<MLAlgorithm> modelPtr; |
There was a problem hiding this comment.
I forgot to follow up with this on the last simple_cv PR, and we're about to
merge that so I don't want to bring this up again there. Nowhere else in the
mlpack codebase is std::unique_ptr<> used, so for consistency I'd prefer to
avoid the use of the class.
I see a couple of possibilities---you can use a bare pointer or actually hold an
MLAlgorithm internally and I think this will not affect the external API you
have already provided. At a quick glance, I think that both of those should be
possible; but perhaps there is a catch I have overlooked.
This is a minor comment that doesn't really affect the actual functionality of
the CV code at all (I think), so we can consider it low-priority. If you
prefer, I can submit the PR to make these changes, but it will be a few weeks
until that happens. (Maybe I will prepare it on the flight home, but that will
be August 12...)
There was a problem hiding this comment.
Initially I thought that some machine learning algorithms are not construable without any argument. But when I reviewed the mlpack code recently, the situation looked differently to me. So, I guess we can store objects themselves and use the move semantic for light-weight assignments.
There was a problem hiding this comment.
I think most are default-constructible, but not all of them---I don't think we've made any strict requirement on that (and I'm not sure we need to).
There was a problem hiding this comment.
If so, why the usage of standard smart pointers is bad? As an alternative, we can keep smart pointers only as a part of implementation and keep them away of any public interface.
There was a problem hiding this comment.
Just consistency, there is no particular reason that smart pointers are bad. If we were making tons of them, it would be faster to use bare pointers or objects (due to the reference counting of unique_ptr), but that is not a concern here. If you really want to use unique_ptr, I guess it is ok, but like you said we should keep them away from the public interface since they do not appear anywhere else in the mlpack API. Let me know what you'd like to do.
There was a problem hiding this comment.
I guess we should revert the changes made by this commit, but CVBase methods will need to return MLAlgorithm objects themselves rather than smart pointers.
There was a problem hiding this comment.
(due to the reference counting of
unique_ptr)
Isn't it about shared_ptr?
There was a problem hiding this comment.
Your call, if that's what you'd like to do feel free. Personally I think the code is cleaner without smart pointers but that is just personal preference and this is your code so it is your choice. :) My only request is that we don't use smart pointers in the public API for consistency with the rest of mlpack.
There was a problem hiding this comment.
Isn't it about shared_ptr?
Ah sorry you are right, there is no reference counting for unique_ptr.
| * We take the ith validation subset after the ith training subset if | ||
| * i < k - 1 and before it otherwise. | ||
| */ | ||
| size_t ValidationSubsetFirtsCol(const size_t i) |
There was a problem hiding this comment.
I think there is a misspelling here, should this be ValidationSubsetFirstCol()?
| */ | ||
| size_t ValidationSubsetFirtsCol(const size_t i) | ||
| { | ||
| return i < k - 1? binSize * i + trainingSubsetSize : binSize * (i - 1); |
There was a problem hiding this comment.
Minor style comment---can you add a space between the '1' and '?'?
Personally I like to make the ternary operands more "grouped" by grouping them
in parentheses, like this:
return (i < k - 1) ? (binSize * i + trainingSubsetSize) : (binSize * (i - 1));
I think it's a little easier to read like that. However it is up to you whether
or not you want to group them like that. :)
There was a problem hiding this comment.
I have nothing against using this style, so I will add the changes.
| { | ||
| return arma::Row<ElementType>(r.colptr(ValidationSubsetFirtsCol(i)), | ||
| binSize, false, true); | ||
| } |
There was a problem hiding this comment.
Since there's already a k_fold_cv_impl.hpp, it may be better just to split the
implementation of these simple functions into that file (since they don't fit
all on one line), and then just mark them 'inline' in this file. That way
someone who reads the header file gets a quick idea of all the methods in the
class, but doesn't need to think about implementation at all.
| trainingSubsetSize = binSize * (k - 1); | ||
|
|
||
| destination = arma::join_rows(sourceAsDT, | ||
| sourceAsDT.cols(0, trainingSubsetSize - binSize - 1)); |
There was a problem hiding this comment.
This approach ends up causing two copies of 'source': the first line allocates
'source.n_elem' elements and copies from source to sourceAsDT, but then
join_rows must allocate an entirely new region of memory and copy everything
from sourceAsDT.
So I think you could just do arma::join_rows(source, source.cols(...)), but I
have not tested that. (This laptop is a little Chromebook that takes ~60-90
minutes to compile all of mlpack... so I have not even tried building this
branch... :))
There was a problem hiding this comment.
The problem with the approach you describe is that we can't call the method cols for some objects that can be passed as data (e.g. temporary objects holding sum of matrixes or results of the arma::join_rows function).
You probably have noticed that sourceAsDT has the type const DestinationType&. It means that if a user has passed data in the type in which we are going to store the data, there will be no copy in this step.
There was a problem hiding this comment.
Ah, I overlooked the const reference. In this case it seems like what you have done is the best solution.
|
Since mlpack#1044 is merged, I can make a PR directly for the main mlpack repository now. |
|
Sure, that sounds good to me. |
|
I worked on all your comments except making explicit constructors and adapting smart pointers. See these changes there. |
|
Thanks---I think that this can be merged into the main repo then, do you want to open a PR for that? |
|
Ah, sorry, ok. I guess we can wait a couple days then. I know the process is a little bit slow, but it helps allow everyone time to comment on each change before it's committed. In this case, is there a particular hurry? Or is continuing to wait a little ok? |
|
There is no hurry from my side. The only reason to hurry can be if you want to review the last changes now, and will not have too much time to do it on the next week. |
|
Next week is just fine for me; I will finally be home. So there will be enough time for sure. :) |
No description provided.