Add k-fold cross-validation by micyril · Pull Request #1 · micyril/mlpack

micyril · 2017-07-25T13:32:13Z

No description provided.

rcurtin

Here is what I wrote up on the long plane ride ...

Hello Kirill,

This flight is really long... there are still some hours left in it too...

I've done

$ git checkout k_fold_cv
$ git diff simple_cv

in order to produce a diff, which I've snipped down and annotated as needed
below.  This is hopefully semi-equivalent to Github PR comments, although the
interface is less nice. :)

Overall everything looks good to me---because of the nice way you built the
CVBase code, it means that there is little that's actually necessary to add for
KFoldCV, so there is not too much for me to comment on.  The tests look good.  I
have a handful of comments about little things, which you can find below.

(I took the comments from below and put them here as a code review.)

rcurtin · 2017-07-29T07:48:46Z

+ * To construct a KFoldCV object you need to pass the k parameter and arguments
+ * that specify data. For the latter see the CVBase constructors as a reference
+ * - the CVBase constructors take exactly the same arguments as ones that are
+ * supposed to be passed after the k parameter in the KFoldCV constructor.


I think this way of referring someone to the other arguments of the constructor
could be a little bit confusing, and because we are not exposing the CVBase
constructors directly to the user, we can't try to force Doxygen to print all of
the CVBase constructors directly along with the KFoldCV constructors.

The only solution I could see would be to replicate the four constructors of
CVBase here. This would cause extra code duplication, but it would be for the
sake of clarity of documentation. Would you be willing to do that? I think
that that would also allow inlining of the Init() functions into the
constructors directly, so maybe the amount of extra code and functions would not
be that many...

and because we are not exposing the CVBase
constructors directly to the user

What do you mean? The CVBase constructors are public.

Anyway, you have mentioned it already the second time (the first time was when you did it for the SimpleCV constructors). So, I guess it's pretty much important in general and to be consisted with other mlpack in particular - more important than I thought initially. In such a case I'm ready to do what you ask, even though I like the idea of avoiding code duplication. If we want to do it here, I guess it makes sense to do it in SimpleCV too.

Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase. So in the end, when a user looks at the KFoldCV documentation, they only see this overload, and have to dig further to find the CVBase constructors that are being referred to. What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?

Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase.

Even if the inheritance was public, the constructors of CVBase would not be available in KFoldCV. This simple code snippet illustrates the point.

class A { public: A(int a) {} }; class B : public A { }; int main() { B b(1); // This line will fail to be compiled. }

What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?

I don't see any way how the documentation can become significantly clearer than now. So, I guess it's easier just to add explicit constructors here and to SimpleCV for uniformity.

Yeah---sorry that that is extra work but I think it is the best option here.

rcurtin · 2017-07-29T07:49:17Z

+   *
+   * @param k Number of folds (should be at least 2).
+   * @param args Basic constructor arguments for MLAlgortithm (see the CVBase
+   *     constructors for reference).


Minor misspelling---this should be MLAlgorithm not MLAlgortithm. :)

I will remove this constructor eventually, so it can be skipped.

rcurtin · 2017-07-29T07:49:38Z

+   */
+  MatType xs;
+  PredictionsType ys;
+  WeightsType weights;


Can you put one comment per line here? Otherwise Doxygen will pick up that
entire comment for just the variable xs, and the documentation will be empty for
both ys and weights.

Ok. I thought that Doxygen doesn't generate documentation for private fields and methods.

Is it fine to use this approach for private methods like here? Or it is better to add a separate comment for each method too?

Hmm, I looked and EXTRACT_PRIVATE is set to true in the Doxyfile so it is generating documentation for private members. I think that should be changed; I will try and do a PR for that in the future (but maybe it will be after I get back from my trip), so private members will not be shown in the generated Doxygen documentation. It might still be better to put a separate comment for each variable, since they are slightly different, and for each overload in CVBase like you linked to, since each overload is used in a slightly different situation.

Ok, I will add separate comments.

rcurtin · 2017-07-29T07:50:14Z

+  size_t trainingSubsetSize;
+
+  //! A pointer to a model from the last run of k-fold cross-validation.
+  std::unique_ptr<MLAlgorithm> modelPtr;


I forgot to follow up with this on the last simple_cv PR, and we're about to
merge that so I don't want to bring this up again there. Nowhere else in the
mlpack codebase is std::unique_ptr<> used, so for consistency I'd prefer to
avoid the use of the class.

I see a couple of possibilities---you can use a bare pointer or actually hold an
MLAlgorithm internally and I think this will not affect the external API you
have already provided. At a quick glance, I think that both of those should be
possible; but perhaps there is a catch I have overlooked.

This is a minor comment that doesn't really affect the actual functionality of
the CV code at all (I think), so we can consider it low-priority. If you
prefer, I can submit the PR to make these changes, but it will be a few weeks
until that happens. (Maybe I will prepare it on the flight home, but that will
be August 12...)

Initially I thought that some machine learning algorithms are not construable without any argument. But when I reviewed the mlpack code recently, the situation looked differently to me. So, I guess we can store objects themselves and use the move semantic for light-weight assignments.

I think most are default-constructible, but not all of them---I don't think we've made any strict requirement on that (and I'm not sure we need to).

If so, why the usage of standard smart pointers is bad? As an alternative, we can keep smart pointers only as a part of implementation and keep them away of any public interface.

Just consistency, there is no particular reason that smart pointers are bad. If we were making tons of them, it would be faster to use bare pointers or objects (due to the reference counting of unique_ptr), but that is not a concern here. If you really want to use unique_ptr, I guess it is ok, but like you said we should keep them away from the public interface since they do not appear anywhere else in the mlpack API. Let me know what you'd like to do.

I guess we should revert the changes made by this commit, but CVBase methods will need to return MLAlgorithm objects themselves rather than smart pointers.

(due to the reference counting of unique_ptr)

Isn't it about shared_ptr?

Your call, if that's what you'd like to do feel free. Personally I think the code is cleaner without smart pointers but that is just personal preference and this is your code so it is your choice. :) My only request is that we don't use smart pointers in the public API for consistency with the rest of mlpack.

Isn't it about shared_ptr?

Ah sorry you are right, there is no reference counting for unique_ptr.

rcurtin · 2017-07-29T07:50:39Z

+   * We take the ith validation subset after the ith training subset if
+   * i < k - 1 and before it otherwise.
+   */
+  size_t ValidationSubsetFirtsCol(const size_t i)


I think there is a misspelling here, should this be ValidationSubsetFirstCol()?

Yes, I will fix it.

rcurtin · 2017-07-29T07:50:56Z

+   */
+  size_t ValidationSubsetFirtsCol(const size_t i)
+  {
+    return i < k - 1? binSize * i + trainingSubsetSize : binSize * (i - 1);


Minor style comment---can you add a space between the '1' and '?'?

Personally I like to make the ternary operands more "grouped" by grouping them
in parentheses, like this:

return (i < k - 1) ? (binSize * i + trainingSubsetSize) : (binSize * (i - 1));

I think it's a little easier to read like that. However it is up to you whether
or not you want to group them like that. :)

I have nothing against using this style, so I will add the changes.

rcurtin · 2017-07-29T07:51:19Z

+  {
+    return arma::Row<ElementType>(r.colptr(ValidationSubsetFirtsCol(i)),
+        binSize, false, true);
+  }


Since there's already a k_fold_cv_impl.hpp, it may be better just to split the
implementation of these simple functions into that file (since they don't fit
all on one line), and then just mark them 'inline' in this file. That way
someone who reads the header file gets a quick idea of all the methods in the
class, but doesn't need to think about implementation at all.

rcurtin · 2017-07-29T07:51:47Z

+  trainingSubsetSize = binSize * (k - 1);
+
+  destination = arma::join_rows(sourceAsDT,
+      sourceAsDT.cols(0, trainingSubsetSize - binSize - 1));


This approach ends up causing two copies of 'source': the first line allocates
'source.n_elem' elements and copies from source to sourceAsDT, but then
join_rows must allocate an entirely new region of memory and copy everything
from sourceAsDT.

So I think you could just do arma::join_rows(source, source.cols(...)), but I
have not tested that. (This laptop is a little Chromebook that takes ~60-90
minutes to compile all of mlpack... so I have not even tried building this
branch... :))

The problem with the approach you describe is that we can't call the method cols for some objects that can be passed as data (e.g. temporary objects holding sum of matrixes or results of the arma::join_rows function).

You probably have noticed that sourceAsDT has the type const DestinationType&. It means that if a user has passed data in the type in which we are going to store the data, there will be no copy in this step.

Ah, I overlooked the const reference. In this case it seems like what you have done is the best solution.

micyril · 2017-08-08T04:28:36Z

Since mlpack#1044 is merged, I can make a PR directly for the main mlpack repository now.

rcurtin · 2017-08-08T04:30:14Z

Sure, that sounds good to me.

micyril · 2017-08-08T04:30:41Z

I worked on all your comments except making explicit constructors and adapting smart pointers. See these changes there.

rcurtin · 2017-08-11T12:00:07Z

Thanks---I think that this can be merged into the main repo then, do you want to open a PR for that?

micyril · 2017-08-11T12:03:24Z

I have written some additional tests, and one of them fails sometimes due to the reasons described in the issue #1086. So, I need to wait until #1089 is merged.

rcurtin · 2017-08-11T12:04:50Z

Ah, sorry, ok. I guess we can wait a couple days then. I know the process is a little bit slow, but it helps allow everyone time to comment on each change before it's committed. In this case, is there a particular hurry? Or is continuing to wait a little ok?

micyril · 2017-08-11T12:08:22Z

There is no hurry from my side. The only reason to hurry can be if you want to review the last changes now, and will not have too much time to do it on the next week.

rcurtin · 2017-08-11T12:19:45Z

Next week is just fine for me; I will finally be home. So there will be enough time for sure. :)

Add k-fold cross-validation

e5e3ad6

micyril assigned micyril and unassigned micyril Jul 25, 2017

rcurtin reviewed Jul 29, 2017

View reviewed changes

micyril added 4 commits August 5, 2017 10:22

Fix typo

2e59c94

Add separate comments

05bb442

Move method definitions

164064e

Make the ternary operands more "grouped"

e35d631

micyril mentioned this pull request Aug 6, 2017

Add simple cross-validation mlpack/mlpack#1044

Merged

micyril mentioned this pull request Aug 15, 2017

Add k-fold cross-validation mlpack/mlpack#1095

Merged

Conversation

micyril commented Jul 25, 2017

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

micyril commented Aug 8, 2017

Uh oh!

rcurtin commented Aug 8, 2017

Uh oh!

micyril commented Aug 8, 2017

Uh oh!

rcurtin commented Aug 11, 2017

Uh oh!

micyril commented Aug 11, 2017

Uh oh!

rcurtin commented Aug 11, 2017

Uh oh!

micyril commented Aug 11, 2017

Uh oh!

rcurtin Aug 1, 2017 •

edited

Loading