Skip to content

Add k-fold cross-validation#1

Open
micyril wants to merge 5 commits into
simple_cvfrom
k_fold_cv
Open

Add k-fold cross-validation#1
micyril wants to merge 5 commits into
simple_cvfrom
k_fold_cv

Conversation

@micyril

@micyril micyril commented Jul 25, 2017

Copy link
Copy Markdown
Owner

No description provided.

@micyril micyril assigned micyril and unassigned micyril Jul 25, 2017

@rcurtin rcurtin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I wrote up on the long plane ride ...

Hello Kirill,

This flight is really long... there are still some hours left in it too...

I've done

$ git checkout k_fold_cv
$ git diff simple_cv

in order to produce a diff, which I've snipped down and annotated as needed
below.  This is hopefully semi-equivalent to Github PR comments, although the
interface is less nice. :)

Overall everything looks good to me---because of the nice way you built the
CVBase code, it means that there is little that's actually necessary to add for
KFoldCV, so there is not too much for me to comment on.  The tests look good.  I
have a handful of comments about little things, which you can find below.

(I took the comments from below and put them here as a code review.)

* To construct a KFoldCV object you need to pass the k parameter and arguments
* that specify data. For the latter see the CVBase constructors as a reference
* - the CVBase constructors take exactly the same arguments as ones that are
* supposed to be passed after the k parameter in the KFoldCV constructor.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way of referring someone to the other arguments of the constructor
could be a little bit confusing, and because we are not exposing the CVBase
constructors directly to the user, we can't try to force Doxygen to print all of
the CVBase constructors directly along with the KFoldCV constructors.

The only solution I could see would be to replicate the four constructors of
CVBase here. This would cause extra code duplication, but it would be for the
sake of clarity of documentation. Would you be willing to do that? I think
that that would also allow inlining of the Init() functions into the
constructors directly, so maybe the amount of extra code and functions would not
be that many...

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and because we are not exposing the CVBase
constructors directly to the user

What do you mean? The CVBase constructors are public.

Anyway, you have mentioned it already the second time (the first time was when you did it for the SimpleCV constructors). So, I guess it's pretty much important in general and to be consisted with other mlpack in particular - more important than I thought initially. In such a case I'm ready to do what you ask, even though I like the idea of avoiding code duplication. If we want to do it here, I guess it makes sense to do it in SimpleCV too.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase. So in the end, when a user looks at the KFoldCV documentation, they only see this overload, and have to dig further to find the CVBase constructors that are being referred to. What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I have misunderstood---I think that because the inheritance is private, the constructors of CVBase are not available in KFoldCV, and instead the single constructor of KFoldCV is delegating to CVBase.

Even if the inheritance was public, the constructors of CVBase would not be available in KFoldCV. This simple code snippet illustrates the point.

class A
{
public:
    A(int a) {}
};

class B : public A
{
};

int main()
{
  B b(1); // This line will fail to be compiled.
}

What do you think, do you like the idea of adding more constructors for clarity, or adding more documentation to be fully clear about what should be passed in?

I don't see any way how the documentation can become significantly clearer than now. So, I guess it's easier just to add explicit constructors here and to SimpleCV for uniformity.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah---sorry that that is extra work but I think it is the best option here.

*
* @param k Number of folds (should be at least 2).
* @param args Basic constructor arguments for MLAlgortithm (see the CVBase
* constructors for reference).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor misspelling---this should be MLAlgorithm not MLAlgortithm. :)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this constructor eventually, so it can be skipped.

*/
MatType xs;
PredictionsType ys;
WeightsType weights;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put one comment per line here? Otherwise Doxygen will pick up that
entire comment for just the variable xs, and the documentation will be empty for
both ys and weights.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I thought that Doxygen doesn't generate documentation for private fields and methods.

Is it fine to use this approach for private methods like here? Or it is better to add a separate comment for each method too?

@rcurtin rcurtin Aug 1, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I looked and EXTRACT_PRIVATE is set to true in the Doxyfile so it is generating documentation for private members. I think that should be changed; I will try and do a PR for that in the future (but maybe it will be after I get back from my trip), so private members will not be shown in the generated Doxygen documentation. It might still be better to put a separate comment for each variable, since they are slightly different, and for each overload in CVBase like you linked to, since each overload is used in a slightly different situation.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add separate comments.

size_t trainingSubsetSize;

//! A pointer to a model from the last run of k-fold cross-validation.
std::unique_ptr<MLAlgorithm> modelPtr;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to follow up with this on the last simple_cv PR, and we're about to
merge that so I don't want to bring this up again there. Nowhere else in the
mlpack codebase is std::unique_ptr<> used, so for consistency I'd prefer to
avoid the use of the class.

I see a couple of possibilities---you can use a bare pointer or actually hold an
MLAlgorithm internally and I think this will not affect the external API you
have already provided. At a quick glance, I think that both of those should be
possible; but perhaps there is a catch I have overlooked.

This is a minor comment that doesn't really affect the actual functionality of
the CV code at all (I think), so we can consider it low-priority. If you
prefer, I can submit the PR to make these changes, but it will be a few weeks
until that happens. (Maybe I will prepare it on the flight home, but that will
be August 12...)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I thought that some machine learning algorithms are not construable without any argument. But when I reviewed the mlpack code recently, the situation looked differently to me. So, I guess we can store objects themselves and use the move semantic for light-weight assignments.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most are default-constructible, but not all of them---I don't think we've made any strict requirement on that (and I'm not sure we need to).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, why the usage of standard smart pointers is bad? As an alternative, we can keep smart pointers only as a part of implementation and keep them away of any public interface.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just consistency, there is no particular reason that smart pointers are bad. If we were making tons of them, it would be faster to use bare pointers or objects (due to the reference counting of unique_ptr), but that is not a concern here. If you really want to use unique_ptr, I guess it is ok, but like you said we should keep them away from the public interface since they do not appear anywhere else in the mlpack API. Let me know what you'd like to do.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should revert the changes made by this commit, but CVBase methods will need to return MLAlgorithm objects themselves rather than smart pointers.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(due to the reference counting of unique_ptr)

Isn't it about shared_ptr?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call, if that's what you'd like to do feel free. Personally I think the code is cleaner without smart pointers but that is just personal preference and this is your code so it is your choice. :) My only request is that we don't use smart pointers in the public API for consistency with the rest of mlpack.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it about shared_ptr?

Ah sorry you are right, there is no reference counting for unique_ptr.

Comment thread src/mlpack/core/cv/k_fold_cv.hpp Outdated
* We take the ith validation subset after the ith training subset if
* i < k - 1 and before it otherwise.
*/
size_t ValidationSubsetFirtsCol(const size_t i)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a misspelling here, should this be ValidationSubsetFirstCol()?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will fix it.

Comment thread src/mlpack/core/cv/k_fold_cv.hpp Outdated
*/
size_t ValidationSubsetFirtsCol(const size_t i)
{
return i < k - 1? binSize * i + trainingSubsetSize : binSize * (i - 1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style comment---can you add a space between the '1' and '?'?

Personally I like to make the ternary operands more "grouped" by grouping them
in parentheses, like this:

return (i < k - 1) ? (binSize * i + trainingSubsetSize) : (binSize * (i - 1));

I think it's a little easier to read like that. However it is up to you whether
or not you want to group them like that. :)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing against using this style, so I will add the changes.

Comment thread src/mlpack/core/cv/k_fold_cv.hpp Outdated
{
return arma::Row<ElementType>(r.colptr(ValidationSubsetFirtsCol(i)),
binSize, false, true);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's already a k_fold_cv_impl.hpp, it may be better just to split the
implementation of these simple functions into that file (since they don't fit
all on one line), and then just mark them 'inline' in this file. That way
someone who reads the header file gets a quick idea of all the methods in the
class, but doesn't need to think about implementation at all.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

trainingSubsetSize = binSize * (k - 1);

destination = arma::join_rows(sourceAsDT,
sourceAsDT.cols(0, trainingSubsetSize - binSize - 1));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach ends up causing two copies of 'source': the first line allocates
'source.n_elem' elements and copies from source to sourceAsDT, but then
join_rows must allocate an entirely new region of memory and copy everything
from sourceAsDT.

So I think you could just do arma::join_rows(source, source.cols(...)), but I
have not tested that. (This laptop is a little Chromebook that takes ~60-90
minutes to compile all of mlpack... so I have not even tried building this
branch... :))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the approach you describe is that we can't call the method cols for some objects that can be passed as data (e.g. temporary objects holding sum of matrixes or results of the arma::join_rows function).

You probably have noticed that sourceAsDT has the type const DestinationType&. It means that if a user has passed data in the type in which we are going to store the data, there will be no copy in this step.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I overlooked the const reference. In this case it seems like what you have done is the best solution.

@micyril

micyril commented Aug 8, 2017

Copy link
Copy Markdown
Owner Author

Since mlpack#1044 is merged, I can make a PR directly for the main mlpack repository now.

@rcurtin

rcurtin commented Aug 8, 2017

Copy link
Copy Markdown

Sure, that sounds good to me.

@micyril

micyril commented Aug 8, 2017

Copy link
Copy Markdown
Owner Author

I worked on all your comments except making explicit constructors and adapting smart pointers. See these changes there.

@rcurtin

rcurtin commented Aug 11, 2017

Copy link
Copy Markdown

Thanks---I think that this can be merged into the main repo then, do you want to open a PR for that?

@micyril

micyril commented Aug 11, 2017

Copy link
Copy Markdown
Owner Author

I have written some additional tests, and one of them fails sometimes due to the reasons described in the issue #1086. So, I need to wait until #1089 is merged.

@rcurtin

rcurtin commented Aug 11, 2017

Copy link
Copy Markdown

Ah, sorry, ok. I guess we can wait a couple days then. I know the process is a little bit slow, but it helps allow everyone time to comment on each change before it's committed. In this case, is there a particular hurry? Or is continuing to wait a little ok?

@micyril

micyril commented Aug 11, 2017

Copy link
Copy Markdown
Owner Author

There is no hurry from my side. The only reason to hurry can be if you want to review the last changes now, and will not have too much time to do it on the next week.

@rcurtin

rcurtin commented Aug 11, 2017

Copy link
Copy Markdown

Next week is just fine for me; I will finally be home. So there will be enough time for sure. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants