This repo is for collecting experiments in how to use resampling to tune hyperparameters in unsupervised learning.
Unlike in supervised learning, in unsupervised learning we cannot identify a singular metric that measures the fit of the model. Instead, we seek to find principles of "good fit" that can be compared across hyperparameter values.
Possible resampling structures for clusting include:
Some clustering algorithms, such as
(initial_conditions folder)
Many common clustering metrics are guaranteed to shrink as
However, it may be interesting to see how these metrics behave over cross-validation folds. Perhaps they are more variable, or more stable, across folds at the correct
(cross_validation folder)
I believe this is the most promising avenue.
In principle, if a clustering structure is "real", it should emerge in any random sample of data. Can we use some kind of bootstrapping or subsampling approach on our data to see if the same structure emerges in different samples?
The challenge here is measuring "same structure". Do we use cluster centers? Cluster membership (but then how to compare across subsamples with different items in them)? Other cluster properties?
(subsampling folder)
In Phase 1, we focus only on number of clusters as the hyperparameter of interest, and we restrict our experiments to k-means clustering and hierarchical clustering.