GitHub - kbodwin/Cluster_Tuning: Experiments for resampling methods on unsupervised learning

Overview

This repo is for collecting experiments in how to use resampling to tune hyperparameters in unsupervised learning.

Unlike in supervised learning, in unsupervised learning we cannot identify a singular metric that measures the fit of the model. Instead, we seek to find principles of "good fit" that can be compared across hyperparameter values.

Possible resampling structures for clusting include:

Cluster consistency across reruns with random conditions

Some clustering algorithms, such as $k$-means, include randomized starting conditions. Can we exploit this as a form of "resampling", and look for consistency of certain cluster properties across runs?

(initial_conditions folder)

Cross-validation and classic metrics

Many common clustering metrics are guaranteed to shrink as $k$ increases; for example, the within-sum-of-squares metric is smaller when $k$ is larger. For this reason, a classic approach is to use the "elbow method" to identify when the "gains" start to diminish. We believe this approach is ambiguous and unsatisfying.

However, it may be interesting to see how these metrics behave over cross-validation folds. Perhaps they are more variable, or more stable, across folds at the correct $k$? Perhaps a metric can be created to compare each fold to a "correct" answer?

(cross_validation folder)

Cluster consistency across subsamples

I believe this is the most promising avenue.

In principle, if a clustering structure is "real", it should emerge in any random sample of data. Can we use some kind of bootstrapping or subsampling approach on our data to see if the same structure emerges in different samples?

The challenge here is measuring "same structure". Do we use cluster centers? Cluster membership (but then how to compare across subsamples with different items in them)? Other cluster properties?

(subsampling folder)

Phase 1

In Phase 1, we focus only on number of clusters as the hyperparameter of interest, and we restrict our experiments to k-means clustering and hierarchical clustering.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cross_validation		cross_validation
initial_conditions		initial_conditions
subsampling		subsampling
.gitignore		.gitignore
Cluster_Tuning.Rproj		Cluster_Tuning.Rproj
ReadMe.md		ReadMe.md
template.qmd		template.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Cluster consistency across reruns with random conditions

Cross-validation and classic metrics

Cluster consistency across subsamples

Phase 1

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Overview

Cluster consistency across reruns with random conditions

Cross-validation and classic metrics

Cluster consistency across subsamples

Phase 1

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages