UtilsGGSV provides ggplot2-based utilities that solve two common pain
points in exploratory data analysis:
- Cluster / group characterisation — the
plot_group_*family creates publication-ready plots that help you understand what makes each cluster (or any labelled group) distinctive:plot_group_heatmap()— ECDF-percentile heat map showing each group’s relative position for every variable.plot_group_density()— per-variable density plots with per-group overlays (density curves and/or median lines).plot_group_scatter()— biaxial scatter with optional PCA / t-SNE / UMAP projection and cluster centroids.plot_group_mst()— minimum-spanning-tree layout coloured by the same ECDF scale as the heat map.
- Correlation visualisation —
ggcorr()creates paired scatter plots with Spearman, Pearson, Kendall, or concordance correlation coefficients overlaid as a formatted table, with support for log / asinh / anyscalestransformation.
Additional helpers round out the toolkit:
axis_limits()— force equal axis limits or expand axis coordinates without manually computing values.add_text_column()— place a column of text annotations at a consistent relative position regardless of the underlying axis transformation.get_trans()— retrieve anyscalestransformation by name, including higher-root andasinhtransformations not available in basescales.
You can install UtilsGGSV from GitHub with:
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("SATVILab/UtilsGGSV")library(UtilsGGSV)
library(ggplot2)
theme_set(cowplot::theme_cowplot())The function ggcorr plots correlation coefficients:
set.seed(3)
response_vec_a <- rnorm(5)
response_tbl <- data.frame(
group = rep(letters[1:3], each = 5),
response = c(
response_vec_a,
response_vec_a * 1.2 + rnorm(5, sd = 0.2),
response_vec_a * 2 + rnorm(5, sd = 2)
),
pid = rep(paste0("id_", 1:5), 3)
)
ggcorr(
data = response_tbl %>% dplyr::filter(group %in% c("a", "b")),
grp = "group",
y = "response",
id = "pid"
)We can display multiple correlation coefficients:
ggcorr(
data = response_tbl %>% dplyr::filter(group %in% c("a", "b")),
grp = "group",
y = "response",
id = "pid",
corr_method = c("spearman", "pearson")
)We can compare more than two groups:
ggcorr(
data = response_tbl,
grp = "group",
y = "response",
id = "pid",
corr_method = "kendall"
)We can compare more than two groups and multiple correlation coefficients:
ggcorr(
data = response_tbl,
grp = "group",
y = "response",
id = "pid",
corr_method = c("spearman", "pearson")
)Specific functionality to make appropriate plots for the concordance correlation coefficient is available:
ggcorr(
data = response_tbl %>% dplyr::filter(group %in% c("a", "b")),
grp = "group",
y = "response",
id = "pid",
corr_method = "concordance",
abline = TRUE,
limits_equal = TRUE
)Text in table can be moved around and resized:
ggcorr(
data = response_tbl %>% dplyr::filter(group %in% c("a", "b")),
grp = "group",
y = "response",
id = "pid",
corr_method = c("spearman", "pearson", "concordance"),
abline = TRUE,
limits_equal = TRUE,
coord = c(0.4, 0.17),
font_size = 3,
skip = 0.04,
pval_signif = 2,
est_signif = 2,
ci_signif = 2
)Finally, the text placement is kept consistent when the axes are visually transformed:
ggcorr(
data = response_tbl %>% dplyr::mutate(response = abs(response + 1)^4),
grp = "group",
y = "response",
id = "pid",
corr_method = "spearman",
abline = TRUE,
limits_equal = TRUE,
trans = "log10",
skip = 0.06
)Fix axis limits to be equal between x- and y-axes, and/or expand axis
coordinates. The primary use of axis_limits is forcing the x- and
y-axes to have the same limits “automatically” (i.e. by inspecting the
ggplot object, thus not requiring the user to manually calculate
limits to pass to ggplot2::expand_limits).
data("cars", package = "datasets")
p0 <- ggplot(cars, aes(speed, dist)) +
cowplot::background_grid(major = "xy") +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Axes unadjusted") +
labs(x = "Speed", y = "Distance")
p1 <- axis_limits(
p = p0,
limits_equal = TRUE
) +
labs(title = "Axes limits equal")
p2 <- axis_limits(
p = p0,
limits_expand = list(
x = c(0, 50),
y = c(-10, 200)
)
) +
labs(title = "Axes limits expanded")
cowplot::plot_grid(p0, p1, p2)Add a column of text easily to a plot, regardless of underlying
transformation, using add_text_column.
data_mod <- data.frame(x = rnorm(mean = 1, 10)^2)
data_mod$y <- data_mod$x * 3 + rnorm(10, sd = 0.5)
fit <- lm(y ~ x, data = data_mod)
coef_tbl <- coefficients(summary(fit))
results_vec <- c(
paste0(
"Intercept: ",
signif(coef_tbl[1, "Estimate"][[1]], 2),
" (",
signif(coef_tbl[1, 1][[1]] - 2 * coef_tbl[1, 2][[1]], 3),
", ",
signif(coef_tbl[1, 1][[1]] + 2 * coef_tbl[1, 2][[1]], 3),
"; p = ",
signif(coef_tbl[1, 4][[1]], 3),
")"
),
paste0(
"Slope: ",
signif(coef_tbl[2, "Estimate"][[1]], 2),
" (",
signif(coef_tbl[2, 1][[1]] - 2 * coef_tbl[2, 2][[1]], 3),
", ",
signif(coef_tbl[2, 1][[1]] + 2 * coef_tbl[2, 2][[1]], 3),
"; p = ",
signif(coef_tbl[2, 4][[1]], 3),
")"
)
)
p <- ggplot(
data = data_mod,
aes(x = x, y = y)
) +
geom_point() +
cowplot::background_grid(major = "xy")
add_text_column(
p = p,
x = data_mod$x,
y = data_mod$y,
text = results_vec,
coord = c(0.05, 0.95),
skip = 0.07
)Note that add_text_column places text in the same position, regardless
of underlying transformation.
p <- p +
scale_y_continuous(
trans = UtilsGGSV::get_trans("asinh")
)
add_text_column(
p = p,
x = data_mod$x,
y = data_mod$y,
text = results_vec,
trans = UtilsGGSV::get_trans("asinh"),
coord = c(0.05, 0.95),
skip = 0.07
)The plot_cluster_* family of functions helps visualise the
characteristics of clusters identified by an unsupervised learning
method.
The function plot_cluster_heatmap creates a heat map where each tile
shows the percentile of the median value of a variable for a cluster.
This percentile is compared against the ECDF of that variable across all
observations not in the cluster. Clusters and variables are ordered by
hierarchical clustering.
set.seed(1)
cluster_data <- data.frame(
cluster = rep(paste0("C", 1:3), each = 20),
var1 = c(rnorm(20, 2), rnorm(20, 0), rnorm(20, -2)),
var2 = c(rnorm(20, -1), rnorm(20, 1), rnorm(20, 0))
)
plot_cluster_heatmap(cluster_data, cluster = "cluster")The function plot_cluster_density visualises, for each variable, how
each cluster’s observations are distributed relative to the overall
population. The density argument controls what is shown: "overall"
(default, overall density plus cluster median lines), "cluster" (one
density curve per cluster), or "both" (overall density plus
per-cluster density curves). When showing per-cluster densities, the
scale argument controls scaling: by default ("max_overall") each
cluster density is rescaled so its maximum equals the overall density
maximum.
set.seed(1)
cluster_data <- data.frame(
cluster = rep(paste0("C", 1:3), each = 20),
var1 = c(rnorm(20, 2), rnorm(20, 0), rnorm(20, -2)),
var2 = c(rnorm(20, -1), rnorm(20, 1), rnorm(20, 0))
)
# Default: overall density with cluster median lines
plot_cluster_density(cluster_data, cluster = "cluster")
#> $var1#>
#> $var2
# Both overall and per-cluster densities (scaled to overall maximum)
plot_cluster_density(cluster_data, cluster = "cluster", density = "both")
#> $var1#>
#> $var2
The function plot_cluster_scatter creates a biaxial scatter plot with
observations coloured by cluster and median centroids overlaid. When
more than two variables are supplied it defaults to a PCA projection.
set.seed(123)
example_data <- data.frame(
cluster = rep(c("A", "B", "C"), each = 20),
var1 = c(rnorm(20, 2), rnorm(20, 0), rnorm(20, -2)),
var2 = c(rnorm(20, -1), rnorm(20, 1), rnorm(20, 0)),
var3 = c(rnorm(20, 1), rnorm(20, -1), rnorm(20, 0))
)
# Default: PCA projection (> 2 numeric variables)
plot_cluster_scatter(example_data, cluster = "cluster")
#> dim_red automatically set to 'pca' because more than two numeric variables are available.Raw variables can also be used directly:
plot_cluster_scatter(
example_data,
cluster = "cluster",
dim_red = "none",
vars = c("var1", "var2")
)The function plot_cluster_mst computes the minimum-spanning tree (MST)
over clusters, using Euclidean distance between cluster median profiles.
Clusters are laid out in two dimensions via classical multidimensional
scaling (MDS). For each variable, a separate plot is produced in which
each node is filled according to the ECDF-standardised percentile of
that cluster’s median — the same colour scale used by
plot_cluster_heatmap. By default a named list of plots is returned;
supplying n_col or n_row returns a combined cowplot::plot_grid
figure.
set.seed(1)
cluster_data <- data.frame(
cluster = rep(paste0("C", 1:3), each = 20),
var1 = c(rnorm(20, 2), rnorm(20, 0), rnorm(20, -2)),
var2 = c(rnorm(20, -1), rnorm(20, 1), rnorm(20, 0))
)
# Default: returns a named list of plots, one per variable
plot_list <- plot_cluster_mst(cluster_data, cluster = "cluster")
plot_list[["var1"]]Combine into a grid with variable-name labels:
plot_cluster_mst(cluster_data, cluster = "cluster", n_col = 2)The utility function get_trans returns trans objects (as implemented
by the scales package) when given characters. It also adds various
higher roots (such as cubic and quartic) and adds the asinh
transformation.
get_trans("log10")
#> Transformer: log-10 [1e-100, Inf]

















