Skip to content

Commit ee12ded

Browse files
committed
updated README
1 parent ab7b390 commit ee12ded

2 files changed

Lines changed: 104 additions & 92 deletions

File tree

README.md

Lines changed: 88 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,12 @@ Brusco et al. (2020;
1414
<a href="https://doi.org/10.1111/bmsp.12186" class="uri">https://doi.org/10.1111/bmsp.12186</a>),
1515
Papenberg (2024;
1616
<a href="https://doi.org/10.1111/bmsp.12315" class="uri">https://doi.org/10.1111/bmsp.12315</a>),
17-
and Papenberg et al. (2025;
18-
<a href="https://doi.org/10.1101/2025.03.03.641320" class="uri">https://doi.org/10.1101/2025.03.03.641320</a>).
17+
Papenberg, Wang, et al. (2025;
18+
<a href="https://doi.org/10.1016/j.crmeth.2025.101137" class="uri">https://doi.org/10.1016/j.crmeth.2025.101137</a>),
19+
Papenberg, Breuer, et al. (2025;
20+
<a href="https://doi.org/10.1017/psy.2025.10052" class="uri">https://doi.org/10.1017/psy.2025.10052</a>),
21+
and Yang et al. (2022;
22+
<a href="https://doi.org/10.1016/j.ejor.2022.02.003" class="uri">https://doi.org/10.1016/j.ejor.2022.02.003</a>)
1923

2024
Installation
2125
------------
@@ -73,20 +77,19 @@ This README contains some basic information on the `R` package
7377
[Preprint](https://doi.org/10.31234/osf.io/dhzrc)).
7478
- A new paper describes the must-link feature and provides
7579
additional comparisons to alternative methods, focusing on
76-
categorical variables (Papenberg et al., 2025;
77-
<a href="https://doi.org/10.1101/2025.03.03.641320" class="uri">https://doi.org/10.1101/2025.03.03.641320</a>).
80+
categorical variables (Papenberg, Wang, et al., 2025;
81+
<a href="https://doi.org/10.1016/j.crmeth.2025.101137" class="uri">https://doi.org/10.1016/j.crmeth.2025.101137</a>).
82+
- Another new paper describes several new algorithms for
83+
anticlustering and the cannot-link feature (Papenberg, Breuer,
84+
et al., 2025;
85+
<a href="https://doi.org/10.1017/psy.2025.10052" class="uri">https://doi.org/10.1017/psy.2025.10052</a>).
7886
- The R documentation of the main functions is actually quite rich
7987
and up to date, so you should definitely check that out when
8088
using the `anticlust` package. The most important background is
8189
provided in `?anticlustering`.
82-
- A [video](https://youtu.be/YGrhSmi1oA8) is available in German
83-
language where I illustrate the main functionalities of the
84-
`anticlustering()` function. My plan is to make a similar video in
85-
English in the future.
8690
- The [package website](https://m-py.github.io/anticlust/) contains
87-
all documentation as a convenient website. At the current time, the
88-
website also has four package vignettes, while additional vignettes
89-
are planned.
91+
all documentation as a convenient website. Also check out the
92+
vignettes on that website.
9093

9194
A quick start
9295
-------------
@@ -101,22 +104,23 @@ First, load the package via
101104
Call the `anticlustering()` method:
102105

103106
anticlusters <- anticlustering(
104-
iris[, -5],
107+
iris,
105108
K = 5,
106109
objective = "kplus",
107110
method = "local-maximum",
108-
repetitions = 10
111+
repetitions = 10,
112+
standardize = TRUE
109113
)
110114

111115
The output is a vector that assigns a group (i.e, a number between 1 and
112116
`K`) to each input element:
113117

114118
anticlusters
115-
#> [1] 1 2 4 5 3 4 2 3 2 2 1 5 1 2 4 1 2 3 2 5 1 5 4 5 1 1 3 4 5 5 5 4 5 2 1 1 3
116-
#> [38] 4 3 3 4 2 3 5 2 5 3 4 3 1 2 2 5 1 2 3 3 4 4 1 5 1 2 3 3 1 2 4 4 4 4 1 3 4
117-
#> [75] 2 4 5 2 5 2 3 3 1 5 4 1 5 3 2 1 2 5 3 4 1 4 1 2 4 5 2 2 3 1 4 1 3 4 4 5 3
118-
#> [112] 2 3 1 5 2 5 3 1 5 4 1 2 5 1 2 3 1 3 3 5 1 2 5 5 4 3 5 4 3 5 5 1 4 4 1 3 4
119-
#> [149] 2 2
119+
#> [1] 1 3 4 2 1 5 5 3 4 1 2 3 2 2 2 3 1 5 3 2 3 5 1 2 1 5 4 3 4 3 5 2 4 4 2 3 5
120+
#> [38] 1 4 4 5 5 1 1 5 4 4 3 1 2 2 4 5 1 3 2 4 4 4 3 1 1 5 5 3 1 1 5 2 1 2 4 1 5
121+
#> [75] 3 3 1 1 4 2 4 3 3 3 2 3 2 4 4 2 5 4 1 5 2 5 3 5 5 2 3 3 5 5 1 4 3 4 1 5 1
122+
#> [112] 4 4 2 4 2 2 3 3 2 5 1 5 3 4 1 5 5 4 1 3 2 1 3 2 2 2 1 2 4 1 1 3 2 5 3 4 5
123+
#> [149] 5 4
120124

121125
By default, each group has the same number of elements (but the argument
122126
`K` can be adjusted to request different group sizes):
@@ -144,38 +148,82 @@ groups to find out if the five groups are similar to each other:
144148
<tbody>
145149
<tr class="odd">
146150
<td style="text-align: left;">1</td>
147-
<td style="text-align: left;">5.84 (0.84)</td>
151+
<td style="text-align: left;">5.85 (0.84)</td>
148152
<td style="text-align: left;">3.06 (0.44)</td>
149-
<td style="text-align: left;">3.76 (1.79)</td>
153+
<td style="text-align: left;">3.76 (1.78)</td>
150154
<td style="text-align: left;">1.20 (0.77)</td>
151155
</tr>
152156
<tr class="even">
153157
<td style="text-align: left;">2</td>
154158
<td style="text-align: left;">5.84 (0.84)</td>
155-
<td style="text-align: left;">3.06 (0.45)</td>
156-
<td style="text-align: left;">3.76 (1.79)</td>
159+
<td style="text-align: left;">3.06 (0.44)</td>
160+
<td style="text-align: left;">3.77 (1.79)</td>
157161
<td style="text-align: left;">1.20 (0.77)</td>
158162
</tr>
159163
<tr class="odd">
160164
<td style="text-align: left;">3</td>
161165
<td style="text-align: left;">5.84 (0.84)</td>
162166
<td style="text-align: left;">3.06 (0.44)</td>
163-
<td style="text-align: left;">3.75 (1.79)</td>
167+
<td style="text-align: left;">3.76 (1.79)</td>
164168
<td style="text-align: left;">1.20 (0.77)</td>
165169
</tr>
166170
<tr class="even">
167171
<td style="text-align: left;">4</td>
168-
<td style="text-align: left;">5.85 (0.84)</td>
169-
<td style="text-align: left;">3.05 (0.45)</td>
170-
<td style="text-align: left;">3.76 (1.79)</td>
171-
<td style="text-align: left;">1.21 (0.77)</td>
172+
<td style="text-align: left;">5.84 (0.84)</td>
173+
<td style="text-align: left;">3.06 (0.44)</td>
174+
<td style="text-align: left;">3.75 (1.79)</td>
175+
<td style="text-align: left;">1.19 (0.77)</td>
172176
</tr>
173177
<tr class="odd">
174178
<td style="text-align: left;">5</td>
175-
<td style="text-align: left;">5.84 (0.84)</td>
179+
<td style="text-align: left;">5.85 (0.84)</td>
176180
<td style="text-align: left;">3.06 (0.44)</td>
177-
<td style="text-align: left;">3.76 (1.79)</td>
178-
<td style="text-align: left;">1.19 (0.78)</td>
181+
<td style="text-align: left;">3.75 (1.79)</td>
182+
<td style="text-align: left;">1.20 (0.77)</td>
183+
</tr>
184+
</tbody>
185+
</table>
186+
187+
We can also verify that the species of plants (a categorical feature) is
188+
evenly distributed among groups:
189+
190+
knitr::kable(table(iris[, 5], anticlusters), row.names = TRUE)
191+
192+
<table>
193+
<thead>
194+
<tr class="header">
195+
<th style="text-align: left;"></th>
196+
<th style="text-align: right;">1</th>
197+
<th style="text-align: right;">2</th>
198+
<th style="text-align: right;">3</th>
199+
<th style="text-align: right;">4</th>
200+
<th style="text-align: right;">5</th>
201+
</tr>
202+
</thead>
203+
<tbody>
204+
<tr class="odd">
205+
<td style="text-align: left;">setosa</td>
206+
<td style="text-align: right;">10</td>
207+
<td style="text-align: right;">10</td>
208+
<td style="text-align: right;">10</td>
209+
<td style="text-align: right;">10</td>
210+
<td style="text-align: right;">10</td>
211+
</tr>
212+
<tr class="even">
213+
<td style="text-align: left;">versicolor</td>
214+
<td style="text-align: right;">10</td>
215+
<td style="text-align: right;">10</td>
216+
<td style="text-align: right;">10</td>
217+
<td style="text-align: right;">10</td>
218+
<td style="text-align: right;">10</td>
219+
</tr>
220+
<tr class="odd">
221+
<td style="text-align: left;">virginica</td>
222+
<td style="text-align: right;">10</td>
223+
<td style="text-align: right;">10</td>
224+
<td style="text-align: right;">10</td>
225+
<td style="text-align: right;">10</td>
226+
<td style="text-align: right;">10</td>
179227
</tr>
180228
</tbody>
181229
</table>
@@ -184,15 +232,16 @@ As illustrated in the example, we can use the function
184232
`anticlustering()` to create similar groups of plants. In this case
185233
“similar” primarily means that the means and standard deviations (in
186234
parentheses) of the variables are pretty much the same across the five
187-
groups. The function `anticlustering()` takes as input a data table
188-
describing the elements that should be assigned to sets. In the data
189-
table, each row represents an element (here a plant, but it can be
190-
anything; for example a person, word, or a photo). Each column is a
191-
numeric variable describing one of the elements’ features. The number of
192-
groups is specified through the argument `K`. The argument `objective`
193-
specifies how between-group similarity is quantified; the argument
194-
`method` specifies the algorithm by which this measure is optimized. See
195-
the documentation `?anticlustering` for more details.
235+
groups, and that the species category was evenly assigned to groups. The
236+
function `anticlustering()` takes as input a data table describing the
237+
elements that should be assigned to sets. In the data table, each row
238+
represents an element (here a plant, but it can be anything; for example
239+
a person, word, or a photo). Each column is a numeric variable
240+
describing one of the elements’ features. The number of groups is
241+
specified through the argument `K`. The argument `objective` specifies
242+
how between-group similarity is quantified; the argument `method`
243+
specifies the algorithm by which this measure is optimized. See the
244+
documentation `?anticlustering` for more details.
196245

197246
Five anticlustering objectives are natively supported in
198247
`anticlustering()`:
@@ -216,30 +265,6 @@ and the references therein. It is also possible to optimize user-defined
216265
objectives, which is also described in the documentation
217266
(`?anticlustering`).
218267

219-
Categorical variables
220-
---------------------
221-
222-
Sometimes, it is required that sets are not only similar with regard to
223-
some numeric variables, but we also want to ensure that each set
224-
contains an equal number of elements of a certain category. Coming back
225-
to the initial iris data set, we may want to require that each set has a
226-
balanced number of plants of the three iris species. To this end, we can
227-
use the argument `categories` as follows:
228-
229-
anticlusters <- anticlustering(
230-
iris[, -5],
231-
K = 3,
232-
categories = iris$Species
233-
)
234-
235-
## The species are as balanced as possible across anticlusters:
236-
table(anticlusters, iris$Species)
237-
#>
238-
#> anticlusters setosa versicolor virginica
239-
#> 1 17 17 16
240-
#> 2 17 16 17
241-
#> 3 16 17 17
242-
243268
Matching and clustering
244269
-----------------------
245270

inst/README.Rmd

Lines changed: 16 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ output:
99

1010
# anticlust <a href='https://m-py.github.io/anticlust/'><img src='man/figures/anticlustStickerV1-0.svg' style="float:right; height:160px" /></a>
1111

12-
Anticlustering partitions a pool of elements into clusters (or *anticlusters*) with the goal of achieving high between-cluster similarity and high within-cluster heterogeneity. This is accomplished by maximizing instead of minimizing a clustering objective function, such as the intra-cluster variance (used in k-means clustering) or the sum of pairwise distances within clusters. The package `anticlust` implements anticlustering methods as described in Papenberg and Klau (2021; https://doi.org/10.1037/met0000301), Brusco et al. (2020; https://doi.org/10.1111/bmsp.12186), Papenberg (2024; https://doi.org/10.1111/bmsp.12315), and Papenberg et al. (2025; https://doi.org/10.1101/2025.03.03.641320).
12+
Anticlustering partitions a pool of elements into clusters (or *anticlusters*) with the goal of achieving high between-cluster similarity and high within-cluster heterogeneity. This is accomplished by maximizing instead of minimizing a clustering objective function, such as the intra-cluster variance (used in k-means clustering) or the sum of pairwise distances within clusters. The package `anticlust` implements anticlustering methods as described in Papenberg and Klau (2021; https://doi.org/10.1037/met0000301), Brusco et al. (2020; https://doi.org/10.1111/bmsp.12186), Papenberg (2024; https://doi.org/10.1111/bmsp.12315), Papenberg, Wang, et al. (2025; https://doi.org/10.1016/j.crmeth.2025.101137), Papenberg, Breuer, et al. (2025; https://doi.org/10.1017/psy.2025.10052), and Yang et al. (2022; https://doi.org/10.1016/j.ejor.2022.02.003)
1313

1414
```{r setup, include = FALSE}
1515
library("anticlust")
@@ -68,13 +68,13 @@ Another great way of showing your appreciation of `anticlust` is to leave a star
6868

6969
This README contains some basic information on the `R` package `anticlust`. More information is available via the following sources:
7070

71-
- Up until now, we published 3 papers describing the theoretical background of `anticlust`.
71+
- Up until now, we published 4 papers describing the theoretical background of `anticlust`.
7272
* The initial presentation of the `anticlust` package is given in Papenberg and Klau (2021) (https://doi.org/10.1111/bmsp.12315; [Preprint](https://doi.org/10.31234/osf.io/7jw6v)).
7373
* The k-plus anticlustering method is described in Papenberg (2024) (https://doi.org/10.1037/met0000527; [Preprint](https://doi.org/10.31234/osf.io/dhzrc)).
74-
* A new paper describes the must-link feature and provides additional comparisons to alternative methods, focusing on categorical variables (Papenberg et al., 2025; https://doi.org/10.1101/2025.03.03.641320).
74+
* A new paper describes the must-link feature and provides additional comparisons to alternative methods, focusing on categorical variables (Papenberg, Wang, et al., 2025; https://doi.org/10.1016/j.crmeth.2025.101137).
75+
* Another new paper describes several new algorithms for anticlustering and the cannot-link feature (Papenberg, Breuer, et al., 2025; https://doi.org/10.1017/psy.2025.10052).
7576
- The R documentation of the main functions is actually quite rich and up to date, so you should definitely check that out when using the `anticlust` package. The most important background is provided in `?anticlustering`.
76-
- A [video](https://youtu.be/YGrhSmi1oA8) is available in German language where I illustrate the main functionalities of the `anticlustering()` function. My plan is to make a similar video in English in the future.
77-
- The [package website](https://m-py.github.io/anticlust/) contains all documentation as a convenient website. At the current time, the website also has four package vignettes, while additional vignettes are planned.
77+
- The [package website](https://m-py.github.io/anticlust/) contains all documentation as a convenient website. Also check out the vignettes on that website.
7878

7979
## A quick start
8080

@@ -89,11 +89,12 @@ Call the `anticlustering()` method:
8989

9090
```{r}
9191
anticlusters <- anticlustering(
92-
iris[, -5],
92+
iris,
9393
K = 5,
9494
objective = "kplus",
9595
method = "local-maximum",
96-
repetitions = 10
96+
repetitions = 10,
97+
standardize = TRUE
9798
)
9899
```
99100

@@ -115,7 +116,14 @@ Last, let's compare the features' means and standard deviations across groups to
115116
knitr::kable(mean_sd_tab(iris[, -5], anticlusters), row.names = TRUE)
116117
```
117118

118-
As illustrated in the example, we can use the function `anticlustering()` to create similar groups of plants. In this case "similar" primarily means that the means and standard deviations (in parentheses) of the variables are pretty much the same across the five groups. The function `anticlustering()` takes as input a data table describing the elements that should be assigned to sets. In the data table, each row represents an element (here a plant, but it can be anything; for example a person, word, or a photo). Each column is a numeric variable describing one of the elements' features. The number of groups is specified through the argument `K`. The argument `objective` specifies how between-group similarity is quantified; the argument `method` specifies the algorithm by which this measure is optimized. See the documentation `?anticlustering` for more details.
119+
We can also verify that the species of plants (a categorical feature) is evenly distributed among groups:
120+
121+
122+
```{r}
123+
knitr::kable(table(iris[, 5], anticlusters), row.names = TRUE)
124+
```
125+
126+
As illustrated in the example, we can use the function `anticlustering()` to create similar groups of plants. In this case "similar" primarily means that the means and standard deviations (in parentheses) of the variables are pretty much the same across the five groups, and that the species was evenly assigned to groups. The function `anticlustering()` takes as input a data table describing the elements that should be assigned to sets. In the data table, each row represents an element (here a plant, but it can be anything; for example a person, word, or a photo). Each column is a numeric variable describing one of the elements' features. The number of groups is specified through the argument `K`. The argument `objective` specifies how between-group similarity is quantified; the argument `method` specifies the algorithm by which this measure is optimized. See the documentation `?anticlustering` for more details.
119127

120128
Five anticlustering objectives are natively supported in `anticlustering()`:
121129

@@ -127,27 +135,6 @@ Five anticlustering objectives are natively supported in `anticlustering()`:
127135

128136
The anticlustering objectives are described in detail in the documentation (`?anticlustering`, `?diversity_objective`, `?variance_objective`, `?kplus_anticlustering`, `?dispersion_objective`) and the references therein. It is also possible to optimize user-defined objectives, which is also described in the documentation (`?anticlustering`).
129137

130-
## Categorical variables
131-
132-
Sometimes, it is required that sets are not only similar with regard to
133-
some numeric variables, but we also want to ensure that each set
134-
contains an equal number of elements of a certain category. Coming back
135-
to the initial iris data set, we may want to require that each set has a
136-
balanced number of plants of the three iris species. To this end, we can
137-
use the argument `categories` as follows:
138-
139-
```{r}
140-
anticlusters <- anticlustering(
141-
iris[, -5],
142-
K = 3,
143-
categories = iris$Species
144-
)
145-
146-
## The species are as balanced as possible across anticlusters:
147-
table(anticlusters, iris$Species)
148-
149-
```
150-
151138
## Matching and clustering
152139

153140
Anticlustering creates sets of dissimilar elements; the heterogenity within anticlusters is maximized. This is the opposite of clustering problems that strive for high within-cluster similarity and good separation between clusters. The `anticlust` package also provides functions for "classical" clustering applications: `balanced_clustering()` creates sets of elements that are similar while ensuring that clusters are of equal size. This is an example:

0 commit comments

Comments
 (0)