2020-01-28
Major update. Many changes are due to the philosophy that future
maintainability should be maximized. This mostly implies simplifying the
exported functions, such as reducing the number of functions and the
number of arguments for each exported function. I took the liberty to
introduce some major changes to the existing functions; this version
will be submitted to CRAN, after which such changes will no longer
occur. Other changes bring about new possibilities for stimulus
selection in experimental psychology (in particular, the new function
matching()).
Major
- A new exported function:
matching()finds groups of similar elements. Internally,matching()calls the same clustering algorithm asbalanced_clustering(), but it differs with regard to the arguments: Inmatching(), you specify the size of the groups rather than the number of groups. Moreover,matching()offers some unique functionality:- Conduct K-partite matching, that is: include a grouping vector (via
argument
match_between) and matches are selected between elements of different groups - Only match elements that are part of the same group (this categorical
constraint is induced through the argument
match_within) - The matches that are returned by
matching()are numbered by similarity; the group identified by1has the least sum of distances between matched elements etc. - In
balanced_clustering(), nearest neighbours are first sought for extreme elements that are far away from the cluster centroid. Inmatching(), it is possible to start searching for matches at the center (settingmatch_extreme_firsttoFALSE) - When a grouping restriction via
match_betweenis included, it is possible to further specify how matches are selected through the optiontarget_groupis included. When specifying"none", matches are always selected for extreme (or central elements first whenmatch_extreme_first = FALSE). With option"smallest"(default), matches are selected from the smallest group specified inmatch_between. With option"diverse", matches are selected from the group having the largest variance.
- Conduct K-partite matching, that is: include a grouping vector (via
argument
- All clustering and anticlustering functions now only take one data
argument (called
x). Internally,anticlusttests if the input is a distance matrix or not. This is a major improvement for the function interfaces. - The argument
ivwas removed from the functionanticlustering()as it does not fit the anticlustering semantic. Its functionality is available in other functions (in particular:matching()), - The random sampling method for anticlustering was removed. It was an
unnecessary burden for the
anticlustcode base as it is much less performant than the exchange method. (This implies that theanticlustering()function no longer has an argumentnrep) - The functions
initialize_K()andgenerate_exchange_partners()were removed. They did not fit into the current design philosophy of the project (as soon asanticlusthits CRAN I will no longer remove exported functions) - It is no longer possible to pass a cluster assignment to
anticlustering()via argumentKthat contains someNA. This functionality is now obsolete because the functionmatching()exists for subset selection (see the package vignette) - Alas, dropped support for the commercial linear programming solvers
that hinder compatibility with CRAN checks
- Using the commericial solvers to solve anticluster editing or cluster editing is still be possible via version 0.3.0 that is tagged on Github
- install version 0.3.0 through
remotes::install_github("m-Py/anticlust", ref = "v0.3.0")
Minor
- A new exported convenience function:
plot_similarity()plot the sum of pairwise distances by cluster to find out which clusters are highly similar - The function
mean_sd_obj()no longer computes the discrepancy of medians, only in means and standard deviations (as the name would also suggest) - Function
plot_clusters()- removed arguments
colandpch - renamed argument
clusteringtoclusters
- removed arguments
- Funtion
generate_partitions()- changed order of the arguments
NandK(is now consistent withn_partitions())
- changed order of the arguments
- Function
balanced_clustering()- the default heuristic option is now called
"centroid"
- the default heuristic option is now called
2019-10-30
This is a rather big update including several changes that may break code used with earlier versions.
-
The package title was changed to »
anticlust: Subset partitioning via anticlustering« -
The arguments
parallelize,seedandstandardizewere removed from the functionanticlustering(). Parallelization was used to speed up the random sampling method, but as the exchange algorithm has proven to be much better---for which reason I do not intend to extend code for the random sampling method---having options for parallelization is an unnecessary burden in my code base. Therefore, the optionseedis also no longer needed as it was only used to ensure reproducibility when using the parallel option. The argumentstandardizehas also been removed. Users now have to manually call the functionscale()on the input before calling theanticlustering()function if they wish to standardize the input. The philosophy behind these changes is that the main function of theanticlustpackage should not have too many arguments, making it easier usable. I also expect that it this makes it easier to adapt the code base in the future. -
Minor change: The argument
preclusteringnow only acceptsTRUE/FALSEas it used to before (see the discussion here). To induce customized clustering constraints, we can still use the argumentcategories. In the following example, we restrict the exchange partners for each element to four similar elements by creating clusters of size 5:
anticlustering(
iris[, -5],
K = 3,
categories = balanced_clustering(iris[, -5], K = 5)
)-
An enhancement was added to the
balanced_clustering()function: The heuristic algorithm to create equal-sized clusters was replaced by one that is much faster and better (courteously contributed by Meik Michalke). This method is also called when the argumentpreclusteringisTRUEin a function call of theanticlustering()function (unless the argumentmethodis "ilp", in which case an exact clustering algorithm is called). -
A new function is available:
n_partitions()computes the number of equal-sized partitions for given N and K. -
A new function is available:
generate_exchange_partners()can be used to group elements that serve as exchange partners when using one of the functionsanticlustering()orfast_anticlustering(). This is good because users can now easily speed up anticlustering computations by reducing the number of exchange partners. The function also enables the possibility to combine categorical and preclustering restrictions, see?generate_exchange_partners()(it is still not possible to combinepreclustering = TRUEand thecategoriesargument). In the following example, we restrict the exchange partners to four random elements within the same species:
anticlustering(
iris[, -5],
K = 3,
categories = generate_exchange_partners(
categories = iris[, 5],
p = 4
)
)- Note that the function
generate_exchange_partners()takes as argument the number of exchange partners «p»; this is different from the anticlustering and clustering functions that take as input the size of the expected cluster «K». Hence,K = 5andp = 4will lead to the same number of exchange partners.1 In another example, we restrict the exchange partners to four similar elements within the same species (here, balanced input is necessary again because internally, clusters of equal size are computed using the functionbalanced_clustering()on each category separately):
anticlustering(
iris[, -5],
K = 3,
categories = generate_exchange_partners(
features = iris[, -5],
categories = iris[, 5],
p = 4,
similar = TRUE
)
)Moreover, several internal changes were implemented to the code base to
enhance future maintainability. For example, to differentiate between
distance and features input, custom S3 classes are added to the data
matrices. For anyone inspecting the source code and wondering why I use
customized classes instead of just using the class dist for distance
input: usually, I want a complete matrix (having upper and lower
triangular) and not a reduced matrix of class dist. Also, it is not
really more effort to add a custom S3 class.
2019-09-17
- Fix: Random assignment under categorical restrictions failed
if a category only had one member because in this case
samplewas called with an vector input of length 1, e.g.,sample(9)actually returnssample(1:9)but should return 9 in this case. This was fixed with f53bf4e. - There is no longer any error tolerance with the CPLEX solver; previously there was an optimality gap of 1e-11, but now it is 0.
- There is a new exported fuction:
wce(). This corresponds to optimal weighted cluster editing without group size restrictions. Optimality means: an ILP solver is required to use this function. Note that the (only) argumentweightsis a similarity matrix, i.e., larger values indicate stronger agreement between elements. This is unlike the other functions, but common for cluster editing.
2019-07-23
Internal change: Optimizing the exchange method with the default distance objective is now much faster. This is accomplished by only updating the sum of distances after each exchange, instead of recomputing all distances (see d51e59d)
This example illustrates the run time improvement:
# For N = 20 to N = 300, test run time for old and new
# optimization of distance criterion:
n <- seq(20, 300, by = 20)
times <- matrix(nrow = length(n), ncol = 4)
times[, 1] <- n
colnames(times) <- c("n", "old_features_input", "old_distance_input", "new_distance")
for (i in seq_along(n)) {
start <- Sys.time()
# Simulate 2 features as input data
data <- matrix(rnorm(n[i] * 2), ncol = 2)
## Old version: feature table as input
ac1 <- anticlustering(
data,
K = rep_len(1:2, nrow(data)),
objective = anticlust:::obj_value_distance
)
times[i, "old_features_input"] <- difftime(Sys.time(), start, units = "s")
## Old version: distance matrix as input
ac2 <- anticlustering(
dist(data),
K = rep_len(1:2, nrow(data)),
objective = anticlust:::distance_objective_
)
times[i, "old_distance_input"] <- difftime(Sys.time(), start, units = "s")
start <- Sys.time()
ac3 <- anticlustering(
data,
K = rep_len(1:2, nrow(data)),
objective = "distance"
)
times[i, "new_distance"] <- difftime(Sys.time(), start, units = "s")
## Ensures that all methods have the same output
stopifnot(all(ac1 == ac2))
stopifnot(all(ac1 == ac3))
}
round(times, 2)
# n old_features_input old_distance_input new_distance
# [1,] 20 0.08 0.12 0.01
# [2,] 40 0.26 0.50 0.03
# [3,] 60 0.72 1.36 0.10
# [4,] 80 1.07 2.62 0.22
# [5,] 100 1.81 4.98 0.46
# [6,] 120 3.84 11.17 0.82
# [7,] 140 3.72 13.17 1.33
# [8,] 160 5.20 20.65 2.16
# [9,] 180 7.30 31.48 2.44
# [10,] 200 8.63 37.96 3.38
# [11,] 220 10.97 53.26 4.80
# [12,] 240 13.78 74.17 6.66
# [13,] 260 17.49 106.81 8.43
# [14,] 280 20.40 149.38 12.23
# [15,] 300 27.21 178.46 15.20As shown in the code and in the output table, two different
objective functions could be called when the exchange algorithm was
employed, depending on the input: When a feature table was passed,
the internal function anticlust:::obj_value_distance was called
in each iteration of the exchange algorithm;
When a distance matrix was passed, the internal function
anticlust:::distance_objective_ was called
in each iteration of the exchange algorithm. The former function
computes all between-element distances within each set and returns their sum
(using the R functions by, dist, sapply and sum). The latter
function stores all distances and will index the relevant distances and
return their sum. Interestingly, this indexing approach was a lot slower
than recomputing all distances every iteration in the exchange algorithm.
In the new version, there no longer is a difference between a feature and distance input; in both cases, the sum of distances is updated based on only the relevant columns/rows in a distance matrix (that means, in each iteration of the exchange method, 4 rows/columns need to be investigated, independent of N). The new approach is a lot faster and especially benefial when we pass distance as input.
2019-07-22
New feature:
- There is now an argument
ivfor the functionanticlustering(). It can be used for »min-max anticlustering«;ivthen contains numeric features (vector, matrix or data frame) whose values are made dissimilar between sets -- as opposed to the usual anticlustering where all features are made similar. See?schaper2019for an example.
2019-07-18
New features:
- In the
anticlustering()function, the argumentKcan now be a vector that serves as the initiation of the anticlusters (This functionality is only available whenmethod = "exchange").- Subset selection is now possible. Subset selection means that
not all input item are assigned to a set, but from the total input
a subset is selected that is assigned to the different sets.
This functionality is enabled by passing a initial cluster
assignment via the argument
Kthat contains someNA(For example, if N = 50 and two sets of 20 items should be created, the argumentKwill contain 10 elements that areNA, 20 times 1, and 20 times 2.) - By using a customized
Kas input, it is now also possible to create sets of different size (e.g.K = c(1, 1, 1, 1, 2, 2)) - The function
initialize_K()can be used to generate initial anticluster assignments in a user-friendly way. The documentation of the functioninitialize_K()contains example code how to conduct subset selection and anticlustering with different set sizes.
- Subset selection is now possible. Subset selection means that
not all input item are assigned to a set, but from the total input
a subset is selected that is assigned to the different sets.
This functionality is enabled by passing a initial cluster
assignment via the argument
- A new objective function was added:
mean_sd_obj(). Maximizing this objective will simply make all sets similar with regard to the mean, median and the standard deviation of all input features.
Internal changes:
- Major internal restructuring to improve the expected maintainability in the
future. In the last weeks, a lot of features were added to
anticlustand a restructuring seemed necessary. - Many test cases were added to test the features that were added in the previous weeks.
- To accommodate the possibility that the argument
KcontainsNA, the objective function to be optimized will be restructured internally. In particular, before the objective is computed, all cases are removed where the cluster isNA(i.e., cases that are currently not assigned to any set). This also works for user-defined objective functions, so users do not need to deal with the handling ofNAthemselves.
2019-07-09
- A bug was fixed that led to an incorrect computation of the objective function for anticluster editing when employing the exchange method (see 243ca64). Tests show that the exchange method now outperforms random sampling for anticluster editing (as well as for k-means anticlustering). Therefore, the exchange method is now the default method (see b101073).
- The fast exchange method is now used when optimizing the variance
criterion in a call to
anticlustering(). This improves run time by a large margin for this important application. See 2f47fea. - Two changes with regard to the functionality of arguments in
anticlustering()- It is possible that the argument
objectivenow takes as input a function. The passed function has to take two arguments, the first being a cluster assignment vector (such as returned byanticlustering()), the second being the data the objective is computed on (e.g. an N x M matrix where rows are elements and columns are features). Larger return values must indicate a better objective as the objective is maximized with the existing methods (exchange method and random sampling). This functionality makes it possible for users to implement their own operationalization of set similarity. - It is possible that the argument
preclusteringnow takes as input a preclustering vector and not onlyTRUEorFALSE(in the former case, the preclustering vector has been computed within theanticlustering()function). This allows for more flexibility in combining preclustering and anticlustering methods. For example, it is now possible to conduct optimal preclustering using integer linear programming with the functionbalanced_clustering(), and then use a heuristic anticlustering method that incorporates this preclustering. - Both of these changes have not yet been added to the function documentation as they require some more testing.
- It is possible that the argument
- The
fast_anticlustering()function has been documented more thoroughly and part of theanticlustering()docs have been reworked (now advocating the exchange method as the preferable option).
2019-07-05
A new exported function is available: fast_anticlustering(). As the
name suggests, it is optimized for speed and particularly useful for large
data sets (many thousand elements). It uses the k-means variance
objective because computing all pairwise distances for the cluster
editing objective becomes computationally infeasible for large data
sets. Additionally, it employs a speed-optimized exchange method.
The number of exchange partners can be adjusted using the argument
k_neighbours. Fewer exchange partners make it possible to apply the
fast_anticlustering() function to very large data sets. The default
value for k_neighbours is Inf, meaning that in the default case,
each element is swapped with all other elements.
2019-07-01
A big update:
- A new algorithm for anticlustering is available: the exchange method.
Building on an initial random assignment, elements are swapped between
anticlusters in such a way that each swap improves anticluster similarity by
the largest amount that is possible (cf. Späth, 1986).
This procedure is repeated for each element; because each
possible swap is investigated for each element, the total number of
exchanges grows quadratically with input size, rendering the exchange
method unsuitable for large N. Setting
preclustering = TRUEwill limit the legal exchange partners to very similar elements, resulting in improved run time while preserving a rather good solution. The exchange method outperforms the random sampling heuristic for k-means anticlustering. The exchange method may incorporate both categorical and preclustering constraints, which is not possible for the random sampling approach. As there are now two heuristic methods (random sampling and exchange) the argumentmethodof the functionanticlustering()now has the following three possible values: "sampling", "exchange", "ilp". In earlier versions, the two options were "heuristic" and "ilp"; this change does not break earlier code because usingmethod = "heuristic"will still refer to the random sampling method. - A new function
generate_partitionscan be used to generate all partitions, making it possible to solve anticlustering via complete enumeration. In particular, it is now possible---for small problem instances---to solve k-means anticlustering optimally, which cannot be done with integer linear programming. The help file (?generate_partitions) contains example code illustrating how to do this.
Minor changes:
- The "variance" criterion can now be computed when there are missing values in the input data.
- In
plot_clusters, it is now possible to adjust the size of the cluster centroid using the new argumentcex_centroid
2019-06-19
Minor update: plot_clusters now has an additional argument
illustrate_variance. If this argument is set to TRUE, a cluster
solution is illustrated with the k-means variance criterion.
2019-05-27
The new version of anticlust now enables parallelization of the random sampling method, improving run time.
- The
anticlusteringfunction now has an additional argument (parallelize) that can be used to activate parallel computation when using the heuristic method - For now, the default value of
parallelizeisFALSE - Another argument was added (
seed) to make the random sampling method reproducible- Just using
set.seed()prior to the computation does not make a function call reproducible whenparallelizeisTRUEbecause each core has its own random seed - The
seedargument is optional
- Just using
An example data set is now included with the package, courteously
provided by Marie Lusia Schaper and Ute Bayen. For details, see
?schaper2019.
2019-04-26
The new version of anticlust includes support for constraints induced by grouping variables.
- The
anticlusteringfunction now has an optional argument (categories) that can be used to induce categorical constraints categoriescan be a vector if there is is one grouping variable or a matrix/data if there is more than one grouping variable- Currently,
categoriescan only be used with the random sampling method (method = "heuristic") categoriesoverrides the value ofpreclustering; it is not possible to use categorical and preclustering constraints at the same time
In anticlustering, the default value of preclustering is now FALSE.
Footnotes
-
Maybe the argument
pwill be added to theanticlustering()function in a future release (in this case, the functiongenerate_exchange_partners()would be called internally). This is probably a useful extension. ↩