|
1 | | -# ParallelKMeans.jl Documentation |
| 1 | +# ParallelKMeans.jl Package |
2 | 2 |
|
3 | 3 | ```@contents |
| 4 | +Depth = 4 |
4 | 5 | ``` |
5 | 6 |
|
| 7 | +## Motivation |
| 8 | +It's actually a funny story led to the development of this package. |
| 9 | +What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after into a heated discussion on the Julia Discourse forums after I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey Oskin offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge. |
| 10 | + |
| 11 | +Say hello to our baby, `ParallelKMeans`! |
| 12 | + |
| 13 | +This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) by offering an extremely fast implementation of the K-Means clustering algorithm with user friendly interface. |
| 14 | + |
| 15 | + |
| 16 | +## K-Means Algorithm Implementation Notes |
| 17 | +Explain main algos and some few lines about the input dimension as well as |
| 18 | + |
6 | 19 | ## Installation |
| 20 | +You can grab the latest stable version of this package by simply running in Julia. |
| 21 | +Don't forget to Julia's package manager with `]` |
7 | 22 |
|
| 23 | +```julia |
| 24 | +pkg> add TextAnalysis |
| 25 | +``` |
| 26 | + |
| 27 | +For the few (and selected) brave ones, one can simply grab the current experimental features by simply adding the experimental branch to your development environment after invoking the package manager with `]`: |
| 28 | + |
| 29 | +```julia |
| 30 | +dev git@github.com:PyDataBlog/ParallelKMeans.jl.git |
| 31 | +``` |
| 32 | + |
| 33 | +Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks! |
| 34 | +```bash |
| 35 | +git checkout experimental |
| 36 | +``` |
8 | 37 |
|
9 | 38 | ## Features |
| 39 | +- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia. |
| 40 | +- Support for multi-theading implementation of Kmeans clustering algorithm. |
| 41 | +- Kmeans++ initialization for faster and better convergence. |
| 42 | +- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm. |
| 43 | + |
| 44 | + |
| 45 | +## Pending Features |
| 46 | +- [X] Implementation of Triangle inequality based on [Elkan C. (2003) "Using the Triangle Inequality to Accelerate |
| 47 | +K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) |
| 48 | +- [ ] Support for DataFrame inputs. |
| 49 | +- [ ] Refactoring and finalizaiton of API desgin. |
| 50 | +- [ ] GPU support. |
| 51 | +- [ ] Even faster Kmeans implementation based on current literature. |
| 52 | +- [ ] Optimization of code base. |
10 | 53 |
|
11 | 54 |
|
12 | 55 | ## How To Use |
| 56 | +Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API. |
| 57 | + |
| 58 | +```julia |
| 59 | +using ParallelKMeans |
| 60 | + |
| 61 | +# Use only 1 core of CPU |
| 62 | +results = kmeans(X, 3, ParallelKMeans.SingleThread(), tol=1e-6, max_iters=300) |
| 63 | + |
| 64 | +# Use all available CPU cores |
| 65 | +multi_results = kmeans(X, 3, ParallelKMeans.MultiThread(), tol=1e-6, max_iters=300) |
| 66 | +``` |
| 67 | + |
| 68 | +### Practical Usage Examples |
| 69 | +Some of the common usage examples of this package are as follows: |
| 70 | + |
| 71 | +#### Clustering With A Desired Number Of Groups |
| 72 | + |
| 73 | +```julia |
| 74 | +using ParallelKMeans, RDatasets, Plots |
| 75 | + |
| 76 | +# load the data |
| 77 | +iris = dataset("datasets", "iris"); |
| 78 | + |
| 79 | +# features to use for clustering |
| 80 | +features = collect(Matrix(iris[:, 1:4])'); |
| 81 | + |
| 82 | +result = kmeans(features, 3, ParallelKMeans.MultiThread()); |
| 83 | + |
| 84 | +# plot with the point color mapped to the assigned cluster index |
| 85 | +scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments, |
| 86 | + color=:lightrainbow, legend=false) |
| 87 | + |
| 88 | +# TODO: Add scatter plot image |
| 89 | +``` |
| 90 | + |
| 91 | +#### Elbow Method For The Selection Of optimal number of clusters |
| 92 | +```julia |
| 93 | +using ParallelKMeans |
| 94 | + |
| 95 | +# Single Thread Implementation of Lloyd's Algorithm |
| 96 | +b = [ParallelKMeans.kmeans(X, i, ParallelKMeans.SingleThread(), |
| 97 | + tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10] |
| 98 | + |
| 99 | +# Multi Thread Implementation of Lloyd's Algorithm |
| 100 | +c = [ParallelKMeans.kmeans(X, i, ParallelKMeans.MultiThread(), |
| 101 | + tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10] |
| 102 | + |
| 103 | +# Multi Thread Implementation plus a modified version of Elkan's triangiulity of inequaltiy |
| 104 | +# to boost speed |
| 105 | +d = [ParallelKMeans.kmeans(ParallelKMeans.LightElkan(), X, i, ParallelKMeans.MultiThread(), |
| 106 | + tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10] |
| 107 | + |
| 108 | +# Single Thread Implementation plus a modified version of Elkan's triangiulity of inequaltiy |
| 109 | +# to boost speed |
| 110 | +e = [ParallelKMeans.kmeans(ParallelKMeans.LightElkan(), X, i, ParallelKMeans.SingleThread(), |
| 111 | + tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10] |
| 112 | +``` |
| 113 | + |
| 114 | + |
| 115 | +## Benchmarks |
| 116 | + |
| 117 | + |
| 118 | +## Release History |
| 119 | +- 0.1.0 Initial release |
| 120 | + |
13 | 121 |
|
| 122 | +## Contributing |
14 | 123 |
|
15 | 124 |
|
16 | 125 | ```@index |
|
0 commit comments