1mlpack_kmeans(1) User Commands mlpack_kmeans(1)
2
3
4
6 mlpack_kmeans - k-means clustering
7
9 mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]
10
12 This program performs K-Means clustering on the given dataset. It can
13 return the learned cluster assignments, and the centroids of the clus‐
14 ters. Empty clusters are not allowed by default; when a cluster becomes
15 empty, the point furthest from the centroid of the cluster with maximum
16 variance is taken to fill that cluster.
17
18 Optionally, the Bradley and Fayyad approach ("Refining initial points
19 for k-means clustering", 1998) can be used to select initial points by
20 specifying the '--refined_start (-r)' parameter. This approach works by
21 taking random samplings of the dataset; to specify the number of sam‐
22 plings, the '--samplings (-S)' parameter is used, and to specify the
23 percentage of the dataset to be used in each sample, the '--percentage
24 (-p)' parameter is used (it should be a value between 0.0 and 1.0).
25
26 There are several options available for the algorithm used for each
27 Lloyd iteration, specified with the '--algorithm (-a)' option. The
28 standard O(kN) approach can be used ('naive'). Other options include
29 the Pelleg-Moore tree-based algorithm ('pelleg-moore'), Elkan's trian‐
30 gle-inequality based algorithm ('elkan'), Hamerly's modification to
31 Elkan's algorithm ('hamerly'), the dual-tree k-means algorithm ('dual‐
32 tree'), and the dual-tree k-means algorithm using the cover tree
33 ('dualtree-covertree').
34
35 The behavior for when an empty cluster is encountered can be modified
36 with the ’--allow_empty_clusters (-e)' option. When this option is
37 specified and there is a cluster owning no points at the end of an
38 iteration, that cluster's centroid will simply remain in its position
39 from the previous iteration. If the '--kill_empty_clusters (-E)' option
40 is specified, then when a cluster owns no points at the end of an iter‐
41 ation, the cluster centroid is simply filled with DBL_MAX, killing it
42 and effectively reducing k for the rest of the computation. Note that
43 the default option when neither empty cluster option is specified can
44 be time-consuming to calculate; therefore, specifying either of these
45 parameters will often accelerate runtime.
46
47 Initial clustering assignments may be specified using the ’--ini‐
48 tial_centroids_file (-I)' parameter, and the maximum number of itera‐
49 tions may be specified with the '--max_iterations (-m)' parameter.
50
51 As an example, to use Hamerly's algorithm to perform k-means clustering
52 with k=10 on the dataset 'data.csv', saving the centroids to 'cen‐
53 troids.csv' and the assignments for each point to 'assignments.csv',
54 the following command could be used:
55
56 $ kmeans --input_file data.csv --clusters 10 --output_file assign‐
57 ments.csv --centroid_file centroids.csv
58
59 To run k-means on that same dataset with initial centroids specified in
60 ’initial.csv' with a maximum of 500 iterations, storing the output cen‐
61 troids in 'final.csv' the following command may be used:
62
63 $ kmeans --input_file data.csv --initial_centroids_file initial.csv
64 --clusters 10 --max_iterations 500 --centroid_file final.csv
65
67 --clusters (-c) [int]
68 Number of clusters to find (0 autodetects from initial cen‐
69 troids).
70
71 --input_file (-i) [string]
72 Input dataset to perform clustering on.
73
75 --algorithm (-a) [string]
76 Algorithm to use for the Lloyd iteration ('naive', 'pelleg-
77 moore', 'elkan', 'hamerly', 'dualtree', or 'dualtree-
78 covertree'). Default value 'naive'.
79
80 --allow_empty_clusters (-e) [bool]
81 Allow empty clusters to be persist.
82
83 --help (-h) [bool]
84 Default help info.
85
86 --in_place (-P) [bool]
87 If specified, a column containing the learned cluster assign‐
88 ments will be added to the input dataset file. In this case,
89 --output_file is overridden. (Do not use in Python.)
90
91 --info [string]
92 Get help on a specific module or option. Default value ''.
93
94 --initial_centroids_file (-I) [string]
95 Start with the specified initial centroids. Default value ''.
96
97 --kill_empty_clusters (-E) [bool]
98 Remove empty clusters when they occur.
99
100 --labels_only (-l) [bool]
101 Only output labels into output file.
102
103 --max_iterations (-m) [int]
104 Maximum number of iterations before k-means terminates. Default
105 value 1000.
106
107 --percentage (-p) [double]
108 Percentage of dataset to use for each refined start sampling
109 (use when --refined_start is specified). Default value 0.02.
110
111 --refined_start (-r) [bool]
112 Use the refined initial point strategy by Bradley and Fayyad to
113 choose initial points.
114
115 --samplings (-S) [int]
116 Number of samplings to perform for refined start
117
118 (use when --refined_start is specified).
119 Default value 100.
120
121 --seed (-s) [int]
122 Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
123
124 --verbose (-v) [bool]
125 Display informational messages and the full list of parameters
126 and timers at the end of execution.
127
128 --version (-V) [bool]
129 Display the version of mlpack.
130
132 --centroid_file (-C) [string]
133 If specified, the centroids of each cluster will be written to
134 the given file. Default value ''.
135
136 --output_file (-o) [string]
137 Matrix to store output labels or labeled data to. Default value
138 ''.
139
141 For further information, including relevant papers, citations, and the‐
142 ory, consult the documentation found at http://www.mlpack.org or
143 included with your distribution of mlpack.
144
145
146
147mlpack-3.0.4 21 February 2019 mlpack_kmeans(1)