1mlpack_kmeans(1) General Commands Manual mlpack_kmeans(1)
2
3
4
6 mlpack_kmeans - k-means clustering
7
9 mlpack_kmeans [-h] [-v]
10
12 This program performs K-Means clustering on the given dataset, storing
13 the learned cluster assignments either as a column of labels in the
14 file containing the input dataset or in a separate file. Empty clusters
15 are not allowed by default; when a cluster becomes empty, the point
16 furthest from the centroid of the cluster with maximum variance is
17 taken to fill that cluster.
18
19 Optionally, the Bradley and Fayyad approach ("Refining initial points
20 for k-means clustering", 1998) can be used to select initial points by
21 specifying the --refined_start (-r) option. This approach works by tak‐
22 ing random samples of the dataset; to specify the number of samples,
23 the --samples parameter is used, and to specify the percentage of the
24 dataset to be used in each sample, the --percentage parameter is used
25 (it should be a value between 0.0 and 1.0).
26
27 There are several options available for the algorithm used for each
28 Lloyd iteration, specified with the --algorithm (-a) option. The stan‐
29 dard O(kN) approach can be used ('naive'). Other options include the
30 Pelleg-Moore tree-based algorithm ('pelleg-moore'), Elkan's triangle-
31 inequality based algorithm ('elkan'), Hamerly's modification to Elkan's
32 algorithm ('hamerly'), the dual-tree k-means algorithm ('dualtree'),
33 and the dual-tree k-means algorithm using the cover tree ('dualtree-
34 covertree').
35
36 The behavior for when an empty cluster is encountered can be modified
37 with the --allow_empty_clusters (-e) option. When this option is speci‐
38 fied and there is a cluster owning no points at the end of an itera‐
39 tion, that cluster's centroid will simply remain in its position from
40 the previous iteration. If the --kill_empty_clusters (-E) option is
41 specified, then when a cluster owns no points at the end of an itera‐
42 tion, the cluster centroid is simply filled with DBL_MAX, killing it
43 and effectively reducing k for the rest of the computation. Note that
44 the default option when neither empty cluster option is specified can
45 be time-consuming to calculate; therefore, specifying -e or -E will
46 often accelerate runtime.
47
48 As of October 2014, the --overclustering option has been removed. If
49 you want this support back, let us know---file a bug at
50 https://github.com/mlpack/mlpack/ or get in touch through another
51 means.
52
54 --clusters (-c) [int]
55 Number of clusters to find (0 autodetects from initial cen‐
56 troids).
57
58 --input_file (-i) [string]
59 Input dataset to perform clustering on.
60
62 --algorithm (-a) [string]
63 Algorithm to use for the Lloyd iteration ('naive', 'pelleg-
64 moore', 'elkan', 'hamerly', ’dualtree', or 'dualtree-
65 covertree'). Default value 'naive'.
66
67 --allow_empty_clusters (-e)
68 Allow empty clusters to be persist.
69
70 --help (-h)
71 Default help info.
72
73 --in_place (-P)
74 If specified, a column containing the learned cluster assign‐
75 ments will be added to the input dataset file. In this case,
76 --outputFile is overridden.
77
78 --info [string]
79 Get help on a specific module or option. Default value ''.
80 --initial_centroids (-I) [string] Start with the specified ini‐
81 tial centroids. Default value ''.
82
83 --kill_empty_clusters (-E)
84 Remove empty clusters when they occur.
85
86 --labels_only (-l)
87 Only output labels into output file.
88
89 --max_iterations (-m) [int]
90 Maximum number of iterations before k-means terminates. Default
91 value 1000.
92
93 --percentage (-p) [double]
94 Percentage of dataset to use for each refined start sampling
95 (use when --refined_start is specified). Default value 0.02.
96
97 --refined_start (-r)
98 Use the refined initial point strategy by Bradley and Fayyad to
99 choose initial points.
100
101 --samplings (-S) [int]
102 Number of samplings to perform for refined start (use when
103 --refined_start is specified). Default value 100.
104
105 --seed (-s) [int]
106 Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
107
108 --verbose (-v)
109 Display informational messages and the full list of parameters
110 and timers at the end of execution.
111
112 --version (-V)
113 Display the version of mlpack.
114
116 --centroid_file (-C) [string] If specified, the centroids of each clus‐
117 ter will be written to the given file. Default value ’'.
118
119 --output_file (-o) [string]
120 File to write output labels or labeled data to. Default value
121 ''.
122
125 For further information, including relevant papers, citations, and the‐
126 ory, For further information, including relevant papers, citations, and
127 theory, consult the documentation found at http://www.mlpack.org or
128 included with your consult the documentation found at
129 http://www.mlpack.org or included with your DISTRIBUTION OF MLPACK.
130 DISTRIBUTION OF MLPACK.
131
132
133
134 mlpack_kmeans(1)