mlpack_kmeans(1)

1mlpack_kmeans(1)                 User Commands                mlpack_kmeans(1)
2
3
4

NAME

6       mlpack_kmeans - k-means clustering
7

SYNOPSIS

9        mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]
10

DESCRIPTION

12       This  program  performs K-Means clustering on the given dataset. It can
13       return the learned cluster assignments, and the centroids of the  clus‐
14       ters. Empty clusters are not allowed by default; when a cluster becomes
15       empty, the point furthest from the centroid of the cluster with maximum
16       variance is taken to fill that cluster.
17
18       Optionally,  the  Bradley and Fayyad approach ("Refining initial points
19       for k-means clustering", 1998) can be used to select initial points  by
20       specifying the '--refined_start (-r)' parameter. This approach works by
21       taking random samplings of the dataset; to specify the number  of  sam‐
22       plings,  the  '--samplings  (-S)' parameter is used, and to specify the
23       percentage of the dataset to be used in each sample, the  '--percentage
24       (-p)' parameter is used (it should be a value between 0.0 and 1.0).
25
26       There  are  several  options  available for the algorithm used for each
27       Lloyd iteration, specified with  the  '--algorithm  (-a)'  option.  The
28       standard  O(kN)  approach  can be used ('naive'). Other options include
29       the Pelleg-Moore tree-based algorithm ('pelleg-moore'), Elkan's  trian‐
30       gle-inequality  based  algorithm  ('elkan'),  Hamerly's modification to
31       Elkan's algorithm ('hamerly'), the dual-tree k-means algorithm  ('dual‐
32       tree'),  and  the  dual-tree  k-means  algorithm  using  the cover tree
33       ('dualtree-covertree').
34
35       The behavior for when an empty cluster is encountered can  be  modified
36       with  the  ’--allow_empty_clusters  (-e)'  option.  When this option is
37       specified and there is a cluster owning no points  at  the  end  of  an
38       iteration,  that  cluster's centroid will simply remain in its position
39       from the previous iteration. If the '--kill_empty_clusters (-E)' option
40       is specified, then when a cluster owns no points at the end of an iter‐
41       ation, the cluster centroid is simply filled with DBL_MAX,  killing  it
42       and  effectively  reducing k for the rest of the computation. Note that
43       the default option when neither empty cluster option is  specified  can
44       be  time-consuming  to calculate; therefore, specifying either of these
45       parameters will often accelerate runtime.
46
47       Initial clustering assignments  may  be  specified  using  the  ’--ini‐
48       tial_centroids_file  (-I)'  parameter, and the maximum number of itera‐
49       tions may be specified with the '--max_iterations (-m)' parameter.
50
51       As an example, to use Hamerly's algorithm to perform k-means clustering
52       with  k=10  on  the  dataset  'data.csv', saving the centroids to 'cen‐
53       troids.csv' and the assignments for each  point  to  'assignments.csv',
54       the following command could be used:
55
56       $  kmeans  --input_file  data.csv  --clusters  10 --output_file assign‐
57       ments.csv --centroid_file centroids.csv
58
59       To run k-means on that same dataset with initial centroids specified in
60       ’initial.csv' with a maximum of 500 iterations, storing the output cen‐
61       troids in 'final.csv' the following command may be used:
62
63       $ kmeans  --input_file  data.csv  --initial_centroids_file  initial.csv
64       --clusters 10 --max_iterations 500 --centroid_file final.csv
65

REQUIRED INPUT OPTIONS

67       --clusters (-c) [int]
68              Number  of  clusters  to  find  (0 autodetects from initial cen‐
69              troids).
70
71       --input_file (-i) [string]
72              Input dataset to perform clustering on.
73

OPTIONAL INPUT OPTIONS

75       --algorithm (-a) [string]
76              Algorithm to use for  the  Lloyd  iteration  ('naive',  'pelleg-
77              moore',    'elkan',   'hamerly',   'dualtree',   or   'dualtree-
78              covertree'). Default value 'naive'.
79
80       --allow_empty_clusters (-e) [bool]
81              Allow empty clusters to be persist.
82
83       --help (-h) [bool]
84              Default help info.
85
86       --in_place (-P) [bool]
87              If specified, a column containing the  learned  cluster  assign‐
88              ments  will  be  added  to the input dataset file. In this case,
89              --output_file is overridden. (Do not use in Python.)
90
91       --info [string]
92              Get help on a specific module or option.  Default value ''.
93
94       --initial_centroids_file (-I) [string]
95              Start with the specified initial centroids.  Default value ''.
96
97       --kill_empty_clusters (-E) [bool]
98              Remove empty clusters when they occur.
99
100       --labels_only (-l) [bool]
101              Only output labels into output file.
102
103       --max_iterations (-m) [int]
104              Maximum number of iterations before k-means terminates.  Default
105              value 1000.
106
107       --percentage (-p) [double]
108              Percentage  of  dataset  to  use for each refined start sampling
109              (use when --refined_start is specified). Default value 0.02.
110
111       --refined_start (-r) [bool]
112              Use the refined initial point strategy by Bradley and Fayyad  to
113              choose initial points.
114
115       --samplings (-S) [int]
116              Number of samplings to perform for refined start
117
118       (use when --refined_start is specified).
119              Default value 100.
120
121       --seed (-s) [int]
122              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.
123
124       --verbose (-v) [bool]
125              Display  informational  messages and the full list of parameters
126              and timers at the end of execution.
127
128       --version (-V) [bool]
129              Display the version of mlpack.
130

OPTIONAL OUTPUT OPTIONS

132       --centroid_file (-C) [string]
133              If specified, the centroids of each cluster will be  written  to
134              the given file. Default value ''.
135
136       --output_file (-o) [string]
137              Matrix  to store output labels or labeled data to. Default value
138              ''.
139

ADDITIONAL INFORMATION

141       For further information, including relevant papers, citations, and the‐
142       ory,  consult  the  documentation  found  at  http://www.mlpack.org  or
143       included with your distribution of mlpack.
144
145
146
147mlpack-3.0.4                   21 February 2019               mlpack_kmeans(1)