mlpack_kmeans(1)

1mlpack_kmeans(1)            General Commands Manual           mlpack_kmeans(1)
2
3
4

NAME

6       mlpack_kmeans - k-means clustering
7

SYNOPSIS

9        mlpack_kmeans [-h] [-v]
10

DESCRIPTION

12       This  program performs K-Means clustering on the given dataset, storing
13       the learned cluster assignments either as a column  of  labels  in  the
14       file containing the input dataset or in a separate file. Empty clusters
15       are not allowed by default; when a cluster  becomes  empty,  the  point
16       furthest  from  the  centroid  of  the cluster with maximum variance is
17       taken to fill that cluster.
18
19       Optionally, the Bradley and Fayyad approach ("Refining  initial  points
20       for  k-means clustering", 1998) can be used to select initial points by
21       specifying the --refined_start (-r) option. This approach works by tak‐
22       ing  random  samples  of the dataset; to specify the number of samples,
23       the --samples parameter is used, and to specify the percentage  of  the
24       dataset  to  be used in each sample, the --percentage parameter is used
25       (it should be a value between 0.0 and 1.0).
26
27       There are several options available for the  algorithm  used  for  each
28       Lloyd  iteration, specified with the --algorithm (-a) option. The stan‐
29       dard O(kN) approach can be used ('naive'). Other  options  include  the
30       Pelleg-Moore  tree-based  algorithm ('pelleg-moore'), Elkan's triangle-
31       inequality based algorithm ('elkan'), Hamerly's modification to Elkan's
32       algorithm  ('hamerly'),  the  dual-tree k-means algorithm ('dualtree'),
33       and the dual-tree k-means algorithm using the  cover  tree  ('dualtree-
34       covertree').
35
36       The  behavior  for when an empty cluster is encountered can be modified
37       with the --allow_empty_clusters (-e) option. When this option is speci‐
38       fied  and  there  is a cluster owning no points at the end of an itera‐
39       tion, that cluster's centroid will simply remain in its  position  from
40       the  previous  iteration.  If  the --kill_empty_clusters (-E) option is
41       specified, then when a cluster owns no points at the end of  an  itera‐
42       tion,  the  cluster  centroid is simply filled with DBL_MAX, killing it
43       and effectively reducing k for the rest of the computation.  Note  that
44       the  default  option when neither empty cluster option is specified can
45       be time-consuming to calculate; therefore, specifying  -e  or  -E  will
46       often accelerate runtime.
47
48       As  of  October  2014, the --overclustering option has been removed. If
49       you  want  this  support  back,   let   us   know---file   a   bug   at
50       https://github.com/mlpack/mlpack/  or  get  in  touch  through  another
51       means.
52

REQUIRED INPUT OPTIONS

54       --clusters (-c) [int]
55              Number of clusters to find  (0  autodetects  from  initial  cen‐
56              troids).
57
58       --input_file (-i) [string]
59              Input dataset to perform clustering on.
60

OPTIONAL INPUT OPTIONS

62       --algorithm (-a) [string]
63              Algorithm  to  use  for  the  Lloyd iteration ('naive', 'pelleg-
64              moore',   'elkan',   'hamerly',   ’dualtree',   or    'dualtree-
65              covertree'). Default value 'naive'.
66
67       --allow_empty_clusters (-e)
68              Allow empty clusters to be persist.
69
70       --help (-h)
71              Default help info.
72
73       --in_place (-P)
74              If  specified,  a  column containing the learned cluster assign‐
75              ments will be added to the input dataset  file.  In  this  case,
76              --outputFile is overridden.
77
78       --info [string]
79              Get  help  on  a  specific  module or option.  Default value ''.
80              --initial_centroids (-I) [string] Start with the specified  ini‐
81              tial centroids.  Default value ''.
82
83       --kill_empty_clusters (-E)
84              Remove empty clusters when they occur.
85
86       --labels_only (-l)
87              Only output labels into output file.
88
89       --max_iterations (-m) [int]
90              Maximum  number of iterations before k-means terminates. Default
91              value 1000.
92
93       --percentage (-p) [double]
94              Percentage of dataset to use for  each  refined  start  sampling
95              (use when --refined_start is specified). Default value 0.02.
96
97       --refined_start (-r)
98              Use  the refined initial point strategy by Bradley and Fayyad to
99              choose initial points.
100
101       --samplings (-S) [int]
102              Number of samplings to  perform  for  refined  start  (use  when
103              --refined_start is specified).  Default value 100.
104
105       --seed (-s) [int]
106              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.
107
108       --verbose (-v)
109              Display  informational  messages and the full list of parameters
110              and timers at the end of execution.
111
112       --version (-V)
113              Display the version of mlpack.
114

OPTIONAL OUTPUT OPTIONS

116       --centroid_file (-C) [string] If specified, the centroids of each clus‐
117       ter will be written to the given file. Default value ’'.
118
119       --output_file (-o) [string]
120              File  to  write output labels or labeled data to.  Default value
121              ''.
122

ADDITIONAL INFORMATION

125       For further information, including relevant papers, citations, and the‐
126       ory, For further information, including relevant papers, citations, and
127       theory, consult the documentation  found  at  http://www.mlpack.org  or
128       included    with    your    consult    the   documentation   found   at
129       http://www.mlpack.org or included with  your  DISTRIBUTION  OF  MLPACK.
130       DISTRIBUTION OF MLPACK.
131
132
133
134                                                              mlpack_kmeans(1)