1mlpack_gmm_train(1) User Commands mlpack_gmm_train(1)
2
3
4
6 mlpack_gmm_train - gaussian mixture model (gmm) training
7
9 mlpack_gmm_train -g int -i string [-d bool] [-m unknown] [-n int] [-P bool] [-N double] [-p double] [-r bool] [-S int] [-s int] [-T double] [-t int] [-V bool] [-M unknown] [-h -v]
10
12 This program takes a parametric estimate of a Gaussian mixture model
13 (GMM) using the EM algorithm to find the maximum likelihood estimate.
14 The model may be saved and reused by other mlpack GMM tools.
15
16 The input data to train on must be specified with the '--input_file
17 (-i)' parameter, and the number of Gaussians in the model must be spec‐
18 ified with the ’--gaussians (-g)' parameter. Optionally, many trials
19 with different random initializations may be run, and the result with
20 highest log-likelihood on the training data will be taken. The number
21 of trials to run is specified with the '--trials (-t)' parameter. By
22 default, only one trial is run.
23
24 The tolerance for convergence and maximum number of iterations of the
25 EM algorithm are specified with the '--tolerance (-T)' and '--max_iter‐
26 ations (-n)' parameters, respectively. The GMM may be initialized for
27 training with another model, specified with the '--input_model_file
28 (-m)' parameter. Otherwise, the model is initialized by running k-
29 means on the data. The k-means clustering initialization can be con‐
30 trolled with the '--refined_start (-r)', '--samplings (-S)', and
31 '--percentage (-p)' parameters. If ’--refined_start (-r)' is specified,
32 then the Bradley-Fayyad refined start initialization will be used. This
33 can often lead to better clustering results.
34
35 The 'diagonal_covariance' flag will cause the learned covariances to be
36 diagonal matrices. This significantly simplifies the model itself and
37 causes training to be faster, but restricts the ability to fit more
38 complex GMMs.
39
40 If GMM training fails with an error indicating that a covariance matrix
41 could not be inverted, make sure that the '--no_force_positive (-P)'
42 parameter is not specified. Alternately, adding a small amount of
43 Gaussian noise (using the '--noise (-N)' parameter) to the entire
44 dataset may help prevent Gaussians with zero variance in a particular
45 dimension, which is usually the cause of non-invertible covariance
46 matrices.
47
48 The '--no_force_positive (-P)' parameter, if set, will avoid the checks
49 after each iteration of the EM algorithm which ensure that the covari‐
50 ance matrices are positive definite. Specifying the flag can cause
51 faster runtime, but may also cause non-positive definite covariance
52 matrices, which will cause the program to crash.
53
54 As an example, to train a 6-Gaussian GMM on the data in 'data.csv' with
55 a maximum of 100 iterations of EM and 3 trials, saving the trained GMM
56 to ’gmm.bin', the following command can be used:
57
58 $ gmm_train --input_file data.csv --gaussians 6 --trials 3 --out‐
59 put_model_file gmm.bin
60
61 To re-train that GMM on another set of data 'data2.csv', the following
62 command may be used:
63
64 $ gmm_train --input_model_file gmm.bin --input_file data2.csv --gaus‐
65 sians 6 --output_model_file new_gmm.bin
66
68 --gaussians (-g) [int]
69 Number of Gaussians in the GMM.
70
71 --input_file (-i) [string]
72 The training data on which the model will be fit.
73
75 --diagonal_covariance (-d) [bool]
76 Force the covariance of the Gaussians to be diagonal. This can
77 accelerate training time significantly.
78
79 --help (-h) [bool]
80 Default help info.
81
82 --info [string]
83 Get help on a specific module or option. Default value ''.
84
85 --input_model_file (-m) [unknown]
86 Initial input GMM model to start training with. Default value
87 ''.
88
89 --max_iterations (-n) [int]
90 Maximum number of iterations of EM algorithm (passing 0 will run
91 until convergence). Default value 250.
92
93 --no_force_positive (-P) [bool]
94 Do not force the covariance matrices to be positive definite.
95
96 --noise (-N) [double]
97 Variance of zero-mean Gaussian noise to add to data. Default
98 value 0.
99
100 --percentage (-p) [double]
101 If using --refined_start, specify the percentage of the dataset
102 used for each sampling (should be between 0.0 and 1.0). Default
103 value 0.02.
104
105 --refined_start (-r) [bool]
106 During the initialization, use refined initial positions for k-
107 means clustering (Bradley and Fayyad, 1998).
108
109 --samplings (-S) [int]
110 If using --refined_start, specify the number of samplings used
111 for initial points. Default value 100.
112
113 --seed (-s) [int]
114 Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
115
116 --tolerance (-T) [double]
117 Tolerance for convergence of EM. Default value 1e-10.
118
119 --trials (-t) [int]
120 Number of trials to perform in training GMM. Default value 1.
121
122 --verbose (-v) [bool]
123 Display informational messages and the full list of parameters
124 and timers at the end of execution.
125
126 --version (-V) [bool]
127 Display the version of mlpack.
128
130 --output_model_file (-M) [unknown]
131 Output for trained GMM model. Default value ''.
132
134 For further information, including relevant papers, citations, and the‐
135 ory, consult the documentation found at http://www.mlpack.org or
136 included with your distribution of mlpack.
137
138
139
140mlpack-3.0.4 21 February 2019 mlpack_gmm_train(1)