1hmmsim(1) HMMER Manual hmmsim(1)
2
3
4
6 hmmsim - collect score distributions on random sequences
7
8
10 hmmsim [options] <hmmfile>
11
12
13
15 The hmmsim program generates random sequences, scores them with the
16 model(s) in <hmmfile>, and outputs various sorts of histograms, plots,
17 and fitted distributions for the resulting scores.
18
19
20 hmmsim is not a mainstream part of the HMMER package. Most users would
21 have no reason to use it. It is used to develop and test the statisti‐
22 cal methods used to determine P-values and E-values in HMMER3. For
23 example, it was used to generate most of the results in a 2008 paper on
24 H3's local alignment statistics (PLoS Comp Bio 4:e1000069, 2008;
25 http://www.ploscompbiol.org/doi/pcbi.1000069).
26
27
28 Because it is a research testbed, you should not expect it to be as
29 robust as other programs in the package. For example, options may
30 interact in weird ways; we haven't tested nor tried to anticipate all
31 different possible combinations.
32
33
34 The main task is to fit a maximum likelihood Gumbel distribution to
35 Viterbi scores or an maximum likelihood exponential tail to high-scor‐
36 ing Forward scores, and to test that these fitted distributions obey
37 the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and the
38 Forward exponential tail.
39
40
41 The output is a table of numbers, one row for each model. Four differ‐
42 ent parametric fits to the score data are tested: (1) maximum likeli‐
43 hood fits to both location (mu/tau) and slope (lambda) parameters; (2)
44 assuming lambda=log_2, maximum likelihood fit to the location parameter
45 only; (3) same but assuming an edge-corrected lambda, using current
46 procedures in H3 [Eddy, 2008]; and (4) using both parameters determined
47 by H3's current procedures. The standard simple, quick and dirty
48 statistic for goodness-of-fit is 'E@10', the calculated E-value of the
49 10th ranked top hit, which we expect to be about 10.
50
51
52 In detail, the columns of the output are:
53
54
55 name Name of the model.
56
57
58 tailp Fraction of the highest scores used to fit the distribution. For
59 Viterbi, MSV, and Hybrid scores, this defaults to 1.0 (a Gumbel
60 distribution is fitted to all the data). For Forward scores,
61 this defaults to 0.02 (an exponential tail is fitted to the
62 highest 2% scores).
63
64
65 mu/tau Location parameter for the maximum likelihood fit to the data.
66
67
68 lambda Slope parameter for the maximum likelihood fit to the data.
69
70
71 E@10 The E-value calculated for the 10th ranked high score ('E@10')
72 using the ML mu/tau and lambda. By definition, this expected to
73 be about 10, if E-value estimation were accurate.
74
75
76 mufix Location parameter, for a maximum likelihood fit with a known
77 (fixed) slope parameter lambda of log_2 (0.693).
78
79
80 E@10fix
81 The E-value calculated for the 10th ranked score using mufix and
82 the expected lambda = log_2 = 0.693.
83
84
85
86 mufix2 Location parameter, for a maximum likelihood fit with an edge-
87 effect-corrected lambda.
88
89
90 E@10fix2
91 The E-value calculated for the 10th ranked score using mufix2
92 and the edge-effect-corrected lambda.
93
94
95 pmu Location parameter as determined by H3's estimation procedures.
96
97
98 plambda
99 Slope parameter as determined by H3's estimation procedures.
100
101
102 pE@10 The E-value calculated for the 10th ranked score using pmu,
103 plambda.
104
105
106
107 At the end of this table, one more line is printed, starting with # and
108 summarizing the overall CPU time used by the simulations.
109
110
111 Some of the optional output files are in xmgrace xy format. xmgrace is
112 powerful and freely available graph-plotting software.
113
114
115
117 -h Help; print a brief reminder of command line usage and all
118 available options.
119
120
121 -a Collect expected Viterbi alignment length statistics from each
122 simulated sequence. This only works with Viterbi scores (the
123 default; see --vit). Two additional fields are printed in the
124 output table for each model: the mean length of Viterbi align‐
125 ments, and the standard deviation.
126
127
128 -v (Verbose). Print the scores too, one score per line.
129
130
131 -L <n> Set the length of the randomly sampled (nonhomologous) sequences
132 to <n>. The default is 100.
133
134
135
136 -N <n> Set the number of randomly sampled sequences to <n>. The
137 default is 1000.
138
139
140 --mpi Run in MPI parallel mode, under mpirun. It is parallelized at
141 the level of sending one profile at a time to an MPI worker
142 process, so parallelization only helps if you have more than one
143 profile in the <hmmfile>, and you want to have at least as many
144 profiles as MPI worker processes. (Only available if optional
145 MPI support was enabled at compile-time.)
146
147
148
149
151 -o <f> Save the main output table to a file <f> rather than sending it
152 to stdout.
153
154
155 --afile <f>
156 When collecting Viterbi alignment statistics (the -a option),
157 for each sampled sequence, output two fields per line to a file
158 <f>: the length of the optimal alignment, and the Viterbi bit
159 score. Requires that the -a option is also used.
160
161
162 --efile <f>
163 Output a rank vs. E-value plot in XMGRACE xy format to file <f>.
164 The x-axis is the rank of this sequence, from highest score to
165 lowest; the y-axis is the E-value calculated for this sequence.
166 E-values are calculated using H3's default procedures (i.e. the
167 pmu, plambda parameters in the output table). You expect a rough
168 match between rank and E-value if E-values are accurately esti‐
169 mated.
170
171
172
173 --ffile <f>
174 Output a "filter power" file to <f>: for each model, a line with
175 three fields: model name, number of sequences passing the P-
176 value threshold, and fraction of sequences passing the P-value
177 threshold. See --pthresh for setting the P-value threshold,
178 which defaults to 0.02 (the default MSV filter threshold in H3).
179 The P-values are as determined by H3's default procedures (the
180 pmu,plambda parameters in the output table). If all is well,
181 you expect to see filter power equal to the predicted P-value
182 setting of the threshold.
183
184
185 --pfile <f>
186 Output cumulative survival plots (P(S>x)) to file <f> in XMGRACE
187 xy format. There are three plots: (1) the observed score distri‐
188 bution; (2) the maximum likelihood fitted distribution; (3) a
189 maximum likelihood fit to the location parameter (mu/tau) while
190 assuming lambda=log_2.
191
192
193 --xfile <f>
194 Output the bit scores as a binary array of double-precision
195 floats (8 bytes per score) to file <f>. Programs like Easel's
196 esl-histplot can read such binary files. This is useful when
197 generating extremely large sample sizes.
198
199
200
202 H3 only uses multihit local alignment ( --fs mode), and this is where
203 we believe the statistical fits. Unihit local alignment scores
204 (Smith/Waterman; --sw mode) also obey our statistical conjectures.
205 Glocal alignment statistics (either multihit or unihit) are still not
206 adequately understood nor adequately fitted.
207
208
209 --fs Collect multihit local alignment scores. This is the default.
210 alignment as 'fragment search mode'.
211
212
213 --sw Collect unihit local alignment scores. The H3 J state is dis‐
214 abled. alignment as 'Smith/Waterman search mode'.
215
216
217 --ls Collect multihit glocal alignment scores. In glocal
218 (global/local) alignment, the entire model must align, to a sub‐
219 sequence of the target. The H3 local entry/exit transition prob‐
220 abilities are disabled. 'ls' comes from HMMER2's historical ter‐
221 minology for multihit local alignment as 'local search mode'.
222
223
224 --s Collect unihit glocal alignment scores. Both the H3 J state and
225 local entry/exit transition probabilities are disabled. 's'
226 comes from HMMER2's historical terminology for unihit glocal
227 alignment.
228
229
230
231
233 --vit Collect Viterbi maximum likelihood alignment scores. This is the
234 default.
235
236
237 --fwd Collect Forward log-odds likelihood scores, summed over align‐
238 ment ensemble.
239
240
241 --hyb Collect 'Hybrid' scores, as described in papers by Yu and Hwa
242 (for instance, Bioinformatics 18:864, 2002). These involve cal‐
243 culating a Forward matrix and taking the maximum cell value. The
244 number itself is statistically somewhat unmotivated, but the
245 distribution is expected be a well-behaved extreme value distri‐
246 bution (Gumbel).
247
248
249 --msv Collect MSV (multiple ungapped segment Viterbi) scores, using
250 H3's main acceleration heuristic.
251
252
253 --fast For any of the above options, use H3's optimized production
254 implementation (using SIMD vectorization). The default is to use
255 the implementations sacrifice a small amount of numerical preci‐
256 sion. This can introduce confounding noise into statistical sim‐
257 ulations and fits, so when one gets super-concerned about exact
258 details, it's better to be able to factor that source of noise
259 out.
260
261
263 In some experiments, it was useful to fit Forward scores to a range of
264 different tail masses, rather than just one. These options provide a
265 mechanism for fitting an evenly-spaced range of different tail masses.
266 For each different tail mass, a line is generated in the output.
267
268
269 --tmin <x>
270 Set the lower bound on the tail mass distribution. (The default
271 is 0.02 for the default single tail mass.)
272
273
274 --tmax <x>
275 Set the upper bound on the tail mass distribution. (The default
276 is 0.02 for the default single tail mass.)
277
278
279 --tpoints <n>
280 Set the number of tail masses to sample, starting from --tmin
281 and ending at --tmax. (The default is 1, for the default 0.02
282 single tail mass.)
283
284
285 --tlinear
286 Sample a range of tail masses with uniform linear spacing. The
287 default is to use uniform logarithmic spacing.
288
289
290
291
293 H3 uses three short random sequence simulations to estimating the loca‐
294 tion parameters for the expected score distributions for MSV scores,
295 Viterbi scores, and Forward scores. These options allow these simula‐
296 tions to be modified.
297
298
299 --EmL <n>
300 Sets the sequence length in simulation that estimates the loca‐
301 tion parameter mu for MSV E-values. Default is 200.
302
303
304 --EmN <n>
305 Sets the number of sequences in simulation that estimates the
306 location parameter mu for MSV E-values. Default is 200.
307
308
309 --EvL <n>
310 Sets the sequence length in simulation that estimates the loca‐
311 tion parameter mu for Viterbi E-values. Default is 200.
312
313
314 --EvN <n>
315 Sets the number of sequences in simulation that estimates the
316 location parameter mu for Viterbi E-values. Default is 200.
317
318
319 --EfL <n>
320 Sets the sequence length in simulation that estimates the loca‐
321 tion parameter tau for Forward E-values. Default is 100.
322
323
324 --EfN <n>
325 Sets the number of sequences in simulation that estimates the
326 location parameter tau for Forward E-values. Default is 200.
327
328
329 --Eft <x>
330 Sets the tail mass fraction to fit in the simulation that esti‐
331 mates the location parameter tau for Forward evalues. Default is
332 0.04.
333
334
335
337 --stall
338 For debugging the MPI master/worker version: pause after start,
339 to enable the developer to attach debuggers to the running mas‐
340 ter and worker(s) processes. Send SIGCONT signal to release the
341 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if
342 optional MPI support was enabled at compile-time.)
343
344
345 --seed <n>
346 Set the random number seed to <n>. The default is 0, which
347 makes the random number generator use an arbitrary seed, so that
348 different runs of hmmsim will almost certainly generate a dif‐
349 ferent statistical sample. For debugging, it is useful to force
350 reproducible results, by fixing a random number seed.
351
352
353
354
356 These options were used in a small variety of different exploratory
357 experiments.
358
359
360 --bgflat
361 Set the background residue distribution to a uniform distribu‐
362 tion, both for purposes of the null model used in calculating
363 scores, and for generating the random sequences. The default is
364 to use a standard amino acid background frequency distribution.
365
366
367 --bgcomp
368 Set the background residue distribution to the mean composition
369 of the profile. This was used in exploring some of the effects
370 of biased composition.
371
372
373 --x-no-lengthmodel
374 Turn the H3 target sequence length model off. Set the self-tran‐
375 sitions for N,C,J and the null model to 350/351 instead; this
376 emulates HMMER2. Not a good idea in general. This was used to
377 demonstrate one of the main H2 vs. H3 differences.
378
379
380 --nu <x>
381 Set the nu parameter for the MSV algorithm -- the expected num‐
382 ber of ungapped local alignments per target sequence. The
383 default is 2.0, corresponding to a E->J transition probability
384 of 0.5. This was used to test whether varying nu has significant
385 effect on result (it doesn't seem to, within reason). This
386 option only works if --msv is selected (it only affects MSV),
387 and it will not work with --fast (because the optimized imple‐
388 mentations are hardwired to assume nu=2.0).
389
390
391 --pthresh <x>
392 Set the filter P-value threshold to use in generating filter
393 power files with --ffile. The default is 0.02 (which would be
394 appropriate for testing MSV scores, since this is the default
395 MSV filter threshold in H3's acceleration pipeline.) Other
396 appropriate choices (matching defaults in the acceleration pipe‐
397 line) would be 0.001 for Viterbi, and 1e-5 for Forward.
398
399
400
401
402
403
405 See hmmer(1) for a master man page with a list of all the individual
406 man pages for programs in the HMMER package.
407
408
409 For complete documentation, see the user guide that came with your
410 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
411
412
413
414
416 Copyright (C) 2015 Howard Hughes Medical Institute.
417 Freely distributed under the GNU General Public License (GPLv3).
418
419 For additional information on copyright and licensing, see the file
420 called COPYRIGHT in your HMMER source distribution, or see the HMMER
421 web page ().
422
423
424
426 Eddy/Rivas Laboratory
427 Janelia Farm Research Campus
428 19700 Helix Drive
429 Ashburn VA 20147 USA
430 http://eddylab.org
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447HMMER 3.1b2 February 2015 hmmsim(1)