1hmmsim(1) HMMER Manual hmmsim(1)
2
3
4
6 hmmsim - collect profile score distributions on random sequences
7
8
10 hmmsim [options] hmmfile
11
12
13
15 The hmmsim program generates random sequences, scores them with the
16 model(s) in hmmfile, and outputs various sorts of histograms, plots,
17 and fitted distributions for the resulting scores.
18
19
20 hmmsim is not a mainstream part of the HMMER package and most users
21 would have no reason to use it. It is used to develop and test the sta‐
22 tistical methods used to determine P-values and E-values in HMMER3. For
23 example, it was used to generate most of the results in a 2008 paper on
24 H3's local alignment statistics (PLoS Comp Bio 4:e1000069, 2008;
25 http://www.ploscompbiol.org/doi/pcbi.1000069).
26
27
28 Because it is a research testbed, you should not expect it to be as ro‐
29 bust as other programs in the package. For example, options may inter‐
30 act in weird ways; we haven't tested nor tried to anticipate all dif‐
31 ferent possible combinations.
32
33
34 The main task is to fit a maximum likelihood Gumbel distribution to
35 Viterbi scores or an maximum likelihood exponential tail to high-scor‐
36 ing Forward scores, and to test that these fitted distributions obey
37 the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and the
38 Forward exponential tail.
39
40
41 The output is a table of numbers, one row for each model. Four differ‐
42 ent parametric fits to the score data are tested: (1) maximum likeli‐
43 hood fits to both location (mu/tau) and slope (lambda) parameters; (2)
44 assuming lambda=log_2, maximum likelihood fit to the location parameter
45 only; (3) same but assuming an edge-corrected lambda, using current
46 procedures in H3 [Eddy, 2008]; and (4) using both parameters determined
47 by H3's current procedures. The standard simple, quick and dirty
48 statistic for goodness-of-fit is 'E@10', the calculated E-value of the
49 10th ranked top hit, which we expect to be about 10.
50
51
52 In detail, the columns of the output are:
53
54
55 name Name of the model.
56
57
58 tailp Fraction of the highest scores used to fit the distribution. For
59 Viterbi, MSV, and Hybrid scores, this defaults to 1.0 (a Gumbel
60 distribution is fitted to all the data). For Forward scores,
61 this defaults to 0.02 (an exponential tail is fitted to the
62 highest 2% scores).
63
64
65 mu/tau Location parameter for the maximum likelihood fit to the data.
66
67
68 lambda Slope parameter for the maximum likelihood fit to the data.
69
70
71 E@10 The E-value calculated for the 10th ranked high score ('E@10')
72 using the ML mu/tau and lambda. By definition, this expected to
73 be about 10, if E-value estimation were accurate.
74
75
76 mufix Location parameter, for a maximum likelihood fit with a known
77 (fixed) slope parameter lambda of log_2 (0.693).
78
79
80 E@10fix
81 The E-value calculated for the 10th ranked score using mufix and
82 the expected lambda = log_2 = 0.693.
83
84
85
86 mufix2 Location parameter, for a maximum likelihood fit with an edge-
87 effect-corrected lambda.
88
89
90 E@10fix2
91 The E-value calculated for the 10th ranked score using mufix2
92 and the edge-effect-corrected lambda.
93
94
95 pmu Location parameter as determined by H3's estimation procedures.
96
97
98 plambda
99 Slope parameter as determined by H3's estimation procedures.
100
101
102 pE@10 The E-value calculated for the 10th ranked score using pmu,
103 plambda.
104
105
106
107 At the end of this table, one more line is printed, starting with # and
108 summarizing the overall CPU time used by the simulations.
109
110
111 Some of the optional output files are in xmgrace xy format. xmgrace is
112 powerful and freely available graph-plotting software.
113
114
115
117 -h Help; print a brief reminder of command line usage and all
118 available options.
119
120
121 -a Collect expected Viterbi alignment length statistics from each
122 simulated sequence. This only works with Viterbi scores (the de‐
123 fault; see --vit). Two additional fields are printed in the
124 output table for each model: the mean length of Viterbi align‐
125 ments, and the standard deviation.
126
127
128 -v (Verbose). Print the scores too, one score per line.
129
130
131 -L <n> Set the length of the randomly sampled (nonhomologous) sequences
132 to <n>. The default is 100.
133
134
135
136 -N <n> Set the number of randomly sampled sequences to <n>. The de‐
137 fault is 1000.
138
139
140 --mpi Run under MPI control with master/worker parallelization (using
141 mpirun, for example, or equivalent). Only available if optional
142 MPI support was enabled at compile-time.
143
144 It is parallelized at the level of sending one profile at a time
145 to an MPI worker process, so parallelization only helps if you
146 have more than one profile in the hmmfile, and you want to have
147 at least as many profiles as MPI worker processes.
148
149
150
151
152
154 -o <f> Save the main output table to a file <f> rather than sending it
155 to stdout.
156
157
158 --afile <f>
159 When collecting Viterbi alignment statistics (the -a option),
160 for each sampled sequence, output two fields per line to a file
161 <f>: the length of the optimal alignment, and the Viterbi bit
162 score. Requires that the -a option is also used.
163
164
165 --efile <f>
166 Output a rank vs. E-value plot in XMGRACE xy format to file <f>.
167 The x-axis is the rank of this sequence, from highest score to
168 lowest; the y-axis is the E-value calculated for this sequence.
169 E-values are calculated using H3's default procedures (i.e. the
170 pmu, plambda parameters in the output table). You expect a rough
171 match between rank and E-value if E-values are accurately esti‐
172 mated.
173
174
175
176 --ffile <f>
177 Output a "filter power" file to <f>: for each model, a line with
178 three fields: model name, number of sequences passing the P-
179 value threshold, and fraction of sequences passing the P-value
180 threshold. See --pthresh for setting the P-value threshold,
181 which defaults to 0.02 (the default MSV filter threshold in H3).
182 The P-values are as determined by H3's default procedures (the
183 pmu,plambda parameters in the output table). If all is well,
184 you expect to see filter power equal to the predicted P-value
185 setting of the threshold.
186
187
188 --pfile <f>
189 Output cumulative survival plots (P(S>x)) to file <f> in XMGRACE
190 xy format. There are three plots: (1) the observed score distri‐
191 bution; (2) the maximum likelihood fitted distribution; (3) a
192 maximum likelihood fit to the location parameter (mu/tau) while
193 assuming lambda=log_2.
194
195
196 --xfile <f>
197 Output the bit scores as a binary array of double-precision
198 floats (8 bytes per score) to file <f>. Programs like Easel's
199 esl-histplot can read such binary files. This is useful when
200 generating extremely large sample sizes.
201
202
203
205 H3 only uses multihit local alignment ( --fs mode), and this is where
206 we believe the statistical fits. Unihit local alignment scores
207 (Smith/Waterman; --sw mode) also obey our statistical conjectures.
208 Glocal alignment statistics (either multihit or unihit) are still not
209 adequately understood nor adequately fitted.
210
211
212 --fs Collect multihit local alignment scores. This is the default.
213 "fs" comes from HMMER2's historical terminology for multihit lo‐
214 cal alignment as 'fragment search mode'.
215
216
217 --sw Collect unihit local alignment scores. The H3 J state is dis‐
218 abled. "sw" comes from HMMER2's historical terminology for uni‐
219 hit local alignment as 'Smith/Waterman search mode'.
220
221
222 --ls Collect multihit glocal alignment scores. In glocal (global/lo‐
223 cal) alignment, the entire model must align, to a subsequence of
224 the target. The H3 local entry/exit transition probabilities are
225 disabled. 'ls' comes from HMMER2's historical terminology for
226 multihit local alignment as 'local search mode'.
227
228
229 --s Collect unihit glocal alignment scores. Both the H3 J state and
230 local entry/exit transition probabilities are disabled. 's'
231 comes from HMMER2's historical terminology for unihit glocal
232 alignment.
233
234
235
236
238 --vit Collect Viterbi maximum likelihood alignment scores. This is the
239 default.
240
241
242 --fwd Collect Forward log-odds likelihood scores, summed over align‐
243 ment ensemble.
244
245
246 --hyb Collect 'Hybrid' scores, as described in papers by Yu and Hwa
247 (for instance, Bioinformatics 18:864, 2002). These involve cal‐
248 culating a Forward matrix and taking the maximum cell value. The
249 number itself is statistically somewhat unmotivated, but the
250 distribution is expected be a well-behaved extreme value distri‐
251 bution (Gumbel).
252
253
254 --msv Collect MSV (multiple ungapped segment Viterbi) scores, using
255 H3's main acceleration heuristic.
256
257
258 --fast For any of the above options, use H3's optimized production im‐
259 plementation (using SIMD vectorization). The default is to use
260 the "generic" implementation (slow and non-vectorized). The op‐
261 timized implementations sacrifice a small amount of numerical
262 precision. This can introduce confounding noise into statistical
263 simulations and fits, so when one gets super-concerned about ex‐
264 act details, it's better to be able to factor that source of
265 noise out.
266
267
269 In some experiments, it was useful to fit Forward scores to a range of
270 different tail masses, rather than just one. These options provide a
271 mechanism for fitting an evenly-spaced range of different tail masses.
272 For each different tail mass, a line is generated in the output.
273
274
275 --tmin <x>
276 Set the lower bound on the tail mass distribution. (The default
277 is 0.02 for the default single tail mass.)
278
279
280 --tmax <x>
281 Set the upper bound on the tail mass distribution. (The default
282 is 0.02 for the default single tail mass.)
283
284
285 --tpoints <n>
286 Set the number of tail masses to sample, starting from --tmin
287 and ending at --tmax. (The default is 1, for the default 0.02
288 single tail mass.)
289
290
291 --tlinear
292 Sample a range of tail masses with uniform linear spacing. The
293 default is to use uniform logarithmic spacing.
294
295
296
297
299 H3 uses three short random sequence simulations to estimating the loca‐
300 tion parameters for the expected score distributions for MSV scores,
301 Viterbi scores, and Forward scores. These options allow these simula‐
302 tions to be modified.
303
304
305 --EmL <n>
306 Sets the sequence length in simulation that estimates the loca‐
307 tion parameter mu for MSV E-values. Default is 200.
308
309
310 --EmN <n>
311 Sets the number of sequences in simulation that estimates the
312 location parameter mu for MSV E-values. Default is 200.
313
314
315 --EvL <n>
316 Sets the sequence length in simulation that estimates the loca‐
317 tion parameter mu for Viterbi E-values. Default is 200.
318
319
320 --EvN <n>
321 Sets the number of sequences in simulation that estimates the
322 location parameter mu for Viterbi E-values. Default is 200.
323
324
325 --EfL <n>
326 Sets the sequence length in simulation that estimates the loca‐
327 tion parameter tau for Forward E-values. Default is 100.
328
329
330 --EfN <n>
331 Sets the number of sequences in simulation that estimates the
332 location parameter tau for Forward E-values. Default is 200.
333
334
335 --Eft <x>
336 Sets the tail mass fraction to fit in the simulation that esti‐
337 mates the location parameter tau for Forward evalues. Default is
338 0.04.
339
340
341
343 --stall
344 For debugging the MPI master/worker version: pause after start,
345 to enable the developer to attach debuggers to the running mas‐
346 ter and worker(s) processes. Send SIGCONT signal to release the
347 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
348 tional MPI support was enabled at compile-time.)
349
350
351 --seed <n>
352 Set the random number seed to <n>. The default is 0, which
353 makes the random number generator use an arbitrary seed, so that
354 different runs of hmmsim will almost certainly generate a dif‐
355 ferent statistical sample. For debugging, it is useful to force
356 reproducible results, by fixing a random number seed.
357
358
359
360
362 These options were used in a small variety of different exploratory ex‐
363 periments.
364
365
366 --bgflat
367 Set the background residue distribution to a uniform distribu‐
368 tion, both for purposes of the null model used in calculating
369 scores, and for generating the random sequences. The default is
370 to use a standard amino acid background frequency distribution.
371
372
373 --bgcomp
374 Set the background residue distribution to the mean composition
375 of the profile. This was used in exploring some of the effects
376 of biased composition.
377
378
379 --x-no-lengthmodel
380 Turn the H3 target sequence length model off. Set the self-tran‐
381 sitions for N,C,J and the null model to 350/351 instead; this
382 emulates HMMER2. Not a good idea in general. This was used to
383 demonstrate one of the main H2 vs. H3 differences.
384
385
386 --nu <x>
387 Set the nu parameter for the MSV algorithm -- the expected num‐
388 ber of ungapped local alignments per target sequence. The de‐
389 fault is 2.0, corresponding to a E->J transition probability of
390 0.5. This was used to test whether varying nu has significant
391 effect on result (it doesn't seem to, within reason). This op‐
392 tion only works if --msv is selected (it only affects MSV), and
393 it will not work with --fast (because the optimized implementa‐
394 tions are hardwired to assume nu=2.0).
395
396
397 --pthresh <x>
398 Set the filter P-value threshold to use in generating filter
399 power files with --ffile. The default is 0.02 (which would be
400 appropriate for testing MSV scores, since this is the default
401 MSV filter threshold in H3's acceleration pipeline.) Other ap‐
402 propriate choices (matching defaults in the acceleration pipe‐
403 line) would be 0.001 for Viterbi, and 1e-5 for Forward.
404
405
406
407
408
409
411 See hmmer(1) for a master man page with a list of all the individual
412 man pages for programs in the HMMER package.
413
414
415 For complete documentation, see the user guide that came with your HM‐
416 MER distribution (Userguide.pdf); or see the HMMER web page (http://hm‐
417 mer.org/).
418
419
420
421
423 Copyright (C) 2020 Howard Hughes Medical Institute.
424 Freely distributed under the BSD open source license.
425
426 For additional information on copyright and licensing, see the file
427 called COPYRIGHT in your HMMER source distribution, or see the HMMER
428 web page (http://hmmer.org/).
429
430
431
433 http://eddylab.org
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450HMMER 3.3.2 Nov 2020 hmmsim(1)