hmmsim(1)

1hmmsim(1)                        HMMER Manual                        hmmsim(1)
2
3
4

NAME

6       hmmsim - collect profile score distributions on random sequences
7
8

SYNOPSIS

10       hmmsim [options] hmmfile
11
12
13

DESCRIPTION

15       The  hmmsim  program  generates  random sequences, scores them with the
16       model(s) in hmmfile, and outputs various sorts  of  histograms,  plots,
17       and fitted distributions for the resulting scores.
18
19
20       hmmsim  is  not  a  mainstream part of the HMMER package and most users
21       would have no reason to use it. It is used to develop and test the sta‐
22       tistical methods used to determine P-values and E-values in HMMER3. For
23       example, it was used to generate most of the results in a 2008 paper on
24       H3's  local  alignment  statistics  (PLoS  Comp  Bio  4:e1000069, 2008;
25       http://www.ploscompbiol.org/doi/pcbi.1000069).
26
27
28       Because it is a research testbed, you should not expect it to be as ro‐
29       bust  as other programs in the package. For example, options may inter‐
30       act in weird ways; we haven't tested nor tried to anticipate  all  dif‐
31       ferent possible combinations.
32
33
34       The  main  task  is  to fit a maximum likelihood Gumbel distribution to
35       Viterbi scores or an maximum likelihood exponential tail to  high-scor‐
36       ing  Forward  scores,  and to test that these fitted distributions obey
37       the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and  the
38       Forward exponential tail.
39
40
41       The  output is a table of numbers, one row for each model. Four differ‐
42       ent parametric fits to the score data are tested: (1)  maximum  likeli‐
43       hood  fits to both location (mu/tau) and slope (lambda) parameters; (2)
44       assuming lambda=log_2, maximum likelihood fit to the location parameter
45       only;  (3)  same  but  assuming an edge-corrected lambda, using current
46       procedures in H3 [Eddy, 2008]; and (4) using both parameters determined
47       by  H3's  current  procedures.  The  standard  simple,  quick and dirty
48       statistic for goodness-of-fit is 'E@10', the calculated E-value of  the
49       10th ranked top hit, which we expect to be about 10.
50
51
52       In detail, the columns of the output are:
53
54
55       name   Name of the model.
56
57
58       tailp  Fraction of the highest scores used to fit the distribution. For
59              Viterbi, MSV, and Hybrid scores, this defaults to 1.0 (a  Gumbel
60              distribution  is  fitted  to  all the data). For Forward scores,
61              this defaults to 0.02 (an exponential  tail  is  fitted  to  the
62              highest 2% scores).
63
64
65       mu/tau Location parameter for the maximum likelihood fit to the data.
66
67
68       lambda Slope parameter for the maximum likelihood fit to the data.
69
70
71       E@10   The  E-value  calculated for the 10th ranked high score ('E@10')
72              using the ML mu/tau and lambda. By definition, this expected  to
73              be about 10, if E-value estimation were accurate.
74
75
76       mufix  Location  parameter,  for  a maximum likelihood fit with a known
77              (fixed) slope parameter lambda of log_2 (0.693).
78
79
80       E@10fix
81              The E-value calculated for the 10th ranked score using mufix and
82              the expected lambda = log_2 = 0.693.
83
84
85
86       mufix2 Location  parameter,  for a maximum likelihood fit with an edge-
87              effect-corrected lambda.
88
89
90       E@10fix2
91              The E-value calculated for the 10th ranked  score  using  mufix2
92              and the edge-effect-corrected lambda.
93
94
95       pmu    Location parameter as determined by H3's estimation procedures.
96
97
98       plambda
99              Slope parameter as determined by H3's estimation procedures.
100
101
102       pE@10  The  E-value  calculated  for  the  10th ranked score using pmu,
103              plambda.
104
105
106
107       At the end of this table, one more line is printed, starting with # and
108       summarizing the overall CPU time used by the simulations.
109
110
111       Some  of the optional output files are in xmgrace xy format. xmgrace is
112       powerful and freely available graph-plotting software.
113
114
115

OPTIONS

117       -h     Help; print a brief reminder  of  command  line  usage  and  all
118              available options.
119
120
121       -a     Collect  expected  Viterbi alignment length statistics from each
122              simulated sequence. This only works with Viterbi scores (the de‐
123              fault;  see  --vit).   Two  additional fields are printed in the
124              output table for each model: the mean length of  Viterbi  align‐
125              ments, and the standard deviation.
126
127
128       -v     (Verbose). Print the scores too, one score per line.
129
130
131       -L <n> Set the length of the randomly sampled (nonhomologous) sequences
132              to <n>.  The default is 100.
133
134
135
136       -N <n> Set the number of randomly sampled sequences to  <n>.   The  de‐
137              fault is 1000.
138
139
140       --mpi  Run  under MPI control with master/worker parallelization (using
141              mpirun, for example, or equivalent). Only available if  optional
142              MPI support was enabled at compile-time.
143
144              It is parallelized at the level of sending one profile at a time
145              to an MPI worker process, so parallelization only helps  if  you
146              have  more than one profile in the hmmfile, and you want to have
147              at least as many profiles as MPI worker processes.
148
149
150
151
152

OPTIONS CONTROLLING OUTPUT

154       -o <f> Save the main output table to a file <f> rather than sending  it
155              to stdout.
156
157
158       --afile <f>
159              When  collecting  Viterbi  alignment statistics (the -a option),
160              for each sampled sequence, output two fields per line to a  file
161              <f>:  the  length  of the optimal alignment, and the Viterbi bit
162              score.  Requires that the -a option is also used.
163
164
165       --efile <f>
166              Output a rank vs. E-value plot in XMGRACE xy format to file <f>.
167              The  x-axis  is the rank of this sequence, from highest score to
168              lowest; the y-axis is the E-value calculated for this  sequence.
169              E-values  are calculated using H3's default procedures (i.e. the
170              pmu, plambda parameters in the output table). You expect a rough
171              match  between rank and E-value if E-values are accurately esti‐
172              mated.
173
174
175
176       --ffile <f>
177              Output a "filter power" file to <f>: for each model, a line with
178              three  fields:  model  name,  number of sequences passing the P-
179              value threshold, and fraction of sequences passing  the  P-value
180              threshold.  See  --pthresh  for  setting  the P-value threshold,
181              which defaults to 0.02 (the default MSV filter threshold in H3).
182              The  P-values  are as determined by H3's default procedures (the
183              pmu,plambda parameters in the output table).  If  all  is  well,
184              you  expect  to  see filter power equal to the predicted P-value
185              setting of the threshold.
186
187
188       --pfile <f>
189              Output cumulative survival plots (P(S>x)) to file <f> in XMGRACE
190              xy format. There are three plots: (1) the observed score distri‐
191              bution; (2) the maximum likelihood fitted  distribution;  (3)  a
192              maximum likelihood fit to the location parameter (mu/tau) while
193                  assuming lambda=log_2.
194
195
196       --xfile <f>
197              Output  the  bit  scores  as  a binary array of double-precision
198              floats (8 bytes per score) to file <f>.  Programs  like  Easel's
199              esl-histplot  can  read  such  binary files. This is useful when
200              generating extremely large sample sizes.
201
202
203

OPTIONS CONTROLLING MODEL CONFIGURATION (MODE)

205       H3 only uses multihit local alignment ( --fs mode), and this  is  where
206       we  believe  the  statistical  fits.   Unihit  local  alignment  scores
207       (Smith/Waterman; --sw mode)  also  obey  our  statistical  conjectures.
208       Glocal  alignment  statistics (either multihit or unihit) are still not
209       adequately understood nor adequately fitted.
210
211
212       --fs   Collect multihit local alignment scores. This  is  the  default.
213              "fs" comes from HMMER2's historical terminology for multihit lo‐
214              cal alignment as 'fragment search mode'.
215
216
217       --sw   Collect unihit local alignment scores. The H3 J  state  is  dis‐
218              abled.  "sw" comes from HMMER2's historical terminology for uni‐
219              hit local alignment as 'Smith/Waterman search mode'.
220
221
222       --ls   Collect multihit glocal alignment scores. In glocal  (global/lo‐
223              cal) alignment, the entire model must align, to a subsequence of
224              the target. The H3 local entry/exit transition probabilities are
225              disabled.  'ls'  comes  from HMMER2's historical terminology for
226              multihit local alignment as 'local search mode'.
227
228
229       --s    Collect unihit glocal alignment scores.  Both the H3 J state and
230              local  entry/exit  transition  probabilities  are  disabled. 's'
231              comes from HMMER2's historical  terminology  for  unihit  glocal
232              alignment.
233
234
235
236

OPTIONS CONTROLLING SCORING ALGORITHM

238       --vit  Collect Viterbi maximum likelihood alignment scores. This is the
239              default.
240
241
242       --fwd  Collect Forward log-odds likelihood scores, summed  over  align‐
243              ment ensemble.
244
245
246       --hyb  Collect  'Hybrid'  scores,  as described in papers by Yu and Hwa
247              (for instance, Bioinformatics 18:864, 2002). These involve  cal‐
248              culating a Forward matrix and taking the maximum cell value. The
249              number itself is statistically  somewhat  unmotivated,  but  the
250              distribution is expected be a well-behaved extreme value distri‐
251              bution (Gumbel).
252
253
254       --msv  Collect MSV (multiple ungapped segment  Viterbi)  scores,  using
255              H3's main acceleration heuristic.
256
257
258       --fast For  any of the above options, use H3's optimized production im‐
259              plementation (using SIMD vectorization). The default is  to  use
260              the  "generic" implementation (slow and non-vectorized). The op‐
261              timized implementations sacrifice a small  amount  of  numerical
262              precision. This can introduce confounding noise into statistical
263              simulations and fits, so when one gets super-concerned about ex‐
264              act  details,  it's  better  to be able to factor that source of
265              noise out.
266
267

OPTIONS CONTROLLING FITTED TAIL MASSES FOR FORWARD

269       In some experiments, it was useful to fit Forward scores to a range  of
270       different  tail  masses,  rather than just one. These options provide a
271       mechanism for fitting an evenly-spaced range of different tail  masses.
272       For each different tail mass, a line is generated in the output.
273
274
275       --tmin <x>
276              Set  the lower bound on the tail mass distribution. (The default
277              is 0.02 for the default single tail mass.)
278
279
280       --tmax <x>
281              Set the upper bound on the tail mass distribution. (The  default
282              is 0.02 for the default single tail mass.)
283
284
285       --tpoints <n>
286              Set  the  number  of tail masses to sample, starting from --tmin
287              and ending at --tmax.  (The default is 1, for the  default  0.02
288              single tail mass.)
289
290
291       --tlinear
292              Sample  a  range of tail masses with uniform linear spacing. The
293              default is to use uniform logarithmic spacing.
294
295
296
297

OPTIONS CONTROLLING H3 PARAMETER ESTIMATION METHODS

299       H3 uses three short random sequence simulations to estimating the loca‐
300       tion  parameters  for  the expected score distributions for MSV scores,
301       Viterbi scores, and Forward scores. These options allow  these  simula‐
302       tions to be modified.
303
304
305       --EmL <n>
306              Sets  the sequence length in simulation that estimates the loca‐
307              tion parameter mu for MSV E-values. Default is 200.
308
309
310       --EmN <n>
311              Sets the number of sequences in simulation  that  estimates  the
312              location parameter mu for MSV E-values. Default is 200.
313
314
315       --EvL <n>
316              Sets  the sequence length in simulation that estimates the loca‐
317              tion parameter mu for Viterbi E-values. Default is 200.
318
319
320       --EvN <n>
321              Sets the number of sequences in simulation  that  estimates  the
322              location parameter mu for Viterbi E-values. Default is 200.
323
324
325       --EfL <n>
326              Sets  the sequence length in simulation that estimates the loca‐
327              tion parameter tau for Forward E-values. Default is 100.
328
329
330       --EfN <n>
331              Sets the number of sequences in simulation  that  estimates  the
332              location parameter tau for Forward E-values. Default is 200.
333
334
335       --Eft <x>
336              Sets  the tail mass fraction to fit in the simulation that esti‐
337              mates the location parameter tau for Forward evalues. Default is
338              0.04.
339
340
341

DEBUGGING OPTIONS

343       --stall
344              For  debugging the MPI master/worker version: pause after start,
345              to enable the developer to attach debuggers to the running  mas‐
346              ter  and worker(s) processes. Send SIGCONT signal to release the
347              pause.  (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
348              tional MPI support was enabled at compile-time.)
349
350
351       --seed <n>
352              Set  the  random  number  seed  to <n>.  The default is 0, which
353              makes the random number generator use an arbitrary seed, so that
354              different  runs  of hmmsim will almost certainly generate a dif‐
355              ferent statistical sample.  For debugging, it is useful to force
356              reproducible results, by fixing a random number seed.
357
358
359
360

EXPERIMENTAL OPTIONS

362       These options were used in a small variety of different exploratory ex‐
363       periments.
364
365
366       --bgflat
367              Set the background residue distribution to a  uniform  distribu‐
368              tion,  both  for  purposes of the null model used in calculating
369              scores, and for generating the random sequences. The default  is
370              to use a standard amino acid background frequency distribution.
371
372
373       --bgcomp
374              Set  the background residue distribution to the mean composition
375              of the profile. This was used in exploring some of  the  effects
376              of biased composition.
377
378
379       --x-no-lengthmodel
380              Turn the H3 target sequence length model off. Set the self-tran‐
381              sitions for N,C,J and the null model to  350/351  instead;  this
382              emulates  HMMER2.   Not a good idea in general. This was used to
383              demonstrate one of the main H2 vs. H3 differences.
384
385
386       --nu <x>
387              Set the nu parameter for the MSV algorithm -- the expected  num‐
388              ber  of  ungapped  local alignments per target sequence. The de‐
389              fault is 2.0, corresponding to a E->J transition probability  of
390              0.5.  This  was  used to test whether varying nu has significant
391              effect on result (it doesn't seem to, within reason).  This  op‐
392              tion  only works if --msv is selected (it only affects MSV), and
393              it will not work with --fast (because the optimized  implementa‐
394              tions are hardwired to assume nu=2.0).
395
396
397       --pthresh <x>
398              Set  the  filter  P-value  threshold to use in generating filter
399              power files with --ffile.  The default is 0.02 (which  would  be
400              appropriate  for  testing  MSV scores, since this is the default
401              MSV filter threshold in H3's acceleration pipeline.)  Other  ap‐
402              propriate  choices  (matching defaults in the acceleration pipe‐
403              line) would be 0.001 for Viterbi, and 1e-5 for Forward.
404
405
406
407
408
409

COPYRIGHT

423       Copyright (C) 2020 Howard Hughes Medical Institute.
424       Freely distributed under the BSD open source license.
425
426       For  additional  information  on  copyright and licensing, see the file
427       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
428       web page (http://hmmer.org/).
429
430
431

AUTHOR

433       http://eddylab.org
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450HMMER 3.3.2                        Nov 2020                          hmmsim(1)