hmmsim(1)

1hmmsim(1)                        HMMER Manual                        hmmsim(1)
2
3
4

NAME

6       hmmsim - collect score distributions on random sequences
7
8

SYNOPSIS

10       hmmsim [options] <hmmfile>
11
12
13

DESCRIPTION

15       The  hmmsim  program  generates  random sequences, scores them with the
16       model(s) in <hmmfile>, and outputs various sorts of histograms,  plots,
17       and fitted distributions for the resulting scores.
18
19
20       hmmsim  is not a mainstream part of the HMMER package. Most users would
21       have no reason to use it. It is used to develop and test the  statisti‐
22       cal  methods  used  to  determine  P-values and E-values in HMMER3. For
23       example, it was used to generate most of the results in a 2008 paper on
24       H3's  local  alignment  statistics  (PLoS  Comp  Bio  4:e1000069, 2008;
25       http://www.ploscompbiol.org/doi/pcbi.1000069).
26
27
28       Because it is a research testbed, you should not expect  it  to  be  as
29       robust  as  other  programs  in  the  package. For example, options may
30       interact in weird ways; we haven't tested nor tried to  anticipate  all
31       different possible combinations.
32
33
34       The  main  task  is  to fit a maximum likelihood Gumbel distribution to
35       Viterbi scores or an maximum likelihood exponential tail to  high-scor‐
36       ing  Forward  scores,  and to test that these fitted distributions obey
37       the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and  the
38       Forward exponential tail.
39
40
41       The  output is a table of numbers, one row for each model. Four differ‐
42       ent parametric fits to the score data are tested: (1)  maximum  likeli‐
43       hood  fits to both location (mu/tau) and slope (lambda) parameters; (2)
44       assuming lambda=log_2, maximum likelihood fit to the location parameter
45       only;  (3)  same  but  assuming an edge-corrected lambda, using current
46       procedures in H3 [Eddy, 2008]; and (4) using both parameters determined
47       by  H3's  current  procedures.  The  standard  simple,  quick and dirty
48       statistic for goodness-of-fit is 'E@10', the calculated E-value of  the
49       10th ranked top hit, which we expect to be about 10.
50
51
52       In detail, the columns of the output are:
53
54
55       name   Name of the model.
56
57
58       tailp  Fraction of the highest scores used to fit the distribution. For
59              Viterbi, MSV, and Hybrid scores, this defaults to 1.0 (a  Gumbel
60              distribution  is  fitted  to  all the data). For Forward scores,
61              this defaults to 0.02 (an exponential  tail  is  fitted  to  the
62              highest 2% scores).
63
64
65       mu/tau Location parameter for the maximum likelihood fit to the data.
66
67
68       lambda Slope parameter for the maximum likelihood fit to the data.
69
70
71       E@10   The  E-value  calculated for the 10th ranked high score ('E@10')
72              using the ML mu/tau and lambda. By definition, this expected  to
73              be about 10, if E-value estimation were accurate.
74
75
76       mufix  Location  parameter,  for  a maximum likelihood fit with a known
77              (fixed) slope parameter lambda of log_2 (0.693).
78
79
80       E@10fix
81              The E-value calculated for the 10th ranked score using mufix and
82              the expected lambda = log_2 = 0.693.
83
84
85
86       mufix2 Location  parameter,  for a maximum likelihood fit with an edge-
87              effect-corrected lambda.
88
89
90       E@10fix2
91              The E-value calculated for the 10th ranked  score  using  mufix2
92              and the edge-effect-corrected lambda.
93
94
95       pmu    Location parameter as determined by H3's estimation procedures.
96
97
98       plambda
99              Slope parameter as determined by H3's estimation procedures.
100
101
102       pE@10  The  E-value  calculated  for  the  10th ranked score using pmu,
103              plambda.
104
105
106
107       At the end of this table, one more line is printed, starting with # and
108       summarizing the overall CPU time used by the simulations.
109
110
111       Some  of the optional output files are in xmgrace xy format. xmgrace is
112       powerful and freely available graph-plotting software.
113
114
115

MISCELLANEOUS OPTIONS

117       -h     Help; print a brief reminder  of  command  line  usage  and  all
118              available options.
119
120
121       -a     Collect  expected  Viterbi alignment length statistics from each
122              simulated sequence. This only works  with  Viterbi  scores  (the
123              default;  see  --vit).  Two additional fields are printed in the
124              output table for each model: the mean length of  Viterbi  align‐
125              ments, and the standard deviation.
126
127
128       -v     (Verbose). Print the scores too, one score per line.
129
130
131       -L <n> Set the length of the randomly sampled (nonhomologous) sequences
132              to <n>.  The default is 100.
133
134
135
136       -N <n> Set the number  of  randomly  sampled  sequences  to  <n>.   The
137              default is 1000.
138
139
140       --mpi  Run  in  MPI parallel mode, under mpirun.  It is parallelized at
141              the level of sending one profile at a  time  to  an  MPI  worker
142              process, so parallelization only helps if you have more than one
143              profile in the <hmmfile>, and you want to have at least as  many
144              profiles  as  MPI worker processes.  (Only available if optional
145              MPI support was enabled at compile-time.)
146
147
148
149

OPTIONS CONTROLLING OUTPUT

151       -o <f> Save the main output table to a file <f> rather than sending  it
152              to stdout.
153
154
155       --afile <f>
156              When  collecting  Viterbi  alignment statistics (the -a option),
157              for each sampled sequence, output two fields per line to a  file
158              <f>:  the  length  of the optimal alignment, and the Viterbi bit
159              score.  Requires that the -a option is also used.
160
161
162       --efile <f>
163              Output a rank vs. E-value plot in XMGRACE xy format to file <f>.
164              The  x-axis  is the rank of this sequence, from highest score to
165              lowest; the y-axis is the E-value calculated for this  sequence.
166              E-values  are calculated using H3's default procedures (i.e. the
167              pmu, plambda parameters in the output table). You expect a rough
168              match  between rank and E-value if E-values are accurately esti‐
169              mated.
170
171
172
173       --ffile <f>
174              Output a "filter power" file to <f>: for each model, a line with
175              three  fields:  model  name,  number of sequences passing the P-
176              value threshold, and fraction of sequences passing  the  P-value
177              threshold.  See  --pthresh  for  setting  the P-value threshold,
178              which defaults to 0.02 (the default MSV filter threshold in H3).
179              The  P-values  are as determined by H3's default procedures (the
180              pmu,plambda parameters in the output table).  If  all  is  well,
181              you  expect  to  see filter power equal to the predicted P-value
182              setting of the threshold.
183
184
185       --pfile <f>
186              Output cumulative survival plots (P(S>x)) to file <f> in XMGRACE
187              xy format. There are three plots: (1) the observed score distri‐
188              bution; (2) the maximum likelihood fitted  distribution;  (3)  a
189              maximum likelihood fit to the location parameter (mu/tau) while
190                  assuming lambda=log_2.
191
192
193       --xfile <f>
194              Output  the  bit  scores  as  a binary array of double-precision
195              floats (8 bytes per score) to file <f>.  Programs  like  Easel's
196              esl-histplot  can  read  such  binary files. This is useful when
197              generating extremely large sample sizes.
198
199
200

OPTIONS CONTROLLING MODEL CONFIGURATION (MODE)

202       H3 only uses multihit local alignment ( --fs mode), and this  is  where
203       we  believe  the  statistical  fits.   Unihit  local  alignment  scores
204       (Smith/Waterman; --sw mode)  also  obey  our  statistical  conjectures.
205       Glocal  alignment  statistics (either multihit or unihit) are still not
206       adequately understood nor adequately fitted.
207
208
209       --fs   Collect multihit local alignment scores. This  is  the  default.
210              alignment as 'fragment search mode'.
211
212
213       --sw   Collect  unihit  local  alignment scores. The H3 J state is dis‐
214              abled.  alignment as 'Smith/Waterman search mode'.
215
216
217       --ls   Collect   multihit   glocal   alignment   scores.   In    glocal
218              (global/local) alignment, the entire model must align, to a sub‐
219              sequence of the target. The H3 local entry/exit transition prob‐
220              abilities are disabled. 'ls' comes from HMMER2's historical ter‐
221              minology for multihit local alignment as 'local search mode'.
222
223
224       --s    Collect unihit glocal alignment scores.  Both the H3 J state and
225              local  entry/exit  transition  probabilities  are  disabled. 's'
226              comes from HMMER2's historical  terminology  for  unihit  glocal
227              alignment.
228
229
230
231

OPTIONS CONTROLLING SCORING ALGORITHM

233       --vit  Collect Viterbi maximum likelihood alignment scores. This is the
234              default.
235
236
237       --fwd  Collect Forward log-odds likelihood scores, summed  over  align‐
238              ment ensemble.
239
240
241       --hyb  Collect  'Hybrid'  scores,  as described in papers by Yu and Hwa
242              (for instance, Bioinformatics 18:864, 2002). These involve  cal‐
243              culating a Forward matrix and taking the maximum cell value. The
244              number itself is statistically  somewhat  unmotivated,  but  the
245              distribution is expected be a well-behaved extreme value distri‐
246              bution (Gumbel).
247
248
249       --msv  Collect MSV (multiple ungapped segment  Viterbi)  scores,  using
250              H3's main acceleration heuristic.
251
252
253       --fast For  any  of  the  above  options, use H3's optimized production
254              implementation (using SIMD vectorization). The default is to use
255              the implementations sacrifice a small amount of numerical preci‐
256              sion. This can introduce confounding noise into statistical sim‐
257              ulations  and fits, so when one gets super-concerned about exact
258              details, it's better to be able to factor that source  of  noise
259              out.
260
261

OPTIONS CONTROLLING FITTED TAIL MASSES FOR FORWARD

263       In  some experiments, it was useful to fit Forward scores to a range of
264       different tail masses, rather than just one. These  options  provide  a
265       mechanism  for fitting an evenly-spaced range of different tail masses.
266       For each different tail mass, a line is generated in the output.
267
268
269       --tmin <x>
270              Set the lower bound on the tail mass distribution. (The  default
271              is 0.02 for the default single tail mass.)
272
273
274       --tmax <x>
275              Set  the upper bound on the tail mass distribution. (The default
276              is 0.02 for the default single tail mass.)
277
278
279       --tpoints <n>
280              Set the number of tail masses to sample,  starting  from  --tmin
281              and  ending  at --tmax.  (The default is 1, for the default 0.02
282              single tail mass.)
283
284
285       --tlinear
286              Sample a range of tail masses with uniform linear  spacing.  The
287              default is to use uniform logarithmic spacing.
288
289
290
291

OPTIONS CONTROLLING H3 PARAMETER ESTIMATION METHODS

293       H3 uses three short random sequence simulations to estimating the loca‐
294       tion parameters for the expected score distributions  for  MSV  scores,
295       Viterbi  scores,  and Forward scores. These options allow these simula‐
296       tions to be modified.
297
298
299       --EmL <n>
300              Sets the sequence length in simulation that estimates the  loca‐
301              tion parameter mu for MSV E-values. Default is 200.
302
303
304       --EmN <n>
305              Sets  the  number  of sequences in simulation that estimates the
306              location parameter mu for MSV E-values. Default is 200.
307
308
309       --EvL <n>
310              Sets the sequence length in simulation that estimates the  loca‐
311              tion parameter mu for Viterbi E-values. Default is 200.
312
313
314       --EvN <n>
315              Sets  the  number  of sequences in simulation that estimates the
316              location parameter mu for Viterbi E-values. Default is 200.
317
318
319       --EfL <n>
320              Sets the sequence length in simulation that estimates the  loca‐
321              tion parameter tau for Forward E-values. Default is 100.
322
323
324       --EfN <n>
325              Sets  the  number  of sequences in simulation that estimates the
326              location parameter tau for Forward E-values. Default is 200.
327
328
329       --Eft <x>
330              Sets the tail mass fraction to fit in the simulation that  esti‐
331              mates the location parameter tau for Forward evalues. Default is
332              0.04.
333
334
335

DEBUGGING OPTIONS

337       --stall
338              For debugging the MPI master/worker version: pause after  start,
339              to  enable the developer to attach debuggers to the running mas‐
340              ter and worker(s) processes. Send SIGCONT signal to release  the
341              pause.   (Under  gdb:  (gdb)  signal SIGCONT) (Only available if
342              optional MPI support was enabled at compile-time.)
343
344
345       --seed <n>
346              Set the random number seed to <n>.   The  default  is  0,  which
347              makes the random number generator use an arbitrary seed, so that
348              different runs of hmmsim will almost certainly generate  a  dif‐
349              ferent statistical sample.  For debugging, it is useful to force
350              reproducible results, by fixing a random number seed.
351
352
353
354

EXPERIMENTAL OPTIONS

356       These options were used in a small  variety  of  different  exploratory
357       experiments.
358
359
360       --bgflat
361              Set  the  background residue distribution to a uniform distribu‐
362              tion, both for purposes of the null model  used  in  calculating
363              scores,  and for generating the random sequences. The default is
364              to use a standard amino acid background frequency distribution.
365
366
367       --bgcomp
368              Set the background residue distribution to the mean  composition
369              of  the  profile. This was used in exploring some of the effects
370              of biased composition.
371
372
373       --x-no-lengthmodel
374              Turn the H3 target sequence length model off. Set the self-tran‐
375              sitions  for  N,C,J  and the null model to 350/351 instead; this
376              emulates HMMER2.  Not a good idea in general. This was  used  to
377              demonstrate one of the main H2 vs. H3 differences.
378
379
380       --nu <x>
381              Set  the nu parameter for the MSV algorithm -- the expected num‐
382              ber of  ungapped  local  alignments  per  target  sequence.  The
383              default  is  2.0, corresponding to a E->J transition probability
384              of 0.5. This was used to test whether varying nu has significant
385              effect  on  result  (it  doesn't  seem to, within reason).  This
386              option only works if --msv is selected (it  only  affects  MSV),
387              and  it  will not work with --fast (because the optimized imple‐
388              mentations are hardwired to assume nu=2.0).
389
390
391       --pthresh <x>
392              Set the filter P-value threshold to  use  in  generating  filter
393              power  files  with --ffile.  The default is 0.02 (which would be
394              appropriate for testing MSV scores, since this  is  the  default
395              MSV  filter  threshold  in  H3's  acceleration  pipeline.) Other
396              appropriate choices (matching defaults in the acceleration pipe‐
397              line) would be 0.001 for Viterbi, and 1e-5 for Forward.
398
399
400
401
402
403

COPYRIGHT

416       Copyright (C) 2015 Howard Hughes Medical Institute.
417       Freely distributed under the GNU General Public License (GPLv3).
418
419       For  additional  information  on  copyright and licensing, see the file
420       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
421       web page ().
422
423
424

AUTHOR

426       Eddy/Rivas Laboratory
427       Janelia Farm Research Campus
428       19700 Helix Drive
429       Ashburn VA 20147 USA
430       http://eddylab.org
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447HMMER 3.1b2                      February 2015                       hmmsim(1)