jackhmmer(1)

1jackhmmer(1)                     HMMER Manual                     jackhmmer(1)
2
3
4

NAME

6       jackhmmer - iteratively search sequence(s) against a protein database
7
8

SYNOPSIS

10       jackhmmer [options] <seqfile> <seqdb>
11
12

DESCRIPTION

14       jackhmmer iteratively searches each query sequence in <seqfile> against
15       the target sequence(s) in <seqdb>.  The first iteration is identical to
16       a  phmmer  search.  For the next iteration, a multiple alignment of the
17       query together with all target sequences satisfying  inclusion  thresh‐
18       olds  is assembled, a profile is constructed from this alignment (iden‐
19       tical to using hmmbuild on the alignment), and profile  search  of  the
20       <seqdb> is done (identical to an hmmsearch with the profile).
21
22
23       The  output  format  is  designed to be human-readable, but is often so
24       voluminous that reading it is impractical, and parsing it  is  a  pain.
25       The --tblout and --domtblout options save output in simple tabular for‐
26       mats that are concise and easier to parse.  The -o option allows  redi‐
27       recting the main output, including throwing it away in /dev/null.
28
29
30

OPTIONS

32       -h     Help;  print  a  brief  reminder  of  command line usage and all
33              available options.
34
35
36       -N <n> Set the maximum number of iterations to <n>.  The default is  5.
37              If N=1, the result is equivalent to a phmmer search.
38
39
40
41
42

OPTIONS CONTROLLING OUTPUT

44       By  default,  output for each iteration appears on stdout in a somewhat
45       human readable, somewhat parseable format. These  options  allow  redi‐
46       recting  that  output  or  saving  additional kinds of output to files,
47       including checkpoint files for each iteration.
48
49
50       -o <f> Direct the human-readable output to a file <f>.
51
52
53       -A <f> After the final iteration, save an annotated multiple  alignment
54              of  all hits satisfying inclusion thresholds (also including the
55              original query) to <f> in Stockholm format.
56
57
58       --tblout <f>
59              After the  final  iteration,  save  a  tabular  summary  of  top
60              sequence  hits  to  <f> in a readily parseable, columnar, white‐
61              space-delimited format.
62
63
64       --domtblout <f>
65              After the final iteration, save a tabular summary of top  domain
66              hits  to <f> in a readily parseable, columnar, whitespace-delim‐
67              ited format.
68
69
70       --chkhmm <prefix>
71              At the start of each iteration, checkpoint the query HMM, saving
72              it  to  a file named <prefix>-<n>.hmm where <n> is the iteration
73              number (from 1..N).
74
75
76       --chkali <prefix>
77              At the end of each iteration, checkpoint  an  alignment  of  all
78              domains  satisfying  inclusion thresholds (e.g. what will become
79              the query HMM for the next iteration), saving it to a file named
80              <checkpoint  file prefix>-<n>.sto in Stockholm format, where <n>
81              is the iteration number (from 1..N).
82
83
84       --acc  Use accessions instead of names in the main output, where avail‐
85              able for profiles and/or sequences.
86
87
88       --noali
89              Omit  the  alignment  section  from  the  main  output. This can
90              greatly reduce the output volume.
91
92
93       --notextw
94              Unlimit the length of each line in the main output. The  default
95              is a limit of 120 characters per line, which helps in displaying
96              the output cleanly on terminals and in editors, but can truncate
97              target profile description lines.
98
99
100       --textw <n>
101              Set  the  main  output's line length limit to <n> characters per
102              line. The default is 120.
103
104
105
106
107
108
109

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING (FIRST ITERATION)

111       By default, the first iteration uses a search model constructed from  a
112       single query sequence. This model is constructed using a standard 20x20
113       substitution matrix  for  residue  probabilities,  and  two  additional
114       parameters  for position-independent gap open and gap extend probabili‐
115       ties. These options allow the default single-sequence  scoring  parame‐
116       ters to be changed.
117
118
119       --popen <x>
120              Set  the  gap open probability for a single sequence query model
121              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
122
123
124       --pextend <x>
125              Set the gap extend probability for a single sequence query model
126              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
127
128
129       --mxfile <mxfile>
130              Obtain  residue  alignment  probabilities  from the substitution
131              matrix in file <mxfile>.  The default score matrix  is  BLOSUM62
132              (this matrix is internal to HMMER and does not have to be avail‐
133              able as a file).  The format of a substitution  matrix  <mxfile>
134              is  the  standard  format  accepted  by  BLAST, FASTA, and other
135              sequence analysis software.
136
137
138

OPTIONS CONTROLLING REPORTING THRESHOLDS

140       Reporting thresholds control which hits are reported  in  output  files
141       (the  main  output,  --tblout,  and  --domtblout).   In each iteration,
142       sequence hits and domain hits are ranked  by  statistical  significance
143       (E-value) and output is generated in two sections called per-target and
144       per-domain output. In per-target output, by default, all sequence  hits
145       with  an E-value <= 10 are reported. In the per-domain output, for each
146       target that has passed per-target  reporting  thresholds,  all  domains
147       satisfying  per-domain  reporting  thresholds are reported. By default,
148       these are domains with conditional E-values of  <=  10.  The  following
149       options  allow  you to change the default E-value reporting thresholds,
150       or to use bit score thresholds instead.
151
152
153
154       -E <x> Report sequences with E-values <= <x>  in  per-sequence  output.
155              The default is 10.0.
156
157
158       -T <x> Use  a bit score threshold for per-sequence output instead of an
159              E-value  threshold  (any  setting  of  -E  is  ignored).  Report
160              sequences with a bit score of >= <x>.  By default this option is
161              unset.
162
163
164       -Z <x> Declare the total size of the database to be <x> sequences,  for
165              purposes  of  E-value calculation.  Normally E-values are calcu‐
166              lated relative to the size of the database you actually searched
167              (e.g.  the  number  of sequences in target_seqdb). In some cases
168              (for instance, if you've split  your  target  sequence  database
169              into multiple files for parallelization of your search), you may
170              know better what the actual size of your search space is.
171
172
173       --domE <x>
174              Report domains with conditional E-values <=  <x>  in  per-domain
175              output,  in  addition  to the top-scoring domain per significant
176              sequence hit. The default is 10.0.
177
178
179       --domT <x>
180              Use a bit score threshold for per-domain output instead of an E-
181              value  threshold  (any  setting  of  --domT  is ignored). Report
182              domains with a bit score of >=  <x>  in  per-domain  output,  in
183              addition to the top-scoring domain per significant sequence hit.
184              By default this option is unset.
185
186
187       --domZ <x>
188              Declare the number of significant sequences to be <x> sequences,
189              for  purposes  of conditional E-value calculation for additional
190              domain significance.  Normally conditional E-values  are  calcu‐
191              lated  relative  to the number of sequences passing per-sequence
192              reporting threshold.
193
194
195

OPTIONS CONTROLLING INCLUSION THRESHOLDS

197       Inclusion thresholds control which hits are included  in  the  multiple
198       alignment  and  profile  constructed for the next search iteration.  By
199       default, a sequence must have a per-sequence E-value of <=  0.001  (see
200       -E option) to be included, and any additional domains in it besides the
201       top-scoring one must have a conditional E-value of <= 0.001 (see --domE
202       option).  The  difference  between  reporting  thresholds and inclusion
203       thresholds is that inclusion thresholds control which hits actually get
204       used  in  the next iteration (or the final output multiple alignment if
205       the -A option is used), whereas reporting thresholds control  what  you
206       see in output. Reporting thresholds are generally more loose so you can
207       see borderline hits in the top of the noise that might be of interest.
208
209
210       --incE <x>
211              Include sequences with E-values <= <x> in  subsequent  iteration
212              or final alignment output by -A.  The default is 0.001.
213
214
215       --incT <x>
216              Use  a bit score threshold for per-sequence inclusion instead of
217              an E-value threshold (any setting of --incE is ignored). Include
218              sequences with a bit score of >= <x>.  By default this option is
219              unset.
220
221
222       --incdomE <x>
223              Include domains with conditional E-values <= <x>  in  subsequent
224              iteration  or  final  alignment output by -A, in addition to the
225              top-scoring domain per significant sequence hit.  The default is
226              0.001.
227
228
229       --incdomT <x>
230              Use a bit score threshold for per-domain inclusion instead of an
231              E-value threshold (any setting of --incT  is  ignored).  Include
232              domains  with  a bit score of >= <x>.  By default this option is
233              unset.
234
235
236
237

OPTIONS CONTROLLING ACCELERATION HEURISTICS

239       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
240       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
241       ter is the fastest and most approximate; the last is the  full  Forward
242       scoring algorithm, slowest but most accurate. There is also a bias fil‐
243       ter step between MSV and Viterbi. Targets that pass all  the  steps  in
244       the  acceleration  pipeline  are  then  subjected  to postprocessing --
245       domain identification and scoring using the Forward/Backward algorithm.
246
247       Essentially the only free parameters  that  control  HMMER's  heuristic
248       filters are the P-value thresholds controlling the expected fraction of
249       nonhomologous sequences that pass  the  filters.  Setting  the  default
250       thresholds  higher  will  pass  a  higher  proportion  of nonhomologous
251       sequence, increasing sensitivity at the expense of  speed;  conversely,
252       setting  lower  P-value  thresholds  will  pass  a  smaller proportion,
253       decreasing sensitivity and increasing speed. Setting a filter's P-value
254       threshold  to  1.0 means it will passing all sequences, and effectively
255       disables the filter.
256
257       Changing filter thresholds only removes or includes targets  from  con‐
258       sideration;  changing  filter  thresholds does not alter bit scores, E-
259       values, or alignments, all of which are determined solely  in  postpro‐
260       cessing.
261
262
263       --max  Maximum  sensitivity.   Turn off all filters, including the bias
264              filter, and run full Forward/Backward  postprocessing  on  every
265              target.  This increases sensitivity slightly, at a large cost in
266              speed.
267
268
269       --F1 <x>
270              First filter threshold; set the P-value threshold  for  the  MSV
271              filter  step.   The  default is 0.02, meaning that roughly 2% of
272              the highest scoring nonhomologous targets are expected  to  pass
273              the filter.
274
275
276       --F2 <x>
277              Second  filter  threshold;  set  the  P-value  threshold for the
278              Viterbi filter step.  The default is 0.001.
279
280
281       --F3 <x>
282              Third filter threshold; set the P-value threshold for  the  For‐
283              ward filter step.  The default is 1e-5.
284
285
286       --nobias
287              Turn  off  the bias filter. This increases sensitivity somewhat,
288              but can come at a high cost in speed, especially  if  the  query
289              has  biased  residue  composition (such as a repetitive sequence
290              region, or if it is a membrane protein  with  large  regions  of
291              hydrophobicity). Without the bias filter, too many sequences may
292              pass the filter with biased  queries,  leading  to  slower  than
293              expected  performance  as  the  computationally  intensive  For‐
294              ward/Backward algorithms shoulder an abnormally heavy load.
295
296
297
298

OPTIONS CONTROLLING PROFILE CONSTRUCTION (LATER ITERATIONS)

300       These options control how consensus columns  are  defined  in  multiple
301       alignments   when  building  profiles.  By  default,  jackhmmer  always
302       includes your original query sequence in the alignment result at  every
303       iteration,  and consensus positions are defined by that query sequence:
304       that is, a default jackhmmer profile is always the same length as  your
305       original query, at every iteration.
306
307
308       --fast Define  consensus  columns as those that have a fraction >= sym‐
309              frac of residues as opposed to gaps. (See below for  the  --sym‐
310              frac  option.) Although this is the default profile construction
311              option elsewhere (in hmmbuild, in particular), it may have unde‐
312              sirable  effects  in  jackhmmer,  because a profile could itera‐
313              tively walk in sequence space away  from  your  original  query,
314              leaving  few  or  no  consensus  columns  corresponding  to  its
315              residues.
316
317
318       --hand Define consensus columns in next profile using reference annota‐
319              tion  to the multiple alignment.  jackhmmer propagates reference
320              annotation from the previous profile to the multiple  alignment,
321              and thence to the next profile. This is the default.
322
323
324       --symfrac <x>
325              Define the residue fraction threshold necessary to define a con‐
326              sensus column when using the --fast option. The default is  0.5.
327              The  symbol  fraction  in each column is calculated after taking
328              relative sequence weighting into account, and ignoring gap char‐
329              acters  corresponding  to ends of sequence fragments (as opposed
330              to internal insertions/deletions).  Setting this  to  1.0  means
331              that every alignment column will be assigned as consensus, which
332              may be useful in some cases. Setting it to 0.0 is  a  bad  idea,
333              because no columns will be assigned as consensus, and you'll get
334              a model of zero length.
335
336
337       --fragthresh <x>
338              We only want to count terminal gaps as deletions if the  aligned
339              sequence  is  known  to  be full-length, not if it is a fragment
340              (for instance, because only part of  it  was  sequenced).  HMMER
341              uses  a simple rule to infer fragments: if the sequence length L
342              is less than a fraction <x> times the mean  sequence  length  of
343              all the sequences in the alignment, then the sequence is handled
344              as a fragment. The default is 0.5.
345
346
347
348

OPTIONS CONTROLLING RELATIVE WEIGHTS

350       Whenever a profile is built from a multiple alignment, HMMER uses an ad
351       hoc   sequence   weighting  algorithm  to  downweight  closely  related
352       sequences and upweight distantly related ones. This has the  effect  of
353       making  models  less  biased by uneven phylogenetic representation. For
354       example, two identical sequences would typically each receive half  the
355       weight  that  one  sequence would (and this is why jackhmmer isn't con‐
356       cerned about always including your  original  query  sequence  in  each
357       iteration's alignment, even if it finds it again in the database you're
358       searching). These options control which algorithm gets used.
359
360
361       --wpb  Use  the  Henikoff  position-based  sequence  weighting   scheme
362              [Henikoff  and  Henikoff, J. Mol. Biol. 243:574, 1994].  This is
363              the default.
364
365
366       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm  [Ger‐
367              stein et al, J. Mol. Biol. 235:1067, 1994].
368
369
370       --wblosum
371              Use  the  same clustering scheme that was used to weight data in
372              calculating BLOSUM subsitution matrices [Henikoff and  Henikoff,
373              Proc.  Natl.  Acad.  Sci  89:10915, 1992]. Sequences are single-
374              linkage clustered at an identity threshold  (default  0.62;  see
375              --wid)  and  within  each  cluster of c sequences, each sequence
376              gets relative weight 1/c.
377
378
379       --wnone
380              No relative weights. All sequences are assigned uniform weight.
381
382
383       --wid <x>
384              Sets the identity threshold used  by  single-linkage  clustering
385              when  using --wblosum.  Invalid with any other weighting scheme.
386              Default is 0.62.
387
388
389
390
391
392

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

394       After relative weights are determined, they are normalized to sum to  a
395       total  effective  sequence  number,  eff_nseq.   This number may be the
396       actual number of sequences in the alignment, but it  is  almost  always
397       smaller  than  that.   The  default  entropy  weighting method (--eent)
398       reduces the effective sequence number to reduce the information content
399       (relative entropy, or average expected score on true homologs) per con‐
400       sensus position. The target relative entropy is controlled  by  a  two-
401       parameter  function,  where  the two parameters are settable with --ere
402       and --esigma.
403
404
405       --eent Adjust effective sequence number to achieve a specific  relative
406              entropy per position (see --ere).  This is the default.
407
408
409       --eclust
410              Set  effective  sequence  number to the number of single-linkage
411              clusters at a specific identity  threshold  (see  --eid).   This
412              option  is  not recommended; it's for experiments evaluating how
413              much better --eent is.
414
415
416       --enone
417              Turn off effective sequence number determination  and  just  use
418              the  actual number of sequences. One reason you might want to do
419              this is to try to maximize the relative entropy/position of your
420              model, which may be useful for short models.
421
422
423       --eset <x>
424              Explicitly  set  the effective sequence number for all models to
425              <x>.
426
427
428       --ere <x>
429              Set  the  minimum  relative  entropy/position  target  to   <x>.
430              Requires  --eent.  Default depends on the sequence alphabet; for
431              protein sequences, it is 0.59 bits/position.
432
433
434       --esigma <x>
435              Sets the minimum relative entropy contributed by an entire model
436              alignment,  over its whole length. This has the effect of making
437              short models have higher  relative  entropy  per  position  than
438              --ere alone would give. The default is 45.0 bits.
439
440
441       --eid <x>
442              Sets  the  fractional  pairwise  identity  cutoff used by single
443              linkage clustering with the  --eclust  option.  The  default  is
444              0.62.
445
446
447
448
449
450
451
452
453

OPTIONS CONTROLLING E-VALUE CALIBRATION

455       Estimating the location parameters for the expected score distributions
456       for MSV filter  scores,  Viterbi  filter  scores,  and  Forward  scores
457       requires three short random sequence simulations.
458
459
460       --EmL <n>
461              Sets  the sequence length in simulation that estimates the loca‐
462              tion parameter mu for MSV filter E-values. Default is 200.
463
464
465       --EmN <n>
466              Sets the number of sequences in simulation  that  estimates  the
467              location parameter mu for MSV filter E-values. Default is 200.
468
469
470       --EvL <n>
471              Sets  the sequence length in simulation that estimates the loca‐
472              tion parameter mu for Viterbi filter E-values. Default is 200.
473
474
475       --EvN <n>
476              Sets the number of sequences in simulation  that  estimates  the
477              location  parameter  mu  for Viterbi filter E-values. Default is
478              200.
479
480
481       --EfL <n>
482              Sets the sequence length in simulation that estimates the  loca‐
483              tion parameter tau for Forward E-values. Default is 100.
484
485
486       --EfN <n>
487              Sets  the  number  of sequences in simulation that estimates the
488              location parameter tau for Forward E-values. Default is 200.
489
490
491       --Eft <x>
492              Sets the tail mass fraction to fit in the simulation that  esti‐
493              mates the location parameter tau for Forward evalues. Default is
494              0.04.
495
496
497

OTHER OPTIONS

499       --nonull2
500              Turn off the null2 score corrections for biased composition.
501
502
503       -Z <x> Assert that the total number of targets in your searches is <x>,
504              for  the  purposes  of per-sequence E-value calculations, rather
505              than the actual number of targets seen.
506
507
508       --domZ <x>
509              Assert that the total number of targets in your searches is <x>,
510              for the purposes of per-domain conditional E-value calculations,
511              rather than the number of  targets  that  passed  the  reporting
512              thresholds.
513
514
515       --seed <n>
516              Seed  the random number generator with <n>, an integer >= 0.  If
517              <n> is >0, any stochastic simulations will be reproducible;  the
518              same  command will give the same results.  If <n> is 0, the ran‐
519              dom number generator is seeded arbitrarily, and stochastic simu‐
520              lations  will  vary  from  run  to run of the same command.  The
521              default seed is 42.
522
523
524
525       --qformat <s>
526              Declare that the input query_seqfile is in format <s>.  Accepted
527              sequence  file  formats  include  FASTA,  EMBL,  Genbank,  DDBJ,
528              Uniprot, Stockholm, and SELEX. Default is to autodetect the for‐
529              mat of the file.
530
531
532       --tformat <s>
533              Declare  that the input target_seqdb is in format <s>.  Accepted
534              sequence  file  formats  include  FASTA,  EMBL,  Genbank,  DDBJ,
535              Uniprot, Stockholm, and SELEX. Default is to autodetect the for‐
536              mat of the file.
537
538
539       --cpu <n>
540              Set the number of parallel worker threads to <n>.   By  default,
541              HMMER  sets  this  to the number of CPU cores it detects in your
542              machine - that is, it tries to maximize the use of  your  avail‐
543              able  processor  cores.  Setting  <n>  higher than the number of
544              available cores is of little if any value, but you may  want  to
545              set  it  to  something less. You can also control this number by
546              setting an environment variable, HMMER_NCPU.
547
548              This option is only available if HMMER was compiled  with  POSIX
549              threads  support.  This  is  the  default,  but it may have been
550              turned off at compile-time for your site  or  machine  for  some
551              reason.
552
553
554              --stall For debugging the MPI master/worker version: pause after
555              start, to enable the developer to attach debuggers to  the  run‐
556              ning  master  and  worker(s)  processes.  Send SIGCONT signal to
557              release the pause.  (Under  gdb:  (gdb)  signal  SIGCONT)  (Only
558              available if optional MPI support was enabled at compile-time.)
559
560
561       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
562              optional MPI support was enabled at compile-time.)
563
564
565
566

COPYRIGHT

580       @HMMER_COPYRIGHT@
581       @HMMER_LICENSE@
582
583       For  additional  information  on  copyright and licensing, see the file
584       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
585       web page (@HMMER_URL@).
586
587
588

AUTHOR

590       Eddy/Rivas Laboratory
591       Janelia Farm Research Campus
592       19700 Helix Drive
593       Ashburn VA 20147 USA
594       http://eddylab.org
595
596
597
598
599
600HMMER @HMMER_VERSION@            @HMMER_DATE@                     jackhmmer(1)