jackhmmer(1)

1jackhmmer(1)                     HMMER Manual                     jackhmmer(1)
2
3
4

NAME

6       jackhmmer - iteratively search sequence(s) against a protein database
7
8

SYNOPSIS

10       jackhmmer [options] <seqfile> <seqdb>
11
12

DESCRIPTION

14       jackhmmer iteratively searches each query sequence in <seqfile> against
15       the target sequence(s) in <seqdb>.  The first iteration is identical to
16       a  phmmer  search.  For the next iteration, a multiple alignment of the
17       query together with all target sequences satisfying  inclusion  thresh‐
18       olds  is assembled, a profile is constructed from this alignment (iden‐
19       tical to using hmmbuild on the alignment), and profile  search  of  the
20       <seqdb> is done (identical to an hmmsearch with the profile).
21
22
23
24       The  query  <seqfile>  may be '-' (a dash character), in which case the
25       query sequences are read from a <stdin> pipe instead of  from  a  file.
26       The  <seqdb>  cannot  be  read from a <stdin> stream, because jackhmmer
27       needs to do multiple passes over the database.
28
29
30
31       The output format is designed to be human-readable,  but  is  often  so
32       voluminous  that  reading  it is impractical, and parsing it is a pain.
33       The --tblout and --domtblout options save output in simple tabular for‐
34       mats  that are concise and easier to parse.  The -o option allows redi‐
35       recting the main output, including throwing it away in /dev/null.
36
37
38

OPTIONS

40       -h     Help; print a brief reminder  of  command  line  usage  and  all
41              available options.
42
43
44       -N <n> Set  the maximum number of iterations to <n>.  The default is 5.
45              If N=1, the result is equivalent to a phmmer search.
46
47
48
49
50

OPTIONS CONTROLLING OUTPUT

52       By default, output for each iteration appears on stdout in  a  somewhat
53       human  readable,  somewhat  parseable format. These options allow redi‐
54       recting that output or saving additional  kinds  of  output  to  files,
55       including checkpoint files for each iteration.
56
57
58       -o <f> Direct the human-readable output to a file <f>.
59
60
61       -A <f> After  the final iteration, save an annotated multiple alignment
62              of all hits satisfying inclusion thresholds (also including  the
63              original query) to <f> in Stockholm format.
64
65
66       --tblout <f>
67              After  the  final  iteration,  save  a  tabular  summary  of top
68              sequence hits to <f> in a readily  parseable,  columnar,  white‐
69              space-delimited format.
70
71
72       --domtblout <f>
73              After  the final iteration, save a tabular summary of top domain
74              hits to <f> in a readily parseable, columnar,  whitespace-delim‐
75              ited format.
76
77
78       --chkhmm <prefix>
79              At the start of each iteration, checkpoint the query HMM, saving
80              it to a file named <prefix>-<n>.hmm where <n> is  the  iteration
81              number (from 1..N).
82
83
84       --chkali <prefix>
85              At  the  end  of  each iteration, checkpoint an alignment of all
86              domains satisfying inclusion thresholds (e.g. what  will  become
87              the query HMM for the next iteration), saving it to a file named
88              <checkpoint file prefix>-<n>.sto in Stockholm format, where  <n>
89              is the iteration number (from 1..N).
90
91
92       --acc  Use accessions instead of names in the main output, where avail‐
93              able for profiles and/or sequences.
94
95
96       --noali
97              Omit the alignment  section  from  the  main  output.  This  can
98              greatly reduce the output volume.
99
100
101       --notextw
102              Unlimit  the length of each line in the main output. The default
103              is a limit of 120 characters per line, which helps in displaying
104              the output cleanly on terminals and in editors, but can truncate
105              target profile description lines.
106
107
108       --textw <n>
109              Set the main output's line length limit to  <n>  characters  per
110              line. The default is 120.
111
112
113
114
115
116
117

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING (FIRST ITERATION)

119       By  default, the first iteration uses a search model constructed from a
120       single query sequence. This model is constructed using a standard 20x20
121       substitution  matrix  for  residue  probabilities,  and  two additional
122       parameters for position-independent gap open and gap extend  probabili‐
123       ties.  These  options allow the default single-sequence scoring parame‐
124       ters to be changed.
125
126
127       --popen <x>
128              Set the gap open probability for a single sequence  query  model
129              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
130
131
132       --pextend <x>
133              Set the gap extend probability for a single sequence query model
134              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
135
136
137       --mx <s>
138              Obtain residue alignment probabilities from the built-in substi‐
139              tution  matrix  named <s>.  Several standard matrices are built-
140              in, and do not need to be read from files.  The matrix name  <s>
141              can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
142              SUM62, BLOSUM80, or BLOSUM90.  Only one of the --mx and --mxfile
143              options may be used.
144
145
146       --mxfile <mxfile>
147              Obtain  residue  alignment  probabilities  from the substitution
148              matrix in file <mxfile>.  The default score matrix  is  BLOSUM62
149              (this matrix is internal to HMMER and does not have to be avail‐
150              able as a file).  The format of a substitution  matrix  <mxfile>
151              is  the  standard  format  accepted  by  BLAST, FASTA, and other
152              sequence analysis software.
153
154
155

OPTIONS CONTROLLING REPORTING THRESHOLDS

157       Reporting thresholds control which hits are reported  in  output  files
158       (the  main  output,  --tblout,  and  --domtblout).   In each iteration,
159       sequence hits and domain hits are ranked  by  statistical  significance
160       (E-value) and output is generated in two sections called per-target and
161       per-domain output. In per-target output, by default, all sequence  hits
162       with  an E-value <= 10 are reported. In the per-domain output, for each
163       target that has passed per-target  reporting  thresholds,  all  domains
164       satisfying  per-domain  reporting  thresholds are reported. By default,
165       these are domains with conditional E-values of  <=  10.  The  following
166       options  allow  you to change the default E-value reporting thresholds,
167       or to use bit score thresholds instead.
168
169
170
171       -E <x> Report sequences with E-values <= <x>  in  per-sequence  output.
172              The default is 10.0.
173
174
175       -T <x> Use  a bit score threshold for per-sequence output instead of an
176              E-value  threshold  (any  setting  of  -E  is  ignored).  Report
177              sequences with a bit score of >= <x>.  By default this option is
178              unset.
179
180
181       -Z <x> Declare the total size of the database to be <x> sequences,  for
182              purposes  of  E-value calculation.  Normally E-values are calcu‐
183              lated relative to the size of the database you actually searched
184              (e.g.  the  number of sequences in target_seqdb).  In some cases
185              (for instance, if you've split  your  target  sequence  database
186              into multiple files for parallelization of your search), you may
187              know better what the actual size of your search space is.
188
189
190       --domE <x>
191              Report domains with conditional E-values <=  <x>  in  per-domain
192              output,  in  addition  to the top-scoring domain per significant
193              sequence hit. The default is 10.0.
194
195
196       --domT <x>
197              Use a bit score threshold for per-domain output instead of an E-
198              value  threshold  (any  setting  of  --domT  is ignored). Report
199              domains with a bit score of >=  <x>  in  per-domain  output,  in
200              addition to the top-scoring domain per significant sequence hit.
201              By default this option is unset.
202
203
204       --domZ <x>
205              Declare the number of significant sequences to be <x> sequences,
206              for  purposes  of conditional E-value calculation for additional
207              domain significance.  Normally conditional E-values  are  calcu‐
208              lated  relative  to the number of sequences passing per-sequence
209              reporting threshold.
210
211
212

OPTIONS CONTROLLING INCLUSION THRESHOLDS

214       Inclusion thresholds control which hits are included  in  the  multiple
215       alignment  and  profile  constructed for the next search iteration.  By
216       default, a sequence must have a per-sequence E-value of <=  0.001  (see
217       -E option) to be included, and any additional domains in it besides the
218       top-scoring one must have a conditional E-value of <= 0.001 (see --domE
219       option).  The  difference  between  reporting  thresholds and inclusion
220       thresholds is that inclusion thresholds control which hits actually get
221       used  in  the next iteration (or the final output multiple alignment if
222       the -A option is used), whereas reporting thresholds control  what  you
223       see in output. Reporting thresholds are generally more loose so you can
224       see borderline hits in the top of the noise that might be of interest.
225
226
227       --incE <x>
228              Include sequences with E-values <= <x> in  subsequent  iteration
229              or final alignment output by -A.  The default is 0.001.
230
231
232       --incT <x>
233              Use  a bit score threshold for per-sequence inclusion instead of
234              an E-value threshold (any setting of --incE is ignored). Include
235              sequences with a bit score of >= <x>.  By default this option is
236              unset.
237
238
239       --incdomE <x>
240              Include domains with conditional E-values <= <x>  in  subsequent
241              iteration  or  final  alignment output by -A, in addition to the
242              top-scoring domain per significant sequence hit.  The default is
243              0.001.
244
245
246       --incdomT <x>
247              Use a bit score threshold for per-domain inclusion instead of an
248              E-value threshold (any setting of --incT  is  ignored).  Include
249              domains  with  a bit score of >= <x>.  By default this option is
250              unset.
251
252
253
254

OPTIONS CONTROLLING ACCELERATION HEURISTICS

256       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
257       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
258       ter is the fastest and most approximate; the last is the  full  Forward
259       scoring algorithm, slowest but most accurate. There is also a bias fil‐
260       ter step between MSV and Viterbi. Targets that pass all  the  steps  in
261       the  acceleration  pipeline  are  then  subjected  to postprocessing --
262       domain identification and scoring using the Forward/Backward algorithm.
263
264       Essentially the only free parameters  that  control  HMMER's  heuristic
265       filters are the P-value thresholds controlling the expected fraction of
266       nonhomologous sequences that pass  the  filters.  Setting  the  default
267       thresholds  higher  will  pass  a  higher  proportion  of nonhomologous
268       sequence, increasing sensitivity at the expense of  speed;  conversely,
269       setting  lower  P-value  thresholds  will  pass  a  smaller proportion,
270       decreasing sensitivity and increasing speed. Setting a filter's P-value
271       threshold  to  1.0 means it will passing all sequences, and effectively
272       disables the filter.
273
274       Changing filter thresholds only removes or includes targets  from  con‐
275       sideration;  changing  filter  thresholds does not alter bit scores, E-
276       values, or alignments, all of which are determined solely  in  postpro‐
277       cessing.
278
279
280       --max  Maximum  sensitivity.   Turn off all filters, including the bias
281              filter, and run full Forward/Backward  postprocessing  on  every
282              target.  This increases sensitivity slightly, at a large cost in
283              speed.
284
285
286       --F1 <x>
287              First filter threshold; set the P-value threshold  for  the  MSV
288              filter  step.   The  default is 0.02, meaning that roughly 2% of
289              the highest scoring nonhomologous targets are expected  to  pass
290              the filter.
291
292
293       --F2 <x>
294              Second  filter  threshold;  set  the  P-value  threshold for the
295              Viterbi filter step.  The default is 0.001.
296
297
298       --F3 <x>
299              Third filter threshold; set the P-value threshold for  the  For‐
300              ward filter step.  The default is 1e-5.
301
302
303       --nobias
304              Turn  off  the bias filter. This increases sensitivity somewhat,
305              but can come at a high cost in speed, especially  if  the  query
306              has  biased  residue  composition (such as a repetitive sequence
307              region, or if it is a membrane protein  with  large  regions  of
308              hydrophobicity). Without the bias filter, too many sequences may
309              pass the filter with biased  queries,  leading  to  slower  than
310              expected  performance  as  the  computationally  intensive  For‐
311              ward/Backward algorithms shoulder an abnormally heavy load.
312
313
314
315

OPTIONS CONTROLLING PROFILE CONSTRUCTION (LATER ITERATIONS)

317       These options control how consensus columns  are  defined  in  multiple
318       alignments   when  building  profiles.  By  default,  jackhmmer  always
319       includes your original query sequence in the alignment result at  every
320       iteration,  and consensus positions are defined by that query sequence:
321       that is, a default jackhmmer profile is always the same length as  your
322       original query, at every iteration.
323
324
325       --fast Define  consensus  columns as those that have a fraction >= sym‐
326              frac of residues as opposed to gaps. (See below for  the  --sym‐
327              frac  option.) Although this is the default profile construction
328              option elsewhere (in hmmbuild, in particular), it may have unde‐
329              sirable  effects  in  jackhmmer,  because a profile could itera‐
330              tively walk in sequence space away  from  your  original  query,
331              leaving  few  or  no  consensus  columns  corresponding  to  its
332              residues.
333
334
335       --hand Define consensus columns in next profile using reference annota‐
336              tion  to the multiple alignment.  jackhmmer propagates reference
337              annotation from the previous profile to the multiple  alignment,
338              and thence to the next profile. This is the default.
339
340
341       --symfrac <x>
342              Define the residue fraction threshold necessary to define a con‐
343              sensus column when using the --fast option. The default is  0.5.
344              The  symbol  fraction  in each column is calculated after taking
345              relative sequence weighting into account, and ignoring gap char‐
346              acters  corresponding  to ends of sequence fragments (as opposed
347              to internal insertions/deletions).  Setting this  to  0.0  means
348              that every alignment column will be assigned as consensus, which
349              may be useful in some cases. Setting it to 1.0 means  that  only
350              columns that include 0 gaps (internal insertions/deletions) will
351              be assigned as consensus.
352
353
354       --fragthresh <x>
355              We only want to count terminal gaps as deletions if the  aligned
356              sequence  is  known  to  be full-length, not if it is a fragment
357              (for instance, because only part of  it  was  sequenced).  HMMER
358              uses  a simple rule to infer fragments: if the sequence length L
359              is less than or equal to a  fraction  <x>  times  the  alignment
360              length  in  columns, then the sequence is handled as a fragment.
361              The default  is  0.5.   Setting  --fragthresh0  will  define  no
362              (nonempty)  sequence as a fragment; you might want to do this if
363              you know you've got a carefully curated alignment of full-length
364              sequences.   Setting  --fragthresh1 will define all sequences as
365              fragments; you might want to do this if you know your  alignment
366              is  entirely  composed  of  fragments,  such as translated short
367              reads in metagenomic shotgun data.
368
369
370
371

OPTIONS CONTROLLING RELATIVE WEIGHTS

373       Whenever a profile is built from a multiple alignment, HMMER uses an ad
374       hoc   sequence   weighting  algorithm  to  downweight  closely  related
375       sequences and upweight distantly related ones. This has the  effect  of
376       making  models  less  biased by uneven phylogenetic representation. For
377       example, two identical sequences would typically each receive half  the
378       weight  that  one  sequence would (and this is why jackhmmer isn't con‐
379       cerned about always including your  original  query  sequence  in  each
380       iteration's alignment, even if it finds it again in the database you're
381       searching). These options control which algorithm gets used.
382
383
384       --wpb  Use  the  Henikoff  position-based  sequence  weighting   scheme
385              [Henikoff  and  Henikoff, J. Mol. Biol. 243:574, 1994].  This is
386              the default.
387
388
389       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm  [Ger‐
390              stein et al, J. Mol. Biol. 235:1067, 1994].
391
392
393       --wblosum
394              Use  the  same clustering scheme that was used to weight data in
395              calculating BLOSUM subsitution matrices [Henikoff and  Henikoff,
396              Proc.  Natl.  Acad.  Sci  89:10915, 1992]. Sequences are single-
397              linkage clustered at an identity threshold  (default  0.62;  see
398              --wid)  and  within  each  cluster of c sequences, each sequence
399              gets relative weight 1/c.
400
401
402       --wnone
403              No relative weights. All sequences are assigned uniform weight.
404
405
406       --wid <x>
407              Sets the identity threshold used  by  single-linkage  clustering
408              when  using --wblosum.  Invalid with any other weighting scheme.
409              Default is 0.62.
410
411
412
413
414
415

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

417       After relative weights are determined, they are normalized to sum to  a
418       total  effective  sequence  number,  eff_nseq.   This number may be the
419       actual number of sequences in the alignment, but it  is  almost  always
420       smaller  than  that.   The  default  entropy  weighting method (--eent)
421       reduces the effective sequence number to reduce the information content
422       (relative entropy, or average expected score on true homologs) per con‐
423       sensus position. The target relative entropy is controlled  by  a  two-
424       parameter  function,  where  the two parameters are settable with --ere
425       and --esigma.
426
427
428       --eent Adjust effective sequence number to achieve a specific  relative
429              entropy per position (see --ere).  This is the default.
430
431
432       --eclust
433              Set  effective  sequence  number to the number of single-linkage
434              clusters at a specific identity  threshold  (see  --eid).   This
435              option  is  not recommended; it's for experiments evaluating how
436              much better --eent is.
437
438
439       --enone
440              Turn off effective sequence number determination  and  just  use
441              the  actual number of sequences. One reason you might want to do
442              this is to try to maximize the relative entropy/position of your
443              model, which may be useful for short models.
444
445
446       --eset <x>
447              Explicitly  set  the effective sequence number for all models to
448              <x>.
449
450
451       --ere <x>
452              Set  the  minimum  relative  entropy/position  target  to   <x>.
453              Requires  --eent.  Default depends on the sequence alphabet; for
454              protein sequences, it is 0.59 bits/position.
455
456
457       --esigma <x>
458              Sets the minimum relative entropy contributed by an entire model
459              alignment,  over its whole length. This has the effect of making
460              short models have higher  relative  entropy  per  position  than
461              --ere alone would give. The default is 45.0 bits.
462
463
464       --eid <x>
465              Sets  the  fractional  pairwise  identity  cutoff used by single
466              linkage clustering with the  --eclust  option.  The  default  is
467              0.62.
468
469
470
471

OPTIONS CONTROLLING PRIORS

473       In  profile  construction, by default, weighted counts are converted to
474       mean posterior probability parameter estimates using mixture  Dirichlet
475       priors.   Default mixture Dirichlet prior parameters for protein models
476       and for nucleic acid (RNA and DNA) models are built in.  The  following
477       options allow you to override the default priors.
478
479       --pnone Don't use any priors. Probability parameters will simply be the
480       observed frequencies, after relative sequence weighting.
481
482       --plaplace Use a Laplace +1 prior  in  place  of  the  default  mixture
483       Dirichlet prior.
484
485
486
487

OPTIONS CONTROLLING E-VALUE CALIBRATION

489       Estimating the location parameters for the expected score distributions
490       for MSV filter  scores,  Viterbi  filter  scores,  and  Forward  scores
491       requires three short random sequence simulations.
492
493
494       --EmL <n>
495              Sets  the sequence length in simulation that estimates the loca‐
496              tion parameter mu for MSV filter E-values. Default is 200.
497
498
499       --EmN <n>
500              Sets the number of sequences in simulation  that  estimates  the
501              location parameter mu for MSV filter E-values. Default is 200.
502
503
504       --EvL <n>
505              Sets  the sequence length in simulation that estimates the loca‐
506              tion parameter mu for Viterbi filter E-values. Default is 200.
507
508
509       --EvN <n>
510              Sets the number of sequences in simulation  that  estimates  the
511              location  parameter  mu  for Viterbi filter E-values. Default is
512              200.
513
514
515       --EfL <n>
516              Sets the sequence length in simulation that estimates the  loca‐
517              tion parameter tau for Forward E-values. Default is 100.
518
519
520       --EfN <n>
521              Sets  the  number  of sequences in simulation that estimates the
522              location parameter tau for Forward E-values. Default is 200.
523
524
525       --Eft <x>
526              Sets the tail mass fraction to fit in the simulation that  esti‐
527              mates the location parameter tau for Forward evalues. Default is
528              0.04.
529
530
531

OTHER OPTIONS

533       --nonull2
534              Turn off the null2 score corrections for biased composition.
535
536
537       -Z <x> Assert that the total number of targets in your searches is <x>,
538              for  the  purposes  of per-sequence E-value calculations, rather
539              than the actual number of targets seen.
540
541
542       --domZ <x>
543              Assert that the total number of targets in your searches is <x>,
544              for the purposes of per-domain conditional E-value calculations,
545              rather than the number of  targets  that  passed  the  reporting
546              thresholds.
547
548
549       --seed <n>
550              Seed  the random number generator with <n>, an integer >= 0.  If
551              <n> is >0, any stochastic simulations will be reproducible;  the
552              same  command will give the same results.  If <n> is 0, the ran‐
553              dom number generator is seeded arbitrarily, and stochastic simu‐
554              lations  will  vary  from  run  to run of the same command.  The
555              default seed is 42.
556
557
558
559       --qformat <s>
560              Declare that the input query_seqfile is in format <s>.  Accepted
561              sequence  file  formats  include  FASTA,  EMBL,  GenBank,  DDBJ,
562              UniProt, Stockholm, and SELEX. Default is to autodetect the for‐
563              mat of the file.
564
565
566       --tformat <s>
567              Declare  that the input target_seqdb is in format <s>.  Accepted
568              sequence  file  formats  include  FASTA,  EMBL,  GenBank,  DDBJ,
569              UniProt, Stockholm, and SELEX. Default is to autodetect the for‐
570              mat of the file.
571
572
573       --cpu <n>
574              Set the number of parallel worker threads to <n>.   By  default,
575              HMMER  sets  this  to the number of CPU cores it detects in your
576              machine - that is, it tries to maximize the use of  your  avail‐
577              able  processor  cores.  Setting  <n>  higher than the number of
578              available cores is of little if any value, but you may  want  to
579              set  it  to  something less. You can also control this number by
580              setting an environment variable, HMMER_NCPU.
581
582              This option is only available if HMMER was compiled  with  POSIX
583              threads  support.  This  is  the  default,  but it may have been
584              turned off at compile-time for your site  or  machine  for  some
585              reason.
586
587
588
589       --stall
590              For  debugging the MPI master/worker version: pause after start,
591              to enable the developer to attach debuggers to the running  mas‐
592              ter  and worker(s) processes. Send SIGCONT signal to release the
593              pause.  (Under gdb: (gdb) signal  SIGCONT)  (Only  available  if
594              optional MPI support was enabled at compile-time.)
595
596
597       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
598              optional MPI support was enabled at compile-time.)
599
600
601
602

COPYRIGHT

615       Copyright (C) 2015 Howard Hughes Medical Institute.
616       Freely distributed under the GNU General Public License (GPLv3).
617
618       For additional information on copyright and  licensing,  see  the  file
619       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
620       web page ().
621
622
623

AUTHOR

625       Eddy/Rivas Laboratory
626       Janelia Farm Research Campus
627       19700 Helix Drive
628       Ashburn VA 20147 USA
629       http://eddylab.org
630
631
632
633
634
635HMMER 3.1b2                      February 2015                    jackhmmer(1)