jackhmmer(1)

1jackhmmer(1)                     HMMER Manual                     jackhmmer(1)
2
3
4

NAME

6       jackhmmer - iteratively search sequence(s) against a sequence database
7
8

SYNOPSIS

10       jackhmmer [options] seqfile seqdb
11
12

DESCRIPTION

14       jackhmmer  iteratively  searches each query sequence in seqfile against
15       the target sequence(s) in seqdb.  The first iteration is identical to a
16       phmmer  search.   For  the  next iteration, a multiple alignment of the
17       query together with all target sequences satisfying  inclusion  thresh‐
18       olds  is assembled, a profile is constructed from this alignment (iden‐
19       tical to using hmmbuild on the alignment), and profile  search  of  the
20       seqdb is done (identical to an hmmsearch with the profile).
21
22
23
24       The  query  seqfile  may  be  '-' (a dash character), in which case the
25       query sequences are read from a stdin pipe instead of from a file.  The
26       seqdb cannot be read from a stdin stream, because jackhmmer needs to do
27       multiple passes over the database.
28
29
30
31       The output format is designed to be human-readable, but is often so vo‐
32       luminous  that reading it is impractical, and parsing it is a pain. The
33       --tblout and --domtblout options save output in simple tabular  formats
34       that are concise and easier to parse.  The -o option allows redirecting
35       the main output, including throwing it away in /dev/null.
36
37
38

OPTIONS

40       -h     Help; print a brief reminder  of  command  line  usage  and  all
41              available options.
42
43
44       -N <n> Set  the maximum number of iterations to <n>.  The default is 5.
45              If N=1, the result is equivalent to a phmmer search.
46
47
48
49
50

OPTIONS CONTROLLING OUTPUT

52       By default, output for each iteration appears on stdout in  a  somewhat
53       human  readable,  somewhat  parseable format. These options allow redi‐
54       recting that output or saving additional kinds of output to files,  in‐
55       cluding checkpoint files for each iteration.
56
57
58       -o <f> Direct the human-readable output to a file <f>.
59
60
61       -A <f> After  the final iteration, save an annotated multiple alignment
62              of all hits satisfying inclusion thresholds (also including  the
63              original query) to <f> in Stockholm format.
64
65
66       --tblout <f>
67              After  the  final  iteration,  save a tabular summary of top se‐
68              quence hits to <f> in a readily parseable, columnar, whitespace-
69              delimited format.
70
71
72       --domtblout <f>
73              After  the final iteration, save a tabular summary of top domain
74              hits to <f> in a readily parseable, columnar,  whitespace-delim‐
75              ited format.
76
77
78       --chkhmm prefix
79              At the start of each iteration, checkpoint the query HMM, saving
80              it to a file named prefix-n.hmm where n is the iteration  number
81              (from 1..N).
82
83
84       --chkali prefix
85              At the end of each iteration, checkpoint an alignment of all do‐
86              mains satisfying inclusion thresholds (e.g. what will become the
87              query  HMM  for  the  next iteration), saving it to a file named
88              prefix-n.sto in Stockholm format, where n is the iteration  num‐
89              ber (from 1..N).
90
91
92       --acc  Use accessions instead of names in the main output, where avail‐
93              able for profiles and/or sequences.
94
95
96       --noali
97              Omit the alignment  section  from  the  main  output.  This  can
98              greatly reduce the output volume.
99
100
101       --notextw
102              Unlimit  the length of each line in the main output. The default
103              is a limit of 120 characters per line, which helps in displaying
104              the output cleanly on terminals and in editors, but can truncate
105              target profile description lines.
106
107
108       --textw <n>
109              Set the main output's line length limit to  <n>  characters  per
110              line. The default is 120.
111
112
113
114
115
116
117

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING (FIRST ITERATION)

119       By  default, the first iteration uses a search model constructed from a
120       single query sequence. This model is constructed using a standard 20x20
121       substitution  matrix  for residue probabilities, and two additional pa‐
122       rameters for position-independent gap open and  gap  extend  probabili‐
123       ties.  These  options allow the default single-sequence scoring parame‐
124       ters to be changed.
125
126
127       --popen <x>
128              Set the gap open probability for a single sequence  query  model
129              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
130
131
132       --pextend <x>
133              Set the gap extend probability for a single sequence query model
134              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
135
136
137       --mx <s>
138              Obtain residue alignment probabilities from the built-in substi‐
139              tution  matrix  named <s>.  Several standard matrices are built-
140              in, and do not need to be read from files.  The matrix name  <s>
141              can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
142              SUM62, BLOSUM80, or BLOSUM90.  Only one of the --mx and --mxfile
143              options may be used.
144
145
146       --mxfile mxfile
147              Obtain residue alignment probabilities from the substitution ma‐
148              trix in file mxfile.  The default score matrix is BLOSUM62 (this
149              matrix is internal to HMMER and does not have to be available as
150              a file).  The format of a  substitution  matrix  mxfile  is  the
151              standard  format  accepted  by  BLAST, FASTA, and other sequence
152              analysis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
153              example  files.  (The  only exception: we require matrices to be
154              square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
155
156
157

OPTIONS CONTROLLING REPORTING THRESHOLDS

159       Reporting thresholds control which hits are reported  in  output  files
160       (the  main  output, --tblout, and --domtblout).  In each iteration, se‐
161       quence hits and domain hits are ranked by statistical significance  (E-
162       value)  and  output  is generated in two sections called per-target and
163       per-domain output. In per-target output, by default, all sequence  hits
164       with  an E-value <= 10 are reported. In the per-domain output, for each
165       target that has passed per-target  reporting  thresholds,  all  domains
166       satisfying  per-domain  reporting  thresholds are reported. By default,
167       these are domains with conditional E-values of <= 10. The following op‐
168       tions  allow you to change the default E-value reporting thresholds, or
169       to use bit score thresholds instead.
170
171
172
173       -E <x> Report sequences with E-values <= <x>  in  per-sequence  output.
174              The default is 10.0.
175
176
177       -T <x> Use  a bit score threshold for per-sequence output instead of an
178              E-value threshold (any setting of -E  is  ignored).  Report  se‐
179              quences  with  a bit score of >= <x>.  By default this option is
180              unset.
181
182
183       -Z <x> Declare the total size of the database to be <x> sequences,  for
184              purposes  of  E-value calculation.  Normally E-values are calcu‐
185              lated relative to the size of the database you actually searched
186              (e.g.  the  number of sequences in target_seqdb).  In some cases
187              (for instance, if you've split  your  target  sequence  database
188              into multiple files for parallelization of your search), you may
189              know better what the actual size of your search space is.
190
191
192       --domE <x>
193              Report domains with conditional E-values <=  <x>  in  per-domain
194              output,  in  addition  to the top-scoring domain per significant
195              sequence hit. The default is 10.0.
196
197
198       --domT <x>
199              Use a bit score threshold for per-domain output instead of an E-
200              value  threshold  (any setting of --domT is ignored). Report do‐
201              mains with a bit score of >= <x> in per-domain output, in  addi‐
202              tion  to the top-scoring domain per significant sequence hit. By
203              default this option is unset.
204
205
206       --domZ <x>
207              Declare the number of significant sequences to be <x> sequences,
208              for  purposes  of conditional E-value calculation for additional
209              domain significance.  Normally conditional E-values  are  calcu‐
210              lated  relative  to the number of sequences passing per-sequence
211              reporting threshold.
212
213
214

OPTIONS CONTROLLING INCLUSION THRESHOLDS

216       Inclusion thresholds control which hits are included  in  the  multiple
217       alignment  and  profile  constructed for the next search iteration.  By
218       default, a sequence must have a per-sequence E-value of <=  0.001  (see
219       -E option) to be included, and any additional domains in it besides the
220       top-scoring one must have a conditional E-value of <= 0.001 (see --domE
221       option).  The  difference  between  reporting  thresholds and inclusion
222       thresholds is that inclusion thresholds control which hits actually get
223       used  in  the next iteration (or the final output multiple alignment if
224       the -A option is used), whereas reporting thresholds control  what  you
225       see in output. Reporting thresholds are generally more loose so you can
226       see borderline hits in the top of the noise that might be of interest.
227
228
229       --incE <x>
230              Include sequences with E-values <= <x> in  subsequent  iteration
231              or final alignment output by -A.  The default is 0.001.
232
233
234       --incT <x>
235              Use  a bit score threshold for per-sequence inclusion instead of
236              an E-value threshold (any setting of --incE is ignored). Include
237              sequences with a bit score of >= <x>.  By default this option is
238              unset.
239
240
241       --incdomE <x>
242              Include domains with conditional E-values <= <x>  in  subsequent
243              iteration  or  final  alignment output by -A, in addition to the
244              top-scoring domain per significant sequence hit.  The default is
245              0.001.
246
247
248       --incdomT <x>
249              Use a bit score threshold for per-domain inclusion instead of an
250              E-value threshold (any setting of --incT  is  ignored).  Include
251              domains  with  a bit score of >= <x>.  By default this option is
252              unset.
253
254
255
256

OPTIONS CONTROLLING ACCELERATION HEURISTICS

258       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
259       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
260       ter is the fastest and most approximate; the last is the  full  Forward
261       scoring algorithm, slowest but most accurate. There is also a bias fil‐
262       ter step between MSV and Viterbi. Targets that pass all  the  steps  in
263       the  acceleration  pipeline are then subjected to postprocessing -- do‐
264       main identification and scoring using the Forward/Backward algorithm.
265
266       Essentially the only free parameters  that  control  HMMER's  heuristic
267       filters are the P-value thresholds controlling the expected fraction of
268       nonhomologous sequences that pass  the  filters.  Setting  the  default
269       thresholds  higher  will  pass a higher proportion of nonhomologous se‐
270       quence, increasing sensitivity at the  expense  of  speed;  conversely,
271       setting  lower  P-value  thresholds will pass a smaller proportion, de‐
272       creasing sensitivity and increasing speed. Setting a  filter's  P-value
273       threshold  to  1.0 means it will passing all sequences, and effectively
274       disables the filter.
275
276       Changing filter thresholds only removes or includes targets  from  con‐
277       sideration;  changing  filter  thresholds does not alter bit scores, E-
278       values, or alignments, all of which are determined solely  in  postpro‐
279       cessing.
280
281
282       --max  Maximum  sensitivity.   Turn off all filters, including the bias
283              filter, and run full Forward/Backward  postprocessing  on  every
284              target.  This increases sensitivity slightly, at a large cost in
285              speed.
286
287
288       --F1 <x>
289              First filter threshold; set the P-value threshold  for  the  MSV
290              filter  step.   The  default is 0.02, meaning that roughly 2% of
291              the highest scoring nonhomologous targets are expected  to  pass
292              the filter.
293
294
295       --F2 <x>
296              Second  filter  threshold;  set  the  P-value  threshold for the
297              Viterbi filter step.  The default is 0.001.
298
299
300       --F3 <x>
301              Third filter threshold; set the P-value threshold for  the  For‐
302              ward filter step.  The default is 1e-5.
303
304
305       --nobias
306              Turn  off  the bias filter. This increases sensitivity somewhat,
307              but can come at a high cost in speed, especially  if  the  query
308              has  biased  residue  composition (such as a repetitive sequence
309              region, or if it is a membrane protein with large regions of hy‐
310              drophobicity).  Without  the bias filter, too many sequences may
311              pass the filter with biased queries, leading to slower than  ex‐
312              pected   performance   as  the  computationally  intensive  For‐
313              ward/Backward algorithms shoulder an abnormally heavy load.
314
315
316
317

OPTIONS CONTROLLING PROFILE CONSTRUCTION (LATER ITERATIONS)

319       jackhmmer always includes your original query sequence in the alignment
320       result  at  every iteration, and consensus positions are always defined
321       by that query sequence. That is, a jackhmmer profile is always the same
322       length as your original query, at every iteration.  Therefore jackhmmer
323       gives you less control over profile construction than hmmbuild does; it
324       does  not  have  the --fast, or --hand, or --symfrac options.  The only
325       profile construction option available in jackhmmer is --fragthresh:
326
327
328
329       --fragthresh <x>
330              We only want to count terminal gaps as deletions if the  aligned
331              sequence  is  known  to  be full-length, not if it is a fragment
332              (for instance, because only part of  it  was  sequenced).  HMMER
333              uses  a simple rule to infer fragments: if the sequence length L
334              is less than or equal to a  fraction  <x>  times  the  alignment
335              length  in  columns, then the sequence is handled as a fragment.
336              The default is 0.5.   Setting  --fragthresh  0  will  define  no
337              (nonempty)  sequence as a fragment; you might want to do this if
338              you know you've got a carefully curated alignment of full-length
339              sequences.   Setting --fragthresh 1 will define all sequences as
340              fragments; you might want to do this if you know your  alignment
341              is  entirely  composed  of  fragments,  such as translated short
342              reads in metagenomic shotgun data.
343
344
345
346

OPTIONS CONTROLLING RELATIVE WEIGHTS

348       Whenever a profile is built from a multiple alignment, HMMER uses an ad
349       hoc  sequence  weighting  algorithm  to  downweight closely related se‐
350       quences and upweight distantly related ones. This  has  the  effect  of
351       making  models  less  biased by uneven phylogenetic representation. For
352       example, two identical sequences would typically each receive half  the
353       weight  that  one  sequence would (and this is why jackhmmer isn't con‐
354       cerned about always including your original query sequence in each  it‐
355       eration's  alignment,  even if it finds it again in the database you're
356       searching). These options control which algorithm gets used.
357
358
359       --wpb  Use  the  Henikoff  position-based  sequence  weighting   scheme
360              [Henikoff  and  Henikoff, J. Mol. Biol. 243:574, 1994].  This is
361              the default.
362
363
364       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm  [Ger‐
365              stein et al, J. Mol. Biol. 235:1067, 1994].
366
367
368       --wblosum
369              Use  the  same clustering scheme that was used to weight data in
370              calculating BLOSUM subsitution matrices [Henikoff and  Henikoff,
371              Proc.  Natl.  Acad.  Sci  89:10915, 1992]. Sequences are single-
372              linkage clustered at an identity threshold  (default  0.62;  see
373              --wid)  and  within  each  cluster of c sequences, each sequence
374              gets relative weight 1/c.
375
376
377       --wnone
378              No relative weights. All sequences are assigned uniform weight.
379
380
381       --wid <x>
382              Sets the identity threshold used  by  single-linkage  clustering
383              when  using --wblosum.  Invalid with any other weighting scheme.
384              Default is 0.62.
385
386
387
388
389
390

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

392       After relative weights are determined, they are normalized to sum to  a
393       total  effective sequence number, eff_nseq.  This number may be the ac‐
394       tual number of sequences in the alignment,  but  it  is  almost  always
395       smaller  than  that.  The default entropy weighting method (--eent) re‐
396       duces the effective sequence number to reduce the  information  content
397       (relative entropy, or average expected score on true homologs) per con‐
398       sensus position. The target relative entropy is controlled by a two-pa‐
399       rameter  function, where the two parameters are settable with --ere and
400       --esigma.
401
402
403       --eent Adjust effective sequence number to achieve a specific  relative
404              entropy per position (see --ere).  This is the default.
405
406
407       --eclust
408              Set  effective  sequence  number to the number of single-linkage
409              clusters at a specific identity threshold (see --eid).  This op‐
410              tion  is  not  recommended;  it's for experiments evaluating how
411              much better --eent is.
412
413
414       --enone
415              Turn off effective sequence number determination  and  just  use
416              the  actual number of sequences. One reason you might want to do
417              this is to try to maximize the relative entropy/position of your
418              model, which may be useful for short models.
419
420
421       --eset <x>
422              Explicitly  set  the effective sequence number for all models to
423              <x>.
424
425
426       --ere <x>
427              Set the minimum relative entropy/position target  to  <x>.   Re‐
428              quires  --eent.   Default  depends on the sequence alphabet; for
429              protein sequences, it is 0.59 bits/position.
430
431
432       --esigma <x>
433              Sets the minimum relative entropy contributed by an entire model
434              alignment,  over its whole length. This has the effect of making
435              short models have higher  relative  entropy  per  position  than
436              --ere alone would give. The default is 45.0 bits.
437
438
439       --eid <x>
440              Sets  the  fractional  pairwise  identity  cutoff used by single
441              linkage clustering with the  --eclust  option.  The  default  is
442              0.62.
443
444
445
446

OPTIONS CONTROLLING PRIORS

448       In  profile  construction, by default, weighted counts are converted to
449       mean posterior probability parameter estimates using mixture  Dirichlet
450       priors.   Default mixture Dirichlet prior parameters for protein models
451       and for nucleic acid (RNA and DNA) models are built in.  The  following
452       options allow you to override the default priors.
453
454
455       --pnone
456              Don't  use any priors. Probability parameters will simply be the
457              observed frequencies, after relative sequence weighting.
458
459
460       --plaplace
461              Use a Laplace +1 prior in place of the default mixture Dirichlet
462              prior.
463
464
465
466

OPTIONS CONTROLLING E-VALUE CALIBRATION

468       Estimating the location parameters for the expected score distributions
469       for MSV filter scores, Viterbi filter scores, and  Forward  scores  re‐
470       quires three short random sequence simulations.
471
472
473       --EmL <n>
474              Sets  the sequence length in simulation that estimates the loca‐
475              tion parameter mu for MSV filter E-values. Default is 200.
476
477
478       --EmN <n>
479              Sets the number of sequences in simulation  that  estimates  the
480              location parameter mu for MSV filter E-values. Default is 200.
481
482
483       --EvL <n>
484              Sets  the sequence length in simulation that estimates the loca‐
485              tion parameter mu for Viterbi filter E-values. Default is 200.
486
487
488       --EvN <n>
489              Sets the number of sequences in simulation  that  estimates  the
490              location  parameter  mu  for Viterbi filter E-values. Default is
491              200.
492
493
494       --EfL <n>
495              Sets the sequence length in simulation that estimates the  loca‐
496              tion parameter tau for Forward E-values. Default is 100.
497
498
499       --EfN <n>
500              Sets  the  number  of sequences in simulation that estimates the
501              location parameter tau for Forward E-values. Default is 200.
502
503
504       --Eft <x>
505              Sets the tail mass fraction to fit in the simulation that  esti‐
506              mates the location parameter tau for Forward evalues. Default is
507              0.04.
508
509
510

OTHER OPTIONS

512       --nonull2
513              Turn off the null2 score corrections for biased composition.
514
515
516       -Z <x> Assert that the total number of targets in your searches is <x>,
517              for  the  purposes  of per-sequence E-value calculations, rather
518              than the actual number of targets seen.
519
520
521       --domZ <x>
522              Assert that the total number of targets in your searches is <x>,
523              for the purposes of per-domain conditional E-value calculations,
524              rather than the number of  targets  that  passed  the  reporting
525              thresholds.
526
527
528       --seed <n>
529              Seed  the random number generator with <n>, an integer >= 0.  If
530              <n> is >0, any stochastic simulations will be reproducible;  the
531              same  command will give the same results.  If <n> is 0, the ran‐
532              dom number generator is seeded arbitrarily, and stochastic simu‐
533              lations  will vary from run to run of the same command.  The de‐
534              fault seed is 42.
535
536
537
538       --qformat <s>
539              Assert that input query seqfile is in format <s>, bypassing for‐
540              mat autodetection.  Common choices for <s> include: fasta, embl,
541              genbank.  Alignment formats also work; common  choices  include:
542              stockholm,  a2m,  afa, psiblast, clustal, phylip.  jackhmmer al‐
543              ways uses a single sequence query to start its search,  so  when
544              the  input  seqfile  is an alignment, jackhmmer reads it one un‐
545              aligned query sequence at a time, not as an alignment.  For more
546              information,  and  for  codes  for some less common formats, see
547              main documentation.  The string <s> is  case-insensitive  (fasta
548              or FASTA both work).
549
550
551       --tformat <s>
552              Assert  that  the  input target sequence seqdb is in format <s>.
553              See --qformat above for accepted choices for <s>.
554
555
556
557
558       --cpu <n>
559              Set the number of parallel worker threads to <n>.  On  multicore
560              machines, the default is 2.  You can also control this number by
561              setting an environment variable, HMMER_NCPU.  There  is  also  a
562              master thread, so the actual number of threads that HMMER spawns
563              is <n>+1.
564
565              This option is not available if HMMER was  compiled  with  POSIX
566              threads support turned off.
567
568
569
570
571       --stall
572              For  debugging the MPI master/worker version: pause after start,
573              to enable the developer to attach debuggers to the running  mas‐
574              ter  and worker(s) processes. Send SIGCONT signal to release the
575              pause.  (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
576              tional MPI support was enabled at compile-time.)
577
578
579       --mpi  Run  under MPI control with master/worker parallelization (using
580              mpirun, for example, or equivalent). Only available if  optional
581              MPI support was enabled at compile-time.
582
583
584
585
586
587

COPYRIGHT

601       Copyright (C) 2020 Howard Hughes Medical Institute.
602       Freely distributed under the BSD open source license.
603
604       For additional information on copyright and  licensing,  see  the  file
605       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
606       web page (http://hmmer.org/).
607
608
609

AUTHOR

611       http://eddylab.org
612
613
614
615
616
617HMMER 3.3.2                        Nov 2020                       jackhmmer(1)