1jackhmmer(1) HMMER Manual jackhmmer(1)
2
3
4
6 jackhmmer - iteratively search sequence(s) against a sequence database
7
8
10 jackhmmer [options] seqfile seqdb
11
12
14 jackhmmer iteratively searches each query sequence in seqfile against
15 the target sequence(s) in seqdb. The first iteration is identical to a
16 phmmer search. For the next iteration, a multiple alignment of the
17 query together with all target sequences satisfying inclusion thresh‐
18 olds is assembled, a profile is constructed from this alignment (iden‐
19 tical to using hmmbuild on the alignment), and profile search of the
20 seqdb is done (identical to an hmmsearch with the profile).
21
22
23
24 The query seqfile may be '-' (a dash character), in which case the
25 query sequences are read from a stdin pipe instead of from a file. The
26 seqdb cannot be read from a stdin stream, because jackhmmer needs to do
27 multiple passes over the database.
28
29
30
31 The output format is designed to be human-readable, but is often so vo‐
32 luminous that reading it is impractical, and parsing it is a pain. The
33 --tblout and --domtblout options save output in simple tabular formats
34 that are concise and easier to parse. The -o option allows redirecting
35 the main output, including throwing it away in /dev/null.
36
37
38
40 -h Help; print a brief reminder of command line usage and all
41 available options.
42
43
44 -N <n> Set the maximum number of iterations to <n>. The default is 5.
45 If N=1, the result is equivalent to a phmmer search.
46
47
48
49
50
52 By default, output for each iteration appears on stdout in a somewhat
53 human readable, somewhat parseable format. These options allow redi‐
54 recting that output or saving additional kinds of output to files, in‐
55 cluding checkpoint files for each iteration.
56
57
58 -o <f> Direct the human-readable output to a file <f>.
59
60
61 -A <f> After the final iteration, save an annotated multiple alignment
62 of all hits satisfying inclusion thresholds (also including the
63 original query) to <f> in Stockholm format.
64
65
66 --tblout <f>
67 After the final iteration, save a tabular summary of top se‐
68 quence hits to <f> in a readily parseable, columnar, whitespace-
69 delimited format.
70
71
72 --domtblout <f>
73 After the final iteration, save a tabular summary of top domain
74 hits to <f> in a readily parseable, columnar, whitespace-delim‐
75 ited format.
76
77
78 --chkhmm prefix
79 At the start of each iteration, checkpoint the query HMM, saving
80 it to a file named prefix-n.hmm where n is the iteration number
81 (from 1..N).
82
83
84 --chkali prefix
85 At the end of each iteration, checkpoint an alignment of all do‐
86 mains satisfying inclusion thresholds (e.g. what will become the
87 query HMM for the next iteration), saving it to a file named
88 prefix-n.sto in Stockholm format, where n is the iteration num‐
89 ber (from 1..N).
90
91
92 --acc Use accessions instead of names in the main output, where avail‐
93 able for profiles and/or sequences.
94
95
96 --noali
97 Omit the alignment section from the main output. This can
98 greatly reduce the output volume.
99
100
101 --notextw
102 Unlimit the length of each line in the main output. The default
103 is a limit of 120 characters per line, which helps in displaying
104 the output cleanly on terminals and in editors, but can truncate
105 target profile description lines.
106
107
108 --textw <n>
109 Set the main output's line length limit to <n> characters per
110 line. The default is 120.
111
112
113
114
115
116
117
119 By default, the first iteration uses a search model constructed from a
120 single query sequence. This model is constructed using a standard 20x20
121 substitution matrix for residue probabilities, and two additional pa‐
122 rameters for position-independent gap open and gap extend probabili‐
123 ties. These options allow the default single-sequence scoring parame‐
124 ters to be changed.
125
126
127 --popen <x>
128 Set the gap open probability for a single sequence query model
129 to <x>. The default is 0.02. <x> must be >= 0 and < 0.5.
130
131
132 --pextend <x>
133 Set the gap extend probability for a single sequence query model
134 to <x>. The default is 0.4. <x> must be >= 0 and < 1.0.
135
136
137 --mx <s>
138 Obtain residue alignment probabilities from the built-in substi‐
139 tution matrix named <s>. Several standard matrices are built-
140 in, and do not need to be read from files. The matrix name <s>
141 can be PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
142 SUM62, BLOSUM80, or BLOSUM90. Only one of the --mx and --mxfile
143 options may be used.
144
145
146 --mxfile mxfile
147 Obtain residue alignment probabilities from the substitution ma‐
148 trix in file mxfile. The default score matrix is BLOSUM62 (this
149 matrix is internal to HMMER and does not have to be available as
150 a file). The format of a substitution matrix mxfile is the
151 standard format accepted by BLAST, FASTA, and other sequence
152 analysis software. See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
153 example files. (The only exception: we require matrices to be
154 square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
155
156
157
159 Reporting thresholds control which hits are reported in output files
160 (the main output, --tblout, and --domtblout). In each iteration, se‐
161 quence hits and domain hits are ranked by statistical significance (E-
162 value) and output is generated in two sections called per-target and
163 per-domain output. In per-target output, by default, all sequence hits
164 with an E-value <= 10 are reported. In the per-domain output, for each
165 target that has passed per-target reporting thresholds, all domains
166 satisfying per-domain reporting thresholds are reported. By default,
167 these are domains with conditional E-values of <= 10. The following op‐
168 tions allow you to change the default E-value reporting thresholds, or
169 to use bit score thresholds instead.
170
171
172
173 -E <x> Report sequences with E-values <= <x> in per-sequence output.
174 The default is 10.0.
175
176
177 -T <x> Use a bit score threshold for per-sequence output instead of an
178 E-value threshold (any setting of -E is ignored). Report se‐
179 quences with a bit score of >= <x>. By default this option is
180 unset.
181
182
183 -Z <x> Declare the total size of the database to be <x> sequences, for
184 purposes of E-value calculation. Normally E-values are calcu‐
185 lated relative to the size of the database you actually searched
186 (e.g. the number of sequences in target_seqdb). In some cases
187 (for instance, if you've split your target sequence database
188 into multiple files for parallelization of your search), you may
189 know better what the actual size of your search space is.
190
191
192 --domE <x>
193 Report domains with conditional E-values <= <x> in per-domain
194 output, in addition to the top-scoring domain per significant
195 sequence hit. The default is 10.0.
196
197
198 --domT <x>
199 Use a bit score threshold for per-domain output instead of an E-
200 value threshold (any setting of --domT is ignored). Report do‐
201 mains with a bit score of >= <x> in per-domain output, in addi‐
202 tion to the top-scoring domain per significant sequence hit. By
203 default this option is unset.
204
205
206 --domZ <x>
207 Declare the number of significant sequences to be <x> sequences,
208 for purposes of conditional E-value calculation for additional
209 domain significance. Normally conditional E-values are calcu‐
210 lated relative to the number of sequences passing per-sequence
211 reporting threshold.
212
213
214
216 Inclusion thresholds control which hits are included in the multiple
217 alignment and profile constructed for the next search iteration. By
218 default, a sequence must have a per-sequence E-value of <= 0.001 (see
219 -E option) to be included, and any additional domains in it besides the
220 top-scoring one must have a conditional E-value of <= 0.001 (see --domE
221 option). The difference between reporting thresholds and inclusion
222 thresholds is that inclusion thresholds control which hits actually get
223 used in the next iteration (or the final output multiple alignment if
224 the -A option is used), whereas reporting thresholds control what you
225 see in output. Reporting thresholds are generally more loose so you can
226 see borderline hits in the top of the noise that might be of interest.
227
228
229 --incE <x>
230 Include sequences with E-values <= <x> in subsequent iteration
231 or final alignment output by -A. The default is 0.001.
232
233
234 --incT <x>
235 Use a bit score threshold for per-sequence inclusion instead of
236 an E-value threshold (any setting of --incE is ignored). Include
237 sequences with a bit score of >= <x>. By default this option is
238 unset.
239
240
241 --incdomE <x>
242 Include domains with conditional E-values <= <x> in subsequent
243 iteration or final alignment output by -A, in addition to the
244 top-scoring domain per significant sequence hit. The default is
245 0.001.
246
247
248 --incdomT <x>
249 Use a bit score threshold for per-domain inclusion instead of an
250 E-value threshold (any setting of --incT is ignored). Include
251 domains with a bit score of >= <x>. By default this option is
252 unset.
253
254
255
256
258 HMMER3 searches are accelerated in a three-step filter pipeline: the
259 MSV filter, the Viterbi filter, and the Forward filter. The first fil‐
260 ter is the fastest and most approximate; the last is the full Forward
261 scoring algorithm, slowest but most accurate. There is also a bias fil‐
262 ter step between MSV and Viterbi. Targets that pass all the steps in
263 the acceleration pipeline are then subjected to postprocessing -- do‐
264 main identification and scoring using the Forward/Backward algorithm.
265
266 Essentially the only free parameters that control HMMER's heuristic
267 filters are the P-value thresholds controlling the expected fraction of
268 nonhomologous sequences that pass the filters. Setting the default
269 thresholds higher will pass a higher proportion of nonhomologous se‐
270 quence, increasing sensitivity at the expense of speed; conversely,
271 setting lower P-value thresholds will pass a smaller proportion, de‐
272 creasing sensitivity and increasing speed. Setting a filter's P-value
273 threshold to 1.0 means it will passing all sequences, and effectively
274 disables the filter.
275
276 Changing filter thresholds only removes or includes targets from con‐
277 sideration; changing filter thresholds does not alter bit scores, E-
278 values, or alignments, all of which are determined solely in postpro‐
279 cessing.
280
281
282 --max Maximum sensitivity. Turn off all filters, including the bias
283 filter, and run full Forward/Backward postprocessing on every
284 target. This increases sensitivity slightly, at a large cost in
285 speed.
286
287
288 --F1 <x>
289 First filter threshold; set the P-value threshold for the MSV
290 filter step. The default is 0.02, meaning that roughly 2% of
291 the highest scoring nonhomologous targets are expected to pass
292 the filter.
293
294
295 --F2 <x>
296 Second filter threshold; set the P-value threshold for the
297 Viterbi filter step. The default is 0.001.
298
299
300 --F3 <x>
301 Third filter threshold; set the P-value threshold for the For‐
302 ward filter step. The default is 1e-5.
303
304
305 --nobias
306 Turn off the bias filter. This increases sensitivity somewhat,
307 but can come at a high cost in speed, especially if the query
308 has biased residue composition (such as a repetitive sequence
309 region, or if it is a membrane protein with large regions of hy‐
310 drophobicity). Without the bias filter, too many sequences may
311 pass the filter with biased queries, leading to slower than ex‐
312 pected performance as the computationally intensive For‐
313 ward/Backward algorithms shoulder an abnormally heavy load.
314
315
316
317
319 jackhmmer always includes your original query sequence in the alignment
320 result at every iteration, and consensus positions are always defined
321 by that query sequence. That is, a jackhmmer profile is always the same
322 length as your original query, at every iteration. Therefore jackhmmer
323 gives you less control over profile construction than hmmbuild does; it
324 does not have the --fast, or --hand, or --symfrac options. The only
325 profile construction option available in jackhmmer is --fragthresh:
326
327
328
329 --fragthresh <x>
330 We only want to count terminal gaps as deletions if the aligned
331 sequence is known to be full-length, not if it is a fragment
332 (for instance, because only part of it was sequenced). HMMER
333 uses a simple rule to infer fragments: if the sequence length L
334 is less than or equal to a fraction <x> times the alignment
335 length in columns, then the sequence is handled as a fragment.
336 The default is 0.5. Setting --fragthresh 0 will define no
337 (nonempty) sequence as a fragment; you might want to do this if
338 you know you've got a carefully curated alignment of full-length
339 sequences. Setting --fragthresh 1 will define all sequences as
340 fragments; you might want to do this if you know your alignment
341 is entirely composed of fragments, such as translated short
342 reads in metagenomic shotgun data.
343
344
345
346
348 Whenever a profile is built from a multiple alignment, HMMER uses an ad
349 hoc sequence weighting algorithm to downweight closely related se‐
350 quences and upweight distantly related ones. This has the effect of
351 making models less biased by uneven phylogenetic representation. For
352 example, two identical sequences would typically each receive half the
353 weight that one sequence would (and this is why jackhmmer isn't con‐
354 cerned about always including your original query sequence in each it‐
355 eration's alignment, even if it finds it again in the database you're
356 searching). These options control which algorithm gets used.
357
358
359 --wpb Use the Henikoff position-based sequence weighting scheme
360 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
361 the default.
362
363
364 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
365 stein et al, J. Mol. Biol. 235:1067, 1994].
366
367
368 --wblosum
369 Use the same clustering scheme that was used to weight data in
370 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
371 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
372 linkage clustered at an identity threshold (default 0.62; see
373 --wid) and within each cluster of c sequences, each sequence
374 gets relative weight 1/c.
375
376
377 --wnone
378 No relative weights. All sequences are assigned uniform weight.
379
380
381 --wid <x>
382 Sets the identity threshold used by single-linkage clustering
383 when using --wblosum. Invalid with any other weighting scheme.
384 Default is 0.62.
385
386
387
388
389
390
392 After relative weights are determined, they are normalized to sum to a
393 total effective sequence number, eff_nseq. This number may be the ac‐
394 tual number of sequences in the alignment, but it is almost always
395 smaller than that. The default entropy weighting method (--eent) re‐
396 duces the effective sequence number to reduce the information content
397 (relative entropy, or average expected score on true homologs) per con‐
398 sensus position. The target relative entropy is controlled by a two-pa‐
399 rameter function, where the two parameters are settable with --ere and
400 --esigma.
401
402
403 --eent Adjust effective sequence number to achieve a specific relative
404 entropy per position (see --ere). This is the default.
405
406
407 --eclust
408 Set effective sequence number to the number of single-linkage
409 clusters at a specific identity threshold (see --eid). This op‐
410 tion is not recommended; it's for experiments evaluating how
411 much better --eent is.
412
413
414 --enone
415 Turn off effective sequence number determination and just use
416 the actual number of sequences. One reason you might want to do
417 this is to try to maximize the relative entropy/position of your
418 model, which may be useful for short models.
419
420
421 --eset <x>
422 Explicitly set the effective sequence number for all models to
423 <x>.
424
425
426 --ere <x>
427 Set the minimum relative entropy/position target to <x>. Re‐
428 quires --eent. Default depends on the sequence alphabet; for
429 protein sequences, it is 0.59 bits/position.
430
431
432 --esigma <x>
433 Sets the minimum relative entropy contributed by an entire model
434 alignment, over its whole length. This has the effect of making
435 short models have higher relative entropy per position than
436 --ere alone would give. The default is 45.0 bits.
437
438
439 --eid <x>
440 Sets the fractional pairwise identity cutoff used by single
441 linkage clustering with the --eclust option. The default is
442 0.62.
443
444
445
446
448 In profile construction, by default, weighted counts are converted to
449 mean posterior probability parameter estimates using mixture Dirichlet
450 priors. Default mixture Dirichlet prior parameters for protein models
451 and for nucleic acid (RNA and DNA) models are built in. The following
452 options allow you to override the default priors.
453
454
455 --pnone
456 Don't use any priors. Probability parameters will simply be the
457 observed frequencies, after relative sequence weighting.
458
459
460 --plaplace
461 Use a Laplace +1 prior in place of the default mixture Dirichlet
462 prior.
463
464
465
466
468 Estimating the location parameters for the expected score distributions
469 for MSV filter scores, Viterbi filter scores, and Forward scores re‐
470 quires three short random sequence simulations.
471
472
473 --EmL <n>
474 Sets the sequence length in simulation that estimates the loca‐
475 tion parameter mu for MSV filter E-values. Default is 200.
476
477
478 --EmN <n>
479 Sets the number of sequences in simulation that estimates the
480 location parameter mu for MSV filter E-values. Default is 200.
481
482
483 --EvL <n>
484 Sets the sequence length in simulation that estimates the loca‐
485 tion parameter mu for Viterbi filter E-values. Default is 200.
486
487
488 --EvN <n>
489 Sets the number of sequences in simulation that estimates the
490 location parameter mu for Viterbi filter E-values. Default is
491 200.
492
493
494 --EfL <n>
495 Sets the sequence length in simulation that estimates the loca‐
496 tion parameter tau for Forward E-values. Default is 100.
497
498
499 --EfN <n>
500 Sets the number of sequences in simulation that estimates the
501 location parameter tau for Forward E-values. Default is 200.
502
503
504 --Eft <x>
505 Sets the tail mass fraction to fit in the simulation that esti‐
506 mates the location parameter tau for Forward evalues. Default is
507 0.04.
508
509
510
512 --nonull2
513 Turn off the null2 score corrections for biased composition.
514
515
516 -Z <x> Assert that the total number of targets in your searches is <x>,
517 for the purposes of per-sequence E-value calculations, rather
518 than the actual number of targets seen.
519
520
521 --domZ <x>
522 Assert that the total number of targets in your searches is <x>,
523 for the purposes of per-domain conditional E-value calculations,
524 rather than the number of targets that passed the reporting
525 thresholds.
526
527
528 --seed <n>
529 Seed the random number generator with <n>, an integer >= 0. If
530 <n> is >0, any stochastic simulations will be reproducible; the
531 same command will give the same results. If <n> is 0, the ran‐
532 dom number generator is seeded arbitrarily, and stochastic simu‐
533 lations will vary from run to run of the same command. The de‐
534 fault seed is 42.
535
536
537
538 --qformat <s>
539 Assert that input query seqfile is in format <s>, bypassing for‐
540 mat autodetection. Common choices for <s> include: fasta, embl,
541 genbank. Alignment formats also work; common choices include:
542 stockholm, a2m, afa, psiblast, clustal, phylip. jackhmmer al‐
543 ways uses a single sequence query to start its search, so when
544 the input seqfile is an alignment, jackhmmer reads it one un‐
545 aligned query sequence at a time, not as an alignment. For more
546 information, and for codes for some less common formats, see
547 main documentation. The string <s> is case-insensitive (fasta
548 or FASTA both work).
549
550
551 --tformat <s>
552 Assert that the input target sequence seqdb is in format <s>.
553 See --qformat above for accepted choices for <s>.
554
555
556
557
558 --cpu <n>
559 Set the number of parallel worker threads to <n>. On multicore
560 machines, the default is 2. You can also control this number by
561 setting an environment variable, HMMER_NCPU. There is also a
562 master thread, so the actual number of threads that HMMER spawns
563 is <n>+1.
564
565 This option is not available if HMMER was compiled with POSIX
566 threads support turned off.
567
568
569
570
571 --stall
572 For debugging the MPI master/worker version: pause after start,
573 to enable the developer to attach debuggers to the running mas‐
574 ter and worker(s) processes. Send SIGCONT signal to release the
575 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
576 tional MPI support was enabled at compile-time.)
577
578
579 --mpi Run under MPI control with master/worker parallelization (using
580 mpirun, for example, or equivalent). Only available if optional
581 MPI support was enabled at compile-time.
582
583
584
585
586
587
589 See hmmer(1) for a master man page with a list of all the individual
590 man pages for programs in the HMMER package.
591
592
593 For complete documentation, see the user guide that came with your HM‐
594 MER distribution (Userguide.pdf); or see the HMMER web page (http://hm‐
595 mer.org/).
596
597
598
599
601 Copyright (C) 2020 Howard Hughes Medical Institute.
602 Freely distributed under the BSD open source license.
603
604 For additional information on copyright and licensing, see the file
605 called COPYRIGHT in your HMMER source distribution, or see the HMMER
606 web page (http://hmmer.org/).
607
608
609
611 http://eddylab.org
612
613
614
615
616
617HMMER 3.3.2 Nov 2020 jackhmmer(1)