1jackhmmer(1) HMMER Manual jackhmmer(1)
2
3
4
6 jackhmmer - iteratively search sequence(s) against a protein database
7
8
10 jackhmmer [options] <seqfile> <seqdb>
11
12
14 jackhmmer iteratively searches each query sequence in <seqfile> against
15 the target sequence(s) in <seqdb>. The first iteration is identical to
16 a phmmer search. For the next iteration, a multiple alignment of the
17 query together with all target sequences satisfying inclusion thresh‐
18 olds is assembled, a profile is constructed from this alignment (iden‐
19 tical to using hmmbuild on the alignment), and profile search of the
20 <seqdb> is done (identical to an hmmsearch with the profile).
21
22
23
24 The query <seqfile> may be '-' (a dash character), in which case the
25 query sequences are read from a <stdin> pipe instead of from a file.
26 The <seqdb> cannot be read from a <stdin> stream, because jackhmmer
27 needs to do multiple passes over the database.
28
29
30
31 The output format is designed to be human-readable, but is often so
32 voluminous that reading it is impractical, and parsing it is a pain.
33 The --tblout and --domtblout options save output in simple tabular for‐
34 mats that are concise and easier to parse. The -o option allows redi‐
35 recting the main output, including throwing it away in /dev/null.
36
37
38
40 -h Help; print a brief reminder of command line usage and all
41 available options.
42
43
44 -N <n> Set the maximum number of iterations to <n>. The default is 5.
45 If N=1, the result is equivalent to a phmmer search.
46
47
48
49
50
52 By default, output for each iteration appears on stdout in a somewhat
53 human readable, somewhat parseable format. These options allow redi‐
54 recting that output or saving additional kinds of output to files,
55 including checkpoint files for each iteration.
56
57
58 -o <f> Direct the human-readable output to a file <f>.
59
60
61 -A <f> After the final iteration, save an annotated multiple alignment
62 of all hits satisfying inclusion thresholds (also including the
63 original query) to <f> in Stockholm format.
64
65
66 --tblout <f>
67 After the final iteration, save a tabular summary of top
68 sequence hits to <f> in a readily parseable, columnar, white‐
69 space-delimited format.
70
71
72 --domtblout <f>
73 After the final iteration, save a tabular summary of top domain
74 hits to <f> in a readily parseable, columnar, whitespace-delim‐
75 ited format.
76
77
78 --chkhmm <prefix>
79 At the start of each iteration, checkpoint the query HMM, saving
80 it to a file named <prefix>-<n>.hmm where <n> is the iteration
81 number (from 1..N).
82
83
84 --chkali <prefix>
85 At the end of each iteration, checkpoint an alignment of all
86 domains satisfying inclusion thresholds (e.g. what will become
87 the query HMM for the next iteration), saving it to a file named
88 <checkpoint file prefix>-<n>.sto in Stockholm format, where <n>
89 is the iteration number (from 1..N).
90
91
92 --acc Use accessions instead of names in the main output, where avail‐
93 able for profiles and/or sequences.
94
95
96 --noali
97 Omit the alignment section from the main output. This can
98 greatly reduce the output volume.
99
100
101 --notextw
102 Unlimit the length of each line in the main output. The default
103 is a limit of 120 characters per line, which helps in displaying
104 the output cleanly on terminals and in editors, but can truncate
105 target profile description lines.
106
107
108 --textw <n>
109 Set the main output's line length limit to <n> characters per
110 line. The default is 120.
111
112
113
114
115
116
117
119 By default, the first iteration uses a search model constructed from a
120 single query sequence. This model is constructed using a standard 20x20
121 substitution matrix for residue probabilities, and two additional
122 parameters for position-independent gap open and gap extend probabili‐
123 ties. These options allow the default single-sequence scoring parame‐
124 ters to be changed.
125
126
127 --popen <x>
128 Set the gap open probability for a single sequence query model
129 to <x>. The default is 0.02. <x> must be >= 0 and < 0.5.
130
131
132 --pextend <x>
133 Set the gap extend probability for a single sequence query model
134 to <x>. The default is 0.4. <x> must be >= 0 and < 1.0.
135
136
137 --mx <s>
138 Obtain residue alignment probabilities from the built-in substi‐
139 tution matrix named <s>. Several standard matrices are built-
140 in, and do not need to be read from files. The matrix name <s>
141 can be PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
142 SUM62, BLOSUM80, or BLOSUM90. Only one of the --mx and --mxfile
143 options may be used.
144
145
146 --mxfile <mxfile>
147 Obtain residue alignment probabilities from the substitution
148 matrix in file <mxfile>. The default score matrix is BLOSUM62
149 (this matrix is internal to HMMER and does not have to be avail‐
150 able as a file). The format of a substitution matrix <mxfile>
151 is the standard format accepted by BLAST, FASTA, and other
152 sequence analysis software.
153
154
155
157 Reporting thresholds control which hits are reported in output files
158 (the main output, --tblout, and --domtblout). In each iteration,
159 sequence hits and domain hits are ranked by statistical significance
160 (E-value) and output is generated in two sections called per-target and
161 per-domain output. In per-target output, by default, all sequence hits
162 with an E-value <= 10 are reported. In the per-domain output, for each
163 target that has passed per-target reporting thresholds, all domains
164 satisfying per-domain reporting thresholds are reported. By default,
165 these are domains with conditional E-values of <= 10. The following
166 options allow you to change the default E-value reporting thresholds,
167 or to use bit score thresholds instead.
168
169
170
171 -E <x> Report sequences with E-values <= <x> in per-sequence output.
172 The default is 10.0.
173
174
175 -T <x> Use a bit score threshold for per-sequence output instead of an
176 E-value threshold (any setting of -E is ignored). Report
177 sequences with a bit score of >= <x>. By default this option is
178 unset.
179
180
181 -Z <x> Declare the total size of the database to be <x> sequences, for
182 purposes of E-value calculation. Normally E-values are calcu‐
183 lated relative to the size of the database you actually searched
184 (e.g. the number of sequences in target_seqdb). In some cases
185 (for instance, if you've split your target sequence database
186 into multiple files for parallelization of your search), you may
187 know better what the actual size of your search space is.
188
189
190 --domE <x>
191 Report domains with conditional E-values <= <x> in per-domain
192 output, in addition to the top-scoring domain per significant
193 sequence hit. The default is 10.0.
194
195
196 --domT <x>
197 Use a bit score threshold for per-domain output instead of an E-
198 value threshold (any setting of --domT is ignored). Report
199 domains with a bit score of >= <x> in per-domain output, in
200 addition to the top-scoring domain per significant sequence hit.
201 By default this option is unset.
202
203
204 --domZ <x>
205 Declare the number of significant sequences to be <x> sequences,
206 for purposes of conditional E-value calculation for additional
207 domain significance. Normally conditional E-values are calcu‐
208 lated relative to the number of sequences passing per-sequence
209 reporting threshold.
210
211
212
214 Inclusion thresholds control which hits are included in the multiple
215 alignment and profile constructed for the next search iteration. By
216 default, a sequence must have a per-sequence E-value of <= 0.001 (see
217 -E option) to be included, and any additional domains in it besides the
218 top-scoring one must have a conditional E-value of <= 0.001 (see --domE
219 option). The difference between reporting thresholds and inclusion
220 thresholds is that inclusion thresholds control which hits actually get
221 used in the next iteration (or the final output multiple alignment if
222 the -A option is used), whereas reporting thresholds control what you
223 see in output. Reporting thresholds are generally more loose so you can
224 see borderline hits in the top of the noise that might be of interest.
225
226
227 --incE <x>
228 Include sequences with E-values <= <x> in subsequent iteration
229 or final alignment output by -A. The default is 0.001.
230
231
232 --incT <x>
233 Use a bit score threshold for per-sequence inclusion instead of
234 an E-value threshold (any setting of --incE is ignored). Include
235 sequences with a bit score of >= <x>. By default this option is
236 unset.
237
238
239 --incdomE <x>
240 Include domains with conditional E-values <= <x> in subsequent
241 iteration or final alignment output by -A, in addition to the
242 top-scoring domain per significant sequence hit. The default is
243 0.001.
244
245
246 --incdomT <x>
247 Use a bit score threshold for per-domain inclusion instead of an
248 E-value threshold (any setting of --incT is ignored). Include
249 domains with a bit score of >= <x>. By default this option is
250 unset.
251
252
253
254
256 HMMER3 searches are accelerated in a three-step filter pipeline: the
257 MSV filter, the Viterbi filter, and the Forward filter. The first fil‐
258 ter is the fastest and most approximate; the last is the full Forward
259 scoring algorithm, slowest but most accurate. There is also a bias fil‐
260 ter step between MSV and Viterbi. Targets that pass all the steps in
261 the acceleration pipeline are then subjected to postprocessing --
262 domain identification and scoring using the Forward/Backward algorithm.
263
264 Essentially the only free parameters that control HMMER's heuristic
265 filters are the P-value thresholds controlling the expected fraction of
266 nonhomologous sequences that pass the filters. Setting the default
267 thresholds higher will pass a higher proportion of nonhomologous
268 sequence, increasing sensitivity at the expense of speed; conversely,
269 setting lower P-value thresholds will pass a smaller proportion,
270 decreasing sensitivity and increasing speed. Setting a filter's P-value
271 threshold to 1.0 means it will passing all sequences, and effectively
272 disables the filter.
273
274 Changing filter thresholds only removes or includes targets from con‐
275 sideration; changing filter thresholds does not alter bit scores, E-
276 values, or alignments, all of which are determined solely in postpro‐
277 cessing.
278
279
280 --max Maximum sensitivity. Turn off all filters, including the bias
281 filter, and run full Forward/Backward postprocessing on every
282 target. This increases sensitivity slightly, at a large cost in
283 speed.
284
285
286 --F1 <x>
287 First filter threshold; set the P-value threshold for the MSV
288 filter step. The default is 0.02, meaning that roughly 2% of
289 the highest scoring nonhomologous targets are expected to pass
290 the filter.
291
292
293 --F2 <x>
294 Second filter threshold; set the P-value threshold for the
295 Viterbi filter step. The default is 0.001.
296
297
298 --F3 <x>
299 Third filter threshold; set the P-value threshold for the For‐
300 ward filter step. The default is 1e-5.
301
302
303 --nobias
304 Turn off the bias filter. This increases sensitivity somewhat,
305 but can come at a high cost in speed, especially if the query
306 has biased residue composition (such as a repetitive sequence
307 region, or if it is a membrane protein with large regions of
308 hydrophobicity). Without the bias filter, too many sequences may
309 pass the filter with biased queries, leading to slower than
310 expected performance as the computationally intensive For‐
311 ward/Backward algorithms shoulder an abnormally heavy load.
312
313
314
315
317 These options control how consensus columns are defined in multiple
318 alignments when building profiles. By default, jackhmmer always
319 includes your original query sequence in the alignment result at every
320 iteration, and consensus positions are defined by that query sequence:
321 that is, a default jackhmmer profile is always the same length as your
322 original query, at every iteration.
323
324
325 --fast Define consensus columns as those that have a fraction >= sym‐
326 frac of residues as opposed to gaps. (See below for the --sym‐
327 frac option.) Although this is the default profile construction
328 option elsewhere (in hmmbuild, in particular), it may have unde‐
329 sirable effects in jackhmmer, because a profile could itera‐
330 tively walk in sequence space away from your original query,
331 leaving few or no consensus columns corresponding to its
332 residues.
333
334
335 --hand Define consensus columns in next profile using reference annota‐
336 tion to the multiple alignment. jackhmmer propagates reference
337 annotation from the previous profile to the multiple alignment,
338 and thence to the next profile. This is the default.
339
340
341 --symfrac <x>
342 Define the residue fraction threshold necessary to define a con‐
343 sensus column when using the --fast option. The default is 0.5.
344 The symbol fraction in each column is calculated after taking
345 relative sequence weighting into account, and ignoring gap char‐
346 acters corresponding to ends of sequence fragments (as opposed
347 to internal insertions/deletions). Setting this to 0.0 means
348 that every alignment column will be assigned as consensus, which
349 may be useful in some cases. Setting it to 1.0 means that only
350 columns that include 0 gaps (internal insertions/deletions) will
351 be assigned as consensus.
352
353
354 --fragthresh <x>
355 We only want to count terminal gaps as deletions if the aligned
356 sequence is known to be full-length, not if it is a fragment
357 (for instance, because only part of it was sequenced). HMMER
358 uses a simple rule to infer fragments: if the sequence length L
359 is less than or equal to a fraction <x> times the alignment
360 length in columns, then the sequence is handled as a fragment.
361 The default is 0.5. Setting --fragthresh0 will define no
362 (nonempty) sequence as a fragment; you might want to do this if
363 you know you've got a carefully curated alignment of full-length
364 sequences. Setting --fragthresh1 will define all sequences as
365 fragments; you might want to do this if you know your alignment
366 is entirely composed of fragments, such as translated short
367 reads in metagenomic shotgun data.
368
369
370
371
373 Whenever a profile is built from a multiple alignment, HMMER uses an ad
374 hoc sequence weighting algorithm to downweight closely related
375 sequences and upweight distantly related ones. This has the effect of
376 making models less biased by uneven phylogenetic representation. For
377 example, two identical sequences would typically each receive half the
378 weight that one sequence would (and this is why jackhmmer isn't con‐
379 cerned about always including your original query sequence in each
380 iteration's alignment, even if it finds it again in the database you're
381 searching). These options control which algorithm gets used.
382
383
384 --wpb Use the Henikoff position-based sequence weighting scheme
385 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
386 the default.
387
388
389 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
390 stein et al, J. Mol. Biol. 235:1067, 1994].
391
392
393 --wblosum
394 Use the same clustering scheme that was used to weight data in
395 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
396 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
397 linkage clustered at an identity threshold (default 0.62; see
398 --wid) and within each cluster of c sequences, each sequence
399 gets relative weight 1/c.
400
401
402 --wnone
403 No relative weights. All sequences are assigned uniform weight.
404
405
406 --wid <x>
407 Sets the identity threshold used by single-linkage clustering
408 when using --wblosum. Invalid with any other weighting scheme.
409 Default is 0.62.
410
411
412
413
414
415
417 After relative weights are determined, they are normalized to sum to a
418 total effective sequence number, eff_nseq. This number may be the
419 actual number of sequences in the alignment, but it is almost always
420 smaller than that. The default entropy weighting method (--eent)
421 reduces the effective sequence number to reduce the information content
422 (relative entropy, or average expected score on true homologs) per con‐
423 sensus position. The target relative entropy is controlled by a two-
424 parameter function, where the two parameters are settable with --ere
425 and --esigma.
426
427
428 --eent Adjust effective sequence number to achieve a specific relative
429 entropy per position (see --ere). This is the default.
430
431
432 --eclust
433 Set effective sequence number to the number of single-linkage
434 clusters at a specific identity threshold (see --eid). This
435 option is not recommended; it's for experiments evaluating how
436 much better --eent is.
437
438
439 --enone
440 Turn off effective sequence number determination and just use
441 the actual number of sequences. One reason you might want to do
442 this is to try to maximize the relative entropy/position of your
443 model, which may be useful for short models.
444
445
446 --eset <x>
447 Explicitly set the effective sequence number for all models to
448 <x>.
449
450
451 --ere <x>
452 Set the minimum relative entropy/position target to <x>.
453 Requires --eent. Default depends on the sequence alphabet; for
454 protein sequences, it is 0.59 bits/position.
455
456
457 --esigma <x>
458 Sets the minimum relative entropy contributed by an entire model
459 alignment, over its whole length. This has the effect of making
460 short models have higher relative entropy per position than
461 --ere alone would give. The default is 45.0 bits.
462
463
464 --eid <x>
465 Sets the fractional pairwise identity cutoff used by single
466 linkage clustering with the --eclust option. The default is
467 0.62.
468
469
470
471
473 In profile construction, by default, weighted counts are converted to
474 mean posterior probability parameter estimates using mixture Dirichlet
475 priors. Default mixture Dirichlet prior parameters for protein models
476 and for nucleic acid (RNA and DNA) models are built in. The following
477 options allow you to override the default priors.
478
479 --pnone Don't use any priors. Probability parameters will simply be the
480 observed frequencies, after relative sequence weighting.
481
482 --plaplace Use a Laplace +1 prior in place of the default mixture
483 Dirichlet prior.
484
485
486
487
489 Estimating the location parameters for the expected score distributions
490 for MSV filter scores, Viterbi filter scores, and Forward scores
491 requires three short random sequence simulations.
492
493
494 --EmL <n>
495 Sets the sequence length in simulation that estimates the loca‐
496 tion parameter mu for MSV filter E-values. Default is 200.
497
498
499 --EmN <n>
500 Sets the number of sequences in simulation that estimates the
501 location parameter mu for MSV filter E-values. Default is 200.
502
503
504 --EvL <n>
505 Sets the sequence length in simulation that estimates the loca‐
506 tion parameter mu for Viterbi filter E-values. Default is 200.
507
508
509 --EvN <n>
510 Sets the number of sequences in simulation that estimates the
511 location parameter mu for Viterbi filter E-values. Default is
512 200.
513
514
515 --EfL <n>
516 Sets the sequence length in simulation that estimates the loca‐
517 tion parameter tau for Forward E-values. Default is 100.
518
519
520 --EfN <n>
521 Sets the number of sequences in simulation that estimates the
522 location parameter tau for Forward E-values. Default is 200.
523
524
525 --Eft <x>
526 Sets the tail mass fraction to fit in the simulation that esti‐
527 mates the location parameter tau for Forward evalues. Default is
528 0.04.
529
530
531
533 --nonull2
534 Turn off the null2 score corrections for biased composition.
535
536
537 -Z <x> Assert that the total number of targets in your searches is <x>,
538 for the purposes of per-sequence E-value calculations, rather
539 than the actual number of targets seen.
540
541
542 --domZ <x>
543 Assert that the total number of targets in your searches is <x>,
544 for the purposes of per-domain conditional E-value calculations,
545 rather than the number of targets that passed the reporting
546 thresholds.
547
548
549 --seed <n>
550 Seed the random number generator with <n>, an integer >= 0. If
551 <n> is >0, any stochastic simulations will be reproducible; the
552 same command will give the same results. If <n> is 0, the ran‐
553 dom number generator is seeded arbitrarily, and stochastic simu‐
554 lations will vary from run to run of the same command. The
555 default seed is 42.
556
557
558
559 --qformat <s>
560 Declare that the input query_seqfile is in format <s>. Accepted
561 sequence file formats include FASTA, EMBL, GenBank, DDBJ,
562 UniProt, Stockholm, and SELEX. Default is to autodetect the for‐
563 mat of the file.
564
565
566 --tformat <s>
567 Declare that the input target_seqdb is in format <s>. Accepted
568 sequence file formats include FASTA, EMBL, GenBank, DDBJ,
569 UniProt, Stockholm, and SELEX. Default is to autodetect the for‐
570 mat of the file.
571
572
573 --cpu <n>
574 Set the number of parallel worker threads to <n>. By default,
575 HMMER sets this to the number of CPU cores it detects in your
576 machine - that is, it tries to maximize the use of your avail‐
577 able processor cores. Setting <n> higher than the number of
578 available cores is of little if any value, but you may want to
579 set it to something less. You can also control this number by
580 setting an environment variable, HMMER_NCPU.
581
582 This option is only available if HMMER was compiled with POSIX
583 threads support. This is the default, but it may have been
584 turned off at compile-time for your site or machine for some
585 reason.
586
587
588
589 --stall
590 For debugging the MPI master/worker version: pause after start,
591 to enable the developer to attach debuggers to the running mas‐
592 ter and worker(s) processes. Send SIGCONT signal to release the
593 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if
594 optional MPI support was enabled at compile-time.)
595
596
597 --mpi Run in MPI master/worker mode, using mpirun. (Only available if
598 optional MPI support was enabled at compile-time.)
599
600
601
602
604 See hmmer(1) for a master man page with a list of all the individual
605 man pages for programs in the HMMER package.
606
607
608 For complete documentation, see the user guide that came with your
609 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
610
611
612
613
615 Copyright (C) 2015 Howard Hughes Medical Institute.
616 Freely distributed under the GNU General Public License (GPLv3).
617
618 For additional information on copyright and licensing, see the file
619 called COPYRIGHT in your HMMER source distribution, or see the HMMER
620 web page ().
621
622
623
625 Eddy/Rivas Laboratory
626 Janelia Farm Research Campus
627 19700 Helix Drive
628 Ashburn VA 20147 USA
629 http://eddylab.org
630
631
632
633
634
635HMMER 3.1b2 February 2015 jackhmmer(1)