1jackhmmer(1) HMMER Manual jackhmmer(1)
2
3
4
6 jackhmmer - iteratively search sequence(s) against a protein database
7
8
10 jackhmmer [options] <seqfile> <seqdb>
11
12
14 jackhmmer iteratively searches each query sequence in <seqfile> against
15 the target sequence(s) in <seqdb>. The first iteration is identical to
16 a phmmer search. For the next iteration, a multiple alignment of the
17 query together with all target sequences satisfying inclusion thresh‐
18 olds is assembled, a profile is constructed from this alignment (iden‐
19 tical to using hmmbuild on the alignment), and profile search of the
20 <seqdb> is done (identical to an hmmsearch with the profile).
21
22
23 The output format is designed to be human-readable, but is often so
24 voluminous that reading it is impractical, and parsing it is a pain.
25 The --tblout and --domtblout options save output in simple tabular for‐
26 mats that are concise and easier to parse. The -o option allows redi‐
27 recting the main output, including throwing it away in /dev/null.
28
29
30
32 -h Help; print a brief reminder of command line usage and all
33 available options.
34
35
36 -N <n> Set the maximum number of iterations to <n>. The default is 5.
37 If N=1, the result is equivalent to a phmmer search.
38
39
40
41
42
44 By default, output for each iteration appears on stdout in a somewhat
45 human readable, somewhat parseable format. These options allow redi‐
46 recting that output or saving additional kinds of output to files,
47 including checkpoint files for each iteration.
48
49
50 -o <f> Direct the human-readable output to a file <f>.
51
52
53 -A <f> After the final iteration, save an annotated multiple alignment
54 of all hits satisfying inclusion thresholds (also including the
55 original query) to <f> in Stockholm format.
56
57
58 --tblout <f>
59 After the final iteration, save a tabular summary of top
60 sequence hits to <f> in a readily parseable, columnar, white‐
61 space-delimited format.
62
63
64 --domtblout <f>
65 After the final iteration, save a tabular summary of top domain
66 hits to <f> in a readily parseable, columnar, whitespace-delim‐
67 ited format.
68
69
70 --chkhmm <prefix>
71 At the start of each iteration, checkpoint the query HMM, saving
72 it to a file named <prefix>-<n>.hmm where <n> is the iteration
73 number (from 1..N).
74
75
76 --chkali <prefix>
77 At the end of each iteration, checkpoint an alignment of all
78 domains satisfying inclusion thresholds (e.g. what will become
79 the query HMM for the next iteration), saving it to a file named
80 <checkpoint file prefix>-<n>.sto in Stockholm format, where <n>
81 is the iteration number (from 1..N).
82
83
84 --acc Use accessions instead of names in the main output, where avail‐
85 able for profiles and/or sequences.
86
87
88 --noali
89 Omit the alignment section from the main output. This can
90 greatly reduce the output volume.
91
92
93 --notextw
94 Unlimit the length of each line in the main output. The default
95 is a limit of 120 characters per line, which helps in displaying
96 the output cleanly on terminals and in editors, but can truncate
97 target profile description lines.
98
99
100 --textw <n>
101 Set the main output's line length limit to <n> characters per
102 line. The default is 120.
103
104
105
106
107
108
109
111 By default, the first iteration uses a search model constructed from a
112 single query sequence. This model is constructed using a standard 20x20
113 substitution matrix for residue probabilities, and two additional
114 parameters for position-independent gap open and gap extend probabili‐
115 ties. These options allow the default single-sequence scoring parame‐
116 ters to be changed.
117
118
119 --popen <x>
120 Set the gap open probability for a single sequence query model
121 to <x>. The default is 0.02. <x> must be >= 0 and < 0.5.
122
123
124 --pextend <x>
125 Set the gap extend probability for a single sequence query model
126 to <x>. The default is 0.4. <x> must be >= 0 and < 1.0.
127
128
129 --mxfile <mxfile>
130 Obtain residue alignment probabilities from the substitution
131 matrix in file <mxfile>. The default score matrix is BLOSUM62
132 (this matrix is internal to HMMER and does not have to be avail‐
133 able as a file). The format of a substitution matrix <mxfile>
134 is the standard format accepted by BLAST, FASTA, and other
135 sequence analysis software.
136
137
138
140 Reporting thresholds control which hits are reported in output files
141 (the main output, --tblout, and --domtblout). In each iteration,
142 sequence hits and domain hits are ranked by statistical significance
143 (E-value) and output is generated in two sections called per-target and
144 per-domain output. In per-target output, by default, all sequence hits
145 with an E-value <= 10 are reported. In the per-domain output, for each
146 target that has passed per-target reporting thresholds, all domains
147 satisfying per-domain reporting thresholds are reported. By default,
148 these are domains with conditional E-values of <= 10. The following
149 options allow you to change the default E-value reporting thresholds,
150 or to use bit score thresholds instead.
151
152
153
154 -E <x> Report sequences with E-values <= <x> in per-sequence output.
155 The default is 10.0.
156
157
158 -T <x> Use a bit score threshold for per-sequence output instead of an
159 E-value threshold (any setting of -E is ignored). Report
160 sequences with a bit score of >= <x>. By default this option is
161 unset.
162
163
164 -Z <x> Declare the total size of the database to be <x> sequences, for
165 purposes of E-value calculation. Normally E-values are calcu‐
166 lated relative to the size of the database you actually searched
167 (e.g. the number of sequences in target_seqdb). In some cases
168 (for instance, if you've split your target sequence database
169 into multiple files for parallelization of your search), you may
170 know better what the actual size of your search space is.
171
172
173 --domE <x>
174 Report domains with conditional E-values <= <x> in per-domain
175 output, in addition to the top-scoring domain per significant
176 sequence hit. The default is 10.0.
177
178
179 --domT <x>
180 Use a bit score threshold for per-domain output instead of an E-
181 value threshold (any setting of --domT is ignored). Report
182 domains with a bit score of >= <x> in per-domain output, in
183 addition to the top-scoring domain per significant sequence hit.
184 By default this option is unset.
185
186
187 --domZ <x>
188 Declare the number of significant sequences to be <x> sequences,
189 for purposes of conditional E-value calculation for additional
190 domain significance. Normally conditional E-values are calcu‐
191 lated relative to the number of sequences passing per-sequence
192 reporting threshold.
193
194
195
197 Inclusion thresholds control which hits are included in the multiple
198 alignment and profile constructed for the next search iteration. By
199 default, a sequence must have a per-sequence E-value of <= 0.001 (see
200 -E option) to be included, and any additional domains in it besides the
201 top-scoring one must have a conditional E-value of <= 0.001 (see --domE
202 option). The difference between reporting thresholds and inclusion
203 thresholds is that inclusion thresholds control which hits actually get
204 used in the next iteration (or the final output multiple alignment if
205 the -A option is used), whereas reporting thresholds control what you
206 see in output. Reporting thresholds are generally more loose so you can
207 see borderline hits in the top of the noise that might be of interest.
208
209
210 --incE <x>
211 Include sequences with E-values <= <x> in subsequent iteration
212 or final alignment output by -A. The default is 0.001.
213
214
215 --incT <x>
216 Use a bit score threshold for per-sequence inclusion instead of
217 an E-value threshold (any setting of --incE is ignored). Include
218 sequences with a bit score of >= <x>. By default this option is
219 unset.
220
221
222 --incdomE <x>
223 Include domains with conditional E-values <= <x> in subsequent
224 iteration or final alignment output by -A, in addition to the
225 top-scoring domain per significant sequence hit. The default is
226 0.001.
227
228
229 --incdomT <x>
230 Use a bit score threshold for per-domain inclusion instead of an
231 E-value threshold (any setting of --incT is ignored). Include
232 domains with a bit score of >= <x>. By default this option is
233 unset.
234
235
236
237
239 HMMER3 searches are accelerated in a three-step filter pipeline: the
240 MSV filter, the Viterbi filter, and the Forward filter. The first fil‐
241 ter is the fastest and most approximate; the last is the full Forward
242 scoring algorithm, slowest but most accurate. There is also a bias fil‐
243 ter step between MSV and Viterbi. Targets that pass all the steps in
244 the acceleration pipeline are then subjected to postprocessing --
245 domain identification and scoring using the Forward/Backward algorithm.
246
247 Essentially the only free parameters that control HMMER's heuristic
248 filters are the P-value thresholds controlling the expected fraction of
249 nonhomologous sequences that pass the filters. Setting the default
250 thresholds higher will pass a higher proportion of nonhomologous
251 sequence, increasing sensitivity at the expense of speed; conversely,
252 setting lower P-value thresholds will pass a smaller proportion,
253 decreasing sensitivity and increasing speed. Setting a filter's P-value
254 threshold to 1.0 means it will passing all sequences, and effectively
255 disables the filter.
256
257 Changing filter thresholds only removes or includes targets from con‐
258 sideration; changing filter thresholds does not alter bit scores, E-
259 values, or alignments, all of which are determined solely in postpro‐
260 cessing.
261
262
263 --max Maximum sensitivity. Turn off all filters, including the bias
264 filter, and run full Forward/Backward postprocessing on every
265 target. This increases sensitivity slightly, at a large cost in
266 speed.
267
268
269 --F1 <x>
270 First filter threshold; set the P-value threshold for the MSV
271 filter step. The default is 0.02, meaning that roughly 2% of
272 the highest scoring nonhomologous targets are expected to pass
273 the filter.
274
275
276 --F2 <x>
277 Second filter threshold; set the P-value threshold for the
278 Viterbi filter step. The default is 0.001.
279
280
281 --F3 <x>
282 Third filter threshold; set the P-value threshold for the For‐
283 ward filter step. The default is 1e-5.
284
285
286 --nobias
287 Turn off the bias filter. This increases sensitivity somewhat,
288 but can come at a high cost in speed, especially if the query
289 has biased residue composition (such as a repetitive sequence
290 region, or if it is a membrane protein with large regions of
291 hydrophobicity). Without the bias filter, too many sequences may
292 pass the filter with biased queries, leading to slower than
293 expected performance as the computationally intensive For‐
294 ward/Backward algorithms shoulder an abnormally heavy load.
295
296
297
298
300 These options control how consensus columns are defined in multiple
301 alignments when building profiles. By default, jackhmmer always
302 includes your original query sequence in the alignment result at every
303 iteration, and consensus positions are defined by that query sequence:
304 that is, a default jackhmmer profile is always the same length as your
305 original query, at every iteration.
306
307
308 --fast Define consensus columns as those that have a fraction >= sym‐
309 frac of residues as opposed to gaps. (See below for the --sym‐
310 frac option.) Although this is the default profile construction
311 option elsewhere (in hmmbuild, in particular), it may have unde‐
312 sirable effects in jackhmmer, because a profile could itera‐
313 tively walk in sequence space away from your original query,
314 leaving few or no consensus columns corresponding to its
315 residues.
316
317
318 --hand Define consensus columns in next profile using reference annota‐
319 tion to the multiple alignment. jackhmmer propagates reference
320 annotation from the previous profile to the multiple alignment,
321 and thence to the next profile. This is the default.
322
323
324 --symfrac <x>
325 Define the residue fraction threshold necessary to define a con‐
326 sensus column when using the --fast option. The default is 0.5.
327 The symbol fraction in each column is calculated after taking
328 relative sequence weighting into account, and ignoring gap char‐
329 acters corresponding to ends of sequence fragments (as opposed
330 to internal insertions/deletions). Setting this to 1.0 means
331 that every alignment column will be assigned as consensus, which
332 may be useful in some cases. Setting it to 0.0 is a bad idea,
333 because no columns will be assigned as consensus, and you'll get
334 a model of zero length.
335
336
337 --fragthresh <x>
338 We only want to count terminal gaps as deletions if the aligned
339 sequence is known to be full-length, not if it is a fragment
340 (for instance, because only part of it was sequenced). HMMER
341 uses a simple rule to infer fragments: if the sequence length L
342 is less than a fraction <x> times the mean sequence length of
343 all the sequences in the alignment, then the sequence is handled
344 as a fragment. The default is 0.5.
345
346
347
348
350 Whenever a profile is built from a multiple alignment, HMMER uses an ad
351 hoc sequence weighting algorithm to downweight closely related
352 sequences and upweight distantly related ones. This has the effect of
353 making models less biased by uneven phylogenetic representation. For
354 example, two identical sequences would typically each receive half the
355 weight that one sequence would (and this is why jackhmmer isn't con‐
356 cerned about always including your original query sequence in each
357 iteration's alignment, even if it finds it again in the database you're
358 searching). These options control which algorithm gets used.
359
360
361 --wpb Use the Henikoff position-based sequence weighting scheme
362 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
363 the default.
364
365
366 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
367 stein et al, J. Mol. Biol. 235:1067, 1994].
368
369
370 --wblosum
371 Use the same clustering scheme that was used to weight data in
372 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
373 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
374 linkage clustered at an identity threshold (default 0.62; see
375 --wid) and within each cluster of c sequences, each sequence
376 gets relative weight 1/c.
377
378
379 --wnone
380 No relative weights. All sequences are assigned uniform weight.
381
382
383 --wid <x>
384 Sets the identity threshold used by single-linkage clustering
385 when using --wblosum. Invalid with any other weighting scheme.
386 Default is 0.62.
387
388
389
390
391
392
394 After relative weights are determined, they are normalized to sum to a
395 total effective sequence number, eff_nseq. This number may be the
396 actual number of sequences in the alignment, but it is almost always
397 smaller than that. The default entropy weighting method (--eent)
398 reduces the effective sequence number to reduce the information content
399 (relative entropy, or average expected score on true homologs) per con‐
400 sensus position. The target relative entropy is controlled by a two-
401 parameter function, where the two parameters are settable with --ere
402 and --esigma.
403
404
405 --eent Adjust effective sequence number to achieve a specific relative
406 entropy per position (see --ere). This is the default.
407
408
409 --eclust
410 Set effective sequence number to the number of single-linkage
411 clusters at a specific identity threshold (see --eid). This
412 option is not recommended; it's for experiments evaluating how
413 much better --eent is.
414
415
416 --enone
417 Turn off effective sequence number determination and just use
418 the actual number of sequences. One reason you might want to do
419 this is to try to maximize the relative entropy/position of your
420 model, which may be useful for short models.
421
422
423 --eset <x>
424 Explicitly set the effective sequence number for all models to
425 <x>.
426
427
428 --ere <x>
429 Set the minimum relative entropy/position target to <x>.
430 Requires --eent. Default depends on the sequence alphabet; for
431 protein sequences, it is 0.59 bits/position.
432
433
434 --esigma <x>
435 Sets the minimum relative entropy contributed by an entire model
436 alignment, over its whole length. This has the effect of making
437 short models have higher relative entropy per position than
438 --ere alone would give. The default is 45.0 bits.
439
440
441 --eid <x>
442 Sets the fractional pairwise identity cutoff used by single
443 linkage clustering with the --eclust option. The default is
444 0.62.
445
446
447
448
449
450
451
452
453
455 Estimating the location parameters for the expected score distributions
456 for MSV filter scores, Viterbi filter scores, and Forward scores
457 requires three short random sequence simulations.
458
459
460 --EmL <n>
461 Sets the sequence length in simulation that estimates the loca‐
462 tion parameter mu for MSV filter E-values. Default is 200.
463
464
465 --EmN <n>
466 Sets the number of sequences in simulation that estimates the
467 location parameter mu for MSV filter E-values. Default is 200.
468
469
470 --EvL <n>
471 Sets the sequence length in simulation that estimates the loca‐
472 tion parameter mu for Viterbi filter E-values. Default is 200.
473
474
475 --EvN <n>
476 Sets the number of sequences in simulation that estimates the
477 location parameter mu for Viterbi filter E-values. Default is
478 200.
479
480
481 --EfL <n>
482 Sets the sequence length in simulation that estimates the loca‐
483 tion parameter tau for Forward E-values. Default is 100.
484
485
486 --EfN <n>
487 Sets the number of sequences in simulation that estimates the
488 location parameter tau for Forward E-values. Default is 200.
489
490
491 --Eft <x>
492 Sets the tail mass fraction to fit in the simulation that esti‐
493 mates the location parameter tau for Forward evalues. Default is
494 0.04.
495
496
497
499 --nonull2
500 Turn off the null2 score corrections for biased composition.
501
502
503 -Z <x> Assert that the total number of targets in your searches is <x>,
504 for the purposes of per-sequence E-value calculations, rather
505 than the actual number of targets seen.
506
507
508 --domZ <x>
509 Assert that the total number of targets in your searches is <x>,
510 for the purposes of per-domain conditional E-value calculations,
511 rather than the number of targets that passed the reporting
512 thresholds.
513
514
515 --seed <n>
516 Seed the random number generator with <n>, an integer >= 0. If
517 <n> is >0, any stochastic simulations will be reproducible; the
518 same command will give the same results. If <n> is 0, the ran‐
519 dom number generator is seeded arbitrarily, and stochastic simu‐
520 lations will vary from run to run of the same command. The
521 default seed is 42.
522
523
524
525 --qformat <s>
526 Declare that the input query_seqfile is in format <s>. Accepted
527 sequence file formats include FASTA, EMBL, Genbank, DDBJ,
528 Uniprot, Stockholm, and SELEX. Default is to autodetect the for‐
529 mat of the file.
530
531
532 --tformat <s>
533 Declare that the input target_seqdb is in format <s>. Accepted
534 sequence file formats include FASTA, EMBL, Genbank, DDBJ,
535 Uniprot, Stockholm, and SELEX. Default is to autodetect the for‐
536 mat of the file.
537
538
539 --cpu <n>
540 Set the number of parallel worker threads to <n>. By default,
541 HMMER sets this to the number of CPU cores it detects in your
542 machine - that is, it tries to maximize the use of your avail‐
543 able processor cores. Setting <n> higher than the number of
544 available cores is of little if any value, but you may want to
545 set it to something less. You can also control this number by
546 setting an environment variable, HMMER_NCPU.
547
548 This option is only available if HMMER was compiled with POSIX
549 threads support. This is the default, but it may have been
550 turned off at compile-time for your site or machine for some
551 reason.
552
553
554 --stall For debugging the MPI master/worker version: pause after
555 start, to enable the developer to attach debuggers to the run‐
556 ning master and worker(s) processes. Send SIGCONT signal to
557 release the pause. (Under gdb: (gdb) signal SIGCONT) (Only
558 available if optional MPI support was enabled at compile-time.)
559
560
561 --mpi Run in MPI master/worker mode, using mpirun. (Only available if
562 optional MPI support was enabled at compile-time.)
563
564
565
566
568 See hmmer(1) for a master man page with a list of all the individual
569 man pages for programs in the HMMER package.
570
571
572 For complete documentation, see the user guide that came with your
573 HMMER distribution (Userguide.pdf); or see the HMMER web page
574 (@HMMER_URL@).
575
576
577
578
580 @HMMER_COPYRIGHT@
581 @HMMER_LICENSE@
582
583 For additional information on copyright and licensing, see the file
584 called COPYRIGHT in your HMMER source distribution, or see the HMMER
585 web page (@HMMER_URL@).
586
587
588
590 Eddy/Rivas Laboratory
591 Janelia Farm Research Campus
592 19700 Helix Drive
593 Ashburn VA 20147 USA
594 http://eddylab.org
595
596
597
598
599
600HMMER @HMMER_VERSION@ @HMMER_DATE@ jackhmmer(1)