1nhmmer(1) HMMER Manual nhmmer(1)
2
3
4
6 nhmmer - search DNA queries against a DNA sequence database
7
8
9
11 nhmmer [options] queryfile seqdb
12
13
14
16 nhmmer is used to search one or more nucleotide queries against a nu‐
17 cleotide sequence database. For each query in queryfile, use that
18 query to search the target database of sequences in seqdb, and output a
19 ranked list of the hits with the most significant matches to the query.
20 A query may be either a profile model built using hmmbuild, a sequence
21 alignment, or a single sequence. Sequence based queries can be in a
22 number of formats (see --qformat), and can typically be autodetected.
23 Note that only Stockholm format supports queries made up of more than
24 one sequence alignment.
25
26
27
28
29 Either the query queryfile or the target seqdb may be '-' (a dash char‐
30 acter), in which case the query file or target database input will be
31 read from a <stdin> pipe instead of from a file. Only one input source
32 can come through <stdin>, not both. If the queryfile contains more
33 than one query, then seqdb cannot come from stdin, because we can't
34 rewind the streaming target database to search it with another profile.
35
36
37 If the query is sequence-based (unaligned or aligned), a new file con‐
38 taining the HMM(s) built from the input(s) in queryfile may optionally
39 be produced, with the filename set using the --hmmout flag.
40
41
42
43 The output format is designed to be human-readable, but is often so vo‐
44 luminous that reading it is impractical, and parsing it is a pain. The
45 --tblout option saves output in a simple tabular format that is concise
46 and easier to parse. The -o option allows redirecting the main output,
47 including throwing it away in /dev/null.
48
49
50
51
53 -h Help; print a brief reminder of command line usage and all
54 available options.
55
56
57
58
60 -o <f> Direct the main human-readable output to a file <f> instead of
61 the default stdout.
62
63
64 -A <f> Save a multiple alignment of all significant hits (those satis‐
65 fying "inclusion thresholds") to the file <f>.
66
67
68 --tblout <f>
69 Save a simple tabular (space-delimited) file summarizing the
70 per-target output, with one data line per homologous target se‐
71 quence found.
72
73
74 --dfamtblout <f>
75 Save a tabular (space-delimited) file summarizing the per-hit
76 output, similar to --tblout but more succinct.
77
78
79 --aliscoresout <f>
80 Save to file a list of per-position scores for each hit. This
81 is useful, for example, in identifying regions of high score
82 density for use in resolving overlapping hits from different
83 models.
84
85
86 --hmmout <f>
87 If queryfile is sequence-based, write the internally-computed
88 HMM(s) to file <f>.
89
90
91
92 --acc Use accessions instead of names in the main output, where avail‐
93 able for profiles and/or sequences.
94
95
96 --noali
97 Omit the alignment section from the main output. This can
98 greatly reduce the output volume.
99
100
101 --notextw
102 Unlimit the length of each line in the main output. The default
103 is a limit of 120 characters per line, which helps in displaying
104 the output cleanly on terminals and in editors, but can truncate
105 target profile description lines.
106
107
108 --textw <n>
109 Set the main output's line length limit to <n> characters per
110 line. The default is 120.
111
112
113
114
116 By default, if a query is a single sequence from a file in fasta for‐
117 mat, nhmmer uses a search model constructed from that sequence and a
118 standard 20x20 substitution matrix for residue probabilities, along
119 with two additional parameters for position-independent gap open and
120 gap extend probabilities. These options allow the default single-se‐
121 quence scoring parameters to be changed, and for single-sequence scor‐
122 ing options to be applied to a single sequence coming from an aligned
123 format.
124
125
126 --singlemx
127 If a single sequence query comes from a multiple sequence align‐
128 ment file, such as in Stockholm format, the search model is by
129 default constructed as is typically done for multiple sequence
130 alignments. This option forces nhmmer to use the single-sequence
131 method with substitution score matrix.
132
133
134 --mxfile<mxfile
135 Obtain residue alignment probabilities from the substitution ma‐
136 trix in file mxfile. The default score matrix is DNA1 (this ma‐
137 trix is internal to HMMER and does not have to be available as a
138 file). The format of a substitution matrix mxfile is the stan‐
139 dard format accepted by BLAST, FASTA, and other sequence analy‐
140 sis software. See ftp.ncbi.nlm.nih.gov/blast/matrices/ for ex‐
141 ample files. (The only exception: we require matrices to be
142 square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
143
144
145
146 --popen <x>
147 Set the gap open probability for a single sequence query model
148 to <x>. The default is 0.02. <x> must be >= 0 and < 0.5.
149
150
151 --pextend <x>
152 Set the gap extend probability for a single sequence query model
153 to <x>. The default is 0.4. <x> must be >= 0 and < 1.0.
154
155
156
157
159 Reporting thresholds control which hits are reported in output files
160 (the main output, --tblout, and --dfamtblout). Hits are ranked by sta‐
161 tistical significance (E-value).
162
163
164
165 -E <x> Report target sequences with an E-value of <= <x>. The default
166 is 10.0, meaning that on average, about 10 false positives will
167 be reported per query, so you can see the top of the noise and
168 decide for yourself if it's really noise.
169
170
171 -T <x> Instead of thresholding output on E-value, instead report target
172 sequences with a bit score of >= <x>.
173
174
175
176
177
179 Inclusion thresholds are stricter than reporting thresholds. Inclusion
180 thresholds control which hits are considered to be reliable enough to
181 be included in an output alignment or a subsequent search round, or
182 marked as significant ("!") as opposed to questionable ("?") in hit
183 output.
184
185
186 --incE <x>
187 Use an E-value of <= <x> as the inclusion threshold. The de‐
188 fault is 0.01, meaning that on average, about 1 false positive
189 would be expected in every 100 searches with different query se‐
190 quences.
191
192
193 --incT <x>
194 Instead of using E-values for setting the inclusion threshold,
195 use a bit score of >= <x> as the inclusion threshold. By de‐
196 fault this option is unset.
197
198
199
200
202 Curated profile databases may define specific bit score thresholds for
203 each profile, superseding any thresholding based on statistical signif‐
204 icance alone.
205
206 To use these options, the profile must contain the appropriate (GA, TC,
207 and/or NC) optional score threshold annotation; this is picked up by
208 hmmbuild from Stockholm format alignment files. For a nucleotide model,
209 each thresholding option has a single per-hit threshold <x> This acts
210 as if -T <x> --incT <x> has been applied specifically using each
211 model's curated thresholds.
212
213
214 --cut_ga
215 Use the GA (gathering) bit score threshold in the model to set
216 per-hit reporting and inclusion thresholds. GA thresholds are
217 generally considered to be the reliable curated thresholds
218 defining family membership; for example, in Dfam, these thresh‐
219 olds are applied when annotating a genome with a model of a fam‐
220 ily known to be found in that organism. They may allow for mini‐
221 mal expected false discovery rate.
222
223
224 --cut_nc
225 Use the NC (noise cutoff) bit score threshold in the model to
226 set per-hit reporting and inclusion thresholds. NC thresholds
227 are less stringent than GA; in the context of Pfam, they are
228 generally used to store the score of the highest-scoring known
229 false positive.
230
231
232 --cut_tc
233 Use the TC (trusted cutoff) bit score threshold in the model to
234 set per-hit reporting and inclusion thresholds. TC thresholds
235 are more stringent than GA, and are generally considered to be
236 the score of the lowest-scoring known true positive that is
237 above all known false positives; for example, in Dfam, these
238 thresholds are applied when annotating a genome with a model of
239 a family not known to be found in that organism.
240
241
242
243
244
246 HMMER3 searches are accelerated in a three-step filter pipeline: the
247 scanning-SSV filter, the Viterbi filter, and the Forward filter. The
248 first filter is the fastest and most approximate; the last is the full
249 Forward scoring algorithm. There is also a bias filter step between SSV
250 and Viterbi. Targets that pass all the steps in the acceleration pipe‐
251 line are then subjected to postprocessing -- domain identification and
252 scoring using the Forward/Backward algorithm.
253
254 Changing filter thresholds only removes or includes targets from con‐
255 sideration; changing filter thresholds does not alter bit scores, E-
256 values, or alignments, all of which are determined solely in postpro‐
257 cessing.
258
259
260 --max Turn off (nearly) all filters, including the bias filter, and
261 run full Forward/Backward postprocessing on most of the target
262 sequence. In contrast to phmmer and hmmsearch, where this flag
263 really does turn off the filters entirely, the --max flag in
264 nhmmer sets the scanning-SSV filter threshold to 0.4, not 1.0.
265 Use of this flag increases sensitivity somewhat, at a large cost
266 in speed.
267
268
269 --F1 <x>
270 Set the P-value threshold for the SSV filter step. The default
271 is 0.02, meaning that roughly 2% of the highest scoring nonho‐
272 mologous targets are expected to pass the filter.
273
274
275 --F2 <x>
276 Set the P-value threshold for the Viterbi filter step. The de‐
277 fault is 0.001.
278
279
280 --F3 <x>
281 Set the P-value threshold for the Forward filter step. The de‐
282 fault is 1e-5.
283
284
285 --nobias
286 Turn off the bias filter. This increases sensitivity somewhat,
287 but can come at a high cost in speed, especially if the query
288 has biased residue composition (such as a repetitive sequence
289 region, or if it is a membrane protein with large regions of hy‐
290 drophobicity). Without the bias filter, too many sequences may
291 pass the filter with biased queries, leading to slower than ex‐
292 pected performance as the computationally intensive For‐
293 ward/Backward algorithms shoulder an abnormally heavy load.
294
295
296
297
299 --dna Assert that sequences in msafile are DNA, bypassing alphabet au‐
300 todetection.
301
302
303 --rna Assert that sequences in msafile are RNA, bypassing alphabet au‐
304 todetection.
305
306
307
308
310 When searching with nhmmer, one may optionally precompute a binary ver‐
311 sion of the target database, using makehmmerdb, then search against
312 that database. Using default settings, this yields a roughly 10-fold
313 acceleration with small loss of sensitivity on benchmarks. This is
314 achieved using a heuristic method that searches for seeds (ungapped
315 alignments) around which full processing is done. This is essentially a
316 replacement to the SSV stage. (This method has been extensively tested,
317 but should still be treated as somewhat experimental.) The following
318 options only impact nhmmer if the value of --tformat is hmmerdb.
319
320 Changing parameters for this seed-finding step will impact both speed
321 and sensitivity - typically faster search leads to lower sensitivity.
322
323
324 --seed_max_depth <n>
325 The seed step requires that a seed reach a specified bit score
326 in length no longer than <n>. By default, this value is 15.
327 Longer seeds allow a greater chance of meeting the bit score
328 threshold, leading to diminished filtering (greater sensitivity,
329 slower run time).
330
331
332 --seed_sc_thresh <x>
333 The seed must reach score <x> (in bits). The default is 15.0
334 bits. A higher threshold increases filtering stringency, leading
335 to faster run times and lower sensitivity.
336
337
338 --seed_sc_density <x>
339 Either all prefixes or all suffixes of a seed must have bit den‐
340 sity (bits per aligned position) of at least <x>. The default
341 is 0.8 bits/position. An increase in the density requirement
342 leads to increased filtering stringency, thus faster run times
343 and lower sensitivity.
344
345
346 --seed_drop_max_len <n>
347 A seed may not have a run of length <n> in which the score drops
348 by --seed_drop_lim or more. Basically, this prunes seeds that go
349 through long slightly-negative seed extensions. The default is
350 4. Increasing the limit causes (slightly) diminished filtering
351 efficiency, thus slower run times and higher sensitivity. (minor
352 tuning option)
353
354
355 --seed_drop_lim <x>
356 In a seed, there may be no run of length --seed_drop_max_len in
357 which the score drops by --seed_drop_lim. The default is 0.3
358 bits. Larger numbers mean less filtering. (minor tuning option)
359
360
361 --seed_req_pos <n>
362 A seed must contain a run of at least <n> positive-scoring
363 matches. The default is 5. Larger values mean increased filter‐
364 ing. (minor tuning option)
365
366
367 --seed_ssv_length <n>
368 After finding a short seed, an ungapped alignment is extended in
369 both directions in an attempt to meet the --F1 score threshold.
370 The window through which this ungapped alignment extends is
371 length <n>. The default is 70. Decreasing this value slightly
372 reduces run time, at a small risk of reduced sensitivity. (minor
373 tuning option)
374
375
376
378 --qformat <s>
379 Assert that input queryfile is a sequence file (unaligned or
380 aligned), in format <s>, bypassing format autodetection. Common
381 choices for <s> include: fasta, embl, genbank. Alignment for‐
382 mats also work, and will serve as the basis for automatic cre‐
383 ation of a profile HMM used for searching; common choices in‐
384 clude: stockholm, a2m, afa, psiblast, clustal, phylip. For more
385 information, and for codes for some less common formats, see
386 main documentation.
387
388
389
390 --qsingle_seqs
391 Force queryfile to be read as individual sequences, even if it
392 is in an msa format. For example, if the input is in aligned
393 stockholm format, the --qsingle_seqs
394 flag will cause each sequence in that alignment to be used as a
395 seperate query sequence.
396
397
398 --tformat <s>
399 Assert that target sequence database seqdb is in format <s>, by‐
400 passing format autodetection. Common choices for <s> include:
401 fasta, embl, genbank, ncbi, fmindex. Alignment formats also
402 work; common choices include: stockholm, a2m, afa, psiblast,
403 clustal, phylip. For more information, and for codes for some
404 less common formats, see main documentation. The string <s> is
405 case-insensitive (fasta or FASTA both work). The format ncbi
406 indicates that the database file is a binary file produced using
407 makeblastdb. The format fmindex indicates that the database
408 file is a binary file produced using makehmmerdb.
409
410
411
412 --nonull2
413 Turn off the null2 score corrections for biased composition.
414
415
416 -Z <x> For the purposes of per-hit E-value calculations, Assert that
417 the total size of the target database is <x> million nucleo‐
418 tides, rather than the actual number of targets seen.
419
420
421
422 --seed <n>
423 Set the random number seed to <n>. Some steps in postprocessing
424 require Monte Carlo simulation. The default is to use a fixed
425 seed (42), so that results are exactly reproducible. Any other
426 positive integer will give different (but also reproducible) re‐
427 sults. A choice of 0 uses a randomly chosen seed.
428
429
430
431 --w_beta <x>
432 Window length tail mass. The upper bound, W, on the length at
433 which nhmmer expects to find an instance of the model is set
434 such that the fraction of all sequences generated by the model
435 with length >= W is less than <x>. The default is 1e-7. This
436 flag may be used to override the value of W established for the
437 model by hmmbuild, or when the query is sequence-based.
438
439
440
441
442 --w_length <n>
443 Override the model instance length upper bound, W, which is oth‐
444 erwise controlled by --w_beta. It should be larger than the
445 model length. The value of W is used deep in the acceleration
446 pipeline, and modest changes are not expected to impact results
447 (though larger values of W do lead to longer run time). This
448 flag may be used to override the value of W established for the
449 model by hmmbuild, or when the query is sequence-based.
450
451
452
453
454 --watson
455 Only search the top strand. By default both the query sequence
456 and its reverse-complement are searched.
457
458
459 --crick
460 Only search the bottom (reverse-complement) strand. By default
461 both the query sequence and its reverse-complement are searched.
462
463
464
465 --cpu <n>
466 Set the number of parallel worker threads to <n>. On multicore
467 machines, the default is 2. You can also control this number by
468 setting an environment variable, HMMER_NCPU. There is also a
469 master thread, so the actual number of threads that HMMER spawns
470 is <n>+1.
471
472 This option is not available if HMMER was compiled with POSIX
473 threads support turned off.
474
475
476
477
478
479 --stall
480 For debugging the MPI master/worker version: pause after start,
481 to enable the developer to attach debuggers to the running mas‐
482 ter and worker(s) processes. Send SIGCONT signal to release the
483 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
484 tional MPI support was enabled at compile-time.)
485
486
487 --mpi Run under MPI control with master/worker parallelization (using
488 mpirun, for example, or equivalent). Only available if optional
489 MPI support was enabled at compile-time.
490
491
492
493
494
495
496
498 See hmmer(1) for a master man page with a list of all the individual
499 man pages for programs in the HMMER package.
500
501
502 For complete documentation, see the user guide that came with your HM‐
503 MER distribution (Userguide.pdf); or see the HMMER web page (http://hm‐
504 mer.org/).
505
506
507
508
510 Copyright (C) 2020 Howard Hughes Medical Institute.
511 Freely distributed under the BSD open source license.
512
513 For additional information on copyright and licensing, see the file
514 called COPYRIGHT in your HMMER source distribution, or see the HMMER
515 web page (http://hmmer.org/).
516
517
518
520 http://eddylab.org
521
522
523
524
525
526
527HMMER 3.3.2 Nov 2020 nhmmer(1)