1nhmmer(1) HMMER Manual nhmmer(1)
2
3
4
6 nhmmer - search DNA/RNA queries against a DNA/RNA sequence database
7
8
9
11 nhmmer [options] <queryfile> <seqdb>
12
13
14
16 nhmmer is used to search one or more nucleotide queries against a nu‐
17 cleotide sequence database. For each query in <queryfile>, use that
18 query to search the target database of sequences in <seqdb>, and output
19 a ranked list of the hits with the most significant matches to the
20 query. A query may be either a profile model built using hmmbuild, a
21 sequence alignment, or a single sequence. Sequence based queries can be
22 in a number of formats (see --qformat), and can typically be autode‐
23 tected. Note that only Stockholm format supports queries made up of
24 more than one sequence alignment.
25
26
27
28
29 Either the query <queryfile> or the target <seqdb> may be '-' (a dash
30 character), in which case the query file or target database input will
31 be read from a <stdin> pipe instead of from a file. Only one input
32 source can come through <stdin>, not both. If the query is sequence-
33 based and passed via <stdin>, the --qformat flag must be used. If the
34 <queryfile> contains more than one query, then <seqdb> cannot come from
35 <stdin>, because we can't rewind the streaming target database to
36 search it with another profile.
37
38
39 If the query is sequence-based, and not from <stdin>, a new file con‐
40 taining the HMM(s) built from the input(s) in <queryfile> may option‐
41 ally be produced, with the filename set using the --hmmout flag.
42
43
44
45 The output format is designed to be human-readable, but is often so
46 voluminous that reading it is impractical, and parsing it is a pain.
47 The --tblout option saves output in a simple tabular format that is
48 concise and easier to parse. The -o option allows redirecting the main
49 output, including throwing it away in /dev/null.
50
51
52
53
55 -h Help; print a brief reminder of command line usage and all
56 available options.
57
58
59
60
62 -o <f> Direct the main human-readable output to a file <f> instead of
63 the default stdout.
64
65
66 -A <f> Save a multiple alignment of all significant hits (those satis‐
67 fying inclusion thresholds) to the file <f>.
68
69
70 --tblout <f>
71 Save a simple tabular (space-delimited) file summarizing the
72 per-target output, with one data line per homologous target
73 sequence found.
74
75
76 --dfamtblout <f>
77 Save a tabular (space-delimited) file summarizing the per-hit
78 output, similar to --tblout but more succinct.
79
80
81 --aliscoresout <f>
82 Save to file a list of per-position scores for each hit. This
83 is useful, for example, in identifying regions of high score
84 density for use in resolving overlapping hits from different
85 models.
86
87
88 --hmmout <f>
89 If <queryfile> is sequence-based, write the internally-computed
90 HMM(s) to <f>.
91
92
93
94 --acc Use accessions instead of names in the main output, where avail‐
95 able for profiles and/or sequences.
96
97
98 --noali
99 Omit the alignment section from the main output. This can
100 greatly reduce the output volume.
101
102
103 --notextw
104 Unlimit the length of each line in the main output. The default
105 is a limit of 120 characters per line, which helps in displaying
106 the output cleanly on terminals and in editors, but can truncate
107 target profile description lines.
108
109
110 --textw <n>
111 Set the main output's line length limit to <n> characters per
112 line. The default is 120.
113
114
115
116
118 Reporting thresholds control which hits are reported in output files
119 (the main output, --tblout, and --dfamtblout). Hits are ranked by sta‐
120 tistical significance (E-value).
121
122
123
124 -E <x> Report target sequences with an E-value of <= <x>. The default
125 is 10.0, meaning that on average, about 10 false positives will
126 be reported per query, so you can see the top of the noise and
127 decide for yourself if it's really noise.
128
129
130 -T <x> Instead of thresholding output on E-value, instead report target
131 sequences with a bit score of >= <x>.
132
133
134
135
136
138 Inclusion thresholds are stricter than reporting thresholds. Inclusion
139 thresholds control which hits are considered to be reliable enough to
140 be included in an output alignment or a subsequent search round, or
141 marked as significant ("!") as opposed to questionable ("?") in hit
142 output.
143
144
145 --incE <x>
146 Use an E-value of <= <x> as the inclusion threshold. The
147 default is 0.01, meaning that on average, about 1 false positive
148 would be expected in every 100 searches with different query
149 sequences.
150
151
152 --incT <x>
153 Instead of using E-values for setting the inclusion threshold,
154 use a bit score of >= <x> as the inclusion threshold. By
155 default this option is unset.
156
157
158
159
161 Curated profile databases may define specific bit score thresholds for
162 each profile, superseding any thresholding based on statistical signif‐
163 icance alone.
164
165 To use these options, the profile must contain the appropriate (GA, TC,
166 and/or NC) optional score threshold annotation; this is picked up by
167 hmmbuild from Stockholm format alignment files. For a nucleotide model,
168 each thresholding option has a single per-hit threshold <x> This acts
169 as if -T<x> --incT<x> has been applied specifically using each model's
170 curated thresholds.
171
172
173 --cut_ga
174 Use the GA (gathering) bit score threshold in the model to set
175 per-hit reporting and inclusion thresholds. GA thresholds are
176 generally considered to be the reliable curated thresholds
177 defining family membership; for example, in Dfam, these thresh‐
178 olds are applied when annotating a genome with a model of a fam‐
179 ily known to be found in that organism. They may allow for mini‐
180 mal expected false discovery rate.
181
182
183 --cut_nc
184 Use the NC (noise cutoff) bit score threshold in the model to
185 set per-hit reporting and inclusion thresholds. NC thresholds
186 are less stringent than GA; in the context of Pfam, they are
187 generally used to store the score of the highest-scoring known
188 false positive.
189
190
191 --cut_tc
192 Use the NC (trusted cutoff) bit score threshold in the model to
193 set per-hit reporting and inclusion thresholds. TC thresholds
194 are more stringent than GA, and are generally considered to be
195 the score of the lowest-scoring known true positive that is
196 above all known false positives; for example, in Dfam, these
197 thresholds are applied when annotating a genome with a model of
198 a family not known to be found in that organism.
199
200
201
202
203
205 HMMER3 searches are accelerated in a three-step filter pipeline: the
206 scanning-SSV filter, the Viterbi filter, and the Forward filter. The
207 first filter is the fastest and most approximate; the last is the full
208 Forward scoring algorithm. There is also a bias filter step between SSV
209 and Viterbi. Targets that pass all the steps in the acceleration pipe‐
210 line are then subjected to postprocessing -- domain identification and
211 scoring using the Forward/Backward algorithm.
212
213 Changing filter thresholds only removes or includes targets from con‐
214 sideration; changing filter thresholds does not alter bit scores, E-
215 values, or alignments, all of which are determined solely in postpro‐
216 cessing.
217
218
219 --max Turn off (nearly) all filters, including the bias filter, and
220 run full Forward/Backward postprocessing on most of the target
221 sequence. In contrast to phmmer and hmmsearch, where this flag
222 really does turn off the filters entirely, the --max flag in
223 nhmmer sets the scanning-SSV filter threshold to 0.4, not 1.0.
224 Use of this flag increases sensitivity somewhat, at a large cost
225 in speed.
226
227
228 --F1 <x>
229 Set the P-value threshold for the SSV filter step. The default
230 is 0.02, meaning that roughly 2% of the highest scoring nonho‐
231 mologous targets are expected to pass the filter.
232
233
234 --F2 <x>
235 Set the P-value threshold for the Viterbi filter step. The
236 default is 0.001.
237
238
239 --F3 <x>
240 Set the P-value threshold for the Forward filter step. The
241 default is 1e-5.
242
243
244 --nobias
245 Turn off the bias filter. This increases sensitivity somewhat,
246 but can come at a high cost in speed, especially if the query
247 has biased residue composition (such as a repetitive sequence
248 region, or if it is a membrane protein with large regions of
249 hydrophobicity). Without the bias filter, too many sequences may
250 pass the filter with biased queries, leading to slower than
251 expected performance as the computationally intensive For‐
252 ward/Backward algorithms shoulder an abnormally heavy load.
253
254
255
256
258 The alphabet type of the target database (DNA or RNA) is autodetected
259 by default, by looking at the composition of the <seqdb>. Autodetec‐
260 tion is normally quite reliable, but occasionally alphabet type may be
261 ambiguous and autodetection can fail (for instance, when the first
262 sequence starts with a run of ambiguous characters). To avoid this, or
263 to increase robustness in automated analysis pipelines, you may specify
264 the alphabet type of <seqdb> with these options.
265
266
267 --dna Specify that all sequences in <seqdb> are DNAs.
268
269
270 --rna Specify that all sequences in <seqdb> are RNAs.
271
272
273
275 When searching with nhmmer, one may optionally precompute a binary ver‐
276 sion of the target database, using makehmmerdb, then search against
277 that database. Using default settings, this yields a roughly 10-fold
278 acceleration with small loss of sensitivity on benchmarks. This is
279 achieved using a heuristic method that searches for seeds (ungapped
280 alignments) around which full processing is done. This is essentially a
281 replacement to the SSV stage. (This method has been extensively tested,
282 but should still be treated as somewhat experimental.) The following
283 options only impact nhmmer if the value of --tformat is hmmerdb.
284
285 Changing parameters for this seed-finding step will impact both speed
286 and sensitivity - typically faster search leads to lower sensitivity.
287
288
289 --seed_max_depth <n>
290 The seed step requires that a seed reach a specified bit score
291 in length no longer than <n>. By default, this value is 15.
292 Longer seeds allow a greater chance of meeting the bit score
293 threshold, leading to diminished filtering (greater sensitivity,
294 slower run time).
295
296
297 --seed_sc_thresh <x>
298 The seed must reach score <x> (in bits). The default is 15.0
299 bits. A higher threshold increases filtering stringency, leading
300 to faster run times and lower sensitivity.
301
302
303 --seed_sc_density <x>
304 Either all prefixes or all suffixes of a seed must have bit den‐
305 sity (bits per aligned position) of at least <x>. The default
306 is 0.8 bits/position. An increase in the density requirement
307 leads to increased filtering stringency, thus faster run times
308 and lower sensitivity.
309
310
311 --seed_drop_max_len <n>
312 A seed may not have a run of length <n> in which the score drops
313 by --seed_drop_lim or more. Basically, this prunes seeds that go
314 through long slightly-negative seed extensions. The default is
315 4. Increasing the limit causes (slightly) diminished filtering
316 efficiency, thus slower run times and higher sensitivity. (minor
317 tuning option)
318
319
320 --seed_drop_lim <x>
321 In a seed, there may be no run of length --seed_drop_max_len in
322 which the score drops by --seed_drop_lim. The default is 0.3
323 bits. Larger numbers mean less filtering. (minor tuning option)
324
325
326 --seed_req_pos <n>
327 A seed must contain a run of at least <n> positive-scoring
328 matches. The default is 5. Larger values mean increased filter‐
329 ing. (minor tuning option)
330
331
332 --seed_ssv_length <n>
333 After finding a short seed, an ungapped alignment is extended in
334 both directions in an attempt to meet the --F1 score threshold.
335 The window through which this ungapped alignment extends is
336 length <n>. The default is 70. Decreasing this value slightly
337 reduces run time, at a small risk of reduced sensitivity. (minor
338 tuning option)
339
340
341
343 --tformat <s>
344 Assert that the target sequence database file is in format <s>.
345 Accepted formats include fasta, embl, genbank, ddbj, uniprot,
346 stockholm, pfam, a2m, afa, and hmmerfm. The default is to
347 autodetect the format of the file. The format hmmerfm indicates
348 that the database file is a binary file produced using makehm‐
349 merdb (this format is not currently autodetected).
350
351
352
353 --qformat <s>
354 Declare that the input <queryfile> is in format <s>. This is
355 used when the query is sequence-based, rather than made up of
356 profile model(s). Currently the accepted multiple alignment
357 sequence file formats include Stockholm, Aligned FASTA, Clustal,
358 NCBI PSI-BLAST, PHYLIP, Selex, and UCSC SAM A2M. Default is to
359 autodetect the format of the file.
360
361
362
363 --nonull2
364 Turn off the null2 score corrections for biased composition.
365
366
367 -Z <x> For the purposes of per-hit E-value calculations, Assert that
368 the total size of the target database is <x> million nucleo‐
369 tides, rather than the actual number of targets seen.
370
371
372
373 --seed <n>
374 Set the random number seed to <n>. Some steps in postprocessing
375 require Monte Carlo simulation. The default is to use a fixed
376 seed (42), so that results are exactly reproducible. Any other
377 positive integer will give different (but also reproducible)
378 results. A choice of 0 uses a randomly chosen seed.
379
380
381
382 --w_beta <x>
383 Window length tail mass. The upper bound, W, on the length at
384 which nhmmer expects to find an instance of the model is set
385 such that the fraction of all sequences generated by the model
386 with length >= W is less than <x>. The default is 1e-7. This
387 flag may be used to override the value of W established for the
388 model by hmmbuild, or when the query is sequence-based.
389
390
391
392
393 --w_length <n>
394 Override the model instance length upper bound, W, which is oth‐
395 erwise controlled by --w_beta. It should be larger than the
396 model length. The value of W is used deep in the acceleration
397 pipeline, and modest changes are not expected to impact results
398 (though larger values of W do lead to longer run time). This
399 flag may be used to override the value of W established for the
400 model by hmmbuild, or when the query is sequence-based.
401
402
403
404
405 --toponly
406 Only search the top strand. By default both the query sequence
407 and its reverse-complement are searched.
408
409
410 --bottomonly
411 Only search the bottom (reverse-complement) strand. By default
412 both the query sequence and its reverse-complement are searched.
413
414
415
416
417 --cpu <n>
418 Set the number of parallel worker threads to <n>. By default,
419 HMMER sets this to the number of CPU cores it detects in your
420 machine - that is, it tries to maximize the use of your avail‐
421 able processor cores. Setting <n> higher than the number of
422 available cores is of little if any value, but you may want to
423 set it to something less. You can also control this number by
424 setting an environment variable, HMMER_NCPU.
425
426 This option is only available if HMMER was compiled with POSIX
427 threads support. This is the default, but it may have been
428 turned off at compile-time for your site or machine for some
429 reason.
430
431
432
433 --stall
434 For debugging the MPI master/worker version: pause after start,
435 to enable the developer to attach debuggers to the running mas‐
436 ter and worker(s) processes. Send SIGCONT signal to release the
437 pause. (Under gdb: (gdb) signal SIGCONT) (Only available if
438 optional MPI support was enabled at compile-time.)
439
440
441 --mpi Run in MPI master/worker mode, using mpirun. (Only available if
442 optional MPI support was enabled at compile-time.)
443
444
445
446
447
448
449
450
452 See hmmer(1) for a master man page with a list of all the individual
453 man pages for programs in the HMMER package.
454
455
456 For complete documentation, see the user guide that came with your
457 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
458
459
460
461
463 Copyright (C) 2015 Howard Hughes Medical Institute.
464 Freely distributed under the GNU General Public License (GPLv3).
465
466 For additional information on copyright and licensing, see the file
467 called COPYRIGHT in your HMMER source distribution, or see the HMMER
468 web page ().
469
470
471
473 Eddy/Rivas Laboratory
474 Janelia Farm Research Campus
475 19700 Helix Drive
476 Ashburn VA 20147 USA
477 http://eddylab.org
478
479
480
481
482
483
484HMMER 3.1b2 February 2015 nhmmer(1)