1nhmmscan(1) HMMER Manual nhmmscan(1)
2
3
4
6 nhmmscan - search nucleotide sequence(s) against a nucleotide profile
7 database
8
9
10
12 hmmscan [options] <hmmdb> <seqfile>
13
14
15
16
18 nhmmscan is used to search nucleotide sequences against collections of
19 nucleotide profiles. For each sequence in <seqfile>, use that query
20 sequence to search the target database of profiles in <hmmdb>, and out‐
21 put ranked lists of the profiles with the most significant matches to
22 the sequence.
23
24
25 The <seqfile> may contain more than one query sequence. It can be in
26 FASTA format, or several other common sequence file formats (genbank,
27 embl, and uniprot, among others), or in alignment file formats (stock‐
28 holm, aligned fasta, and others). See the --qformat option for a com‐
29 plete list.
30
31
32 The <hmmdb> needs to be press'ed using hmmpress before it can be
33 searched with hmmscan. This creates four binary files, suffixed
34 .h3{fimp}.
35
36
37 The query <seqfile> may be '-' (a dash character), in which case the
38 query sequences are read from a <stdin> pipe instead of from a file.
39 The <hmmdb> cannot be read from a <stdin> stream, because it needs to
40 have those four auxiliary binary files generated by hmmpress.
41
42
43 The output format is designed to be human-readable, but is often so
44 voluminous that reading it is impractical, and parsing it is a pain.
45 The --tblout option saves output in a simple tabular format that is
46 concise and easier to parse. The -o option allows redirecting the main
47 output, including throwing it away in /dev/null.
48
49
50
51
53 -h Help; print a brief reminder of command line usage and all
54 available options.
55
56
57
58
60 -o <f> Direct the main human-readable output to a file <f> instead of
61 the default stdout.
62
63
64 --tblout <f>
65 Save a simple tabular (space-delimited) file summarizing the
66 per-hit output, with one data line per homologous target model
67 hit found.
68
69
70 --dfamtblout <f>
71 Save a tabular (space-delimited) file summarizing the per-hit
72 output, similar to --tblout but more succinct.
73
74
75 --aliscoresout <f>
76 Save to file a list of per-position scores for each hit. This
77 is useful, for example, in identifying regions of high score
78 density for use in resolving overlapping hits from different
79 models.
80
81
82
83 --acc Use accessions instead of names in the main output, where avail‐
84 able for profiles and/or sequences.
85
86
87 --noali
88 Omit the alignment section from the main output. This can
89 greatly reduce the output volume.
90
91
92 --notextw
93 Unlimit the length of each line in the main output. The default
94 is a limit of 120 characters per line, which helps in displaying
95 the output cleanly on terminals and in editors, but can truncate
96 target profile description lines.
97
98
99 --textw <n>
100 Set the main output's line length limit to <n> characters per
101 line. The default is 120.
102
103
104
105
107 Reporting thresholds control which hits are reported in output files
108 (the main output, --tblout, and --dfamtblout). Hits are ranked by sta‐
109 tistical significance (E-value).
110
111
112 -E <x> Report target profiles with an E-value of <= <x>. The default
113 is 10.0, meaning that on average, about 10 false positives will
114 be reported per query, so you can see the top of the noise and
115 decide for yourself if it's really noise.
116
117
118 -T <x> Instead of thresholding output on E-value, instead report target
119 profiles with a bit score of >= <x>.
120
121
122
123
124
126 Inclusion thresholds are stricter than reporting thresholds. Inclusion
127 thresholds control which hits are considered to be reliable enough to
128 be included in an output alignment or a subsequent search round. In
129 nhmmscan, which does not have any alignment output (like nhmmer),
130 inclusion thresholds have little effect. They only affect what hits get
131 marked as significant (!) or questionable (?) in hit output.
132
133
134 --incE <x>
135 Use an E-value of <= <x> as the inclusion threshold. The
136 default is 0.01, meaning that on average, about 1 false positive
137 would be expected in every 100 searches with different query
138 sequences.
139
140
141 --incT <x>
142 Instead of using E-values for setting the inclusion threshold,
143 use a bit score of >= <x> as the inclusion threshold. It would
144 be unusual to use bit score thresholds with hmmscan, because you
145 don't expect a single score threshold to work for different pro‐
146 files; different profiles have slightly different expected score
147 distributions.
148
149
150
151
153 Curated profile databases may define specific bit score thresholds for
154 each profile, superseding any thresholding based on statistical signif‐
155 icance alone.
156
157 To use these options, the profile must contain the appropriate (GA, TC,
158 and/or NC) optional score threshold annotation; this is picked up by
159 hmmbuild from Stockholm format alignment files. For a nucleotide model,
160 each thresholding option has a single per-hit threshold <x> This acts
161 as if -T<x> --incT<x> has been applied specifically using each model's
162 curated thresholds.
163
164
165 --cut_ga
166 Use the GA (gathering) bit score threshold in the model to set
167 per-hit reporting and inclusion thresholds. GA thresholds are
168 generally considered to be the reliable curated thresholds
169 defining family membership; for example, in Dfam, these thresh‐
170 olds are applied when annotating a genome with a model of a fam‐
171 ily known to be found in that organism. They may allow for mini‐
172 mal expected false discovery rate.
173
174
175 --cut_nc
176 Use the NC (noise cutoff) bit score threshold in the model to
177 set per-hit reporting and inclusion thresholds. NC thresholds
178 are less stringent than GA; in the context of Pfam, they are
179 generally used to store the score of the highest-scoring known
180 false positive.
181
182
183 --cut_tc
184 Use the NC (trusted cutoff) bit score threshold in the model to
185 set per-hit reporting and inclusion thresholds. TC thresholds
186 are more stringent than GA, and are generally considered to be
187 the score of the lowest-scoring known true positive that is
188 above all known false positives; for example, in Dfam, these
189 thresholds are applied when annotating a genome with a model of
190 a family not known to be found in that organism.
191
192
193
194
196 HMMER3 searches are accelerated in a three-step filter pipeline: the
197 scanning-SSV filter, the Viterbi filter, and the Forward filter. The
198 first filter is the fastest and most approximate; the last is the full
199 Forward scoring algorithm. There is also a bias filter step between SSV
200 and Viterbi. Targets that pass all the steps in the acceleration pipe‐
201 line are then subjected to postprocessing -- domain identification and
202 scoring using the Forward/Backward algorithm.
203
204 Changing filter thresholds only removes or includes targets from con‐
205 sideration; changing filter thresholds does not alter bit scores, E-
206 values, or alignments, all of which are determined solely in postpro‐
207 cessing.
208
209
210 --max Turn off (nearly) all filters, including the bias filter, and
211 run full Forward/Backward postprocessing on most of the target
212 sequence. In contrast to hmmscan, where this flag really does
213 turn off the filters entirely, the --max flag in nhmmscan sets
214 the scanning-SSV filter threshold to 0.4, not 1.0. Use of this
215 flag increases sensitivity somewhat, at a large cost in speed.
216
217
218 --F1 <x>
219 Set the P-value threshold for the MSV filter step. The default
220 is 0.02, meaning that roughly 2% of the highest scoring nonho‐
221 mologous targets are expected to pass the filter.
222
223
224 --F2 <x>
225 Set the P-value threshold for the Viterbi filter step. The
226 default is 0.001.
227
228
229 --F3 <x>
230 Set the P-value threshold for the Forward filter step. The
231 default is 1e-5.
232
233
234 --nobias
235 Turn off the bias filter. This increases sensitivity somewhat,
236 but can come at a high cost in speed, especially if the query
237 has biased residue composition (such as a repetitive sequence
238 region, or if it is a membrane protein with large regions of
239 hydrophobicity). Without the bias filter, too many sequences may
240 pass the filter with biased queries, leading to slower than
241 expected performance as the computationally intensive For‐
242 ward/Backward algorithms shoulder an abnormally heavy load.
243
244
245
246
248 --nonull2
249 Turn off the null2 score corrections for biased composition.
250
251
252 -Z <x> Assert that the total number of targets in your searches is <x>,
253 for the purposes of per-sequence E-value calculations, rather
254 than the actual number of targets seen.
255
256
257 --seed <n>
258 Set the random number seed to <n>. Some steps in postprocessing
259 require Monte Carlo simulation. The default is to use a fixed
260 seed (42), so that results are exactly reproducible. Any other
261 positive integer will give different (but also reproducible)
262 results. A choice of 0 uses an arbitrarily chosen seed.
263
264
265 --qformat <s>
266 Assert that the query sequence file is in format <s>. Accepted
267 formats include fasta, embl, genbank, ddbj, uniprot, stockholm,
268 pfam, a2m, and afa. The default is to autodetect the format of
269 the file.
270
271
272
273 --w_beta <x>
274 Window length tail mass. The upper bound, W, on the length at
275 which nhmmer expects to find an instance of the model is set
276 such that the fraction of all sequences generated by the model
277 with length >= W is less than <x>. The default is 1e-7. This
278 flag may be used to override the value of W established for the
279 model by hmmbuild.
280
281
282
283
284 --w_length <n>
285 Override the model instance length upper bound, W, which is oth‐
286 erwise controlled by --w_beta. It should be larger than the
287 model length. The value of W is used deep in the acceleration
288 pipeline, and modest changes are not expected to impact results
289 (though larger values of W do lead to longer run time). This
290 flag may be used to override the value of W established for the
291 model by hmmbuild.
292
293
294
295 --toponly
296 Only search the top strand. By default both the query sequence
297 and its reverse-complement are searched.
298
299
300 --bottomonly
301 Only search the bottom (reverse-complement) strand. By default
302 both the query sequence and its reverse-complement are searched.
303
304
305
306 --cpu <n>
307 Set the number of parallel worker threads to <n>. By default,
308 HMMER sets this to the number of CPU cores it detects in your
309 machine - that is, it tries to maximize the use of your avail‐
310 able processor cores. Setting <n> higher than the number of
311 available cores is of little if any value, but you may want to
312 set it to something less. You can also control this number by
313 setting an environment variable, HMMER_NCPU.
314
315 This option is only available if HMMER was compiled with POSIX
316 threads support. This is the default, but it may have been
317 turned off for your site or machine for some reason.
318
319
320
321 --stall
322 For debugging the MPI master/worker version: pause after start,
323 to enable the developer to attach debuggers to the running mas‐
324 ter and worker(s) processes. Send SIGCONT signal to release the
325 pause. (Under gdb: (gdb) signal SIGCONT)
326
327 (Only available if optional MPI support was enabled at compile-
328 time.)
329
330
331 --mpi Run in MPI master/worker mode, using mpirun.
332
333 (Only available if optional MPI support was enabled at compile-
334 time.)
335
336
337
338
339
340
341
342
344 See hmmer(1) for a master man page with a list of all the individual
345 man pages for programs in the HMMER package.
346
347
348 For complete documentation, see the user guide that came with your
349 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
350
351
352
353
355 Copyright (C) 2015 Howard Hughes Medical Institute.
356 Freely distributed under the GNU General Public License (GPLv3).
357
358 For additional information on copyright and licensing, see the file
359 called COPYRIGHT in your HMMER source distribution, or see the HMMER
360 web page ().
361
362
363
365 Eddy/Rivas Laboratory
366 Janelia Farm Research Campus
367 19700 Helix Drive
368 Ashburn VA 20147 USA
369 http://eddylab.org
370
371
372
373
374HMMER 3.1b2 February 2015 nhmmscan(1)