1nhmmscan(1) HMMER Manual nhmmscan(1)
2
3
4
6 nhmmscan - search DNA sequence(s) against a DNA profile database
7
8
9
11 nhmmscan [options] hmmdb seqfile
12
13
14
15
17 nhmmscan is used to search nucleotide sequences against collections of
18 nucleotide profiles. For each sequence in seqfile, use that query se‐
19 quence to search the target database of profiles in hmmdb, and output
20 ranked lists of the profiles with the most significant matches to the
21 sequence.
22
23
24 The seqfile may contain more than one query sequence. It can be in
25 FASTA format, or several other common sequence file formats (genbank,
26 embl, and uniprot, among others), or in alignment file formats (stock‐
27 holm, aligned fasta, and others). See the --qformat option for a com‐
28 plete list.
29
30
31 The hmmdb needs to be press'ed using hmmpress before it can be searched
32 with nhmmscan. This creates four binary files, suffixed .h3{fimp}.
33
34
35 The query seqfile may be '-' (a dash character), in which case the
36 query sequences are read from a stdin pipe instead of from a file. The
37 hmmdb cannot be read from a stdin stream, because it needs to have the
38 four auxiliary binary files generated by hmmpress.
39
40
41 The output format is designed to be human-readable, but is often so vo‐
42 luminous that reading it is impractical, and parsing it is a pain. The
43 --tblout option saves output in a simple tabular format that is concise
44 and easier to parse. The -o option allows redirecting the main output,
45 including throwing it away in /dev/null.
46
47
48
49
51 -h Help; print a brief reminder of command line usage and all
52 available options.
53
54
55
56
58 -o <f> Direct the main human-readable output to a file <f> instead of
59 the default stdout.
60
61
62 --tblout <f>
63 Save a simple tabular (space-delimited) file summarizing the
64 per-hit output, with one data line per homologous target model
65 hit found.
66
67
68 --dfamtblout <f>
69 Save a tabular (space-delimited) file summarizing the per-hit
70 output, similar to --tblout but more succinct.
71
72
73 --aliscoresout <f>
74 Save to file a list of per-position scores for each hit. This
75 is useful, for example, in identifying regions of high score
76 density for use in resolving overlapping hits from different
77 models.
78
79
80
81 --acc Use accessions instead of names in the main output, where avail‐
82 able for profiles and/or sequences.
83
84
85 --noali
86 Omit the alignment section from the main output. This can
87 greatly reduce the output volume.
88
89
90 --notextw
91 Unlimit the length of each line in the main output. The default
92 is a limit of 120 characters per line, which helps in displaying
93 the output cleanly on terminals and in editors, but can truncate
94 target profile description lines.
95
96
97 --textw <n>
98 Set the main output's line length limit to <n> characters per
99 line. The default is 120.
100
101
102
103
105 Reporting thresholds control which hits are reported in output files
106 (the main output, --tblout, and --dfamtblout). Hits are ranked by sta‐
107 tistical significance (E-value).
108
109
110 -E <x> Report target profiles with an E-value of <= <x>. The default
111 is 10.0, meaning that on average, about 10 false positives will
112 be reported per query, so you can see the top of the noise and
113 decide for yourself if it's really noise.
114
115
116 -T <x> Instead of thresholding output on E-value, instead report target
117 profiles with a bit score of >= <x>.
118
119
120
121
122
124 Inclusion thresholds are stricter than reporting thresholds. Inclusion
125 thresholds control which hits are considered to be reliable enough to
126 be included in an output alignment or a subsequent search round. In
127 nhmmscan, which does not have any alignment output (like nhmmer), in‐
128 clusion thresholds have little effect. They only affect what hits get
129 marked as significant (!) or questionable (?) in hit output.
130
131
132 --incE <x>
133 Use an E-value of <= <x> as the inclusion threshold. The de‐
134 fault is 0.01, meaning that on average, about 1 false positive
135 would be expected in every 100 searches with different query se‐
136 quences.
137
138
139 --incT <x>
140 Instead of using E-values for setting the inclusion threshold,
141 use a bit score of >= <x> as the inclusion threshold. It would
142 be unusual to use bit score thresholds with hmmscan, because you
143 don't expect a single score threshold to work for different pro‐
144 files; different profiles have slightly different expected score
145 distributions.
146
147
148
149
151 Curated profile databases may define specific bit score thresholds for
152 each profile, superseding any thresholding based on statistical signif‐
153 icance alone.
154
155 To use these options, the profile must contain the appropriate (GA, TC,
156 and/or NC) optional score threshold annotation; this is picked up by
157 hmmbuild from Stockholm format alignment files. For a nucleotide model,
158 each thresholding option has a single per-hit threshold <x> This acts
159 as if -T <x> --incT <x> has been applied specifically using each
160 model's curated thresholds.
161
162
163 --cut_ga
164 Use the GA (gathering) bit score threshold in the model to set
165 per-hit reporting and inclusion thresholds. GA thresholds are
166 generally considered to be the reliable curated thresholds
167 defining family membership; for example, in Dfam, these thresh‐
168 olds are applied when annotating a genome with a model of a fam‐
169 ily known to be found in that organism. They may allow for mini‐
170 mal expected false discovery rate.
171
172
173 --cut_nc
174 Use the NC (noise cutoff) bit score threshold in the model to
175 set per-hit reporting and inclusion thresholds. NC thresholds
176 are less stringent than GA; in the context of Pfam, they are
177 generally used to store the score of the highest-scoring known
178 false positive.
179
180
181 --cut_tc
182 Use the TC (trusted cutoff) bit score threshold in the model to
183 set per-hit reporting and inclusion thresholds. TC thresholds
184 are more stringent than GA, and are generally considered to be
185 the score of the lowest-scoring known true positive that is
186 above all known false positives; for example, in Dfam, these
187 thresholds are applied when annotating a genome with a model of
188 a family not known to be found in that organism.
189
190
191
192
194 HMMER3 searches are accelerated in a three-step filter pipeline: the
195 scanning-SSV filter, the Viterbi filter, and the Forward filter. The
196 first filter is the fastest and most approximate; the last is the full
197 Forward scoring algorithm. There is also a bias filter step between SSV
198 and Viterbi. Targets that pass all the steps in the acceleration pipe‐
199 line are then subjected to postprocessing -- domain identification and
200 scoring using the Forward/Backward algorithm.
201
202 Changing filter thresholds only removes or includes targets from con‐
203 sideration; changing filter thresholds does not alter bit scores, E-
204 values, or alignments, all of which are determined solely in postpro‐
205 cessing.
206
207
208 --max Turn off (nearly) all filters, including the bias filter, and
209 run full Forward/Backward postprocessing on most of the target
210 sequence. In contrast to hmmscan, where this flag really does
211 turn off the filters entirely, the --max flag in nhmmscan sets
212 the scanning-SSV filter threshold to 0.4, not 1.0. Use of this
213 flag increases sensitivity somewhat, at a large cost in speed.
214
215
216 --F1 <x>
217 Set the P-value threshold for the MSV filter step. The default
218 is 0.02, meaning that roughly 2% of the highest scoring nonho‐
219 mologous targets are expected to pass the filter.
220
221
222 --F2 <x>
223 Set the P-value threshold for the Viterbi filter step. The de‐
224 fault is 0.001.
225
226
227 --F3 <x>
228 Set the P-value threshold for the Forward filter step. The de‐
229 fault is 1e-5.
230
231
232 --nobias
233 Turn off the bias filter. This increases sensitivity somewhat,
234 but can come at a high cost in speed, especially if the query
235 has biased residue composition (such as a repetitive sequence
236 region, or if it is a membrane protein with large regions of hy‐
237 drophobicity). Without the bias filter, too many sequences may
238 pass the filter with biased queries, leading to slower than ex‐
239 pected performance as the computationally intensive For‐
240 ward/Backward algorithms shoulder an abnormally heavy load.
241
242
243
244
246 --nonull2
247 Turn off the null2 score corrections for biased composition.
248
249
250 -Z <x> Assert that the total number of targets in your searches is <x>,
251 for the purposes of per-sequence E-value calculations, rather
252 than the actual number of targets seen.
253
254
255 --seed <n>
256 Set the random number seed to <n>. Some steps in postprocessing
257 require Monte Carlo simulation. The default is to use a fixed
258 seed (42), so that results are exactly reproducible. Any other
259 positive integer will give different (but also reproducible) re‐
260 sults. A choice of 0 uses an arbitrarily chosen seed.
261
262
263 --qformat <s>
264 Assert that input query seqfile is in format <s>, bypassing for‐
265 mat autodetection. Common choices for <s> include: fasta, embl,
266 genbank. Alignment formats also work; common choices include:
267 stockholm, a2m, afa, psiblast, clustal, phylip. For more infor‐
268 mation, and for codes for some less common formats, see main
269 documentation. The string <s> is case-insensitive (fasta or
270 FASTA both work).
271
272
273
274 --w_beta <x>
275 Window length tail mass. The upper bound, W, on the length at
276 which nhmmer expects to find an instance of the model is set
277 such that the fraction of all sequences generated by the model
278 with length >= W is less than <x>. The default is 1e-7. This
279 flag may be used to override the value of W established for the
280 model by hmmbuild.
281
282
283
284
285 --w_length <n>
286 Override the model instance length upper bound, W, which is oth‐
287 erwise controlled by --w_beta. It should be larger than the
288 model length. The value of W is used deep in the acceleration
289 pipeline, and modest changes are not expected to impact results
290 (though larger values of W do lead to longer run time). This
291 flag may be used to override the value of W established for the
292 model by hmmbuild.
293
294
295
296 --watson
297 Only search the top strand. By default both the query sequence
298 and its reverse-complement are searched.
299
300
301 --crick
302 Only search the bottom (reverse-complement) strand. By default
303 both the query sequence and its reverse-complement are searched.
304
305
306
307 --cpu <n>
308 Set the number of parallel worker threads to <n>. On multicore
309 machines, the default is 2. You can also control this number by
310 setting an environment variable, HMMER_NCPU. There is also a
311 master thread, so the actual number of threads that HMMER spawns
312 is <n>+1.
313
314 This option is not available if HMMER was compiled with POSIX
315 threads support turned off.
316
317
318
319
320
321 --stall
322 For debugging the MPI master/worker version: pause after start,
323 to enable the developer to attach debuggers to the running mas‐
324 ter and worker(s) processes. Send SIGCONT signal to release the
325 pause. (Under gdb: (gdb) signal SIGCONT)
326
327 (Only available if optional MPI support was enabled at compile-
328 time.)
329
330
331 --mpi Run under MPI control with master/worker parallelization (using
332 mpirun, for example, or equivalent). Only available if optional
333 MPI support was enabled at compile-time.
334
335
336
337
338
339
340
341
342
343
345 See hmmer(1) for a master man page with a list of all the individual
346 man pages for programs in the HMMER package.
347
348
349 For complete documentation, see the user guide that came with your HM‐
350 MER distribution (Userguide.pdf); or see the HMMER web page (http://hm‐
351 mer.org/).
352
353
354
355
357 Copyright (C) 2020 Howard Hughes Medical Institute.
358 Freely distributed under the BSD open source license.
359
360 For additional information on copyright and licensing, see the file
361 called COPYRIGHT in your HMMER source distribution, or see the HMMER
362 web page (http://hmmer.org/).
363
364
365
367 http://eddylab.org
368
369
370
371
372HMMER 3.3.2 Nov 2020 nhmmscan(1)