nhmmscan(1)

1nhmmscan(1)                      HMMER Manual                      nhmmscan(1)
2
3
4

NAME

6       nhmmscan - search DNA sequence(s) against a DNA profile database
7
8
9

SYNOPSIS

11       nhmmscan [options] hmmdb seqfile
12
13
14
15

DESCRIPTION

17       nhmmscan  is used to search nucleotide sequences against collections of
18       nucleotide profiles. For each sequence in seqfile, use that  query  se‐
19       quence  to  search the target database of profiles in hmmdb, and output
20       ranked lists of the profiles with the most significant matches  to  the
21       sequence.
22
23
24       The  seqfile  may  contain  more  than one query sequence. It can be in
25       FASTA format, or several other common sequence file  formats  (genbank,
26       embl,  and uniprot, among others), or in alignment file formats (stock‐
27       holm, aligned fasta, and others). See the --qformat option for  a  com‐
28       plete list.
29
30
31       The hmmdb needs to be press'ed using hmmpress before it can be searched
32       with nhmmscan.  This creates four binary files, suffixed .h3{fimp}.
33
34
35       The query seqfile may be '-' (a dash  character),  in  which  case  the
36       query sequences are read from a stdin pipe instead of from a file.  The
37       hmmdb cannot be read from a stdin stream, because it needs to have  the
38       four auxiliary binary files generated by hmmpress.
39
40
41       The output format is designed to be human-readable, but is often so vo‐
42       luminous that reading it is impractical, and parsing it is a pain.  The
43       --tblout option saves output in a simple tabular format that is concise
44       and easier to parse.  The -o option allows redirecting the main output,
45       including throwing it away in /dev/null.
46
47
48
49

OPTIONS

51       -h     Help;  print  a  brief  reminder  of  command line usage and all
52              available options.
53
54
55
56

OPTIONS FOR CONTROLLING OUTPUT

58       -o <f> Direct the main human-readable output to a file <f>  instead  of
59              the default stdout.
60
61
62       --tblout <f>
63              Save  a  simple  tabular  (space-delimited) file summarizing the
64              per-hit output, with one data line per homologous  target  model
65              hit found.
66
67
68       --dfamtblout <f>
69              Save  a  tabular  (space-delimited) file summarizing the per-hit
70              output, similar to --tblout but more succinct.
71
72
73       --aliscoresout <f>
74              Save to file a list of per-position scores for each  hit.   This
75              is  useful,  for  example,  in identifying regions of high score
76              density for use in resolving  overlapping  hits  from  different
77              models.
78
79
80
81       --acc  Use accessions instead of names in the main output, where avail‐
82              able for profiles and/or sequences.
83
84
85       --noali
86              Omit the alignment  section  from  the  main  output.  This  can
87              greatly reduce the output volume.
88
89
90       --notextw
91              Unlimit  the length of each line in the main output. The default
92              is a limit of 120 characters per line, which helps in displaying
93              the output cleanly on terminals and in editors, but can truncate
94              target profile description lines.
95
96
97       --textw <n>
98              Set the main output's line length limit to  <n>  characters  per
99              line. The default is 120.
100
101
102
103

OPTIONS FOR REPORTING THRESHOLDS

105       Reporting  thresholds  control  which hits are reported in output files
106       (the main output, --tblout, and --dfamtblout).  Hits are ranked by sta‐
107       tistical significance (E-value).
108
109
110       -E <x> Report  target  profiles with an E-value of <= <x>.  The default
111              is 10.0, meaning that on average, about 10 false positives  will
112              be  reported  per query, so you can see the top of the noise and
113              decide for yourself if it's really noise.
114
115
116       -T <x> Instead of thresholding output on E-value, instead report target
117              profiles with a bit score of >= <x>.
118
119
120
121
122

OPTIONS FOR INCLUSION THRESHOLDS

124       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
125       thresholds control which hits are considered to be reliable  enough  to
126       be  included  in  an output alignment or a subsequent search round.  In
127       nhmmscan, which does not have any alignment output (like  nhmmer),  in‐
128       clusion  thresholds  have little effect. They only affect what hits get
129       marked as significant (!) or questionable (?) in hit output.
130
131
132       --incE <x>
133              Use an E-value of <= <x> as the inclusion  threshold.   The  de‐
134              fault  is  0.01, meaning that on average, about 1 false positive
135              would be expected in every 100 searches with different query se‐
136              quences.
137
138
139       --incT <x>
140              Instead  of  using E-values for setting the inclusion threshold,
141              use a bit score of >= <x> as the inclusion threshold.  It  would
142              be unusual to use bit score thresholds with hmmscan, because you
143              don't expect a single score threshold to work for different pro‐
144              files; different profiles have slightly different expected score
145              distributions.
146
147
148
149

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

151       Curated profile databases may define specific bit score thresholds  for
152       each profile, superseding any thresholding based on statistical signif‐
153       icance alone.
154
155       To use these options, the profile must contain the appropriate (GA, TC,
156       and/or  NC)  optional  score threshold annotation; this is picked up by
157       hmmbuild from Stockholm format alignment files. For a nucleotide model,
158       each  thresholding  option has a single per-hit threshold <x> This acts
159       as if -T <x> --incT  <x>  has  been  applied  specifically  using  each
160       model's curated thresholds.
161
162
163       --cut_ga
164              Use  the  GA (gathering) bit score threshold in the model to set
165              per-hit reporting and inclusion thresholds.  GA  thresholds  are
166              generally  considered  to  be  the  reliable  curated thresholds
167              defining family membership; for example, in Dfam, these  thresh‐
168              olds are applied when annotating a genome with a model of a fam‐
169              ily known to be found in that organism. They may allow for mini‐
170              mal expected false discovery rate.
171
172
173       --cut_nc
174              Use  the  NC  (noise cutoff) bit score threshold in the model to
175              set per-hit reporting and inclusion  thresholds.  NC  thresholds
176              are  less  stringent  than  GA; in the context of Pfam, they are
177              generally used to store the score of the  highest-scoring  known
178              false positive.
179
180
181       --cut_tc
182              Use  the TC (trusted cutoff) bit score threshold in the model to
183              set per-hit reporting and inclusion  thresholds.  TC  thresholds
184              are  more  stringent than GA, and are generally considered to be
185              the score of the lowest-scoring  known  true  positive  that  is
186              above  all  known  false  positives; for example, in Dfam, these
187              thresholds are applied when annotating a genome with a model  of
188              a family not known to be found in that organism.
189
190
191
192

CONTROL OF THE ACCELERATION PIPELINE

194       HMMER3  searches  are  accelerated in a three-step filter pipeline: the
195       scanning-SSV filter, the Viterbi filter, and the  Forward  filter.  The
196       first  filter is the fastest and most approximate; the last is the full
197       Forward scoring algorithm. There is also a bias filter step between SSV
198       and  Viterbi. Targets that pass all the steps in the acceleration pipe‐
199       line are then subjected to postprocessing -- domain identification  and
200       scoring using the Forward/Backward algorithm.
201
202       Changing  filter  thresholds only removes or includes targets from con‐
203       sideration; changing filter thresholds does not alter  bit  scores,  E-
204       values,  or  alignments, all of which are determined solely in postpro‐
205       cessing.
206
207
208       --max  Turn off (nearly) all filters, including the  bias  filter,  and
209              run  full  Forward/Backward postprocessing on most of the target
210              sequence.  In contrast to hmmscan, where this flag  really  does
211              turn  off  the filters entirely, the --max flag in nhmmscan sets
212              the scanning-SSV filter threshold to 0.4, not 1.0. Use  of  this
213              flag increases sensitivity somewhat, at a large cost in speed.
214
215
216       --F1 <x>
217              Set  the P-value threshold for the MSV filter step.  The default
218              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
219              mologous targets are expected to pass the filter.
220
221
222       --F2 <x>
223              Set  the P-value threshold for the Viterbi filter step.  The de‐
224              fault is 0.001.
225
226
227       --F3 <x>
228              Set the P-value threshold for the Forward filter step.  The  de‐
229              fault is 1e-5.
230
231
232       --nobias
233              Turn  off  the bias filter. This increases sensitivity somewhat,
234              but can come at a high cost in speed, especially  if  the  query
235              has  biased  residue  composition (such as a repetitive sequence
236              region, or if it is a membrane protein with large regions of hy‐
237              drophobicity).  Without  the bias filter, too many sequences may
238              pass the filter with biased queries, leading to slower than  ex‐
239              pected   performance   as  the  computationally  intensive  For‐
240              ward/Backward algorithms shoulder an abnormally heavy load.
241
242
243
244

OTHER OPTIONS

246       --nonull2
247              Turn off the null2 score corrections for biased composition.
248
249
250       -Z <x> Assert that the total number of targets in your searches is <x>,
251              for  the  purposes  of per-sequence E-value calculations, rather
252              than the actual number of targets seen.
253
254
255       --seed <n>
256              Set the random number seed to <n>.  Some steps in postprocessing
257              require  Monte  Carlo simulation.  The default is to use a fixed
258              seed (42), so that results are exactly reproducible.  Any  other
259              positive integer will give different (but also reproducible) re‐
260              sults. A choice of 0 uses an arbitrarily chosen seed.
261
262
263       --qformat <s>
264              Assert that input query seqfile is in format <s>, bypassing for‐
265              mat autodetection.  Common choices for <s> include: fasta, embl,
266              genbank.  Alignment formats also work; common  choices  include:
267              stockholm, a2m, afa, psiblast, clustal, phylip.  For more infor‐
268              mation, and for codes for some less  common  formats,  see  main
269              documentation.   The  string  <s>  is case-insensitive (fasta or
270              FASTA both work).
271
272
273
274       --w_beta <x>
275              Window length tail mass.  The upper bound, W, on the  length  at
276              which  nhmmer  expects  to  find an instance of the model is set
277              such that the fraction of all sequences generated by  the  model
278              with  length  >= W is less than <x>.  The default is 1e-7.  This
279              flag may be used to override the value of W established for  the
280              model by hmmbuild.
281
282
283
284
285       --w_length <n>
286              Override the model instance length upper bound, W, which is oth‐
287              erwise controlled by --w_beta.  It should  be  larger  than  the
288              model  length.  The value of  W is used deep in the acceleration
289              pipeline, and modest changes are not expected to impact  results
290              (though  larger  values  of W do lead to longer run time).  This
291              flag may be used to override the value of W established for  the
292              model by hmmbuild.
293
294
295
296       --watson
297              Only  search  the top strand. By default both the query sequence
298              and its reverse-complement are searched.
299
300
301       --crick
302              Only search the bottom (reverse-complement) strand.  By  default
303              both the query sequence and its reverse-complement are searched.
304
305
306
307       --cpu <n>
308              Set  the number of parallel worker threads to <n>.  On multicore
309              machines, the default is 2.  You can also control this number by
310              setting  an  environment  variable, HMMER_NCPU.  There is also a
311              master thread, so the actual number of threads that HMMER spawns
312              is <n>+1.
313
314              This  option  is  not available if HMMER was compiled with POSIX
315              threads support turned off.
316
317
318
319
320
321       --stall
322              For debugging the MPI master/worker version: pause after  start,
323              to  enable the developer to attach debuggers to the running mas‐
324              ter and worker(s) processes. Send SIGCONT signal to release  the
325              pause.  (Under gdb: (gdb) signal SIGCONT)
326
327              (Only  available if optional MPI support was enabled at compile-
328              time.)
329
330
331       --mpi  Run under MPI control with master/worker parallelization  (using
332              mpirun,  for example, or equivalent). Only available if optional
333              MPI support was enabled at compile-time.
334
335
336
337
338
339
340
341
342
343

COPYRIGHT

357       Copyright (C) 2020 Howard Hughes Medical Institute.
358       Freely distributed under the BSD open source license.
359
360       For  additional  information  on  copyright and licensing, see the file
361       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
362       web page (http://hmmer.org/).
363
364
365

AUTHOR

367       http://eddylab.org
368
369
370
371
372HMMER 3.3.2                        Nov 2020                        nhmmscan(1)