nhmmscan(1)

1nhmmscan(1)                      HMMER Manual                      nhmmscan(1)
2
3
4

NAME

6       nhmmscan  -  search nucleotide sequence(s) against a nucleotide profile
7       database
8
9
10

SYNOPSIS

12       hmmscan [options] <hmmdb> <seqfile>
13
14
15
16

DESCRIPTION

18       nhmmscan is used to search nucleotide sequences against collections  of
19       nucleotide  profiles.  For  each  sequence in <seqfile>, use that query
20       sequence to search the target database of profiles in <hmmdb>, and out‐
21       put  ranked  lists of the profiles with the most significant matches to
22       the sequence.
23
24
25       The <seqfile> may contain more than one query sequence. It  can  be  in
26       FASTA  format,  or several other common sequence file formats (genbank,
27       embl, and uniprot, among others), or in alignment file formats  (stock‐
28       holm,  aligned  fasta, and others). See the --qformat option for a com‐
29       plete list.
30
31
32       The <hmmdb> needs to be  press'ed  using  hmmpress  before  it  can  be
33       searched  with  hmmscan.   This  creates  four  binary  files, suffixed
34       .h3{fimp}.
35
36
37       The query <seqfile> may be '-' (a dash character), in  which  case  the
38       query  sequences  are  read from a <stdin> pipe instead of from a file.
39       The <hmmdb> cannot be read from a <stdin> stream, because it  needs  to
40       have those four auxiliary binary files generated by hmmpress.
41
42
43       The  output  format  is  designed to be human-readable, but is often so
44       voluminous that reading it is impractical, and parsing it  is  a  pain.
45       The  --tblout  option  saves  output in a simple tabular format that is
46       concise and easier to parse.  The -o option allows redirecting the main
47       output, including throwing it away in /dev/null.
48
49
50
51

OPTIONS

53       -h     Help;  print  a  brief  reminder  of  command line usage and all
54              available options.
55
56
57
58

OPTIONS FOR CONTROLLING OUTPUT

60       -o <f> Direct the main human-readable output to a file <f>  instead  of
61              the default stdout.
62
63
64       --tblout <f>
65              Save  a  simple  tabular  (space-delimited) file summarizing the
66              per-hit output, with one data line per homologous  target  model
67              hit found.
68
69
70       --dfamtblout <f>
71              Save  a  tabular  (space-delimited) file summarizing the per-hit
72              output, similar to --tblout but more succinct.
73
74
75       --aliscoresout <f>
76              Save to file a list of per-position scores for each  hit.   This
77              is  useful,  for  example,  in identifying regions of high score
78              density for use in resolving  overlapping  hits  from  different
79              models.
80
81
82
83       --acc  Use accessions instead of names in the main output, where avail‐
84              able for profiles and/or sequences.
85
86
87       --noali
88              Omit the alignment  section  from  the  main  output.  This  can
89              greatly reduce the output volume.
90
91
92       --notextw
93              Unlimit  the length of each line in the main output. The default
94              is a limit of 120 characters per line, which helps in displaying
95              the output cleanly on terminals and in editors, but can truncate
96              target profile description lines.
97
98
99       --textw <n>
100              Set the main output's line length limit to  <n>  characters  per
101              line. The default is 120.
102
103
104
105

OPTIONS FOR REPORTING THRESHOLDS

107       Reporting  thresholds  control  which hits are reported in output files
108       (the main output, --tblout, and --dfamtblout).  Hits are ranked by sta‐
109       tistical significance (E-value).
110
111
112       -E <x> Report  target  profiles with an E-value of <= <x>.  The default
113              is 10.0, meaning that on average, about 10 false positives  will
114              be  reported  per query, so you can see the top of the noise and
115              decide for yourself if it's really noise.
116
117
118       -T <x> Instead of thresholding output on E-value, instead report target
119              profiles with a bit score of >= <x>.
120
121
122
123
124

OPTIONS FOR INCLUSION THRESHOLDS

126       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
127       thresholds control which hits are considered to be reliable  enough  to
128       be  included  in  an output alignment or a subsequent search round.  In
129       nhmmscan, which does not  have  any  alignment  output  (like  nhmmer),
130       inclusion thresholds have little effect. They only affect what hits get
131       marked as significant (!) or questionable (?) in hit output.
132
133
134       --incE <x>
135              Use an E-value of  <=  <x>  as  the  inclusion  threshold.   The
136              default is 0.01, meaning that on average, about 1 false positive
137              would be expected in every 100  searches  with  different  query
138              sequences.
139
140
141       --incT <x>
142              Instead  of  using E-values for setting the inclusion threshold,
143              use a bit score of >= <x> as the inclusion threshold.  It  would
144              be unusual to use bit score thresholds with hmmscan, because you
145              don't expect a single score threshold to work for different pro‐
146              files; different profiles have slightly different expected score
147              distributions.
148
149
150
151

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

153       Curated profile databases may define specific bit score thresholds  for
154       each profile, superseding any thresholding based on statistical signif‐
155       icance alone.
156
157       To use these options, the profile must contain the appropriate (GA, TC,
158       and/or  NC)  optional  score threshold annotation; this is picked up by
159       hmmbuild from Stockholm format alignment files. For a nucleotide model,
160       each  thresholding  option has a single per-hit threshold <x> This acts
161       as if -T<x> --incT<x> has been applied specifically using each  model's
162       curated thresholds.
163
164
165       --cut_ga
166              Use  the  GA (gathering) bit score threshold in the model to set
167              per-hit reporting and inclusion thresholds.  GA  thresholds  are
168              generally  considered  to  be  the  reliable  curated thresholds
169              defining family membership; for example, in Dfam, these  thresh‐
170              olds are applied when annotating a genome with a model of a fam‐
171              ily known to be found in that organism. They may allow for mini‐
172              mal expected false discovery rate.
173
174
175       --cut_nc
176              Use  the  NC  (noise cutoff) bit score threshold in the model to
177              set per-hit reporting and inclusion  thresholds.  NC  thresholds
178              are  less  stringent  than  GA; in the context of Pfam, they are
179              generally used to store the score of the  highest-scoring  known
180              false positive.
181
182
183       --cut_tc
184              Use  the NC (trusted cutoff) bit score threshold in the model to
185              set per-hit reporting and inclusion  thresholds.  TC  thresholds
186              are  more  stringent than GA, and are generally considered to be
187              the score of the lowest-scoring  known  true  positive  that  is
188              above  all  known  false  positives; for example, in Dfam, these
189              thresholds are applied when annotating a genome with a model  of
190              a family not known to be found in that organism.
191
192
193
194

CONTROL OF THE ACCELERATION PIPELINE

196       HMMER3  searches  are  accelerated in a three-step filter pipeline: the
197       scanning-SSV filter, the Viterbi filter, and the  Forward  filter.  The
198       first  filter is the fastest and most approximate; the last is the full
199       Forward scoring algorithm. There is also a bias filter step between SSV
200       and  Viterbi. Targets that pass all the steps in the acceleration pipe‐
201       line are then subjected to postprocessing -- domain identification  and
202       scoring using the Forward/Backward algorithm.
203
204       Changing  filter  thresholds only removes or includes targets from con‐
205       sideration; changing filter thresholds does not alter  bit  scores,  E-
206       values,  or  alignments, all of which are determined solely in postpro‐
207       cessing.
208
209
210       --max  Turn off (nearly) all filters, including the  bias  filter,  and
211              run  full  Forward/Backward postprocessing on most of the target
212              sequence.  In contrast to hmmscan, where this flag  really  does
213              turn  off  the filters entirely, the --max flag in nhmmscan sets
214              the scanning-SSV filter threshold to 0.4, not 1.0. Use  of  this
215              flag increases sensitivity somewhat, at a large cost in speed.
216
217
218       --F1 <x>
219              Set  the P-value threshold for the MSV filter step.  The default
220              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
221              mologous targets are expected to pass the filter.
222
223
224       --F2 <x>
225              Set  the  P-value  threshold  for  the Viterbi filter step.  The
226              default is 0.001.
227
228
229       --F3 <x>
230              Set the P-value threshold for  the  Forward  filter  step.   The
231              default is 1e-5.
232
233
234       --nobias
235              Turn  off  the bias filter. This increases sensitivity somewhat,
236              but can come at a high cost in speed, especially  if  the  query
237              has  biased  residue  composition (such as a repetitive sequence
238              region, or if it is a membrane protein  with  large  regions  of
239              hydrophobicity). Without the bias filter, too many sequences may
240              pass the filter with biased  queries,  leading  to  slower  than
241              expected  performance  as  the  computationally  intensive  For‐
242              ward/Backward algorithms shoulder an abnormally heavy load.
243
244
245
246

OTHER OPTIONS

248       --nonull2
249              Turn off the null2 score corrections for biased composition.
250
251
252       -Z <x> Assert that the total number of targets in your searches is <x>,
253              for  the  purposes  of per-sequence E-value calculations, rather
254              than the actual number of targets seen.
255
256
257       --seed <n>
258              Set the random number seed to <n>.  Some steps in postprocessing
259              require  Monte  Carlo simulation.  The default is to use a fixed
260              seed (42), so that results are exactly reproducible.  Any  other
261              positive  integer  will  give  different (but also reproducible)
262              results. A choice of 0 uses an arbitrarily chosen seed.
263
264
265       --qformat <s>
266              Assert that the query sequence file is in format <s>.   Accepted
267              formats  include fasta, embl, genbank, ddbj, uniprot, stockholm,
268              pfam, a2m, and afa.  The default is to autodetect the format  of
269              the file.
270
271
272
273       --w_beta <x>
274              Window  length  tail mass.  The upper bound, W, on the length at
275              which nhmmer expects to find an instance of  the  model  is  set
276              such  that  the fraction of all sequences generated by the model
277              with length >= W is less than <x>.  The default is  1e-7.   This
278              flag  may be used to override the value of W established for the
279              model by hmmbuild.
280
281
282
283
284       --w_length <n>
285              Override the model instance length upper bound, W, which is oth‐
286              erwise  controlled  by  --w_beta.   It should be larger than the
287              model length. The value of W is used deep  in  the  acceleration
288              pipeline,  and modest changes are not expected to impact results
289              (though larger values of W do lead to longer  run  time).   This
290              flag  may be used to override the value of W established for the
291              model by hmmbuild.
292
293
294
295       --toponly
296              Only search the top strand. By default both the  query  sequence
297              and its reverse-complement are searched.
298
299
300       --bottomonly
301              Only  search  the bottom (reverse-complement) strand. By default
302              both the query sequence and its reverse-complement are searched.
303
304
305
306       --cpu <n>
307              Set the number of parallel worker threads to <n>.   By  default,
308              HMMER  sets  this  to the number of CPU cores it detects in your
309              machine - that is, it tries to maximize the use of  your  avail‐
310              able  processor  cores.  Setting  <n>  higher than the number of
311              available cores is of little if any value, but you may  want  to
312              set  it  to  something less. You can also control this number by
313              setting an environment variable, HMMER_NCPU.
314
315              This option is only available if HMMER was compiled  with  POSIX
316              threads  support.  This  is  the  default,  but it may have been
317              turned off for your site or machine for some reason.
318
319
320
321       --stall
322              For debugging the MPI master/worker version: pause after  start,
323              to  enable the developer to attach debuggers to the running mas‐
324              ter and worker(s) processes. Send SIGCONT signal to release  the
325              pause.  (Under gdb: (gdb) signal SIGCONT)
326
327              (Only  available if optional MPI support was enabled at compile-
328              time.)
329
330
331       --mpi  Run in MPI master/worker mode, using mpirun.
332
333              (Only available if optional MPI support was enabled at  compile-
334              time.)
335
336
337
338
339
340
341
342

COPYRIGHT

355       Copyright (C) 2015 Howard Hughes Medical Institute.
356       Freely distributed under the GNU General Public License (GPLv3).
357
358       For  additional  information  on  copyright and licensing, see the file
359       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
360       web page ().
361
362
363

AUTHOR

365       Eddy/Rivas Laboratory
366       Janelia Farm Research Campus
367       19700 Helix Drive
368       Ashburn VA 20147 USA
369       http://eddylab.org
370
371
372
373
374HMMER 3.1b2                      February 2015                     nhmmscan(1)