nhmmer(1)

1nhmmer(1)                        HMMER Manual                        nhmmer(1)
2
3
4

NAME

6       nhmmer - search DNA/RNA queries against a DNA/RNA sequence database
7
8
9

SYNOPSIS

11       nhmmer [options] <queryfile> <seqdb>
12
13
14

DESCRIPTION

16       nhmmer  is  used to search one or more nucleotide queries against a nu‐
17       cleotide sequence database.  For each query in  <queryfile>,  use  that
18       query to search the target database of sequences in <seqdb>, and output
19       a ranked list of the hits with the  most  significant  matches  to  the
20       query.  A  query  may be either a profile model built using hmmbuild, a
21       sequence alignment, or a single sequence. Sequence based queries can be
22       in  a  number  of formats (see --qformat), and can typically be autode‐
23       tected. Note that only Stockholm format supports  queries  made  up  of
24       more than one sequence alignment.
25
26
27
28
29       Either  the  query <queryfile> or the target <seqdb> may be '-' (a dash
30       character), in which case the query file or target database input  will
31       be  read  from  a  <stdin>  pipe instead of from a file. Only one input
32       source can come through <stdin>, not both.  If the query  is  sequence-
33       based  and passed via <stdin>, the --qformat flag must be used.  If the
34       <queryfile> contains more than one query, then <seqdb> cannot come from
35       <stdin>,  because  we  can't  rewind  the  streaming target database to
36       search it with another profile.
37
38
39       If the query is sequence-based, and not from <stdin>, a new  file  con‐
40       taining  the  HMM(s) built from the input(s) in <queryfile> may option‐
41       ally be produced, with the filename set using the --hmmout flag.
42
43
44
45       The output format is designed to be human-readable,  but  is  often  so
46       voluminous  that  reading  it is impractical, and parsing it is a pain.
47       The --tblout option saves output in a simple  tabular  format  that  is
48       concise and easier to parse.  The -o option allows redirecting the main
49       output, including throwing it away in /dev/null.
50
51
52
53

OPTIONS

55       -h     Help; print a brief reminder  of  command  line  usage  and  all
56              available options.
57
58
59
60

OPTIONS FOR CONTROLLING OUTPUT

62       -o <f> Direct  the  main human-readable output to a file <f> instead of
63              the default stdout.
64
65
66       -A <f> Save a multiple alignment of all significant hits (those  satis‐
67              fying inclusion thresholds) to the file <f>.
68
69
70       --tblout <f>
71              Save  a  simple  tabular  (space-delimited) file summarizing the
72              per-target output, with one  data  line  per  homologous  target
73              sequence found.
74
75
76       --dfamtblout <f>
77              Save  a  tabular  (space-delimited) file summarizing the per-hit
78              output, similar to --tblout but more succinct.
79
80
81       --aliscoresout <f>
82              Save to file a list of per-position scores for each  hit.   This
83              is  useful,  for  example,  in identifying regions of high score
84              density for use in resolving  overlapping  hits  from  different
85              models.
86
87
88       --hmmout <f>
89              If  <queryfile> is sequence-based, write the internally-computed
90              HMM(s) to <f>.
91
92
93
94       --acc  Use accessions instead of names in the main output, where avail‐
95              able for profiles and/or sequences.
96
97
98       --noali
99              Omit  the  alignment  section  from  the  main  output. This can
100              greatly reduce the output volume.
101
102
103       --notextw
104              Unlimit the length of each line in the main output. The  default
105              is a limit of 120 characters per line, which helps in displaying
106              the output cleanly on terminals and in editors, but can truncate
107              target profile description lines.
108
109
110       --textw <n>
111              Set  the  main  output's line length limit to <n> characters per
112              line. The default is 120.
113
114
115
116

OPTIONS CONTROLLING REPORTING THRESHOLDS

118       Reporting thresholds control which hits are reported  in  output  files
119       (the main output, --tblout, and --dfamtblout).  Hits are ranked by sta‐
120       tistical significance (E-value).
121
122
123
124       -E <x> Report target sequences with an E-value of <= <x>.  The  default
125              is  10.0, meaning that on average, about 10 false positives will
126              be reported per query, so you can see the top of the  noise  and
127              decide for yourself if it's really noise.
128
129
130       -T <x> Instead of thresholding output on E-value, instead report target
131              sequences with a bit score of >= <x>.
132
133
134
135
136

OPTIONS FOR INCLUSION THRESHOLDS

138       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
139       thresholds  control  which hits are considered to be reliable enough to
140       be included in an output alignment or a  subsequent  search  round,  or
141       marked  as  significant  ("!") as opposed to questionable ("?")  in hit
142       output.
143
144
145       --incE <x>
146              Use an E-value of  <=  <x>  as  the  inclusion  threshold.   The
147              default is 0.01, meaning that on average, about 1 false positive
148              would be expected in every 100  searches  with  different  query
149              sequences.
150
151
152       --incT <x>
153              Instead  of  using E-values for setting the inclusion threshold,
154              use a bit score of  >=  <x>  as  the  inclusion  threshold.   By
155              default this option is unset.
156
157
158
159

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

161       Curated  profile databases may define specific bit score thresholds for
162       each profile, superseding any thresholding based on statistical signif‐
163       icance alone.
164
165       To use these options, the profile must contain the appropriate (GA, TC,
166       and/or NC) optional score threshold annotation; this is  picked  up  by
167       hmmbuild from Stockholm format alignment files. For a nucleotide model,
168       each thresholding option has a single per-hit threshold <x>  This  acts
169       as  if -T<x> --incT<x> has been applied specifically using each model's
170       curated thresholds.
171
172
173       --cut_ga
174              Use the GA (gathering) bit score threshold in the model  to  set
175              per-hit  reporting  and  inclusion thresholds. GA thresholds are
176              generally considered  to  be  the  reliable  curated  thresholds
177              defining  family membership; for example, in Dfam, these thresh‐
178              olds are applied when annotating a genome with a model of a fam‐
179              ily known to be found in that organism. They may allow for mini‐
180              mal expected false discovery rate.
181
182
183       --cut_nc
184              Use the NC (noise cutoff) bit score threshold in  the  model  to
185              set  per-hit  reporting  and inclusion thresholds. NC thresholds
186              are less stringent than GA; in the context  of  Pfam,  they  are
187              generally  used  to store the score of the highest-scoring known
188              false positive.
189
190
191       --cut_tc
192              Use the NC (trusted cutoff) bit score threshold in the model  to
193              set  per-hit  reporting  and inclusion thresholds. TC thresholds
194              are more stringent than GA, and are generally considered  to  be
195              the  score  of  the  lowest-scoring  known true positive that is
196              above all known false positives; for  example,  in  Dfam,  these
197              thresholds  are applied when annotating a genome with a model of
198              a family not known to be found in that organism.
199
200
201
202
203

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

205       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
206       scanning-SSV  filter,  the  Viterbi filter, and the Forward filter. The
207       first filter is the fastest and most approximate; the last is the  full
208       Forward scoring algorithm. There is also a bias filter step between SSV
209       and Viterbi. Targets that pass all the steps in the acceleration  pipe‐
210       line  are then subjected to postprocessing -- domain identification and
211       scoring using the Forward/Backward algorithm.
212
213       Changing filter thresholds only removes or includes targets  from  con‐
214       sideration;  changing  filter  thresholds does not alter bit scores, E-
215       values, or alignments, all of which are determined solely  in  postpro‐
216       cessing.
217
218
219       --max  Turn  off  (nearly)  all filters, including the bias filter, and
220              run full Forward/Backward postprocessing on most of  the  target
221              sequence.   In contrast to phmmer and hmmsearch, where this flag
222              really does turn off the filters entirely,  the  --max  flag  in
223              nhmmer  sets  the scanning-SSV filter threshold to 0.4, not 1.0.
224              Use of this flag increases sensitivity somewhat, at a large cost
225              in speed.
226
227
228       --F1 <x>
229              Set  the P-value threshold for the SSV filter step.  The default
230              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
231              mologous targets are expected to pass the filter.
232
233
234       --F2 <x>
235              Set  the  P-value  threshold  for  the Viterbi filter step.  The
236              default is 0.001.
237
238
239       --F3 <x>
240              Set the P-value threshold for  the  Forward  filter  step.   The
241              default is 1e-5.
242
243
244       --nobias
245              Turn  off  the bias filter. This increases sensitivity somewhat,
246              but can come at a high cost in speed, especially  if  the  query
247              has  biased  residue  composition (such as a repetitive sequence
248              region, or if it is a membrane protein  with  large  regions  of
249              hydrophobicity). Without the bias filter, too many sequences may
250              pass the filter with biased  queries,  leading  to  slower  than
251              expected  performance  as  the  computationally  intensive  For‐
252              ward/Backward algorithms shoulder an abnormally heavy load.
253
254
255
256

OPTIONS FOR SPECIFYING THE ALPHABET

258       The alphabet type of the target database (DNA or RNA)  is  autodetected
259       by  default,  by looking at the composition of the <seqdb>.  Autodetec‐
260       tion is normally quite reliable, but occasionally alphabet type may  be
261       ambiguous  and  autodetection  can  fail  (for instance, when the first
262       sequence starts with a run of ambiguous characters). To avoid this,  or
263       to increase robustness in automated analysis pipelines, you may specify
264       the alphabet type of <seqdb> with these options.
265
266
267       --dna  Specify that all sequences in <seqdb> are DNAs.
268
269
270       --rna  Specify that all sequences in <seqdb> are RNAs.
271
272
273

OPTIONS CONTROLLING SEED SEARCH HEURISTIC

275       When searching with nhmmer, one may optionally precompute a binary ver‐
276       sion  of  the  target  database, using makehmmerdb, then search against
277       that database. Using default settings, this yields  a  roughly  10-fold
278       acceleration  with  small  loss  of sensitivity on benchmarks.  This is
279       achieved using a heuristic method that  searches  for  seeds  (ungapped
280       alignments) around which full processing is done. This is essentially a
281       replacement to the SSV stage. (This method has been extensively tested,
282       but  should  still be treated as somewhat experimental.)  The following
283       options only impact nhmmer if the value of --tformat is hmmerdb.
284
285       Changing parameters for this seed-finding step will impact  both  speed
286       and sensitivity - typically faster search leads to lower sensitivity.
287
288
289       --seed_max_depth <n>
290              The  seed  step requires that a seed reach a specified bit score
291              in length no longer than <n>.  By default,  this  value  is  15.
292              Longer  seeds  allow  a  greater chance of meeting the bit score
293              threshold, leading to diminished filtering (greater sensitivity,
294              slower run time).
295
296
297       --seed_sc_thresh <x>
298              The  seed  must  reach  score <x> (in bits). The default is 15.0
299              bits. A higher threshold increases filtering stringency, leading
300              to faster run times and lower sensitivity.
301
302
303       --seed_sc_density <x>
304              Either all prefixes or all suffixes of a seed must have bit den‐
305              sity (bits per aligned position) of at least <x>.   The  default
306              is  0.8  bits/position.  An  increase in the density requirement
307              leads to increased filtering stringency, thus faster  run  times
308              and lower sensitivity.
309
310
311       --seed_drop_max_len <n>
312              A seed may not have a run of length <n> in which the score drops
313              by --seed_drop_lim or more. Basically, this prunes seeds that go
314              through  long  slightly-negative seed extensions. The default is
315              4.  Increasing the limit causes (slightly) diminished  filtering
316              efficiency, thus slower run times and higher sensitivity. (minor
317              tuning option)
318
319
320       --seed_drop_lim <x>
321              In a seed, there may be no run of length --seed_drop_max_len  in
322              which  the  score  drops by --seed_drop_lim.  The default is 0.3
323              bits. Larger numbers mean less filtering.  (minor tuning option)
324
325
326       --seed_req_pos <n>
327              A seed must contain a  run  of  at  least  <n>  positive-scoring
328              matches.  The default is 5. Larger values mean increased filter‐
329              ing.  (minor tuning option)
330
331
332       --seed_ssv_length <n>
333              After finding a short seed, an ungapped alignment is extended in
334              both  directions in an attempt to meet the --F1 score threshold.
335              The window through which  this  ungapped  alignment  extends  is
336              length  <n>.  The default is 70.  Decreasing this value slightly
337              reduces run time, at a small risk of reduced sensitivity. (minor
338              tuning option)
339
340
341

OTHER OPTIONS

343       --tformat <s>
344              Assert  that the target sequence database file is in format <s>.
345              Accepted formats include fasta, embl,  genbank,  ddbj,  uniprot,
346              stockholm,  pfam,  a2m,  afa,  and  hmmerfm.   The default is to
347              autodetect the format of the file. The format hmmerfm  indicates
348              that  the  database file is a binary file produced using makehm‐
349              merdb (this format is not currently autodetected).
350
351
352
353       --qformat <s>
354              Declare that the input <queryfile> is in format  <s>.   This  is
355              used  when  the  query is sequence-based, rather than made up of
356              profile model(s).  Currently  the  accepted  multiple  alignment
357              sequence file formats include Stockholm, Aligned FASTA, Clustal,
358              NCBI PSI-BLAST, PHYLIP, Selex, and UCSC SAM A2M. Default  is  to
359              autodetect the format of the file.
360
361
362
363       --nonull2
364              Turn off the null2 score corrections for biased composition.
365
366
367       -Z <x> For  the  purposes  of per-hit E-value calculations, Assert that
368              the total size of the target database  is  <x>  million  nucleo‐
369              tides, rather than the actual number of targets seen.
370
371
372
373       --seed <n>
374              Set the random number seed to <n>.  Some steps in postprocessing
375              require Monte Carlo simulation.  The default is to use  a  fixed
376              seed  (42),  so that results are exactly reproducible. Any other
377              positive integer will give  different  (but  also  reproducible)
378              results. A choice of 0 uses a randomly chosen seed.
379
380
381
382       --w_beta <x>
383              Window  length  tail mass.  The upper bound, W, on the length at
384              which nhmmer expects to find an instance of  the  model  is  set
385              such  that  the fraction of all sequences generated by the model
386              with length >= W is less than <x>.  The default is  1e-7.   This
387              flag  may be used to override the value of W established for the
388              model by hmmbuild, or when the query is sequence-based.
389
390
391
392
393       --w_length <n>
394              Override the model instance length upper bound, W, which is oth‐
395              erwise  controlled  by  --w_beta.   It should be larger than the
396              model length. The value of W is used deep  in  the  acceleration
397              pipeline,  and modest changes are not expected to impact results
398              (though larger values of W do lead to longer  run  time).   This
399              flag  may be used to override the value of W established for the
400              model by hmmbuild, or when the query is sequence-based.
401
402
403
404
405       --toponly
406              Only search the top strand. By default both the  query  sequence
407              and its reverse-complement are searched.
408
409
410       --bottomonly
411              Only  search  the bottom (reverse-complement) strand. By default
412              both the query sequence and its reverse-complement are searched.
413
414
415
416
417       --cpu <n>
418              Set the number of parallel worker threads to <n>.   By  default,
419              HMMER  sets  this  to the number of CPU cores it detects in your
420              machine - that is, it tries to maximize the use of  your  avail‐
421              able  processor  cores.  Setting  <n>  higher than the number of
422              available cores is of little if any value, but you may  want  to
423              set  it  to  something less. You can also control this number by
424              setting an environment variable, HMMER_NCPU.
425
426              This option is only available if HMMER was compiled  with  POSIX
427              threads  support.  This  is  the  default,  but it may have been
428              turned off at compile-time for your site  or  machine  for  some
429              reason.
430
431
432
433       --stall
434              For  debugging the MPI master/worker version: pause after start,
435              to enable the developer to attach debuggers to the running  mas‐
436              ter  and worker(s) processes. Send SIGCONT signal to release the
437              pause.  (Under gdb: (gdb) signal  SIGCONT)  (Only  available  if
438              optional MPI support was enabled at compile-time.)
439
440
441       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
442              optional MPI support was enabled at compile-time.)
443
444
445
446
447
448
449
450

COPYRIGHT

463       Copyright (C) 2015 Howard Hughes Medical Institute.
464       Freely distributed under the GNU General Public License (GPLv3).
465
466       For additional information on copyright and  licensing,  see  the  file
467       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
468       web page ().
469
470
471

AUTHOR

473       Eddy/Rivas Laboratory
474       Janelia Farm Research Campus
475       19700 Helix Drive
476       Ashburn VA 20147 USA
477       http://eddylab.org
478
479
480
481
482
483
484HMMER 3.1b2                      February 2015                       nhmmer(1)