nhmmer(1)

1nhmmer(1)                        HMMER Manual                        nhmmer(1)
2
3
4

NAME

6       nhmmer - search DNA queries against a DNA sequence database
7
8
9

SYNOPSIS

11       nhmmer [options] queryfile seqdb
12
13
14

DESCRIPTION

16       nhmmer  is  used to search one or more nucleotide queries against a nu‐
17       cleotide sequence database.  For each  query  in  queryfile,  use  that
18       query to search the target database of sequences in seqdb, and output a
19       ranked list of the hits with the most significant matches to the query.
20       A  query may be either a profile model built using hmmbuild, a sequence
21       alignment, or a single sequence. Sequence based queries  can  be  in  a
22       number  of  formats (see --qformat), and can typically be autodetected.
23       Note that only Stockholm format supports queries made up of  more  than
24       one sequence alignment.
25
26
27
28
29       Either the query queryfile or the target seqdb may be '-' (a dash char‐
30       acter), in which case the query file or target database input  will  be
31       read  from a <stdin> pipe instead of from a file. Only one input source
32       can come through <stdin>, not both.  If  the  queryfile  contains  more
33       than  one  query,  then  seqdb cannot come from stdin, because we can't
34       rewind the streaming target database to search it with another profile.
35
36
37       If the query is sequence-based (unaligned or aligned), a new file  con‐
38       taining  the HMM(s) built from the input(s) in queryfile may optionally
39       be produced, with the filename set using the --hmmout flag.
40
41
42
43       The output format is designed to be human-readable, but is often so vo‐
44       luminous  that reading it is impractical, and parsing it is a pain. The
45       --tblout option saves output in a simple tabular format that is concise
46       and easier to parse.  The -o option allows redirecting the main output,
47       including throwing it away in /dev/null.
48
49
50
51

OPTIONS

53       -h     Help; print a brief reminder  of  command  line  usage  and  all
54              available options.
55
56
57
58

OPTIONS FOR CONTROLLING OUTPUT

60       -o <f> Direct  the  main human-readable output to a file <f> instead of
61              the default stdout.
62
63
64       -A <f> Save a multiple alignment of all significant hits (those  satis‐
65              fying "inclusion thresholds") to the file <f>.
66
67
68       --tblout <f>
69              Save  a  simple  tabular  (space-delimited) file summarizing the
70              per-target output, with one data line per homologous target  se‐
71              quence found.
72
73
74       --dfamtblout <f>
75              Save  a  tabular  (space-delimited) file summarizing the per-hit
76              output, similar to --tblout but more succinct.
77
78
79       --aliscoresout <f>
80              Save to file a list of per-position scores for each  hit.   This
81              is  useful,  for  example,  in identifying regions of high score
82              density for use in resolving  overlapping  hits  from  different
83              models.
84
85
86       --hmmout <f>
87              If  queryfile  is  sequence-based, write the internally-computed
88              HMM(s) to file <f>.
89
90
91
92       --acc  Use accessions instead of names in the main output, where avail‐
93              able for profiles and/or sequences.
94
95
96       --noali
97              Omit  the  alignment  section  from  the  main  output. This can
98              greatly reduce the output volume.
99
100
101       --notextw
102              Unlimit the length of each line in the main output. The  default
103              is a limit of 120 characters per line, which helps in displaying
104              the output cleanly on terminals and in editors, but can truncate
105              target profile description lines.
106
107
108       --textw <n>
109              Set  the  main  output's line length limit to <n> characters per
110              line. The default is 120.
111
112
113
114

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING

116       By default, if a query is a single sequence from a file in  fasta  for‐
117       mat,  nhmmer  uses  a search model constructed from that sequence and a
118       standard 20x20 substitution matrix  for  residue  probabilities,  along
119       with  two  additional  parameters for position-independent gap open and
120       gap extend probabilities. These options allow  the  default  single-se‐
121       quence  scoring parameters to be changed, and for single-sequence scor‐
122       ing options to be applied to a single sequence coming from  an  aligned
123       format.
124
125
126       --singlemx
127              If a single sequence query comes from a multiple sequence align‐
128              ment file, such as in Stockholm format, the search model  is  by
129              default  constructed  as is typically done for multiple sequence
130              alignments. This option forces nhmmer to use the single-sequence
131              method with substitution score matrix.
132
133
134       --mxfile<mxfile
135              Obtain residue alignment probabilities from the substitution ma‐
136              trix in file mxfile.  The default score matrix is DNA1 (this ma‐
137              trix is internal to HMMER and does not have to be available as a
138              file).  The format of a substitution matrix mxfile is the  stan‐
139              dard  format accepted by BLAST, FASTA, and other sequence analy‐
140              sis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for  ex‐
141              ample  files.  (The  only  exception:  we require matrices to be
142              square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
143
144
145
146       --popen <x>
147              Set the gap open probability for a single sequence  query  model
148              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
149
150
151       --pextend <x>
152              Set the gap extend probability for a single sequence query model
153              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
154
155
156
157

OPTIONS CONTROLLING REPORTING THRESHOLDS

159       Reporting thresholds control which hits are reported  in  output  files
160       (the main output, --tblout, and --dfamtblout).  Hits are ranked by sta‐
161       tistical significance (E-value).
162
163
164
165       -E <x> Report target sequences with an E-value of <= <x>.  The  default
166              is  10.0, meaning that on average, about 10 false positives will
167              be reported per query, so you can see the top of the  noise  and
168              decide for yourself if it's really noise.
169
170
171       -T <x> Instead of thresholding output on E-value, instead report target
172              sequences with a bit score of >= <x>.
173
174
175
176
177

OPTIONS FOR INCLUSION THRESHOLDS

179       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
180       thresholds  control  which hits are considered to be reliable enough to
181       be included in an output alignment or a  subsequent  search  round,  or
182       marked  as  significant  ("!") as opposed to questionable ("?")  in hit
183       output.
184
185
186       --incE <x>
187              Use an E-value of <= <x> as the inclusion  threshold.   The  de‐
188              fault  is  0.01, meaning that on average, about 1 false positive
189              would be expected in every 100 searches with different query se‐
190              quences.
191
192
193       --incT <x>
194              Instead  of  using E-values for setting the inclusion threshold,
195              use a bit score of >= <x> as the inclusion  threshold.   By  de‐
196              fault this option is unset.
197
198
199
200

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

202       Curated  profile databases may define specific bit score thresholds for
203       each profile, superseding any thresholding based on statistical signif‐
204       icance alone.
205
206       To use these options, the profile must contain the appropriate (GA, TC,
207       and/or NC) optional score threshold annotation; this is  picked  up  by
208       hmmbuild from Stockholm format alignment files. For a nucleotide model,
209       each thresholding option has a single per-hit threshold <x>  This  acts
210       as  if  -T  <x>  --incT  <x>  has  been applied specifically using each
211       model's curated thresholds.
212
213
214       --cut_ga
215              Use the GA (gathering) bit score threshold in the model  to  set
216              per-hit  reporting  and  inclusion thresholds. GA thresholds are
217              generally considered  to  be  the  reliable  curated  thresholds
218              defining  family membership; for example, in Dfam, these thresh‐
219              olds are applied when annotating a genome with a model of a fam‐
220              ily known to be found in that organism. They may allow for mini‐
221              mal expected false discovery rate.
222
223
224       --cut_nc
225              Use the NC (noise cutoff) bit score threshold in  the  model  to
226              set  per-hit  reporting  and inclusion thresholds. NC thresholds
227              are less stringent than GA; in the context  of  Pfam,  they  are
228              generally  used  to store the score of the highest-scoring known
229              false positive.
230
231
232       --cut_tc
233              Use the TC (trusted cutoff) bit score threshold in the model  to
234              set  per-hit  reporting  and inclusion thresholds. TC thresholds
235              are more stringent than GA, and are generally considered  to  be
236              the  score  of  the  lowest-scoring  known true positive that is
237              above all known false positives; for  example,  in  Dfam,  these
238              thresholds  are applied when annotating a genome with a model of
239              a family not known to be found in that organism.
240
241
242
243
244

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

246       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
247       scanning-SSV  filter,  the  Viterbi filter, and the Forward filter. The
248       first filter is the fastest and most approximate; the last is the  full
249       Forward scoring algorithm. There is also a bias filter step between SSV
250       and Viterbi. Targets that pass all the steps in the acceleration  pipe‐
251       line  are then subjected to postprocessing -- domain identification and
252       scoring using the Forward/Backward algorithm.
253
254       Changing filter thresholds only removes or includes targets  from  con‐
255       sideration;  changing  filter  thresholds does not alter bit scores, E-
256       values, or alignments, all of which are determined solely  in  postpro‐
257       cessing.
258
259
260       --max  Turn  off  (nearly)  all filters, including the bias filter, and
261              run full Forward/Backward postprocessing on most of  the  target
262              sequence.   In contrast to phmmer and hmmsearch, where this flag
263              really does turn off the filters entirely,  the  --max  flag  in
264              nhmmer  sets  the scanning-SSV filter threshold to 0.4, not 1.0.
265              Use of this flag increases sensitivity somewhat, at a large cost
266              in speed.
267
268
269       --F1 <x>
270              Set  the P-value threshold for the SSV filter step.  The default
271              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
272              mologous targets are expected to pass the filter.
273
274
275       --F2 <x>
276              Set  the P-value threshold for the Viterbi filter step.  The de‐
277              fault is 0.001.
278
279
280       --F3 <x>
281              Set the P-value threshold for the Forward filter step.  The  de‐
282              fault is 1e-5.
283
284
285       --nobias
286              Turn  off  the bias filter. This increases sensitivity somewhat,
287              but can come at a high cost in speed, especially  if  the  query
288              has  biased  residue  composition (such as a repetitive sequence
289              region, or if it is a membrane protein with large regions of hy‐
290              drophobicity).  Without  the bias filter, too many sequences may
291              pass the filter with biased queries, leading to slower than  ex‐
292              pected   performance   as  the  computationally  intensive  For‐
293              ward/Backward algorithms shoulder an abnormally heavy load.
294
295
296
297

OPTIONS FOR SPECIFYING THE ALPHABET

299       --dna  Assert that sequences in msafile are DNA, bypassing alphabet au‐
300              todetection.
301
302
303       --rna  Assert that sequences in msafile are RNA, bypassing alphabet au‐
304              todetection.
305
306
307
308

OPTIONS CONTROLLING SEED SEARCH HEURISTIC

310       When searching with nhmmer, one may optionally precompute a binary ver‐
311       sion  of  the  target  database, using makehmmerdb, then search against
312       that database. Using default settings, this yields  a  roughly  10-fold
313       acceleration  with  small  loss  of sensitivity on benchmarks.  This is
314       achieved using a heuristic method that  searches  for  seeds  (ungapped
315       alignments) around which full processing is done. This is essentially a
316       replacement to the SSV stage. (This method has been extensively tested,
317       but  should  still be treated as somewhat experimental.)  The following
318       options only impact nhmmer if the value of --tformat is hmmerdb.
319
320       Changing parameters for this seed-finding step will impact  both  speed
321       and sensitivity - typically faster search leads to lower sensitivity.
322
323
324       --seed_max_depth <n>
325              The  seed  step requires that a seed reach a specified bit score
326              in length no longer than <n>.  By default,  this  value  is  15.
327              Longer  seeds  allow  a  greater chance of meeting the bit score
328              threshold, leading to diminished filtering (greater sensitivity,
329              slower run time).
330
331
332       --seed_sc_thresh <x>
333              The  seed  must  reach  score <x> (in bits). The default is 15.0
334              bits. A higher threshold increases filtering stringency, leading
335              to faster run times and lower sensitivity.
336
337
338       --seed_sc_density <x>
339              Either all prefixes or all suffixes of a seed must have bit den‐
340              sity (bits per aligned position) of at least <x>.   The  default
341              is  0.8  bits/position.  An  increase in the density requirement
342              leads to increased filtering stringency, thus faster  run  times
343              and lower sensitivity.
344
345
346       --seed_drop_max_len <n>
347              A seed may not have a run of length <n> in which the score drops
348              by --seed_drop_lim or more. Basically, this prunes seeds that go
349              through  long  slightly-negative seed extensions. The default is
350              4.  Increasing the limit causes (slightly) diminished  filtering
351              efficiency, thus slower run times and higher sensitivity. (minor
352              tuning option)
353
354
355       --seed_drop_lim <x>
356              In a seed, there may be no run of length --seed_drop_max_len  in
357              which  the  score  drops by --seed_drop_lim.  The default is 0.3
358              bits. Larger numbers mean less filtering.  (minor tuning option)
359
360
361       --seed_req_pos <n>
362              A seed must contain a  run  of  at  least  <n>  positive-scoring
363              matches.  The default is 5. Larger values mean increased filter‐
364              ing.  (minor tuning option)
365
366
367       --seed_ssv_length <n>
368              After finding a short seed, an ungapped alignment is extended in
369              both  directions in an attempt to meet the --F1 score threshold.
370              The window through which  this  ungapped  alignment  extends  is
371              length  <n>.  The default is 70.  Decreasing this value slightly
372              reduces run time, at a small risk of reduced sensitivity. (minor
373              tuning option)
374
375
376

OTHER OPTIONS

378       --qformat <s>
379              Assert  that  input  queryfile  is a sequence file (unaligned or
380              aligned), in format <s>, bypassing format autodetection.  Common
381              choices  for  <s> include: fasta, embl, genbank.  Alignment for‐
382              mats also work, and will serve as the basis for  automatic  cre‐
383              ation  of  a  profile HMM used for searching; common choices in‐
384              clude: stockholm, a2m, afa, psiblast, clustal, phylip.  For more
385              information,  and  for  codes  for some less common formats, see
386              main documentation.
387
388
389
390       --qsingle_seqs
391              Force queryfile to be read as individual sequences, even  if  it
392              is  in  an  msa  format. For example, if the input is in aligned
393              stockholm format, the --qsingle_seqs
394               flag will cause each sequence in that alignment to be used as a
395              seperate query sequence.
396
397
398       --tformat <s>
399              Assert that target sequence database seqdb is in format <s>, by‐
400              passing format autodetection.  Common choices for  <s>  include:
401              fasta,  embl,  genbank,  ncbi,  fmindex.  Alignment formats also
402              work; common choices include:  stockholm,  a2m,  afa,  psiblast,
403              clustal,  phylip.   For more information, and for codes for some
404              less common formats, see main documentation.  The string <s>  is
405              case-insensitive  (fasta  or  FASTA both work).  The format ncbi
406              indicates that the database file is a binary file produced using
407              makeblastdb.   The  format  fmindex  indicates that the database
408              file is a binary file produced using makehmmerdb.
409
410
411
412       --nonull2
413              Turn off the null2 score corrections for biased composition.
414
415
416       -Z <x> For the purposes of per-hit E-value  calculations,  Assert  that
417              the  total  size  of  the target database is <x> million nucleo‐
418              tides, rather than the actual number of targets seen.
419
420
421
422       --seed <n>
423              Set the random number seed to <n>.  Some steps in postprocessing
424              require  Monte  Carlo simulation.  The default is to use a fixed
425              seed (42), so that results are exactly reproducible.  Any  other
426              positive integer will give different (but also reproducible) re‐
427              sults. A choice of 0 uses a randomly chosen seed.
428
429
430
431       --w_beta <x>
432              Window length tail mass.  The upper bound, W, on the  length  at
433              which  nhmmer  expects  to  find an instance of the model is set
434              such that the fraction of all sequences generated by  the  model
435              with  length  >= W is less than <x>.  The default is 1e-7.  This
436              flag may be used to override the value of W established for  the
437              model by hmmbuild, or when the query is sequence-based.
438
439
440
441
442       --w_length <n>
443              Override the model instance length upper bound, W, which is oth‐
444              erwise controlled by --w_beta.  It should  be  larger  than  the
445              model  length.  The  value of W is used deep in the acceleration
446              pipeline, and modest changes are not expected to impact  results
447              (though  larger  values  of W do lead to longer run time).  This
448              flag may be used to override the value of W established for  the
449              model by hmmbuild, or when the query is sequence-based.
450
451
452
453
454       --watson
455              Only  search  the top strand. By default both the query sequence
456              and its reverse-complement are searched.
457
458
459       --crick
460              Only search the bottom (reverse-complement) strand.  By  default
461              both the query sequence and its reverse-complement are searched.
462
463
464
465       --cpu <n>
466              Set  the number of parallel worker threads to <n>.  On multicore
467              machines, the default is 2.  You can also control this number by
468              setting  an  environment  variable, HMMER_NCPU.  There is also a
469              master thread, so the actual number of threads that HMMER spawns
470              is <n>+1.
471
472              This  option  is  not available if HMMER was compiled with POSIX
473              threads support turned off.
474
475
476
477
478
479       --stall
480              For debugging the MPI master/worker version: pause after  start,
481              to  enable the developer to attach debuggers to the running mas‐
482              ter and worker(s) processes. Send SIGCONT signal to release  the
483              pause.  (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
484              tional MPI support was enabled at compile-time.)
485
486
487       --mpi  Run under MPI control with master/worker parallelization  (using
488              mpirun,  for example, or equivalent). Only available if optional
489              MPI support was enabled at compile-time.
490
491
492
493
494
495
496

COPYRIGHT

510       Copyright (C) 2020 Howard Hughes Medical Institute.
511       Freely distributed under the BSD open source license.
512
513       For  additional  information  on  copyright and licensing, see the file
514       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
515       web page (http://hmmer.org/).
516
517
518

AUTHOR

520       http://eddylab.org
521
522
523
524
525
526
527HMMER 3.3.2                        Nov 2020                          nhmmer(1)