hmmsearch(1)

1hmmsearch(1)                     HMMER Manual                     hmmsearch(1)
2
3
4

NAME

6       hmmsearch - search profile(s) against a sequence database
7
8
9

SYNOPSIS

11       hmmsearch [options] <hmmfile> <seqdb>
12
13
14

DESCRIPTION

16       hmmsearch  is  used  to  search one or more profiles against a sequence
17       database.  For each profile in <hmmfile>, use  that  query  profile  to
18       search  the  target database of sequences in <seqdb>, and output ranked
19       lists of the sequences with the most significant matches  to  the  pro‐
20       file.  To build profiles from multiple alignments, see hmmbuild.
21
22
23       Either  the  query  <hmmfile>  or the target <seqdb> may be '-' (a dash
24       character), in which case the query profile or  target  database  input
25       will be read from a <stdin> pipe instead of from a file. Only one input
26       source can come through <stdin>, not both.  An exception is that if the
27       <hmmfile>  contains  more  than  one profile query, then <seqdb> cannot
28       come from <stdin>, because we can't rewind the streaming  target  data‐
29       base to search it with another profile.
30
31
32       The  output  format  is  designed to be human-readable, but is often so
33       voluminous that reading it is impractical, and parsing it  is  a  pain.
34       The --tblout and --domtblout options save output in simple tabular for‐
35       mats that are concise and easier to parse.  The -o option allows  redi‐
36       recting the main output, including throwing it away in /dev/null.
37
38
39
40

OPTIONS

42       -h     Help;  print  a  brief  reminder  of  command line usage and all
43              available options.
44
45
46
47

OPTIONS FOR CONTROLLING OUTPUT

49       -o <f> Direct the main human-readable output to a file <f>  instead  of
50              the default stdout.
51
52
53       -A <f> Save  a multiple alignment of all significant hits (those satis‐
54              fying inclusion thresholds) to the file <f>.
55
56
57       --tblout <f>
58              Save a simple tabular  (space-delimited)  file  summarizing  the
59              per-target  output,  with  one  data  line per homologous target
60              sequence found.
61
62
63       --domtblout <f>
64              Save a simple tabular  (space-delimited)  file  summarizing  the
65              per-domain  output,  with  one  data  line per homologous domain
66              detected in a query sequence for each homologous model.
67
68
69       --acc  Use accessions instead of names in the main output, where avail‐
70              able for profiles and/or sequences.
71
72
73       --noali
74              Omit  the  alignment  section  from  the  main  output. This can
75              greatly reduce the output volume.
76
77
78       --notextw
79              Unlimit the length of each line in the main output. The  default
80              is a limit of 120 characters per line, which helps in displaying
81              the output cleanly on terminals and in editors, but can truncate
82              target profile description lines.
83
84
85       --textw <n>
86              Set  the  main  output's line length limit to <n> characters per
87              line. The default is 120.
88
89
90
91

OPTIONS CONTROLLING REPORTING THRESHOLDS

93       Reporting thresholds control which hits are reported  in  output  files
94       (the main output, --tblout, and --domtblout).  Sequence hits and domain
95       hits are ranked by statistical significance  (E-value)  and  output  is
96       generated  in  two sections called per-target and per-domain output. In
97       per-target output, by default, all sequence hits with an E-value <=  10
98       are reported. In the per-domain output, for each target that has passed
99       per-target reporting  thresholds,  all  domains  satisfying  per-domain
100       reporting  thresholds  are reported. By default, these are domains with
101       conditional E-values of <= 10.  The  following  options  allow  you  to
102       change  the  default  E-value reporting thresholds, or to use bit score
103       thresholds instead.
104
105
106
107       -E <x> In the per-target output, report target  sequences  with  an  E-
108              value  of <= <x>.  The default is 10.0, meaning that on average,
109              about 10 false positives will be reported per query, so you  can
110              see  the top of the noise and decide for yourself if it's really
111              noise.
112
113
114       -T <x> Instead of thresholding per-profile output on  E-value,  instead
115              report target sequences with a bit score of >= <x>.
116
117
118       --domE <x>
119              In the per-domain output, for target sequences that have already
120              satisfied the per-profile reporting threshold, report individual
121              domains  with  a  conditional E-value of <= <x>.  The default is
122              10.0.  A conditional E-value means the expected number of  addi‐
123              tional  false  positive  domains  in the smaller search space of
124              those comparisons that already satisfied the per-target  report‐
125              ing threshold (and thus must have at least one homologous domain
126              already).
127
128
129
130       --domT <x>
131              Instead of thresholding per-domain output  on  E-value,  instead
132              report domains with a bit score of >= <x>.
133
134
135
136
137

OPTIONS FOR INCLUSION THRESHOLDS

139       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
140       thresholds control which hits are considered to be reliable  enough  to
141       be  included  in  an  output alignment or a subsequent search round, or
142       marked as significant ("!") as opposed to questionable ("?")  in domain
143       output.
144
145
146       --incE <x>
147              Use  an E-value of <= <x> as the per-target inclusion threshold.
148              The default is 0.01, meaning that on average, about 1 false pos‐
149              itive  would  be  expected  in every 100 searches with different
150              query sequences.
151
152
153       --incT <x>
154              Instead of using E-values for setting the  inclusion  threshold,
155              instead  use  a  bit score of >= <x> as the per-target inclusion
156              threshold.  By default this option is unset.
157
158
159       --incdomE <x>
160              Use a conditional E-value of <= <x> as the per-domain  inclusion
161              threshold,  in  targets  that have already satisfied the overall
162              per-target inclusion threshold.  The default is 0.01.
163
164
165       --incdomT <x>
166              Instead of using E-values, use a bit score of >= <x> as the per-
167              domain inclusion threshold.
168
169
170
171

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

173       Curated  profile databases may define specific bit score thresholds for
174       each profile, superseding any thresholding based on statistical signif‐
175       icance alone.
176
177       To use these options, the profile must contain the appropriate (GA, TC,
178       and/or NC) optional score threshold annotation; this is  picked  up  by
179       hmmbuild  from  Stockholm  format  alignment  files.  Each thresholding
180       option has two scores: the per-sequence threshold  <x1>  and  the  per-
181       domain  threshold  <x2>  These  act  as if -T<x1> --incT<x1> --domT<x2>
182       --incdomT<x2> has been applied specifically using each model's  curated
183       thresholds.
184
185
186       --cut_ga
187              Use  the  GA  (gathering)  bit  scores  in the model to set per-
188              sequence (GA1) and  per-domain  (GA2)  reporting  and  inclusion
189              thresholds.  GA  thresholds  are  generally considered to be the
190              reliable curated  thresholds  defining  family  membership;  for
191              example,  in Pfam, these thresholds define what gets included in
192              Pfam Full alignments based on searches with Pfam Seed models.
193
194
195       --cut_nc
196              Use the NC (noise cutoff) bit score thresholds in the  model  to
197              set per-sequence (NC1) and per-domain (NC2) reporting and inclu‐
198              sion thresholds. NC thresholds are generally  considered  to  be
199              the score of the highest-scoring known false positive.
200
201
202       --cut_tc
203              Use the TC (trusted cutoff) bit score thresholds in the model to
204              set per-sequence (TC1) and per-domain (TC2) reporting and inclu‐
205              sion  thresholds.  TC  thresholds are generally considered to be
206              the score of the lowest-scoring  known  true  positive  that  is
207              above all known false positives.
208
209
210
211
212

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

214       HMMER3  searches  are  accelerated in a three-step filter pipeline: the
215       MSV filter, the Viterbi filter, and the Forward filter. The first  fil‐
216       ter  is  the fastest and most approximate; the last is the full Forward
217       scoring algorithm. There is also a bias filter  step  between  MSV  and
218       Viterbi.  Targets  that pass all the steps in the acceleration pipeline
219       are then subjected to postprocessing -- domain identification and scor‐
220       ing using the Forward/Backward algorithm.
221
222       Changing  filter  thresholds only removes or includes targets from con‐
223       sideration; changing filter thresholds does not alter  bit  scores,  E-
224       values,  or  alignments, all of which are determined solely in postpro‐
225       cessing.
226
227
228       --max  Turn off all filters, including the bias filter,  and  run  full
229              Forward/Backward  postprocessing on every target. This increases
230              sensitivity somewhat, at a large cost in speed.
231
232
233       --F1 <x>
234              Set the P-value threshold for the MSV filter step.  The  default
235              is  0.02,  meaning that roughly 2% of the highest scoring nonho‐
236              mologous targets are expected to pass the filter.
237
238
239       --F2 <x>
240              Set the P-value threshold for  the  Viterbi  filter  step.   The
241              default is 0.001.
242
243
244       --F3 <x>
245              Set  the  P-value  threshold  for  the Forward filter step.  The
246              default is 1e-5.
247
248
249       --nobias
250              Turn off the bias filter. This increases  sensitivity  somewhat,
251              but  can  come  at a high cost in speed, especially if the query
252              has biased residue composition (such as  a  repetitive  sequence
253              region,  or  if  it  is a membrane protein with large regions of
254              hydrophobicity). Without the bias filter, too many sequences may
255              pass  the  filter  with  biased  queries, leading to slower than
256              expected  performance  as  the  computationally  intensive  For‐
257              ward/Backward algorithms shoulder an abnormally heavy load.
258
259
260
261

OTHER OPTIONS

263       --nonull2
264              Turn off the null2 score corrections for biased composition.
265
266
267       -Z <x> Assert that the total number of targets in your searches is <x>,
268              for the purposes of per-sequence  E-value  calculations,  rather
269              than the actual number of targets seen.
270
271
272       --domZ <x>
273              Assert that the total number of targets in your searches is <x>,
274              for the purposes of per-domain conditional E-value calculations,
275              rather  than  the  number  of  targets that passed the reporting
276              thresholds.
277
278
279       --seed <n>
280              Set the random number seed to <n>.  Some steps in postprocessing
281              require  Monte  Carlo simulation.  The default is to use a fixed
282              seed (42), so that results are exactly reproducible.  Any  other
283              positive  integer  will  give  different (but also reproducible)
284              results. A choice of 0 uses a randomly chosen seed.
285
286
287       --tformat <s>
288              Assert that the target sequence database file is in format  <s>.
289              Accepted  formats  include  fasta, embl, genbank, ddbj, uniprot,
290              stockholm, pfam, a2m, and afa.  The default is to autodetect the
291              format of the file.
292
293
294
295       --cpu <n>
296              Set  the  number of parallel worker threads to <n>.  By default,
297              HMMER sets this to the number of CPU cores it  detects  in  your
298              machine  -  that is, it tries to maximize the use of your avail‐
299              able processor cores. Setting <n>  higher  than  the  number  of
300              available  cores  is of little if any value, but you may want to
301              set it to something less. You can also control  this  number  by
302              setting an environment variable, HMMER_NCPU.
303
304              This  option  is only available if HMMER was compiled with POSIX
305              threads support. This is the  default,  but  it  may  have  been
306              turned  off  at  compile-time  for your site or machine for some
307              reason.
308
309
310
311       --stall
312              For debugging the MPI master/worker version: pause after  start,
313              to  enable the developer to attach debuggers to the running mas‐
314              ter and worker(s) processes. Send SIGCONT signal to release  the
315              pause.   (Under  gdb:  (gdb)  signal SIGCONT) (Only available if
316              optional MPI support was enabled at compile-time.)
317
318
319       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
320              optional MPI support was enabled at compile-time.)
321
322
323
324
325
326
327
328

COPYRIGHT

341       Copyright (C) 2015 Howard Hughes Medical Institute.
342       Freely distributed under the GNU General Public License (GPLv3).
343
344       For  additional  information  on  copyright and licensing, see the file
345       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
346       web page ().
347
348
349

AUTHOR

351       Eddy/Rivas Laboratory
352       Janelia Farm Research Campus
353       19700 Helix Drive
354       Ashburn VA 20147 USA
355       http://eddylab.org
356
357
358
359
360
361
362HMMER 3.1b2                      February 2015                    hmmsearch(1)