hmmscan(1)

1hmmscan(1)                       HMMER Manual                       hmmscan(1)
2
3
4

NAME

6       hmmscan - search sequence(s) against a profile database
7
8
9

SYNOPSIS

11       hmmscan [options] <hmmdb> <seqfile>
12
13
14
15

DESCRIPTION

17       hmmscan  is  used  to search sequences against collections of profiles.
18       For each sequence in <seqfile>, use that query sequence to  search  the
19       target  database of profiles in <hmmdb>, and output ranked lists of the
20       profiles with the most significant matches to the sequence.
21
22
23       The <seqfile> may contain more than one query sequence. It  can  be  in
24       FASTA  format,  or several other common sequence file formats (genbank,
25       embl, and uniprot, among others), or in alignment file formats  (stock‐
26       holm,  aligned  fasta, and others). See the --qformat option for a com‐
27       plete list.
28
29
30       The <hmmdb> needs to be  press'ed  using  hmmpress  before  it  can  be
31       searched  with  hmmscan.   This  creates  four  binary  files, suffixed
32       .h3{fimp}.
33
34
35       The output format is designed to be human-readable,  but  is  often  so
36       voluminous  that  reading  it is impractical, and parsing it is a pain.
37       The --tblout and --domtblout options save output in simple tabular for‐
38       mats  that are concise and easier to parse.  The -o option allows redi‐
39       recting the main output, including throwing it away in /dev/null.
40
41
42
43
44

OPTIONS

46       -h     Help; print a brief reminder  of  command  line  usage  and  all
47              available options.
48
49
50
51

OPTIONS FOR CONTROLLING OUTPUT

53       -o <f> Direct  the  main human-readable output to a file <f> instead of
54              the default stdout.
55
56
57       --tblout <f>
58              Save a simple tabular  (space-delimited)  file  summarizing  the
59              per-target  output,  with  one  data  line per homologous target
60              model found.
61
62
63       --domtblout <f>
64              Save a simple tabular  (space-delimited)  file  summarizing  the
65              per-domain  output,  with  one  data  line per homologous domain
66              detected in a query sequence for each homologous model.
67
68
69       --acc  Use accessions instead of names in the main output, where avail‐
70              able for profiles and/or sequences.
71
72
73       --noali
74              Omit  the  alignment  section  from  the  main  output. This can
75              greatly reduce the output volume.
76
77
78       --notextw
79              Unlimit the length of each line in the main output. The  default
80              is a limit of 120 characters per line, which helps in displaying
81              the output cleanly on terminals and in editors, but can truncate
82              target profile description lines.
83
84
85       --textw <n>
86              Set  the  main  output's line length limit to <n> characters per
87              line. The default is 120.
88
89
90
91

OPTIONS FOR REPORTING THRESHOLDS

93       Reporting thresholds control which hits are reported  in  output  files
94       (the main output, --tblout, and --domtblout).
95
96
97       -E <x> In the per-target output, report target profiles with an E-value
98              of <= <x>.  The default is 10.0, meaning that on average,  about
99              10  false  positives  will be reported per query, so you can see
100              the top of the noise and decide  for  yourself  if  it's  really
101              noise.
102
103
104       -T <x> Instead  of  thresholding per-profile output on E-value, instead
105              report target profiles with a bit score of >= <x>.
106
107
108       --domE <x>
109              In the per-domain output, for target profiles that have  already
110              satisfied the per-profile reporting threshold, report individual
111              domains with a conditional E-value of <= <x>.   The  default  is
112              10.0.   A conditional E-value means the expected number of addi‐
113              tional false positive domains in the  smaller  search  space  of
114              those comparisons that already satisfied the per-profile report‐
115              ing threshold (and thus must have at least one homologous domain
116              already).
117
118
119
120       --domT <x>
121              Instead  of  thresholding  per-domain output on E-value, instead
122              report domains with a bit score of >= <x>.
123
124
125
126
127

OPTIONS FOR INCLUSION THRESHOLDS

129       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
130       thresholds  control  which hits are considered to be reliable enough to
131       be included in an output alignment or a subsequent  search  round.   In
132       hmmscan,  which  does  not have any alignment output (like hmmsearch or
133       phmmer) nor any iterative  search  steps  (like  jackhmmer),  inclusion
134       thresholds have little effect. They only affect what domains get marked
135       as significant (!) or questionable (?) in domain output.
136
137
138       --incE <x>
139              Use an E-value of <= <x> as the per-target inclusion  threshold.
140              The default is 0.01, meaning that on average, about 1 false pos‐
141              itive would be expected in every  100  searches  with  different
142              query sequences.
143
144
145       --incT <x>
146              Instead  of  using E-values for setting the inclusion threshold,
147              instead use a bit score of >= <x> as  the  per-target  inclusion
148              threshold.  It would be unusual to use bit score thresholds with
149              hmmscan, because you don't expect a single  score  threshold  to
150              work  for  different  profiles; different profiles have slightly
151              different expected score distributions.
152
153
154       --incdomE <x>
155              Use a conditional E-value of <= <x> as the per-domain  inclusion
156              threshold,  in  targets  that have already satisfied the overall
157              per-target inclusion threshold.  The default is 0.01.
158
159
160       --incdomT <x>
161              Instead of using E-values, instead use a bit score of >= <x>  as
162              the  per-domain  inclusion  threshold.  As with --incT above, it
163              would be unusual to use a single bit score threshold in hmmscan.
164
165
166
167

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

169       Curated profile databases may define specific bit score thresholds  for
170       each profile, superseding any thresholding based on statistical signif‐
171       icance alone.
172
173       To use these options, the profile must contain the appropriate (GA, TC,
174       and/or  NC)  optional  score threshold annotation; this is picked up by
175       hmmbuild from  Stockholm  format  alignment  files.  Each  thresholding
176       option  has  two  scores:  the per-sequence threshold <x1> and the per-
177       domain threshold <x2> These act  as  if  -T<x1>  --incT<x1>  --domT<x2>
178       --incdomT<x2>  has been applied specifically using each model's curated
179       thresholds.
180
181
182       --cut_ga
183              Use the GA (gathering) bit scores  in  the  model  to  set  per-
184              sequence  (GA1)  and  per-domain  (GA2)  reporting and inclusion
185              thresholds. GA thresholds are generally  considered  to  be  the
186              reliable  curated  thresholds  defining  family  membership; for
187              example, in Pfam, these thresholds define what gets included  in
188              Pfam Full alignments based on searches with Pfam Seed models.
189
190
191       --cut_nc
192              Use  the  NC (noise cutoff) bit score thresholds in the model to
193              set per-sequence (NC1) and per-domain (NC2) reporting and inclu‐
194              sion  thresholds.  NC  thresholds are generally considered to be
195              the score of the highest-scoring known false positive.
196
197
198       --cut_tc
199              Use the NC (trusted cutoff) bit score thresholds in the model to
200              set per-sequence (TC1) and per-domain (TC2) reporting and inclu‐
201              sion thresholds. TC thresholds are generally  considered  to  be
202              the  score  of  the  lowest-scoring  known true positive that is
203              above all known false positives.
204
205
206
207
208

CONTROL OF THE ACCELERATION PIPELINE

210       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
211       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
212       ter is the fastest and most approximate; the last is the  full  Forward
213       scoring  algorithm.  There  is  also a bias filter step between MSV and
214       Viterbi. Targets that pass all the steps in the  acceleration  pipeline
215       are then subjected to postprocessing -- domain identification and scor‐
216       ing using the Forward/Backward algorithm.
217
218       Changing filter thresholds only removes or includes targets  from  con‐
219       sideration;  changing  filter  thresholds does not alter bit scores, E-
220       values, or alignments, all of which are determined solely  in  postpro‐
221       cessing.
222
223
224       --max  Turn  off  all  filters, including the bias filter, and run full
225              Forward/Backward postprocessing on every target. This  increases
226              sensitivity somewhat, at a large cost in speed.
227
228
229       --F1 <x>
230              Set  the P-value threshold for the MSV filter step.  The default
231              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
232              mologous targets are expected to pass the filter.
233
234
235       --F2 <x>
236              Set  the  P-value  threshold  for  the Viterbi filter step.  The
237              default is 0.001.
238
239
240       --F3 <x>
241              Set the P-value threshold for  the  Forward  filter  step.   The
242              default is 1e-5.
243
244
245       --nobias
246              Turn  off  the bias filter. This increases sensitivity somewhat,
247              but can come at a high cost in speed, especially  if  the  query
248              has  biased  residue  composition (such as a repetitive sequence
249              region, or if it is a membrane protein  with  large  regions  of
250              hydrophobicity). Without the bias filter, too many sequences may
251              pass the filter with biased  queries,  leading  to  slower  than
252              expected  performance  as  the  computationally  intensive  For‐
253              ward/Backward algorithms shoulder an abnormally heavy load.
254
255
256
257

OTHER OPTIONS

259       --nonull2
260              Turn off the null2 score corrections for biased composition.
261
262
263       -Z <x> Assert that the total number of targets in your searches is <x>,
264              for  the  purposes  of per-sequence E-value calculations, rather
265              than the actual number of targets seen.
266
267
268       --domZ <x>
269              Assert that the total number of targets in your searches is <x>,
270              for the purposes of per-domain conditional E-value calculations,
271              rather than the number of  targets  that  passed  the  reporting
272              thresholds.
273
274
275       --seed <n>
276              Set the random number seed to <n>.  Some steps in postprocessing
277              require Monte Carlo simulation.  The default is to use  a  fixed
278              seed  (42),  so that results are exactly reproducible. Any other
279              positive integer will give  different  (but  also  reproducible)
280              results. A choice of 0 uses an arbitrarily chosen seed.
281
282
283       --qformat <s>
284              Assert  that  the query sequence file is in format <s>. Accepted
285              formats include fasta, embl, genbank, ddbj, uniprot,  stockholm,
286              pfam, a2m, and afa.
287
288
289       --cpu <n>
290              Set  the  number of parallel worker threads to <n>.  By default,
291              HMMER sets this to the number of CPU cores it  detects  in  your
292              machine  -  that is, it tries to maximize the use of your avail‐
293              able processor cores. Setting <n>  higher  than  the  number  of
294              available  cores  is of little if any value, but you may want to
295              set it to something less. You can also control  this  number  by
296              setting an environment variable, HMMER_NCPU.
297
298              This  option  is only available if HMMER was compiled with POSIX
299              threads support. This is the  default,  but  it  may  have  been
300              turned off for your site or machine for some reason.
301
302
303
304       --stall
305              For  debugging the MPI master/worker version: pause after start,
306              to enable the developer to attach debuggers to the running  mas‐
307              ter  and worker(s) processes. Send SIGCONT signal to release the
308              pause.  (Under gdb: (gdb) signal SIGCONT)
309
310              (Only available if optional MPI support was enabled at  compile-
311              time.)
312
313
314       --mpi  Run in MPI master/worker mode, using mpirun.
315
316              (Only  available if optional MPI support was enabled at compile-
317              time.)
318
319
320
321
322
323
324
325

COPYRIGHT

339       @HMMER_COPYRIGHT@
340       @HMMER_LICENSE@
341
342       For  additional  information  on  copyright and licensing, see the file
343       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
344       web page (@HMMER_URL@).
345
346
347

AUTHOR

349       Eddy/Rivas Laboratory
350       Janelia Farm Research Campus
351       19700 Helix Drive
352       Ashburn VA 20147 USA
353       http://eddylab.org
354
355
356
357
358HMMER @HMMER_VERSION@            @HMMER_DATE@                       hmmscan(1)