hmmscan(1)

1hmmscan(1)                       HMMER Manual                       hmmscan(1)
2
3
4

NAME

6       hmmscan - search protein sequence(s) against a protein profile database
7
8
9

SYNOPSIS

11       hmmscan [options] <hmmdb> <seqfile>
12
13
14
15

DESCRIPTION

17       hmmscan is used to search protein sequences against collections of pro‐
18       tein profiles. For each sequence in <seqfile>, use that query  sequence
19       to search the target database of profiles in <hmmdb>, and output ranked
20       lists of  the  profiles  with  the  most  significant  matches  to  the
21       sequence.
22
23
24       The  <seqfile>  may  contain more than one query sequence. It can be in
25       FASTA format, or several other common sequence file  formats  (genbank,
26       embl,  and uniprot, among others), or in alignment file formats (stock‐
27       holm, aligned fasta, and others). See the --qformat option for  a  com‐
28       plete list.
29
30
31       The  <hmmdb>  needs  to  be  press'ed  using  hmmpress before it can be
32       searched with  hmmscan.   This  creates  four  binary  files,  suffixed
33       .h3{fimp}.
34
35
36       The  query  <seqfile>  may be '-' (a dash character), in which case the
37       query sequences are read from a <stdin> pipe instead of  from  a  file.
38       The  <hmmdb>  cannot be read from a <stdin> stream, because it needs to
39       have those four auxiliary binary files generated by hmmpress.
40
41
42       The output format is designed to be human-readable,  but  is  often  so
43       voluminous  that  reading  it is impractical, and parsing it is a pain.
44       The --tblout and --domtblout options save output in simple tabular for‐
45       mats  that are concise and easier to parse.  The -o option allows redi‐
46       recting the main output, including throwing it away in /dev/null.
47
48
49
50

OPTIONS

52       -h     Help; print a brief reminder  of  command  line  usage  and  all
53              available options.
54
55
56
57

OPTIONS FOR CONTROLLING OUTPUT

59       -o <f> Direct  the  main human-readable output to a file <f> instead of
60              the default stdout.
61
62
63       --tblout <f>
64              Save a simple tabular  (space-delimited)  file  summarizing  the
65              per-target  output,  with  one  data  line per homologous target
66              model found.
67
68
69       --domtblout <f>
70              Save a simple tabular  (space-delimited)  file  summarizing  the
71              per-domain  output,  with  one  data  line per homologous domain
72              detected in a query sequence for each homologous model.
73
74
75       --pfamtblout <f>
76              Save an especially succinct tabular (space-delimited) file  sum‐
77              marizing  the  per-target output, with one data line per homolo‐
78              gous target model found.
79
80
81
82       --acc  Use accessions instead of names in the main output, where avail‐
83              able for profiles and/or sequences.
84
85
86       --noali
87              Omit  the  alignment  section  from  the  main  output. This can
88              greatly reduce the output volume.
89
90
91       --notextw
92              Unlimit the length of each line in the main output. The  default
93              is a limit of 120 characters per line, which helps in displaying
94              the output cleanly on terminals and in editors, but can truncate
95              target profile description lines.
96
97
98       --textw <n>
99              Set  the  main  output's line length limit to <n> characters per
100              line. The default is 120.
101
102
103
104

OPTIONS FOR REPORTING THRESHOLDS

106       Reporting thresholds control which hits are reported  in  output  files
107       (the main output, --tblout, and --domtblout).
108
109
110       -E <x> In the per-target output, report target profiles with an E-value
111              of <= <x>.  The default is 10.0, meaning that on average,  about
112              10  false  positives  will be reported per query, so you can see
113              the top of the noise and decide  for  yourself  if  it's  really
114              noise.
115
116
117       -T <x> Instead  of  thresholding per-profile output on E-value, instead
118              report target profiles with a bit score of >= <x>.
119
120
121       --domE <x>
122              In the per-domain output, for target profiles that have  already
123              satisfied the per-profile reporting threshold, report individual
124              domains with a conditional E-value of <= <x>.   The  default  is
125              10.0.   A conditional E-value means the expected number of addi‐
126              tional false positive domains in the  smaller  search  space  of
127              those comparisons that already satisfied the per-profile report‐
128              ing threshold (and thus must have at least one homologous domain
129              already).
130
131
132
133       --domT <x>
134              Instead  of  thresholding  per-domain output on E-value, instead
135              report domains with a bit score of >= <x>.
136
137
138
139
140

OPTIONS FOR INCLUSION THRESHOLDS

142       Inclusion thresholds are stricter than reporting thresholds.  Inclusion
143       thresholds  control  which hits are considered to be reliable enough to
144       be included in an output alignment or a subsequent  search  round.   In
145       hmmscan,  which  does  not have any alignment output (like hmmsearch or
146       phmmer) nor any iterative  search  steps  (like  jackhmmer),  inclusion
147       thresholds have little effect. They only affect what domains get marked
148       as significant (!) or questionable (?) in domain output.
149
150
151       --incE <x>
152              Use an E-value of <= <x> as the per-target inclusion  threshold.
153              The default is 0.01, meaning that on average, about 1 false pos‐
154              itive would be expected in every  100  searches  with  different
155              query sequences.
156
157
158       --incT <x>
159              Instead  of  using E-values for setting the inclusion threshold,
160              instead use a bit score of >= <x> as  the  per-target  inclusion
161              threshold.  It would be unusual to use bit score thresholds with
162              hmmscan, because you don't expect a single  score  threshold  to
163              work  for  different  profiles; different profiles have slightly
164              different expected score distributions.
165
166
167       --incdomE <x>
168              Use a conditional E-value of <= <x> as the per-domain  inclusion
169              threshold,  in  targets  that have already satisfied the overall
170              per-target inclusion threshold.  The default is 0.01.
171
172
173       --incdomT <x>
174              Instead of using E-values, instead use a bit score of >= <x>  as
175              the  per-domain  inclusion  threshold.  As with --incT above, it
176              would be unusual to use a single bit score threshold in hmmscan.
177
178
179
180

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

182       Curated profile databases may define specific bit score thresholds  for
183       each profile, superseding any thresholding based on statistical signif‐
184       icance alone.
185
186       To use these options, the profile must contain the appropriate (GA, TC,
187       and/or  NC)  optional  score threshold annotation; this is picked up by
188       hmmbuild from  Stockholm  format  alignment  files.  Each  thresholding
189       option  has  two  scores:  the per-sequence threshold <x1> and the per-
190       domain threshold <x2> These act  as  if  -T<x1>  --incT<x1>  --domT<x2>
191       --incdomT<x2>  has been applied specifically using each model's curated
192       thresholds.
193
194
195       --cut_ga
196              Use the GA (gathering) bit scores  in  the  model  to  set  per-
197              sequence  (GA1)  and  per-domain  (GA2)  reporting and inclusion
198              thresholds. GA thresholds are generally  considered  to  be  the
199              reliable  curated  thresholds  defining  family  membership; for
200              example, in Pfam, these thresholds define what gets included  in
201              Pfam Full alignments based on searches with Pfam Seed models.
202
203
204       --cut_nc
205              Use  the  NC (noise cutoff) bit score thresholds in the model to
206              set per-sequence (NC1) and per-domain (NC2) reporting and inclu‐
207              sion  thresholds.  NC  thresholds are generally considered to be
208              the score of the highest-scoring known false positive.
209
210
211       --cut_tc
212              Use the NC (trusted cutoff) bit score thresholds in the model to
213              set per-sequence (TC1) and per-domain (TC2) reporting and inclu‐
214              sion thresholds. TC thresholds are generally  considered  to  be
215              the  score  of  the  lowest-scoring  known true positive that is
216              above all known false positives.
217
218
219
220
221

CONTROL OF THE ACCELERATION PIPELINE

223       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
224       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
225       ter is the fastest and most approximate; the last is the  full  Forward
226       scoring  algorithm.  There  is  also a bias filter step between MSV and
227       Viterbi. Targets that pass all the steps in the  acceleration  pipeline
228       are then subjected to postprocessing -- domain identification and scor‐
229       ing using the Forward/Backward algorithm.
230
231       Changing filter thresholds only removes or includes targets  from  con‐
232       sideration;  changing  filter  thresholds does not alter bit scores, E-
233       values, or alignments, all of which are determined solely  in  postpro‐
234       cessing.
235
236
237       --max  Turn  off  all  filters, including the bias filter, and run full
238              Forward/Backward postprocessing on every target. This  increases
239              sensitivity somewhat, at a large cost in speed.
240
241
242       --F1 <x>
243              Set  the P-value threshold for the MSV filter step.  The default
244              is 0.02, meaning that roughly 2% of the highest  scoring  nonho‐
245              mologous targets are expected to pass the filter.
246
247
248       --F2 <x>
249              Set  the  P-value  threshold  for  the Viterbi filter step.  The
250              default is 0.001.
251
252
253       --F3 <x>
254              Set the P-value threshold for  the  Forward  filter  step.   The
255              default is 1e-5.
256
257
258       --nobias
259              Turn  off  the bias filter. This increases sensitivity somewhat,
260              but can come at a high cost in speed, especially  if  the  query
261              has  biased  residue  composition (such as a repetitive sequence
262              region, or if it is a membrane protein  with  large  regions  of
263              hydrophobicity). Without the bias filter, too many sequences may
264              pass the filter with biased  queries,  leading  to  slower  than
265              expected  performance  as  the  computationally  intensive  For‐
266              ward/Backward algorithms shoulder an abnormally heavy load.
267
268
269
270

OTHER OPTIONS

272       --nonull2
273              Turn off the null2 score corrections for biased composition.
274
275
276       -Z <x> Assert that the total number of targets in your searches is <x>,
277              for  the  purposes  of per-sequence E-value calculations, rather
278              than the actual number of targets seen.
279
280
281       --domZ <x>
282              Assert that the total number of targets in your searches is <x>,
283              for the purposes of per-domain conditional E-value calculations,
284              rather than the number of  targets  that  passed  the  reporting
285              thresholds.
286
287
288       --seed <n>
289              Set the random number seed to <n>.  Some steps in postprocessing
290              require Monte Carlo simulation.  The default is to use  a  fixed
291              seed  (42),  so that results are exactly reproducible. Any other
292              positive integer will give  different  (but  also  reproducible)
293              results. A choice of 0 uses an arbitrarily chosen seed.
294
295
296       --qformat <s>
297              Assert  that the query sequence file is in format <s>.  Accepted
298              formats include fasta, embl, genbank, ddbj, uniprot,  stockholm,
299              pfam, a2m, and afa.
300
301
302       --cpu <n>
303              Set  the  number of parallel worker threads to <n>.  By default,
304              HMMER sets this to the number of CPU cores it  detects  in  your
305              machine  -  that is, it tries to maximize the use of your avail‐
306              able processor cores. Setting <n>  higher  than  the  number  of
307              available  cores  is of little if any value, but you may want to
308              set it to something less. You can also control  this  number  by
309              setting an environment variable, HMMER_NCPU.
310
311              This  option  is only available if HMMER was compiled with POSIX
312              threads support. This is the  default,  but  it  may  have  been
313              turned off for your site or machine for some reason.
314
315
316
317       --stall
318              For  debugging the MPI master/worker version: pause after start,
319              to enable the developer to attach debuggers to the running  mas‐
320              ter  and worker(s) processes. Send SIGCONT signal to release the
321              pause.  (Under gdb: (gdb) signal SIGCONT)
322
323              (Only available if optional MPI support was enabled at  compile-
324              time.)
325
326
327       --mpi  Run in MPI master/worker mode, using mpirun.
328
329              (Only  available if optional MPI support was enabled at compile-
330              time.)
331
332
333
334
335
336
337
338

COPYRIGHT

351       Copyright (C) 2015 Howard Hughes Medical Institute.
352       Freely distributed under the GNU General Public License (GPLv3).
353
354       For additional information on copyright and  licensing,  see  the  file
355       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
356       web page ().
357
358
359

AUTHOR

361       Eddy/Rivas Laboratory
362       Janelia Farm Research Campus
363       19700 Helix Drive
364       Ashburn VA 20147 USA
365       http://eddylab.org
366
367
368
369
370HMMER 3.1b2                      February 2015                      hmmscan(1)