phmmer(1)

1phmmer(1)                        HMMER Manual                        phmmer(1)
2
3
4

NAME

6       phmmer - search protein sequence(s) against a protein sequence database
7
8
9

SYNOPSIS

11       phmmer [options] <seqfile> <seqdb>
12
13
14

DESCRIPTION

16       phmmer  is used to search one or more query protein sequences against a
17       protein sequence database.  For each query sequence in  <seqfile>,  use
18       that  sequence  to  search the target database of sequences in <seqdb>,
19       and output ranked lists of the  sequences  with  the  most  significant
20       matches to the query.
21
22
23       Either  the  query  <seqfile>  or the target <seqdb> may be '-' (a dash
24       character), in which case the query sequences or target database  input
25       will be read from a <stdin> pipe instead of from a file. Only one input
26       source can come through <stdin>, not both.  An exception is that if the
27       <seqfile>  contains  more  than one query sequence, then <seqdb> cannot
28       come from <stdin>, because we can't rewind the streaming  target  data‐
29       base to search it with another query.
30
31
32
33       The  output  format  is  designed to be human-readable, but is often so
34       voluminous that reading it is impractical, and parsing it  is  a  pain.
35       The --tblout and --domtblout options save output in simple tabular for‐
36       mats that are concise and easier to parse.  The -o option allows  redi‐
37       recting the main output, including throwing it away in /dev/null.
38
39

OPTIONS

41       -h     Help;  print  a  brief  reminder  of  command line usage and all
42              available options.
43
44
45

OPTIONS FOR CONTROLLING OUTPUT

47       -o <f> Direct the main human-readable output to a file <f>  instead  of
48              the default stdout.
49
50
51       -A <f> Save  a multiple alignment of all significant hits (those satis‐
52              fying inclusion thresholds) to the file <f> in Stockholm format.
53
54
55       --tblout <f>
56              Save a simple tabular  (space-delimited)  file  summarizing  the
57              per-target  output,  with  one  data  line per homologous target
58              sequence found.
59
60
61       --domtblout <f>
62              Save a simple tabular  (space-delimited)  file  summarizing  the
63              per-domain  output,  with  one  data  line per homologous domain
64              detected in a query sequence for each homologous model.
65
66
67       --acc  Use accessions instead of names in the main output, where avail‐
68              able for profiles and/or sequences.
69
70
71       --noali
72              Omit  the  alignment  section  from  the  main  output. This can
73              greatly reduce the output volume.
74
75
76       --notextw
77              Unlimit the length of each line in the main output. The  default
78              is a limit of 120 characters per line, which helps in displaying
79              the output cleanly on terminals and in editors, but can truncate
80              target profile description lines.
81
82
83       --textw <n>
84              Set  the  main  output's line length limit to <n> characters per
85              line. The default is 120.
86
87
88
89

OPTIONS CONTROLLING SCORING SYSTEM

91       The probability model in phmmer is  constructed  by  inferring  residue
92       probabilities from a standard 20x20 substitution score matrix, plus two
93       additional parameters for position-independent gap open and gap  extend
94       probabilities.
95
96
97       --popen <x>
98              Set  the  gap open probability for a single sequence query model
99              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
100
101
102       --pextend <x>
103              Set the gap extend probability for a single sequence query model
104              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
105
106
107       --mx <s>
108              Obtain residue alignment probabilities from the built-in substi‐
109              tution matrix named <s>.  Several standard matrices  are  built-
110              in,  and do not need to be read from files.  The matrix name <s>
111              can be PAM30, PAM70, PAM120, PAM240,  BLOSUM45,  BLOSUM50,  BLO‐
112              SUM62, BLOSUM80, or BLOSUM90.  Only one of the --mx and --mxfile
113              options may be used.
114
115
116       --mxfile <mxfile>
117              Obtain residue alignment  probabilities  from  the  substitution
118              matrix  in  file <mxfile>.  The default score matrix is BLOSUM62
119              (this matrix is internal to HMMER and does not have to be avail‐
120              able  as  a file).  The format of a substitution matrix <mxfile>
121              is the standard format  accepted  by  BLAST,  FASTA,  and  other
122              sequence  analysis  software.  Only one of the --mx and --mxfile
123              options may be used.
124
125
126

OPTIONS CONTROLLING REPORTING THRESHOLDS

128       Reporting thresholds control which hits are reported  in  output  files
129       (the main output, --tblout, and --domtblout).  Sequence hits and domain
130       hits are ranked by statistical significance  (E-value)  and  output  is
131       generated  in  two sections called per-target and per-domain output. In
132       per-target output, by default, all sequence hits with an E-value <=  10
133       are reported. In the per-domain output, for each target that has passed
134       per-target reporting  thresholds,  all  domains  satisfying  per-domain
135       reporting  thresholds  are reported. By default, these are domains with
136       conditional E-values of <= 10.  The  following  options  allow  you  to
137       change  the  default  E-value reporting thresholds, or to use bit score
138       thresholds instead.
139
140
141
142       -E <x> In the per-target output, report target  sequences  with  an  E-
143              value  of <= <x>.  The default is 10.0, meaning that on average,
144              about 10 false positives will be reported per query, so you  can
145              see  the top of the noise and decide for yourself if it's really
146              noise.
147
148
149       -T <x> Instead of thresholding per-profile output on  E-value,  instead
150              report target sequences with a bit score of >= <x>.
151
152
153       --domE <x>
154              In the per-domain output, for target sequences that have already
155              satisfied the per-profile reporting threshold, report individual
156              domains  with  a  conditional E-value of <= <x>.  The default is
157              10.0.  A conditional E-value means the expected number of  addi‐
158              tional  false  positive  domains  in the smaller search space of
159              those comparisons that already satisfied the per-target  report‐
160              ing threshold (and thus must have at least one homologous domain
161              already).
162
163
164       --domT <x>
165              Instead of thresholding per-domain output  on  E-value,  instead
166              report domains with a bit score of >= <x>.
167
168

OPTIONS CONTROLLING INCLUSION THRESHOLDS

170       Inclusion  thresholds are stricter than reporting thresholds. They con‐
171       trol which hits are included in any output multiple alignment  (the  -A
172       option) and which domains are marked as significant ("!") as opposed to
173       questionable ("?")  in domain output.
174
175
176       --incE <x>
177              Use an E-value of <= <x> as the per-target inclusion  threshold.
178              The default is 0.01, meaning that on average, about 1 false pos‐
179              itive would be expected in every  100  searches  with  different
180              query sequences.
181
182
183       --incT <x>
184              Instead  of  using E-values for setting the inclusion threshold,
185              instead use a bit score of >= <x> as  the  per-target  inclusion
186              threshold.  By default this option is unset.
187
188
189       --incdomE <x>
190              Use  a conditional E-value of <= <x> as the per-domain inclusion
191              threshold, in targets that have already  satisfied  the  overall
192              per-target inclusion threshold.  The default is 0.01.
193
194
195       --incdomT <x>
196              Instead of using E-values, use a bit score of >= <x> as the per-
197              domain inclusion threshold.  By default this option is unset.
198
199
200
201
202

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

204       HMMER3 searches are accelerated in a three-step  filter  pipeline:  the
205       MSV  filter, the Viterbi filter, and the Forward filter. The first fil‐
206       ter is the fastest and most approximate; the last is the  full  Forward
207       scoring algorithm, slowest but most accurate. There is also a bias fil‐
208       ter step between MSV and Viterbi. Targets that pass all  the  steps  in
209       the  acceleration  pipeline  are  then  subjected  to postprocessing --
210       domain identification and scoring using the Forward/Backward algorithm.
211
212       Essentially the only free parameters  that  control  HMMER's  heuristic
213       filters are the P-value thresholds controlling the expected fraction of
214       nonhomologous sequences that pass  the  filters.  Setting  the  default
215       thresholds  higher  will  pass  a  higher  proportion  of nonhomologous
216       sequence, increasing sensitivity at the expense of  speed;  conversely,
217       setting  lower  P-value  thresholds  will  pass  a  smaller proportion,
218       decreasing sensitivity and increasing speed. Setting a filter's P-value
219       threshold  to  1.0 means it will passing all sequences, and effectively
220       disables the filter.
221
222       Changing filter thresholds only removes or includes targets  from  con‐
223       sideration;  changing  filter  thresholds does not alter bit scores, E-
224       values, or alignments, all of which are determined solely  in  postpro‐
225       cessing.
226
227
228       --max  Maximum  sensitivity.   Turn off all filters, including the bias
229              filter, and run full Forward/Backward  postprocessing  on  every
230              target.  This increases sensitivity slightly, at a large cost in
231              speed.
232
233
234       --F1 <x>
235              First filter threshold; set the P-value threshold  for  the  MSV
236              filter  step.   The  default is 0.02, meaning that roughly 2% of
237              the highest scoring nonhomologous targets are expected  to  pass
238              the filter.
239
240
241       --F2 <x>
242              Second  filter  threshold;  set  the  P-value  threshold for the
243              Viterbi filter step.  The default is 0.001.
244
245
246       --F3 <x>
247              Third filter threshold; set the P-value threshold for  the  For‐
248              ward filter step.  The default is 1e-5.
249
250
251       --nobias
252              Turn  off  the bias filter. This increases sensitivity somewhat,
253              but can come at a high cost in speed, especially  if  the  query
254              has  biased  residue  composition (such as a repetitive sequence
255              region, or if it is a membrane protein  with  large  regions  of
256              hydrophobicity). Without the bias filter, too many sequences may
257              pass the filter with biased  queries,  leading  to  slower  than
258              expected  performance  as  the  computationally  intensive  For‐
259              ward/Backward algorithms shoulder an abnormally heavy load.
260
261
262
263
264

OPTIONS CONTROLLING E-VALUE CALIBRATION

266       Estimating the location parameters for the expected score distributions
267       for  MSV  filter  scores,  Viterbi  filter  scores,  and Forward scores
268       requires three short random sequence simulations.
269
270
271       --EmL <n>
272              Sets the sequence length in simulation that estimates the  loca‐
273              tion parameter mu for MSV filter E-values. Default is 200.
274
275
276       --EmN <n>
277              Sets  the  number  of sequences in simulation that estimates the
278              location parameter mu for MSV filter E-values. Default is 200.
279
280
281       --EvL <n>
282              Sets the sequence length in simulation that estimates the  loca‐
283              tion parameter mu for Viterbi filter E-values. Default is 200.
284
285
286       --EvN <n>
287              Sets  the  number  of sequences in simulation that estimates the
288              location parameter mu for Viterbi filter  E-values.  Default  is
289              200.
290
291
292       --EfL <n>
293              Sets  the sequence length in simulation that estimates the loca‐
294              tion parameter tau for Forward E-values. Default is 100.
295
296
297       --EfN <n>
298              Sets the number of sequences in simulation  that  estimates  the
299              location parameter tau for Forward E-values. Default is 200.
300
301
302       --Eft <x>
303              Sets  the tail mass fraction to fit in the simulation that esti‐
304              mates the location parameter tau for Forward evalues. Default is
305              0.04.
306
307
308
309
310

OTHER OPTIONS

312       --nonull2
313              Turn off the null2 score corrections for biased composition.
314
315
316       -Z <x> Assert that the total number of targets in your searches is <x>,
317              for the purposes of per-sequence  E-value  calculations,  rather
318              than the actual number of targets seen.
319
320
321       --domZ <x>
322              Assert that the total number of targets in your searches is <x>,
323              for the purposes of per-domain conditional E-value calculations,
324              rather  than  the  number  of  targets that passed the reporting
325              thresholds.
326
327
328       --seed <n>
329              Seed the random number generator with <n>, an integer >= 0.   If
330              <n>  is >0, any stochastic simulations will be reproducible; the
331              same command will give the same results.  If <n> is 0, the  ran‐
332              dom number generator is seeded arbitrarily, and stochastic simu‐
333              lations will vary from run to run  of  the  same  command.   The
334              default seed is 42.
335
336
337       --qformat <s>
338              Declare  that  the  input  <seqfile> is in format <s>.  Accepted
339              formats include fasta, embl, genbank, ddbj, uniprot,  stockholm,
340              pfam,  a2m, and afa.  The default is to autodetect the format of
341              the file.
342
343
344       --tformat <s>
345              Declare that the input <seqdb> is in format <s>.  Accepted  for‐
346              mats  include  fasta,  embl,  genbank, ddbj, uniprot, stockholm,
347              pfam, a2m, and afa.  The default is to autodetect the format  of
348              the file.
349
350
351       --cpu <n>
352              Set  the  number of parallel worker threads to <n>.  By default,
353              HMMER sets this to the number of CPU cores it  detects  in  your
354              machine  -  that is, it tries to maximize the use of your avail‐
355              able processor cores. Setting <n>  higher  than  the  number  of
356              available  cores  is of little if any value, but you may want to
357              set it to something less. You can also control  this  number  by
358              setting an environment variable, HMMER_NCPU.
359
360              This  option  is only available if HMMER was compiled with POSIX
361              threads support. This is the  default,  but  it  may  have  been
362              turned  off  at  compile-time  for your site or machine for some
363              reason.
364
365
366       --stall
367              For debugging the MPI master/worker version: pause after  start,
368              to  enable the developer to attach debuggers to the running mas‐
369              ter and worker(s) processes. Send SIGCONT signal to release  the
370              pause.   (Under  gdb:  (gdb)  signal SIGCONT) (Only available if
371              optional MPI support was enabled at compile-time.)
372
373
374       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
375              optional MPI support was enabled at compile-time.)
376
377
378
379
380

COPYRIGHT

393       Copyright (C) 2015 Howard Hughes Medical Institute.
394       Freely distributed under the GNU General Public License (GPLv3).
395
396       For  additional  information  on  copyright and licensing, see the file
397       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
398       web page ().
399
400
401

AUTHOR

403       Eddy/Rivas Laboratory
404       Janelia Farm Research Campus
405       19700 Helix Drive
406       Ashburn VA 20147 USA
407       http://eddylab.org
408
409
410
411
412HMMER 3.1b2                      February 2015                       phmmer(1)