phmmer(1)

1phmmer(1)                        HMMER Manual                        phmmer(1)
2
3
4

NAME

6       phmmer - search protein sequence(s) against a protein sequence database
7
8
9

SYNOPSIS

11       phmmer [options] seqfile seqdb
12
13
14

DESCRIPTION

16       phmmer  is used to search one or more query protein sequences against a
17       protein sequence database.  For each query  sequence  in  seqfile,  use
18       that  sequence to search the target database of sequences in seqdb, and
19       output ranked lists of the sequences with the most significant  matches
20       to the query.
21
22
23       Either the query seqfile or the target seqdb may be '-' (a dash charac‐
24       ter), in which case the query sequences or target database  input  will
25       be  read  from  a  <stdin>  pipe instead of from a file. Only one input
26       source can come through <stdin>, not both.  An exception is that if the
27       seqfile  contains  more than one query sequence, then seqdb cannot come
28       from <stdin>, because we can't rewind the streaming target database  to
29       search it with another query.
30
31
32
33       The output format is designed to be human-readable, but is often so vo‐
34       luminous that reading it is impractical, and parsing it is a pain.  The
35       --tblout  and --domtblout options save output in simple tabular formats
36       that are concise and easier to parse.  The -o option allows redirecting
37       the main output, including throwing it away in /dev/null.
38
39

OPTIONS

41       -h     Help;  print  a  brief  reminder  of  command line usage and all
42              available options.
43
44
45

OPTIONS FOR CONTROLLING OUTPUT

47       -o <f> Direct the main human-readable output to a file <f>  instead  of
48              the default stdout.
49
50
51       -A <f> Save  a multiple alignment of all significant hits (those satis‐
52              fying inclusion thresholds) to the file <f> in Stockholm format.
53
54
55       --tblout <f>
56              Save a simple tabular  (space-delimited)  file  summarizing  the
57              per-target  output, with one data line per homologous target se‐
58              quence found.
59
60
61       --domtblout <f>
62              Save a simple tabular  (space-delimited)  file  summarizing  the
63              per-domain  output, with one data line per homologous domain de‐
64              tected in a query sequence for each homologous model.
65
66
67       --acc  Use accessions instead of names in the main output, where avail‐
68              able for profiles and/or sequences.
69
70
71       --noali
72              Omit  the  alignment  section  from  the  main  output. This can
73              greatly reduce the output volume.
74
75
76       --notextw
77              Unlimit the length of each line in the main output. The  default
78              is a limit of 120 characters per line, which helps in displaying
79              the output cleanly on terminals and in editors, but can truncate
80              target profile description lines.
81
82
83       --textw <n>
84              Set  the  main  output's line length limit to <n> characters per
85              line. The default is 120.
86
87
88
89

OPTIONS CONTROLLING SCORING SYSTEM

91       The probability model in phmmer is  constructed  by  inferring  residue
92       probabilities from a standard 20x20 substitution score matrix, plus two
93       additional parameters for position-independent gap open and gap  extend
94       probabilities.
95
96
97       --popen <x>
98              Set  the  gap open probability for a single sequence query model
99              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
100
101
102       --pextend <x>
103              Set the gap extend probability for a single sequence query model
104              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
105
106
107       --mx <s>
108              Obtain residue alignment probabilities from the built-in substi‐
109              tution matrix named <s>.  Several standard matrices  are  built-
110              in,  and do not need to be read from files.  The matrix name <s>
111              can be PAM30, PAM70, PAM120, PAM240,  BLOSUM45,  BLOSUM50,  BLO‐
112              SUM62, BLOSUM80, or BLOSUM90.  Only one of the --mx and --mxfile
113              options may be used.
114
115
116       --mxfile mxfile
117              Obtain residue alignment probabilities from the substitution ma‐
118              trix in file mxfile.  The default score matrix is BLOSUM62 (this
119              matrix is internal to HMMER and does not have to be available as
120              a  file).   The  format  of  a substitution matrix mxfile is the
121              standard format accepted by BLAST,  FASTA,  and  other  sequence
122              analysis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
123              example files. (The only exception: we require  matrices  to  be
124              square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
125
126
127
128

OPTIONS CONTROLLING REPORTING THRESHOLDS

130       Reporting  thresholds  control  which hits are reported in output files
131       (the main output, --tblout, and --domtblout).  Sequence hits and domain
132       hits  are  ranked  by  statistical significance (E-value) and output is
133       generated in two sections called per-target and per-domain  output.  In
134       per-target  output, by default, all sequence hits with an E-value <= 10
135       are reported. In the per-domain output, for each target that has passed
136       per-target  reporting thresholds, all domains satisfying per-domain re‐
137       porting thresholds are reported. By default,  these  are  domains  with
138       conditional  E-values  of  <=  10.  The  following options allow you to
139       change the default E-value reporting thresholds, or to  use  bit  score
140       thresholds instead.
141
142
143
144       -E <x> In  the  per-target  output,  report target sequences with an E-
145              value of <= <x>.  The default is 10.0, meaning that on  average,
146              about  10 false positives will be reported per query, so you can
147              see the top of the noise and decide for yourself if it's  really
148              noise.
149
150
151       -T <x> Instead  of  thresholding per-profile output on E-value, instead
152              report target sequences with a bit score of >= <x>.
153
154
155       --domE <x>
156              In the per-domain output, for target sequences that have already
157              satisfied the per-profile reporting threshold, report individual
158              domains with a conditional E-value of <= <x>.   The  default  is
159              10.0.   A conditional E-value means the expected number of addi‐
160              tional false positive domains in the  smaller  search  space  of
161              those  comparisons that already satisfied the per-target report‐
162              ing threshold (and thus must have at least one homologous domain
163              already).
164
165
166       --domT <x>
167              Instead  of  thresholding  per-domain output on E-value, instead
168              report domains with a bit score of >= <x>.
169
170

OPTIONS CONTROLLING INCLUSION THRESHOLDS

172       Inclusion thresholds are stricter than reporting thresholds. They  con‐
173       trol  which  hits are included in any output multiple alignment (the -A
174       option) and which domains are marked as significant ("!") as opposed to
175       questionable ("?")  in domain output.
176
177
178       --incE <x>
179              Use  an E-value of <= <x> as the per-target inclusion threshold.
180              The default is 0.01, meaning that on average, about 1 false pos‐
181              itive  would  be  expected  in every 100 searches with different
182              query sequences.
183
184
185       --incT <x>
186              Instead of using E-values for setting the  inclusion  threshold,
187              instead  use  a  bit score of >= <x> as the per-target inclusion
188              threshold.  By default this option is unset.
189
190
191       --incdomE <x>
192              Use a conditional E-value of <= <x> as the per-domain  inclusion
193              threshold,  in  targets  that have already satisfied the overall
194              per-target inclusion threshold.  The default is 0.01.
195
196
197       --incdomT <x>
198              Instead of using E-values, use a bit score of >= <x> as the per-
199              domain inclusion threshold.  By default this option is unset.
200
201
202
203
204

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

206       HMMER3  searches  are  accelerated in a three-step filter pipeline: the
207       MSV filter, the Viterbi filter, and the Forward filter. The first  fil‐
208       ter  is  the fastest and most approximate; the last is the full Forward
209       scoring algorithm, slowest but most accurate. There is also a bias fil‐
210       ter  step  between  MSV and Viterbi. Targets that pass all the steps in
211       the acceleration pipeline are then subjected to postprocessing  --  do‐
212       main identification and scoring using the Forward/Backward algorithm.
213
214       Essentially  the  only  free  parameters that control HMMER's heuristic
215       filters are the P-value thresholds controlling the expected fraction of
216       nonhomologous  sequences  that  pass  the  filters. Setting the default
217       thresholds higher will pass a higher proportion  of  nonhomologous  se‐
218       quence,  increasing  sensitivity  at  the expense of speed; conversely,
219       setting lower P-value thresholds will pass a  smaller  proportion,  de‐
220       creasing  sensitivity  and increasing speed. Setting a filter's P-value
221       threshold to 1.0 means it will passing all sequences,  and  effectively
222       disables the filter.
223
224       Changing  filter  thresholds only removes or includes targets from con‐
225       sideration; changing filter thresholds does not alter  bit  scores,  E-
226       values,  or  alignments, all of which are determined solely in postpro‐
227       cessing.
228
229
230       --max  Maximum sensitivity.  Turn off all filters, including  the  bias
231              filter,  and  run  full Forward/Backward postprocessing on every
232              target. This increases sensitivity slightly, at a large cost  in
233              speed.
234
235
236       --F1 <x>
237              First  filter  threshold;  set the P-value threshold for the MSV
238              filter step.  The default is 0.02, meaning that  roughly  2%  of
239              the  highest  scoring nonhomologous targets are expected to pass
240              the filter.
241
242
243       --F2 <x>
244              Second filter threshold;  set  the  P-value  threshold  for  the
245              Viterbi filter step.  The default is 0.001.
246
247
248       --F3 <x>
249              Third  filter  threshold; set the P-value threshold for the For‐
250              ward filter step.  The default is 1e-5.
251
252
253       --nobias
254              Turn off the bias filter. This increases  sensitivity  somewhat,
255              but  can  come  at a high cost in speed, especially if the query
256              has biased residue composition (such as  a  repetitive  sequence
257              region, or if it is a membrane protein with large regions of hy‐
258              drophobicity). Without the bias filter, too many  sequences  may
259              pass  the filter with biased queries, leading to slower than ex‐
260              pected  performance  as  the  computationally   intensive   For‐
261              ward/Backward algorithms shoulder an abnormally heavy load.
262
263
264
265
266

OPTIONS CONTROLLING E-VALUE CALIBRATION

268       Estimating the location parameters for the expected score distributions
269       for MSV filter scores, Viterbi filter scores, and  Forward  scores  re‐
270       quires three short random sequence simulations.
271
272
273       --EmL <n>
274              Sets  the sequence length in simulation that estimates the loca‐
275              tion parameter mu for MSV filter E-values. Default is 200.
276
277
278       --EmN <n>
279              Sets the number of sequences in simulation  that  estimates  the
280              location parameter mu for MSV filter E-values. Default is 200.
281
282
283       --EvL <n>
284              Sets  the sequence length in simulation that estimates the loca‐
285              tion parameter mu for Viterbi filter E-values. Default is 200.
286
287
288       --EvN <n>
289              Sets the number of sequences in simulation  that  estimates  the
290              location  parameter  mu  for Viterbi filter E-values. Default is
291              200.
292
293
294       --EfL <n>
295              Sets the sequence length in simulation that estimates the  loca‐
296              tion parameter tau for Forward E-values. Default is 100.
297
298
299       --EfN <n>
300              Sets  the  number  of sequences in simulation that estimates the
301              location parameter tau for Forward E-values. Default is 200.
302
303
304       --Eft <x>
305              Sets the tail mass fraction to fit in the simulation that  esti‐
306              mates the location parameter tau for Forward evalues. Default is
307              0.04.
308
309
310
311
312

OTHER OPTIONS

314       --nonull2
315              Turn off the null2 score corrections for biased composition.
316
317
318       -Z <x> Assert that the total number of targets in your searches is <x>,
319              for  the  purposes  of per-sequence E-value calculations, rather
320              than the actual number of targets seen.
321
322
323       --domZ <x>
324              Assert that the total number of targets in your searches is <x>,
325              for the purposes of per-domain conditional E-value calculations,
326              rather than the number of  targets  that  passed  the  reporting
327              thresholds.
328
329
330       --seed <n>
331              Seed  the random number generator with <n>, an integer >= 0.  If
332              <n> is >0, any stochastic simulations will be reproducible;  the
333              same  command will give the same results.  If <n> is 0, the ran‐
334              dom number generator is seeded arbitrarily, and stochastic simu‐
335              lations  will vary from run to run of the same command.  The de‐
336              fault seed is 42.
337
338
339       --qformat <s>
340              Assert that input seqfile is in format <s>, bypassing format au‐
341              todetection.   Common choices for <s> include: fasta, embl, gen‐
342              bank.  Alignment formats  also  work;  common  choices  include:
343              stockholm,  a2m,  afa, psiblast, clustal, phylip.  phmmer always
344              uses a single sequence query to start its search,  so  when  the
345              input  seqfile  is  an  alignment, phmmer reads it one unaligned
346              query sequence at a time, not as an alignment.  For more  infor‐
347              mation,  and  for  codes  for some less common formats, see main
348              documentation.  The string <s>  is  case-insensitive  (fasta  or
349              FASTA both work).
350
351              --tformat  <s>  Assert that target sequence database seqdb is in
352              format <s>, bypassing format autodetection.  See --qformat above
353              for list of accepted format codes for <s>.
354
355
356
357       --cpu <n>
358              Set  the number of parallel worker threads to <n>.  On multicore
359              machines, the default is 2.  You can also control this number by
360              setting  an  environment  variable, HMMER_NCPU.  There is also a
361              master thread, so the actual number of threads that HMMER spawns
362              is <n>+1.
363
364              This  option  is  not available if HMMER was compiled with POSIX
365              threads support turned off.
366
367
368
369
370       --stall
371              For debugging the MPI master/worker version: pause after  start,
372              to  enable the developer to attach debuggers to the running mas‐
373              ter and worker(s) processes. Send SIGCONT signal to release  the
374              pause.  (Under gdb: (gdb) signal SIGCONT) (Only available if op‐
375              tional MPI support was enabled at compile-time.)
376
377
378       --mpi  Run under MPI control with master/worker parallelization  (using
379              mpirun,  for example, or equivalent). Only available if optional
380              MPI support was enabled at compile-time.
381
382
383
384
385
386

COPYRIGHT

400       Copyright (C) 2020 Howard Hughes Medical Institute.
401       Freely distributed under the BSD open source license.
402
403       For  additional  information  on  copyright and licensing, see the file
404       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
405       web page (http://hmmer.org/).
406
407
408

AUTHOR

410       http://eddylab.org
411
412
413
414
415HMMER 3.3.2                        Nov 2020                          phmmer(1)