hmmbuild(1)

1hmmbuild(1)                      HMMER Manual                      hmmbuild(1)
2
3
4

NAME

6       hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)
7
8

SYNOPSIS

10       hmmbuild [options] <hmmfile_out> <msafile>
11
12
13

DESCRIPTION

15       For  each  multiple sequence alignment in <msafile> build a profile HMM
16       and save it to a new file <hmmfile_out>.
17
18
19
20       <msafile> may be '-' (dash), which means reading this input from  stdin
21       rather  than  a  file.  To use '-', you must also specify the alignment
22       file format with --informat <s>, as in --informat stockholm (because of
23       a  current limitation in our implementation, MSA file formats cannot be
24       autodetected in a nonrewindable input stream.)
25
26
27       <hmmfile_out> may not be '-' (stdout), because sending the HMM file  to
28       stdout would conflict with the other text output of the program.
29
30
31
32
33
34
35

OPTIONS

37       -h     Help;  print  a  brief  reminder  of  command line usage and all
38              available options.
39
40
41       -n <s> Name the new profile <s>.  The default is to use the name of the
42              alignment  (if  one is present in the msafile, or, failing that,
43              the name of the hmmfile.  If  msafile  contains  more  than  one
44              alignment, -n doesn't work, and every alignment must have a name
45              annotated in the msafile (as in Stockholm #=GF ID annotation).
46
47
48
49       -o <f> Direct the summary output to file <f>, rather than to stdout.
50
51
52       -O <f> After each model is constructed, resave annotated, possibly mod‐
53              ified  source alignments to a file <f> in Stockholm format.  The
54              alignments are annotated with a reference annotation line  indi‐
55              cating  which  columns were assigned as consensus, and sequences
56              are annotated with what relative sequence weights were assigned.
57              Some residues of the alignment may have been shifted to accommo‐
58              date restrictions of the Plan7 profile architecture, which  dis‐
59              allows transitions between insert and delete states.
60
61
62

OPTIONS FOR SPECIFYING THE ALPHABET

64       The  alphabet  type (amino, DNA, or RNA) is autodetected by default, by
65       looking at the composition of the msafile.  Autodetection  is  normally
66       quite  reliable,  but  occasionally  alphabet type may be ambiguous and
67       autodetection can fail (for instance, on tiny toy alignments of just  a
68       few  residues).  To  avoid this, or to increase robustness in automated
69       analysis pipelines, you may specify the alphabet type of  msafile  with
70       these options.
71
72
73       --amino
74              Specify that all sequences in msafile are proteins.
75
76
77       --dna  Specify that all sequences in msafile are DNAs.
78
79
80       --rna  Specify that all sequences in msafile are RNAs.
81
82
83
84

OPTIONS CONTROLLING PROFILE CONSTRUCTION

86       These  options  control  how consensus columns are defined in an align‐
87       ment.
88
89
90       --fast Define consensus columns as those that have a fraction  >=  sym‐
91              frac  of  residues as opposed to gaps. (See below for the --sym‐
92              frac option.) This is the default.
93
94
95       --hand Define consensus columns in next profile using reference annota‐
96              tion  to  the multiple alignment.  This allows you to define any
97              consensus columns you like.
98
99
100       --symfrac <x>
101              Define the residue fraction threshold necessary to define a con‐
102              sensus  column when using the --fast option. The default is 0.5.
103              The symbol fraction in each column is  calculated  after  taking
104              relative sequence weighting into account, and ignoring gap char‐
105              acters corresponding to ends of sequence fragments  (as  opposed
106              to  internal  insertions/deletions).   Setting this to 0.0 means
107              that every alignment column will be assigned as consensus, which
108              may  be  useful in some cases. Setting it to 1.0 means that only
109              columns that include 0 gaps (internal insertions/deletions) will
110              be assigned as consensus.
111
112
113       --fragthresh <x>
114              We  only want to count terminal gaps as deletions if the aligned
115              sequence is known to be full-length, not if  it  is  a  fragment
116              (for  instance,  because  only  part of it was sequenced). HMMER
117              uses a simple rule  to  infer  fragments:  if  the  range  of  a
118              sequence  in  the  alignment  (the  number  of alignment columns
119              between the first and last positions of the  sequence)  is  less
120              than  or  equal  to a fraction <x> times the alignment length in
121              columns, then the sequence is handled as a fragment. The default
122              is   0.5.   Setting  --fragthresh0  will  define  no  (nonempty)
123              sequence as a fragment; you might want to do this  if  you  know
124              you've   got   a  carefully  curated  alignment  of  full-length
125              sequences.  Setting --fragthresh1 will define all  sequences  as
126              fragments;  you might want to do this if you know your alignment
127              is entirely composed of  fragments,  such  as  translated  short
128              reads in metagenomic shotgun data.
129
130
131

OPTIONS CONTROLLING RELATIVE WEIGHTS

133       HMMER uses an ad hoc sequence weighting algorithm to downweight closely
134       related sequences and upweight distantly related  ones.  This  has  the
135       effect  of making models less biased by uneven phylogenetic representa‐
136       tion. For example, two identical sequences would typically each receive
137       half  the  weight that one sequence would.  These options control which
138       algorithm gets used.
139
140
141       --wpb  Use  the  Henikoff  position-based  sequence  weighting   scheme
142              [Henikoff  and  Henikoff, J. Mol. Biol. 243:574, 1994].  This is
143              the default.
144
145
146       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm  [Ger‐
147              stein et al, J. Mol. Biol. 235:1067, 1994].
148
149
150       --wblosum
151              Use  the  same clustering scheme that was used to weight data in
152              calculating BLOSUM subsitution matrices [Henikoff and  Henikoff,
153              Proc.  Natl.  Acad.  Sci  89:10915, 1992]. Sequences are single-
154              linkage clustered at an identity threshold  (default  0.62;  see
155              --wid)  and  within  each  cluster of c sequences, each sequence
156              gets relative weight 1/c.
157
158
159       --wnone
160              No relative weights. All sequences are assigned uniform weight.
161
162
163       --wid <x>
164              Sets the identity threshold used  by  single-linkage  clustering
165              when  using --wblosum.  Invalid with any other weighting scheme.
166              Default is 0.62.
167
168
169
170
171

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

173       After relative weights are determined, they are normalized to sum to  a
174       total  effective  sequence  number,  eff_nseq.   This number may be the
175       actual number of sequences in the alignment, but it  is  almost  always
176       smaller  than  that.   The  default  entropy  weighting method (--eent)
177       reduces the effective sequence number to reduce the information content
178       (relative entropy, or average expected score on true homologs) per con‐
179       sensus position. The target relative entropy is controlled  by  a  two-
180       parameter  function,  where  the two parameters are settable with --ere
181       and --esigma.
182
183
184       --eent Adjust effective sequence number to achieve a specific  relative
185              entropy per position (see --ere).  This is the default.
186
187
188       --eclust
189              Set  effective  sequence  number to the number of single-linkage
190              clusters at a specific identity  threshold  (see  --eid).   This
191              option  is  not recommended; it's for experiments evaluating how
192              much better --eent is.
193
194
195       --enone
196              Turn off effective sequence number determination  and  just  use
197              the  actual number of sequences. One reason you might want to do
198              this is to try to maximize the relative entropy/position of your
199              model, which may be useful for short models.
200
201
202       --eset <x>
203              Explicitly  set  the effective sequence number for all models to
204              <x>.
205
206
207       --ere <x>
208              Set  the  minimum  relative  entropy/position  target  to   <x>.
209              Requires  --eent.  Default depends on the sequence alphabet. For
210              protein sequences, it  is  0.59  bits/position;  for  nucleotide
211              sequences, it is 0.45 bits/position.
212
213
214       --esigma <x>
215              Sets the minimum relative entropy contributed by an entire model
216              alignment, over its whole length. This has the effect of  making
217              short  models  have  higher  relative  entropy per position than
218              --ere alone would give. The default is 45.0 bits.
219
220
221       --eid <x>
222              Sets the fractional pairwise  identity  cutoff  used  by  single
223              linkage  clustering  with  the  --eclust  option. The default is
224              0.62.
225
226
227

OPTIONS CONTROLLING PRIORS

229       By default, weighted counts are converted to mean posterior probability
230       parameter  estimates  using  mixture Dirichlet priors.  Default mixture
231       Dirichlet prior parameters for protein models and for nucleic acid (RNA
232       and  DNA) models are built in. The following options allow you to over‐
233       ride the default priors.
234
235
236       --pnone
237              Don't use any priors. Probability parameters will simply be  the
238              observed frequencies, after relative sequence weighting.
239
240
241       --plaplace
242              Use a Laplace +1 prior in place of the default mixture Dirichlet
243              prior.
244
245
246
247

OPTIONS CONTROLLING E-VALUE CALIBRATION

249       The location parameters for the expected score  distributions  for  MSV
250       filter  scores, Viterbi filter scores, and Forward scores require three
251       short random sequence simulations.
252
253
254       --EmL <n>
255              Sets the sequence length in simulation that estimates the  loca‐
256              tion parameter mu for MSV filter E-values. Default is 200.
257
258
259       --EmN <n>
260              Sets  the  number  of sequences in simulation that estimates the
261              location parameter mu for MSV filter E-values. Default is 200.
262
263
264       --EvL <n>
265              Sets the sequence length in simulation that estimates the  loca‐
266              tion parameter mu for Viterbi filter E-values. Default is 200.
267
268
269       --EvN <n>
270              Sets  the  number  of sequences in simulation that estimates the
271              location parameter mu for Viterbi filter  E-values.  Default  is
272              200.
273
274
275       --EfL <n>
276              Sets  the sequence length in simulation that estimates the loca‐
277              tion parameter tau for Forward E-values. Default is 100.
278
279
280       --EfN <n>
281              Sets the number of sequences in simulation  that  estimates  the
282              location parameter tau for Forward E-values. Default is 200.
283
284
285       --Eft <x>
286              Sets  the tail mass fraction to fit in the simulation that esti‐
287              mates the location parameter tau for Forward evalues. Default is
288              0.04.
289
290
291

OTHER OPTIONS

293       --cpu <n>
294              Set  the  number of parallel worker threads to <n>.  By default,
295              HMMER sets this to the number of CPU cores it  detects  in  your
296              machine  -  that is, it tries to maximize the use of your avail‐
297              able processor cores. Setting <n>  higher  than  the  number  of
298              available  cores  is of little if any value, but you may want to
299              set it to something less. You can also control  this  number  by
300              setting an environment variable, HMMER_NCPU.
301
302              This  option  is only available if HMMER was compiled with POSIX
303              threads support. This is the  default,  but  it  may  have  been
304              turned off for your site or machine for some reason.
305
306
307
308       --informat <s>
309              Declare  that the input msafile is in format <s>.  Currently the
310              accepted multiple alignment sequence file formats include Stock‐
311              holm, Aligned FASTA, Clustal, NCBI PSI-BLAST, PHYLIP, Selex, and
312              UCSC SAM A2M. Default is to autodetect the format of the file.
313
314
315
316       --seed <n>
317              Seed the random number generator with <n>, an integer >= 0.   If
318              <n> is nonzero, any stochastic simulations will be reproducible;
319              the same command will give the same results.  If <n> is  0,  the
320              random  number  generator  is seeded arbitrarily, and stochastic
321              simulations will vary from run to run of the same command.   The
322              default seed is 42.
323
324
325
326       --w_beta <x>
327              Window  length  tail mass.  The upper bound, W, on the length at
328              which nhmmer expects to find an instance of  the  model  is  set
329              such  that  the fraction of all sequences generated by the model
330              with length >= W is less than <x>.  The default is 1e-7.
331
332
333
334
335       --w_length <n>
336              Override the model instance length upper bound, W, which is oth‐
337              erwise  controlled  by  --w_beta.   It should be larger than the
338              model length. The value of W is used deep  in  the  acceleration
339              pipeline,  and modest changes are not expected to impact results
340              (though larger values of W do lead to longer run time).
341
342
343
344       --mpi  Run as a parallel MPI program. Each alignment is assigned  to  a
345              MPI worker node for construction. (Therefore, the maximum paral‐
346              lelization cannot exceed the number of alignments in  the  input
347              msafile.)  This is useful when building large profile libraries.
348              This option is only available if  optional  MPI  capability  was
349              enabled at compile-time.
350
351
352
353       --stall
354              For  debugging  MPI  parallelization:  arrest  program execution
355              immediately after start, and wait for a debugger  to  attach  to
356              the running process and release the arrest.
357
358
359
360       --maxinsertlen <n>
361              Restrict  insert  length parameterization such that the expected
362              insert length at each position of the model is no more than <n>.
363
364
365
366
367

COPYRIGHT

380       Copyright (C) 2015 Howard Hughes Medical Institute.
381       Freely distributed under the GNU General Public License (GPLv3).
382
383       For additional information on copyright and  licensing,  see  the  file
384       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
385       web page ().
386
387
388

AUTHOR

390       Eddy/Rivas Laboratory
391       Janelia Farm Research Campus
392       19700 Helix Drive
393       Ashburn VA 20147 USA
394       http://eddylab.org
395
396
397
398
399HMMER 3.1b2                      February 2015                     hmmbuild(1)