hmmbuild(1)

1hmmbuild(1)                      HMMER Manual                      hmmbuild(1)
2
3
4

NAME

6       hmmbuild - construct profiles from multiple sequence alignments
7
8

SYNOPSIS

10       hmmbuild [options] hmmfile msafile
11
12
13

DESCRIPTION

15       For each multiple sequence alignment in msafile build a profile HMM and
16       save it to a new file hmmfile.
17
18
19
20       msafile may be '-' (dash), which means reading this  input  from  stdin
21       rather than a file.
22
23
24
25       hmmfile may not be '-' (stdout), because sending the HMM file to stdout
26       would conflict with the other text output of the program.
27
28
29
30
31

OPTIONS

33       -h     Help; print a brief reminder  of  command  line  usage  and  all
34              available options.
35
36
37       -n <s> Name the new profile <s>.  The default is to use the name of the
38              alignment (if one is present in the msafile, or,  failing  that,
39              the  name  of  the  hmmfile.   If msafile contains more than one
40              alignment, -n doesn't work, and every alignment must have a name
41              annotated in the msafile (as in Stockholm #=GF ID annotation).
42
43
44
45       -o <f> Direct the summary output to file <f>, rather than to stdout.
46
47
48       -O <f> After each model is constructed, resave annotated, possibly mod‐
49              ified source alignments to a file <f> in Stockholm format.   The
50              alignments  are annotated with a reference annotation line indi‐
51              cating which columns were assigned as consensus,  and  sequences
52              are annotated with what relative sequence weights were assigned.
53              Some residues of the alignment may have been shifted to accommo‐
54              date  restrictions of the Plan7 profile architecture, which dis‐
55              allows transitions between insert and delete states.
56
57
58

OPTIONS FOR SPECIFYING THE ALPHABET

60       --amino
61              Assert that sequences in msafile are protein, bypassing alphabet
62              autodetection.
63
64
65       --dna  Assert that sequences in msafile are DNA, bypassing alphabet au‐
66              todetection.
67
68
69       --rna  Assert that sequences in msafile are RNA, bypassing alphabet au‐
70              todetection.
71
72

OPTIONS CONTROLLING PROFILE CONSTRUCTION

74       These  options  control  how consensus columns are defined in an align‐
75       ment.
76
77
78       --fast Define consensus columns as those that have a fraction  >=  sym‐
79              frac  of  residues as opposed to gaps. (See below for the --sym‐
80              frac option.) This is the default.
81
82
83       --hand Define consensus columns in next profile using reference annota‐
84              tion  to  the multiple alignment.  This allows you to define any
85              consensus columns you like.
86
87
88       --symfrac <x>
89              Define the residue fraction threshold necessary to define a con‐
90              sensus  column when using the --fast option. The default is 0.5.
91              The symbol fraction in each column is  calculated  after  taking
92              relative sequence weighting into account, and ignoring gap char‐
93              acters corresponding to ends of sequence fragments  (as  opposed
94              to  internal  insertions/deletions).   Setting this to 0.0 means
95              that every alignment column will be assigned as consensus, which
96              may  be  useful in some cases. Setting it to 1.0 means that only
97              columns that include 0 gaps (internal insertions/deletions) will
98              be assigned as consensus.
99
100
101       --fragthresh <x>
102              We  only want to count terminal gaps as deletions if the aligned
103              sequence is known to be full-length, not if  it  is  a  fragment
104              (for  instance,  because  only  part of it was sequenced). HMMER
105              uses a simple rule to infer fragments: if the  range  of  a  se‐
106              quence in the alignment (the number of alignment columns between
107              the first and last positions of the sequence) is  less  than  or
108              equal  to  a fraction <x> times the alignment length in columns,
109              then the sequence is handled as a fragment. The default is  0.5.
110              Setting  --fragthresh  0 will define no (nonempty) sequence as a
111              fragment; you might want to do this if you  know  you've  got  a
112              carefully  curated  alignment of full-length sequences.  Setting
113              --fragthresh 1 will define all sequences as fragments; you might
114              want  to do this if you know your alignment is entirely composed
115              of fragments, such as  translated  short  reads  in  metagenomic
116              shotgun data.
117
118
119

OPTIONS CONTROLLING RELATIVE WEIGHTS

121       HMMER uses an ad hoc sequence weighting algorithm to downweight closely
122       related sequences and upweight distantly related ones. This has the ef‐
123       fect  of  making  models less biased by uneven phylogenetic representa‐
124       tion. For example, two identical sequences would typically each receive
125       half  the  weight that one sequence would.  These options control which
126       algorithm gets used.
127
128
129       --wpb  Use  the  Henikoff  position-based  sequence  weighting   scheme
130              [Henikoff  and  Henikoff, J. Mol. Biol. 243:574, 1994].  This is
131              the default.
132
133
134       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm  [Ger‐
135              stein et al, J. Mol. Biol. 235:1067, 1994].
136
137
138       --wblosum
139              Use  the  same clustering scheme that was used to weight data in
140              calculating BLOSUM subsitution matrices [Henikoff and  Henikoff,
141              Proc.  Natl.  Acad.  Sci  89:10915, 1992]. Sequences are single-
142              linkage clustered at an identity threshold  (default  0.62;  see
143              --wid)  and  within  each  cluster of c sequences, each sequence
144              gets relative weight 1/c.
145
146
147       --wnone
148              No relative weights. All sequences are assigned uniform weight.
149
150
151       --wid <x>
152              Sets the identity threshold used  by  single-linkage  clustering
153              when  using --wblosum.  Invalid with any other weighting scheme.
154              Default is 0.62.
155
156
157
158
159

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

161       After relative weights are determined, they are normalized to sum to  a
162       total  effective sequence number, eff_nseq.  This number may be the ac‐
163       tual number of sequences in the alignment,  but  it  is  almost  always
164       smaller  than  that.  The default entropy weighting method (--eent) re‐
165       duces the effective sequence number to reduce the  information  content
166       (relative entropy, or average expected score on true homologs) per con‐
167       sensus position. The target relative entropy is controlled by a two-pa‐
168       rameter  function, where the two parameters are settable with --ere and
169       --esigma.
170
171
172       --eent Adjust effective sequence number to achieve a specific  relative
173              entropy per position (see --ere).  This is the default.
174
175
176       --eclust
177              Set  effective  sequence  number to the number of single-linkage
178              clusters at a specific identity threshold (see --eid).  This op‐
179              tion  is  not  recommended;  it's for experiments evaluating how
180              much better --eent is.
181
182
183       --enone
184              Turn off effective sequence number determination  and  just  use
185              the  actual number of sequences. One reason you might want to do
186              this is to try to maximize the relative entropy/position of your
187              model, which may be useful for short models.
188
189
190       --eset <x>
191              Explicitly  set  the effective sequence number for all models to
192              <x>.
193
194
195       --ere <x>
196              Set the minimum relative entropy/position target  to  <x>.   Re‐
197              quires  --eent.   Default  depends on the sequence alphabet. For
198              protein sequences, it is 0.59 bits/position; for nucleotide  se‐
199              quences, it is 0.45 bits/position.
200
201
202       --esigma <x>
203              Sets the minimum relative entropy contributed by an entire model
204              alignment, over its whole length. This has the effect of  making
205              short  models  have  higher  relative  entropy per position than
206              --ere alone would give. The default is 45.0 bits.
207
208
209       --eid <x>
210              Sets the fractional pairwise  identity  cutoff  used  by  single
211              linkage  clustering  with  the  --eclust  option. The default is
212              0.62.
213
214
215

OPTIONS CONTROLLING PRIORS

217       By default, weighted counts are converted to mean posterior probability
218       parameter  estimates  using  mixture Dirichlet priors.  Default mixture
219       Dirichlet prior parameters for protein models and for nucleic acid (RNA
220       and  DNA) models are built in. The following options allow you to over‐
221       ride the default priors.
222
223
224       --pnone
225              Don't use any priors. Probability parameters will simply be  the
226              observed frequencies, after relative sequence weighting.
227
228
229       --plaplace
230              Use a Laplace +1 prior in place of the default mixture Dirichlet
231              prior.
232
233
234
235
236

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING

238       By default, if a query is a single sequence from a file in  fasta  for‐
239       mat,  hmmbuild constructs a search model from that sequence and a stan‐
240       dard 20x20 substitution matrix for residue  probabilities,  along  with
241       two additional parameters for position-independent gap open and gap ex‐
242       tend probabilities. These options  allow  the  default  single-sequence
243       scoring  parameters  to be changed, and for single-sequence scoring op‐
244       tions to be applied to a single sequence coming from an aligned format.
245
246
247       --singlemx
248              If a single sequence query comes from a multiple sequence align‐
249              ment  file,  such as in stockholm format, the search model is by
250              default constructed as is typically done for  multiple  sequence
251              alignments.  This  option  forces hmmbuild to use the single-se‐
252              quence method with substitution score matrix.
253
254
255       --mx <s>
256              Obtain residue alignment probabilities from the built-in substi‐
257              tution  matrix  named <s>.  Several standard matrices are built-
258              in, and do not need to be read from files.  The matrix name  <s>
259              can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
260              SUM62, BLOSUM80, BLOSUM90, or DNA1.  Only one of  the  --mx  and
261              --mxfile options may be used.
262
263
264       --mxfile <mxfile>
265              Obtain residue alignment probabilities from the substitution ma‐
266              trix in file <mxfile>.  The default score matrix is BLOSUM62 for
267              protein  sequences, and DNA1 for nucleotide sequences (these ma‐
268              trices are internal to HMMER and do not need to be available  as
269              a  file).   The  format of a substitution matrix <mxfile> is the
270              standard format accepted by BLAST,  FASTA,  and  other  sequence
271              analysis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
272              example files. (The only exception: we require  matrices  to  be
273              square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
274
275
276       --popen <x>
277              Set  the  gap open probability for a single sequence query model
278              to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.
279
280
281       --pextend <x>
282              Set the gap extend probability for a single sequence query model
283              to <x>.  The default is 0.4.  <x> must be >= 0 and < 1.0.
284
285
286

OPTIONS CONTROLLING E-VALUE CALIBRATION

288       The  location  parameters  for the expected score distributions for MSV
289       filter scores, Viterbi filter scores, and Forward scores require  three
290       short random sequence simulations.
291
292
293       --EmL <n>
294              Sets  the sequence length in simulation that estimates the loca‐
295              tion parameter mu for MSV filter E-values. Default is 200.
296
297
298       --EmN <n>
299              Sets the number of sequences in simulation  that  estimates  the
300              location parameter mu for MSV filter E-values. Default is 200.
301
302
303       --EvL <n>
304              Sets  the sequence length in simulation that estimates the loca‐
305              tion parameter mu for Viterbi filter E-values. Default is 200.
306
307
308       --EvN <n>
309              Sets the number of sequences in simulation  that  estimates  the
310              location  parameter  mu  for Viterbi filter E-values. Default is
311              200.
312
313
314       --EfL <n>
315              Sets the sequence length in simulation that estimates the  loca‐
316              tion parameter tau for Forward E-values. Default is 100.
317
318
319       --EfN <n>
320              Sets  the  number  of sequences in simulation that estimates the
321              location parameter tau for Forward E-values. Default is 200.
322
323
324       --Eft <x>
325              Sets the tail mass fraction to fit in the simulation that  esti‐
326              mates the location parameter tau for Forward evalues. Default is
327              0.04.
328
329
330

OTHER OPTIONS

332       --cpu <n>
333              Set the number of parallel worker threads to <n>.  On  multicore
334              machines, the default is 2.  You can also control this number by
335              setting an environment variable, HMMER_NCPU.  There  is  also  a
336              master thread, so the actual number of threads that HMMER spawns
337              is <n>+1.
338
339              This option is not available if HMMER was  compiled  with  POSIX
340              threads support turned off.
341
342
343
344
345       --informat <s>
346              Assert  that input msafile is in alignment format <s>, bypassing
347              format autodetection.  Common choices for  <s>  include:  stock‐
348              holm,  a2m,  afa,  psiblast, clustal, phylip.  For more informa‐
349              tion, and for codes for some less common formats, see main docu‐
350              mentation.   The string <s> is case-insensitive (a2m or A2M both
351              work).
352
353
354
355       --seed <n>
356              Seed the random number generator with <n>, an integer >= 0.   If
357              <n> is nonzero, any stochastic simulations will be reproducible;
358              the same command will give the same results.  If <n> is  0,  the
359              random  number  generator  is seeded arbitrarily, and stochastic
360              simulations will vary from run to run of the same command.   The
361              default seed is 42.
362
363
364
365       --w_beta <x>
366              Window  length  tail mass.  The upper bound, W, on the length at
367              which nhmmer expects to find an instance of  the  model  is  set
368              such  that  the fraction of all sequences generated by the model
369              with length >= W is less than <x>.  The default is 1e-7.
370
371
372
373
374       --w_length <n>
375              Override the model instance length upper bound, W, which is oth‐
376              erwise  controlled  by  --w_beta.   It should be larger than the
377              model length. The value of W is used deep  in  the  acceleration
378              pipeline,  and modest changes are not expected to impact results
379              (though larger values of W do lead to longer run time).
380
381
382
383       --mpi  Run as a parallel MPI program. Each alignment is assigned  to  a
384              MPI worker node for construction. (Therefore, the maximum paral‐
385              lelization cannot exceed the number of alignments in  the  input
386              msafile.)  This is useful when building large profile libraries.
387              This option is only available if optional MPI capability was en‐
388              abled at compile-time.
389
390
391
392       --stall
393              For  debugging MPI parallelization: arrest program execution im‐
394              mediately after start, and wait for a debugger to attach to  the
395              running process and release the arrest.
396
397
398
399       --maxinsertlen <n>
400              Restrict  insert  length parameterization such that the expected
401              insert length at each position of the model is no more than <n>.
402
403
404
405
406

COPYRIGHT

420       Copyright (C) 2020 Howard Hughes Medical Institute.
421       Freely distributed under the BSD open source license.
422
423       For  additional  information  on  copyright and licensing, see the file
424       called COPYRIGHT in your HMMER source distribution, or  see  the  HMMER
425       web page (http://hmmer.org/).
426
427
428

AUTHOR

430       http://eddylab.org
431
432
433
434
435HMMER 3.3.2                        Nov 2020                        hmmbuild(1)