hmmbuild(1)

1hmmbuild(1)                      HMMER Manual                      hmmbuild(1)
2
3
4

NAME

6       hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)
7
8

SYNOPSIS

10       hmmbuild [options] hmmfile msafile
11
12
13

DESCRIPTION

15       Build  a  profile  HMM for each multiple sequence alignment in msafile,
16       and save it to a new file hmmfile.
17
18
19
20

OPTIONS

22       -h     Help; print a brief reminder  of  command  line  usage  and  all
23              available options.
24
25
26       -n <s> Name the new profile <s>.  The default is to use the name of the
27              alignment (if one is present in the msafile, or,  failing  that,
28              the  name  of  the  hmmfile.   If msafile contains more than one
29              alignment, -n doesn't work, and every alignment must have a name
30              annotated in the msafile (as in Stockholm #=GF ID annotation).
31
32
33
34       -o <f> Direct the summary output to file <f>, rather than to stdout.
35
36
37       -O <f> After each model is constructed, resave annotated, possibly mod‐
38              ified source alignments to a file <f> in Stockholm format.   The
39              alignments  are annotated with a reference annotation line indi‐
40              cating which columns were assigned as consensus,  and  sequences
41              are annotated with what relative sequence weights were assigned.
42              Some residues of the alignment may have been shifted to accommo‐
43              date  restrictions of the Plan7 profile architecture, which dis‐
44              allows transitions between insert and delete states.
45
46
47

OPTIONS FOR SPECIFYING THE ALPHABET

49       The alphabet type (amino, DNA, or RNA) is autodetected by  default,  by
50       looking  at  the composition of the msafile.  Autodetection is normally
51       quite reliable, but occasionally alphabet type  may  be  ambiguous  and
52       autodetection  can fail (for instance, on tiny toy alignments of just a
53       few residues). To avoid this, or to increase  robustness  in  automated
54       analysis  pipelines,  you may specify the alphabet type of msafile with
55       these options.
56
57
58       --amino
59              Specify that all sequences in msafile are proteins.
60
61
62       --dna  Specify that all sequences in msafile are DNAs.
63
64
65       --rna  Specify that all sequences in msafile are RNAs.
66
67
68
69

OPTIONS CONTROLLING PROFILE CONSTRUCTION

71       These options control how consensus columns are defined  in  an  align‐
72       ment.
73
74
75       --fast Define  consensus  columns as those that have a fraction >= sym‐
76              frac of residues as opposed to gaps. (See below for  the  --sym‐
77              frac option.) This is the default.
78
79
80       --hand Define consensus columns in next profile using reference annota‐
81              tion to the multiple alignment.  This allows you to  define  any
82              consensus columns you like.
83
84
85       --symfrac <x>
86              Define the residue fraction threshold necessary to define a con‐
87              sensus column when using the default --fast construction option.
88              The  default  for  --symfrac is 0.5. The symbol fraction in each
89              column is calculated after taking  relative  sequence  weighting
90              into  account, and ignoring gap characters corresponding to ends
91              of sequence fragments (as opposed to  internal  insertions/dele‐
92              tions).   Setting  this to 0.0 means that every alignment column
93              will be assigned as consensus,  which  may  be  useful  in  some
94              cases.  Setting  it  to 1.0 means that only columns that have no
95              gap characters at all will be assigned as consensus.
96
97
98       --fragthresh <x>
99              We only want to count terminal gaps as deletions if the  aligned
100              sequence  is  known  to  be full-length, not if it is a fragment
101              (for instance, because only part of  it  was  sequenced).  HMMER
102              uses  a simple rule to infer fragments: if the sequence length L
103              is less than a fraction <x> times the mean  sequence  length  of
104              all the sequences in the alignment, then the sequence is handled
105              as a fragment. The default is 0.5.
106
107
108
109

OPTIONS CONTROLLING RELATIVE WEIGHTS

111       HMMER uses an ad hoc sequence weighting algorithm to downweight closely
112       related  sequences  and  upweight  distantly related ones. This has the
113       effect of making models less biased by uneven phylogenetic  representa‐
114       tion. For example, two identical sequences would typically each receive
115       half the weight that one sequence would.  These options  control  which
116       algorithm gets used.
117
118
119       --wpb  Use   the  Henikoff  position-based  sequence  weighting  scheme
120              [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994].   This  is
121              the default.
122
123
124       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  weighting algorithm [Ger‐
125              stein et al, J. Mol. Biol. 235:1067, 1994].
126
127
128       --wblosum
129              Use the same clustering scheme that was used to weight  data  in
130              calculating  BLOSUM subsitution matrices [Henikoff and Henikoff,
131              Proc. Natl. Acad. Sci 89:10915,  1992].  Sequences  are  single-
132              linkage  clustered  at  an identity threshold (default 0.62; see
133              --wid) and within each cluster of  c  sequences,  each  sequence
134              gets relative weight 1/c.
135
136
137       --wnone
138              No relative weights. All sequences are assigned uniform weight.
139
140
141       --wid <x>
142              Sets  the  identity  threshold used by single-linkage clustering
143              when using --wblosum.  Invalid with any other weighting  scheme.
144              Default is 0.62.
145
146
147
148
149

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

151       After  relative weights are determined, they are normalized to sum to a
152       total effective sequence number, eff_nseq.   This  number  may  be  the
153       actual  number  of  sequences in the alignment, but it is almost always
154       smaller than that.   The  default  entropy  weighting  method  (--eent)
155       reduces the effective sequence number to reduce the information content
156       (relative entropy, or average expected score on true homologs) per con‐
157       sensus  position.  The  target relative entropy is controlled by a two-
158       parameter function, where the two parameters are  settable  with  --ere
159       and --esigma.
160
161
162       --eent Adjust  effective sequence number to achieve a specific relative
163              entropy per position (see --ere).  This is the default.
164
165
166       --eclust
167              Set effective sequence number to the  number  of  single-linkage
168              clusters  at  a  specific  identity threshold (see --eid).  This
169              option is not recommended; it's for experiments  evaluating  how
170              much better --eent is.
171
172
173       --enone
174              Turn  off  effective  sequence number determination and just use
175              the actual number of sequences. One reason you might want to  do
176              this is to try to maximize the relative entropy/position of your
177              model, which may be useful for short models.
178
179
180       --eset <x>
181              Explicitly set the effective sequence number for all  models  to
182              <x>.
183
184
185       --ere <x>
186              Set   the  minimum  relative  entropy/position  target  to  <x>.
187              Requires --eent.  Default depends on the sequence alphabet;  for
188              protein sequences, it is 0.59 bits/position.
189
190
191       --esigma <x>
192              Sets the minimum relative entropy contributed by an entire model
193              alignment, over its whole length. This has the effect of  making
194              short  models  have  higher  relative  entropy per position than
195              --ere alone would give. The default is 45.0 bits.
196
197
198       --eid <x>
199              Sets the fractional pairwise  identity  cutoff  used  by  single
200              linkage  clustering  with  the  --eclust  option. The default is
201              0.62.
202
203
204

OPTIONS CONTROLLING E-VALUE CALIBRATION

206       The location parameters for the expected score  distributions  for  MSV
207       filter  scores, Viterbi filter scores, and Forward scores require three
208       short random sequence simulations.
209
210
211       --EmL <n>
212              Sets the sequence length in simulation that estimates the  loca‐
213              tion parameter mu for MSV filter E-values. Default is 200.
214
215
216       --EmN <n>
217              Sets  the  number  of sequences in simulation that estimates the
218              location parameter mu for MSV filter E-values. Default is 200.
219
220
221       --EvL <n>
222              Sets the sequence length in simulation that estimates the  loca‐
223              tion parameter mu for Viterbi filter E-values. Default is 200.
224
225
226       --EvN <n>
227              Sets  the  number  of sequences in simulation that estimates the
228              location parameter mu for Viterbi filter  E-values.  Default  is
229              200.
230
231
232       --EfL <n>
233              Sets  the sequence length in simulation that estimates the loca‐
234              tion parameter tau for Forward E-values. Default is 100.
235
236
237       --EfN <n>
238              Sets the number of sequences in simulation  that  estimates  the
239              location parameter tau for Forward E-values. Default is 200.
240
241
242       --Eft <x>
243              Sets  the tail mass fraction to fit in the simulation that esti‐
244              mates the location parameter tau for Forward evalues. Default is
245              0.04.
246
247
248

OTHER OPTIONS

250       --mpi  Run  as  a parallel MPI program. Each alignment is assigned to a
251              MPI worker node for construction. (Therefore, the maximum paral‐
252              lelization  cannot  exceed the number of alignments in the input
253              msafile.)  This is useful when building large profile libraries.
254              This  option  is  only  available if optional MPI capability was
255              enabled at compile-time.
256
257
258       --informat <s>
259              Declare that the input msafile is in format <s>.  Currently  the
260              accepted  multiple  alignment sequence file formats only include
261              Stockholm and SELEX. Default is to autodetect the format of  the
262              file.
263
264
265       --seed <n>
266              Seed  the random number generator with <n>, an integer >= 0.  If
267              <n> is nonzero, any stochastic simulations will be reproducible;
268              the  same  command will give the same results.  If <n> is 0, the
269              random number generator is seeded  arbitrarily,  and  stochastic
270              simulations  will vary from run to run of the same command.  The
271              default seed is 42.
272
273              --laplace Experimental only: use a Laplace +1 prior in place  of
274              the default mixture Dirichlet prior.
275
276
277       --stall
278              For  debugging  MPI  parallelization:  arrest  program execution
279              immediately after start, and wait for a debugger  to  attach  to
280              the running process and release the arrest.
281
282
283
284
285

COPYRIGHT

299       @HMMER_COPYRIGHT@
300       @HMMER_LICENSE@
301
302       For additional information on copyright and  licensing,  see  the  file
303       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
304       web page (@HMMER_URL@).
305
306
307

AUTHOR

309       Eddy/Rivas Laboratory
310       Janelia Farm Research Campus
311       19700 Helix Drive
312       Ashburn VA 20147 USA
313       http://eddylab.org
314
315
316
317
318HMMER @HMMER_VERSION@            @HMMER_DATE@                      hmmbuild(1)