1hmmbuild(1) HMMER Manual hmmbuild(1)
2
3
4
6 hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)
7
8
10 hmmbuild [options] hmmfile msafile
11
12
13
15 Build a profile HMM for each multiple sequence alignment in msafile,
16 and save it to a new file hmmfile.
17
18
19
20
22 -h Help; print a brief reminder of command line usage and all
23 available options.
24
25
26 -n <s> Name the new profile <s>. The default is to use the name of the
27 alignment (if one is present in the msafile, or, failing that,
28 the name of the hmmfile. If msafile contains more than one
29 alignment, -n doesn't work, and every alignment must have a name
30 annotated in the msafile (as in Stockholm #=GF ID annotation).
31
32
33
34 -o <f> Direct the summary output to file <f>, rather than to stdout.
35
36
37 -O <f> After each model is constructed, resave annotated, possibly mod‐
38 ified source alignments to a file <f> in Stockholm format. The
39 alignments are annotated with a reference annotation line indi‐
40 cating which columns were assigned as consensus, and sequences
41 are annotated with what relative sequence weights were assigned.
42 Some residues of the alignment may have been shifted to accommo‐
43 date restrictions of the Plan7 profile architecture, which dis‐
44 allows transitions between insert and delete states.
45
46
47
49 The alphabet type (amino, DNA, or RNA) is autodetected by default, by
50 looking at the composition of the msafile. Autodetection is normally
51 quite reliable, but occasionally alphabet type may be ambiguous and
52 autodetection can fail (for instance, on tiny toy alignments of just a
53 few residues). To avoid this, or to increase robustness in automated
54 analysis pipelines, you may specify the alphabet type of msafile with
55 these options.
56
57
58 --amino
59 Specify that all sequences in msafile are proteins.
60
61
62 --dna Specify that all sequences in msafile are DNAs.
63
64
65 --rna Specify that all sequences in msafile are RNAs.
66
67
68
69
71 These options control how consensus columns are defined in an align‐
72 ment.
73
74
75 --fast Define consensus columns as those that have a fraction >= sym‐
76 frac of residues as opposed to gaps. (See below for the --sym‐
77 frac option.) This is the default.
78
79
80 --hand Define consensus columns in next profile using reference annota‐
81 tion to the multiple alignment. This allows you to define any
82 consensus columns you like.
83
84
85 --symfrac <x>
86 Define the residue fraction threshold necessary to define a con‐
87 sensus column when using the default --fast construction option.
88 The default for --symfrac is 0.5. The symbol fraction in each
89 column is calculated after taking relative sequence weighting
90 into account, and ignoring gap characters corresponding to ends
91 of sequence fragments (as opposed to internal insertions/dele‐
92 tions). Setting this to 0.0 means that every alignment column
93 will be assigned as consensus, which may be useful in some
94 cases. Setting it to 1.0 means that only columns that have no
95 gap characters at all will be assigned as consensus.
96
97
98 --fragthresh <x>
99 We only want to count terminal gaps as deletions if the aligned
100 sequence is known to be full-length, not if it is a fragment
101 (for instance, because only part of it was sequenced). HMMER
102 uses a simple rule to infer fragments: if the sequence length L
103 is less than a fraction <x> times the mean sequence length of
104 all the sequences in the alignment, then the sequence is handled
105 as a fragment. The default is 0.5.
106
107
108
109
111 HMMER uses an ad hoc sequence weighting algorithm to downweight closely
112 related sequences and upweight distantly related ones. This has the
113 effect of making models less biased by uneven phylogenetic representa‐
114 tion. For example, two identical sequences would typically each receive
115 half the weight that one sequence would. These options control which
116 algorithm gets used.
117
118
119 --wpb Use the Henikoff position-based sequence weighting scheme
120 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
121 the default.
122
123
124 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
125 stein et al, J. Mol. Biol. 235:1067, 1994].
126
127
128 --wblosum
129 Use the same clustering scheme that was used to weight data in
130 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
131 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
132 linkage clustered at an identity threshold (default 0.62; see
133 --wid) and within each cluster of c sequences, each sequence
134 gets relative weight 1/c.
135
136
137 --wnone
138 No relative weights. All sequences are assigned uniform weight.
139
140
141 --wid <x>
142 Sets the identity threshold used by single-linkage clustering
143 when using --wblosum. Invalid with any other weighting scheme.
144 Default is 0.62.
145
146
147
148
149
151 After relative weights are determined, they are normalized to sum to a
152 total effective sequence number, eff_nseq. This number may be the
153 actual number of sequences in the alignment, but it is almost always
154 smaller than that. The default entropy weighting method (--eent)
155 reduces the effective sequence number to reduce the information content
156 (relative entropy, or average expected score on true homologs) per con‐
157 sensus position. The target relative entropy is controlled by a two-
158 parameter function, where the two parameters are settable with --ere
159 and --esigma.
160
161
162 --eent Adjust effective sequence number to achieve a specific relative
163 entropy per position (see --ere). This is the default.
164
165
166 --eclust
167 Set effective sequence number to the number of single-linkage
168 clusters at a specific identity threshold (see --eid). This
169 option is not recommended; it's for experiments evaluating how
170 much better --eent is.
171
172
173 --enone
174 Turn off effective sequence number determination and just use
175 the actual number of sequences. One reason you might want to do
176 this is to try to maximize the relative entropy/position of your
177 model, which may be useful for short models.
178
179
180 --eset <x>
181 Explicitly set the effective sequence number for all models to
182 <x>.
183
184
185 --ere <x>
186 Set the minimum relative entropy/position target to <x>.
187 Requires --eent. Default depends on the sequence alphabet; for
188 protein sequences, it is 0.59 bits/position.
189
190
191 --esigma <x>
192 Sets the minimum relative entropy contributed by an entire model
193 alignment, over its whole length. This has the effect of making
194 short models have higher relative entropy per position than
195 --ere alone would give. The default is 45.0 bits.
196
197
198 --eid <x>
199 Sets the fractional pairwise identity cutoff used by single
200 linkage clustering with the --eclust option. The default is
201 0.62.
202
203
204
206 The location parameters for the expected score distributions for MSV
207 filter scores, Viterbi filter scores, and Forward scores require three
208 short random sequence simulations.
209
210
211 --EmL <n>
212 Sets the sequence length in simulation that estimates the loca‐
213 tion parameter mu for MSV filter E-values. Default is 200.
214
215
216 --EmN <n>
217 Sets the number of sequences in simulation that estimates the
218 location parameter mu for MSV filter E-values. Default is 200.
219
220
221 --EvL <n>
222 Sets the sequence length in simulation that estimates the loca‐
223 tion parameter mu for Viterbi filter E-values. Default is 200.
224
225
226 --EvN <n>
227 Sets the number of sequences in simulation that estimates the
228 location parameter mu for Viterbi filter E-values. Default is
229 200.
230
231
232 --EfL <n>
233 Sets the sequence length in simulation that estimates the loca‐
234 tion parameter tau for Forward E-values. Default is 100.
235
236
237 --EfN <n>
238 Sets the number of sequences in simulation that estimates the
239 location parameter tau for Forward E-values. Default is 200.
240
241
242 --Eft <x>
243 Sets the tail mass fraction to fit in the simulation that esti‐
244 mates the location parameter tau for Forward evalues. Default is
245 0.04.
246
247
248
250 --mpi Run as a parallel MPI program. Each alignment is assigned to a
251 MPI worker node for construction. (Therefore, the maximum paral‐
252 lelization cannot exceed the number of alignments in the input
253 msafile.) This is useful when building large profile libraries.
254 This option is only available if optional MPI capability was
255 enabled at compile-time.
256
257
258 --informat <s>
259 Declare that the input msafile is in format <s>. Currently the
260 accepted multiple alignment sequence file formats only include
261 Stockholm and SELEX. Default is to autodetect the format of the
262 file.
263
264
265 --seed <n>
266 Seed the random number generator with <n>, an integer >= 0. If
267 <n> is nonzero, any stochastic simulations will be reproducible;
268 the same command will give the same results. If <n> is 0, the
269 random number generator is seeded arbitrarily, and stochastic
270 simulations will vary from run to run of the same command. The
271 default seed is 42.
272
273 --laplace Experimental only: use a Laplace +1 prior in place of
274 the default mixture Dirichlet prior.
275
276
277 --stall
278 For debugging MPI parallelization: arrest program execution
279 immediately after start, and wait for a debugger to attach to
280 the running process and release the arrest.
281
282
283
284
285
287 See hmmer(1) for a master man page with a list of all the individual
288 man pages for programs in the HMMER package.
289
290
291 For complete documentation, see the user guide that came with your
292 HMMER distribution (Userguide.pdf); or see the HMMER web page
293 (@HMMER_URL@).
294
295
296
297
299 @HMMER_COPYRIGHT@
300 @HMMER_LICENSE@
301
302 For additional information on copyright and licensing, see the file
303 called COPYRIGHT in your HMMER source distribution, or see the HMMER
304 web page (@HMMER_URL@).
305
306
307
309 Eddy/Rivas Laboratory
310 Janelia Farm Research Campus
311 19700 Helix Drive
312 Ashburn VA 20147 USA
313 http://eddylab.org
314
315
316
317
318HMMER @HMMER_VERSION@ @HMMER_DATE@ hmmbuild(1)