1hmmbuild(1) HMMER Manual hmmbuild(1)
2
3
4
6 hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)
7
8
10 hmmbuild [options] <hmmfile_out> <msafile>
11
12
13
15 For each multiple sequence alignment in <msafile> build a profile HMM
16 and save it to a new file <hmmfile_out>.
17
18
19
20 <msafile> may be '-' (dash), which means reading this input from stdin
21 rather than a file. To use '-', you must also specify the alignment
22 file format with --informat <s>, as in --informat stockholm (because of
23 a current limitation in our implementation, MSA file formats cannot be
24 autodetected in a nonrewindable input stream.)
25
26
27 <hmmfile_out> may not be '-' (stdout), because sending the HMM file to
28 stdout would conflict with the other text output of the program.
29
30
31
32
33
34
35
37 -h Help; print a brief reminder of command line usage and all
38 available options.
39
40
41 -n <s> Name the new profile <s>. The default is to use the name of the
42 alignment (if one is present in the msafile, or, failing that,
43 the name of the hmmfile. If msafile contains more than one
44 alignment, -n doesn't work, and every alignment must have a name
45 annotated in the msafile (as in Stockholm #=GF ID annotation).
46
47
48
49 -o <f> Direct the summary output to file <f>, rather than to stdout.
50
51
52 -O <f> After each model is constructed, resave annotated, possibly mod‐
53 ified source alignments to a file <f> in Stockholm format. The
54 alignments are annotated with a reference annotation line indi‐
55 cating which columns were assigned as consensus, and sequences
56 are annotated with what relative sequence weights were assigned.
57 Some residues of the alignment may have been shifted to accommo‐
58 date restrictions of the Plan7 profile architecture, which dis‐
59 allows transitions between insert and delete states.
60
61
62
64 The alphabet type (amino, DNA, or RNA) is autodetected by default, by
65 looking at the composition of the msafile. Autodetection is normally
66 quite reliable, but occasionally alphabet type may be ambiguous and
67 autodetection can fail (for instance, on tiny toy alignments of just a
68 few residues). To avoid this, or to increase robustness in automated
69 analysis pipelines, you may specify the alphabet type of msafile with
70 these options.
71
72
73 --amino
74 Specify that all sequences in msafile are proteins.
75
76
77 --dna Specify that all sequences in msafile are DNAs.
78
79
80 --rna Specify that all sequences in msafile are RNAs.
81
82
83
84
86 These options control how consensus columns are defined in an align‐
87 ment.
88
89
90 --fast Define consensus columns as those that have a fraction >= sym‐
91 frac of residues as opposed to gaps. (See below for the --sym‐
92 frac option.) This is the default.
93
94
95 --hand Define consensus columns in next profile using reference annota‐
96 tion to the multiple alignment. This allows you to define any
97 consensus columns you like.
98
99
100 --symfrac <x>
101 Define the residue fraction threshold necessary to define a con‐
102 sensus column when using the --fast option. The default is 0.5.
103 The symbol fraction in each column is calculated after taking
104 relative sequence weighting into account, and ignoring gap char‐
105 acters corresponding to ends of sequence fragments (as opposed
106 to internal insertions/deletions). Setting this to 0.0 means
107 that every alignment column will be assigned as consensus, which
108 may be useful in some cases. Setting it to 1.0 means that only
109 columns that include 0 gaps (internal insertions/deletions) will
110 be assigned as consensus.
111
112
113 --fragthresh <x>
114 We only want to count terminal gaps as deletions if the aligned
115 sequence is known to be full-length, not if it is a fragment
116 (for instance, because only part of it was sequenced). HMMER
117 uses a simple rule to infer fragments: if the range of a
118 sequence in the alignment (the number of alignment columns
119 between the first and last positions of the sequence) is less
120 than or equal to a fraction <x> times the alignment length in
121 columns, then the sequence is handled as a fragment. The default
122 is 0.5. Setting --fragthresh0 will define no (nonempty)
123 sequence as a fragment; you might want to do this if you know
124 you've got a carefully curated alignment of full-length
125 sequences. Setting --fragthresh1 will define all sequences as
126 fragments; you might want to do this if you know your alignment
127 is entirely composed of fragments, such as translated short
128 reads in metagenomic shotgun data.
129
130
131
133 HMMER uses an ad hoc sequence weighting algorithm to downweight closely
134 related sequences and upweight distantly related ones. This has the
135 effect of making models less biased by uneven phylogenetic representa‐
136 tion. For example, two identical sequences would typically each receive
137 half the weight that one sequence would. These options control which
138 algorithm gets used.
139
140
141 --wpb Use the Henikoff position-based sequence weighting scheme
142 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
143 the default.
144
145
146 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
147 stein et al, J. Mol. Biol. 235:1067, 1994].
148
149
150 --wblosum
151 Use the same clustering scheme that was used to weight data in
152 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
153 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
154 linkage clustered at an identity threshold (default 0.62; see
155 --wid) and within each cluster of c sequences, each sequence
156 gets relative weight 1/c.
157
158
159 --wnone
160 No relative weights. All sequences are assigned uniform weight.
161
162
163 --wid <x>
164 Sets the identity threshold used by single-linkage clustering
165 when using --wblosum. Invalid with any other weighting scheme.
166 Default is 0.62.
167
168
169
170
171
173 After relative weights are determined, they are normalized to sum to a
174 total effective sequence number, eff_nseq. This number may be the
175 actual number of sequences in the alignment, but it is almost always
176 smaller than that. The default entropy weighting method (--eent)
177 reduces the effective sequence number to reduce the information content
178 (relative entropy, or average expected score on true homologs) per con‐
179 sensus position. The target relative entropy is controlled by a two-
180 parameter function, where the two parameters are settable with --ere
181 and --esigma.
182
183
184 --eent Adjust effective sequence number to achieve a specific relative
185 entropy per position (see --ere). This is the default.
186
187
188 --eclust
189 Set effective sequence number to the number of single-linkage
190 clusters at a specific identity threshold (see --eid). This
191 option is not recommended; it's for experiments evaluating how
192 much better --eent is.
193
194
195 --enone
196 Turn off effective sequence number determination and just use
197 the actual number of sequences. One reason you might want to do
198 this is to try to maximize the relative entropy/position of your
199 model, which may be useful for short models.
200
201
202 --eset <x>
203 Explicitly set the effective sequence number for all models to
204 <x>.
205
206
207 --ere <x>
208 Set the minimum relative entropy/position target to <x>.
209 Requires --eent. Default depends on the sequence alphabet. For
210 protein sequences, it is 0.59 bits/position; for nucleotide
211 sequences, it is 0.45 bits/position.
212
213
214 --esigma <x>
215 Sets the minimum relative entropy contributed by an entire model
216 alignment, over its whole length. This has the effect of making
217 short models have higher relative entropy per position than
218 --ere alone would give. The default is 45.0 bits.
219
220
221 --eid <x>
222 Sets the fractional pairwise identity cutoff used by single
223 linkage clustering with the --eclust option. The default is
224 0.62.
225
226
227
229 By default, weighted counts are converted to mean posterior probability
230 parameter estimates using mixture Dirichlet priors. Default mixture
231 Dirichlet prior parameters for protein models and for nucleic acid (RNA
232 and DNA) models are built in. The following options allow you to over‐
233 ride the default priors.
234
235
236 --pnone
237 Don't use any priors. Probability parameters will simply be the
238 observed frequencies, after relative sequence weighting.
239
240
241 --plaplace
242 Use a Laplace +1 prior in place of the default mixture Dirichlet
243 prior.
244
245
246
247
249 The location parameters for the expected score distributions for MSV
250 filter scores, Viterbi filter scores, and Forward scores require three
251 short random sequence simulations.
252
253
254 --EmL <n>
255 Sets the sequence length in simulation that estimates the loca‐
256 tion parameter mu for MSV filter E-values. Default is 200.
257
258
259 --EmN <n>
260 Sets the number of sequences in simulation that estimates the
261 location parameter mu for MSV filter E-values. Default is 200.
262
263
264 --EvL <n>
265 Sets the sequence length in simulation that estimates the loca‐
266 tion parameter mu for Viterbi filter E-values. Default is 200.
267
268
269 --EvN <n>
270 Sets the number of sequences in simulation that estimates the
271 location parameter mu for Viterbi filter E-values. Default is
272 200.
273
274
275 --EfL <n>
276 Sets the sequence length in simulation that estimates the loca‐
277 tion parameter tau for Forward E-values. Default is 100.
278
279
280 --EfN <n>
281 Sets the number of sequences in simulation that estimates the
282 location parameter tau for Forward E-values. Default is 200.
283
284
285 --Eft <x>
286 Sets the tail mass fraction to fit in the simulation that esti‐
287 mates the location parameter tau for Forward evalues. Default is
288 0.04.
289
290
291
293 --cpu <n>
294 Set the number of parallel worker threads to <n>. By default,
295 HMMER sets this to the number of CPU cores it detects in your
296 machine - that is, it tries to maximize the use of your avail‐
297 able processor cores. Setting <n> higher than the number of
298 available cores is of little if any value, but you may want to
299 set it to something less. You can also control this number by
300 setting an environment variable, HMMER_NCPU.
301
302 This option is only available if HMMER was compiled with POSIX
303 threads support. This is the default, but it may have been
304 turned off for your site or machine for some reason.
305
306
307
308 --informat <s>
309 Declare that the input msafile is in format <s>. Currently the
310 accepted multiple alignment sequence file formats include Stock‐
311 holm, Aligned FASTA, Clustal, NCBI PSI-BLAST, PHYLIP, Selex, and
312 UCSC SAM A2M. Default is to autodetect the format of the file.
313
314
315
316 --seed <n>
317 Seed the random number generator with <n>, an integer >= 0. If
318 <n> is nonzero, any stochastic simulations will be reproducible;
319 the same command will give the same results. If <n> is 0, the
320 random number generator is seeded arbitrarily, and stochastic
321 simulations will vary from run to run of the same command. The
322 default seed is 42.
323
324
325
326 --w_beta <x>
327 Window length tail mass. The upper bound, W, on the length at
328 which nhmmer expects to find an instance of the model is set
329 such that the fraction of all sequences generated by the model
330 with length >= W is less than <x>. The default is 1e-7.
331
332
333
334
335 --w_length <n>
336 Override the model instance length upper bound, W, which is oth‐
337 erwise controlled by --w_beta. It should be larger than the
338 model length. The value of W is used deep in the acceleration
339 pipeline, and modest changes are not expected to impact results
340 (though larger values of W do lead to longer run time).
341
342
343
344 --mpi Run as a parallel MPI program. Each alignment is assigned to a
345 MPI worker node for construction. (Therefore, the maximum paral‐
346 lelization cannot exceed the number of alignments in the input
347 msafile.) This is useful when building large profile libraries.
348 This option is only available if optional MPI capability was
349 enabled at compile-time.
350
351
352
353 --stall
354 For debugging MPI parallelization: arrest program execution
355 immediately after start, and wait for a debugger to attach to
356 the running process and release the arrest.
357
358
359
360 --maxinsertlen <n>
361 Restrict insert length parameterization such that the expected
362 insert length at each position of the model is no more than <n>.
363
364
365
366
367
369 See hmmer(1) for a master man page with a list of all the individual
370 man pages for programs in the HMMER package.
371
372
373 For complete documentation, see the user guide that came with your
374 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
375
376
377
378
380 Copyright (C) 2015 Howard Hughes Medical Institute.
381 Freely distributed under the GNU General Public License (GPLv3).
382
383 For additional information on copyright and licensing, see the file
384 called COPYRIGHT in your HMMER source distribution, or see the HMMER
385 web page ().
386
387
388
390 Eddy/Rivas Laboratory
391 Janelia Farm Research Campus
392 19700 Helix Drive
393 Ashburn VA 20147 USA
394 http://eddylab.org
395
396
397
398
399HMMER 3.1b2 February 2015 hmmbuild(1)