1hmmbuild(1) HMMER Manual hmmbuild(1)
2
3
4
6 hmmbuild - construct profiles from multiple sequence alignments
7
8
10 hmmbuild [options] hmmfile msafile
11
12
13
15 For each multiple sequence alignment in msafile build a profile HMM and
16 save it to a new file hmmfile.
17
18
19
20 msafile may be '-' (dash), which means reading this input from stdin
21 rather than a file.
22
23
24
25 hmmfile may not be '-' (stdout), because sending the HMM file to stdout
26 would conflict with the other text output of the program.
27
28
29
30
31
33 -h Help; print a brief reminder of command line usage and all
34 available options.
35
36
37 -n <s> Name the new profile <s>. The default is to use the name of the
38 alignment (if one is present in the msafile, or, failing that,
39 the name of the hmmfile. If msafile contains more than one
40 alignment, -n doesn't work, and every alignment must have a name
41 annotated in the msafile (as in Stockholm #=GF ID annotation).
42
43
44
45 -o <f> Direct the summary output to file <f>, rather than to stdout.
46
47
48 -O <f> After each model is constructed, resave annotated, possibly mod‐
49 ified source alignments to a file <f> in Stockholm format. The
50 alignments are annotated with a reference annotation line indi‐
51 cating which columns were assigned as consensus, and sequences
52 are annotated with what relative sequence weights were assigned.
53 Some residues of the alignment may have been shifted to accommo‐
54 date restrictions of the Plan7 profile architecture, which dis‐
55 allows transitions between insert and delete states.
56
57
58
60 --amino
61 Assert that sequences in msafile are protein, bypassing alphabet
62 autodetection.
63
64
65 --dna Assert that sequences in msafile are DNA, bypassing alphabet au‐
66 todetection.
67
68
69 --rna Assert that sequences in msafile are RNA, bypassing alphabet au‐
70 todetection.
71
72
74 These options control how consensus columns are defined in an align‐
75 ment.
76
77
78 --fast Define consensus columns as those that have a fraction >= sym‐
79 frac of residues as opposed to gaps. (See below for the --sym‐
80 frac option.) This is the default.
81
82
83 --hand Define consensus columns in next profile using reference annota‐
84 tion to the multiple alignment. This allows you to define any
85 consensus columns you like.
86
87
88 --symfrac <x>
89 Define the residue fraction threshold necessary to define a con‐
90 sensus column when using the --fast option. The default is 0.5.
91 The symbol fraction in each column is calculated after taking
92 relative sequence weighting into account, and ignoring gap char‐
93 acters corresponding to ends of sequence fragments (as opposed
94 to internal insertions/deletions). Setting this to 0.0 means
95 that every alignment column will be assigned as consensus, which
96 may be useful in some cases. Setting it to 1.0 means that only
97 columns that include 0 gaps (internal insertions/deletions) will
98 be assigned as consensus.
99
100
101 --fragthresh <x>
102 We only want to count terminal gaps as deletions if the aligned
103 sequence is known to be full-length, not if it is a fragment
104 (for instance, because only part of it was sequenced). HMMER
105 uses a simple rule to infer fragments: if the range of a se‐
106 quence in the alignment (the number of alignment columns between
107 the first and last positions of the sequence) is less than or
108 equal to a fraction <x> times the alignment length in columns,
109 then the sequence is handled as a fragment. The default is 0.5.
110 Setting --fragthresh 0 will define no (nonempty) sequence as a
111 fragment; you might want to do this if you know you've got a
112 carefully curated alignment of full-length sequences. Setting
113 --fragthresh 1 will define all sequences as fragments; you might
114 want to do this if you know your alignment is entirely composed
115 of fragments, such as translated short reads in metagenomic
116 shotgun data.
117
118
119
121 HMMER uses an ad hoc sequence weighting algorithm to downweight closely
122 related sequences and upweight distantly related ones. This has the ef‐
123 fect of making models less biased by uneven phylogenetic representa‐
124 tion. For example, two identical sequences would typically each receive
125 half the weight that one sequence would. These options control which
126 algorithm gets used.
127
128
129 --wpb Use the Henikoff position-based sequence weighting scheme
130 [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is
131 the default.
132
133
134 --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger‐
135 stein et al, J. Mol. Biol. 235:1067, 1994].
136
137
138 --wblosum
139 Use the same clustering scheme that was used to weight data in
140 calculating BLOSUM subsitution matrices [Henikoff and Henikoff,
141 Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-
142 linkage clustered at an identity threshold (default 0.62; see
143 --wid) and within each cluster of c sequences, each sequence
144 gets relative weight 1/c.
145
146
147 --wnone
148 No relative weights. All sequences are assigned uniform weight.
149
150
151 --wid <x>
152 Sets the identity threshold used by single-linkage clustering
153 when using --wblosum. Invalid with any other weighting scheme.
154 Default is 0.62.
155
156
157
158
159
161 After relative weights are determined, they are normalized to sum to a
162 total effective sequence number, eff_nseq. This number may be the ac‐
163 tual number of sequences in the alignment, but it is almost always
164 smaller than that. The default entropy weighting method (--eent) re‐
165 duces the effective sequence number to reduce the information content
166 (relative entropy, or average expected score on true homologs) per con‐
167 sensus position. The target relative entropy is controlled by a two-pa‐
168 rameter function, where the two parameters are settable with --ere and
169 --esigma.
170
171
172 --eent Adjust effective sequence number to achieve a specific relative
173 entropy per position (see --ere). This is the default.
174
175
176 --eclust
177 Set effective sequence number to the number of single-linkage
178 clusters at a specific identity threshold (see --eid). This op‐
179 tion is not recommended; it's for experiments evaluating how
180 much better --eent is.
181
182
183 --enone
184 Turn off effective sequence number determination and just use
185 the actual number of sequences. One reason you might want to do
186 this is to try to maximize the relative entropy/position of your
187 model, which may be useful for short models.
188
189
190 --eset <x>
191 Explicitly set the effective sequence number for all models to
192 <x>.
193
194
195 --ere <x>
196 Set the minimum relative entropy/position target to <x>. Re‐
197 quires --eent. Default depends on the sequence alphabet. For
198 protein sequences, it is 0.59 bits/position; for nucleotide se‐
199 quences, it is 0.45 bits/position.
200
201
202 --esigma <x>
203 Sets the minimum relative entropy contributed by an entire model
204 alignment, over its whole length. This has the effect of making
205 short models have higher relative entropy per position than
206 --ere alone would give. The default is 45.0 bits.
207
208
209 --eid <x>
210 Sets the fractional pairwise identity cutoff used by single
211 linkage clustering with the --eclust option. The default is
212 0.62.
213
214
215
217 By default, weighted counts are converted to mean posterior probability
218 parameter estimates using mixture Dirichlet priors. Default mixture
219 Dirichlet prior parameters for protein models and for nucleic acid (RNA
220 and DNA) models are built in. The following options allow you to over‐
221 ride the default priors.
222
223
224 --pnone
225 Don't use any priors. Probability parameters will simply be the
226 observed frequencies, after relative sequence weighting.
227
228
229 --plaplace
230 Use a Laplace +1 prior in place of the default mixture Dirichlet
231 prior.
232
233
234
235
236
238 By default, if a query is a single sequence from a file in fasta for‐
239 mat, hmmbuild constructs a search model from that sequence and a stan‐
240 dard 20x20 substitution matrix for residue probabilities, along with
241 two additional parameters for position-independent gap open and gap ex‐
242 tend probabilities. These options allow the default single-sequence
243 scoring parameters to be changed, and for single-sequence scoring op‐
244 tions to be applied to a single sequence coming from an aligned format.
245
246
247 --singlemx
248 If a single sequence query comes from a multiple sequence align‐
249 ment file, such as in stockholm format, the search model is by
250 default constructed as is typically done for multiple sequence
251 alignments. This option forces hmmbuild to use the single-se‐
252 quence method with substitution score matrix.
253
254
255 --mx <s>
256 Obtain residue alignment probabilities from the built-in substi‐
257 tution matrix named <s>. Several standard matrices are built-
258 in, and do not need to be read from files. The matrix name <s>
259 can be PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLO‐
260 SUM62, BLOSUM80, BLOSUM90, or DNA1. Only one of the --mx and
261 --mxfile options may be used.
262
263
264 --mxfile <mxfile>
265 Obtain residue alignment probabilities from the substitution ma‐
266 trix in file <mxfile>. The default score matrix is BLOSUM62 for
267 protein sequences, and DNA1 for nucleotide sequences (these ma‐
268 trices are internal to HMMER and do not need to be available as
269 a file). The format of a substitution matrix <mxfile> is the
270 standard format accepted by BLAST, FASTA, and other sequence
271 analysis software. See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
272 example files. (The only exception: we require matrices to be
273 square, so for DNA, use files like NCBI's NUC.4.4, not NUC.4.2.)
274
275
276 --popen <x>
277 Set the gap open probability for a single sequence query model
278 to <x>. The default is 0.02. <x> must be >= 0 and < 0.5.
279
280
281 --pextend <x>
282 Set the gap extend probability for a single sequence query model
283 to <x>. The default is 0.4. <x> must be >= 0 and < 1.0.
284
285
286
288 The location parameters for the expected score distributions for MSV
289 filter scores, Viterbi filter scores, and Forward scores require three
290 short random sequence simulations.
291
292
293 --EmL <n>
294 Sets the sequence length in simulation that estimates the loca‐
295 tion parameter mu for MSV filter E-values. Default is 200.
296
297
298 --EmN <n>
299 Sets the number of sequences in simulation that estimates the
300 location parameter mu for MSV filter E-values. Default is 200.
301
302
303 --EvL <n>
304 Sets the sequence length in simulation that estimates the loca‐
305 tion parameter mu for Viterbi filter E-values. Default is 200.
306
307
308 --EvN <n>
309 Sets the number of sequences in simulation that estimates the
310 location parameter mu for Viterbi filter E-values. Default is
311 200.
312
313
314 --EfL <n>
315 Sets the sequence length in simulation that estimates the loca‐
316 tion parameter tau for Forward E-values. Default is 100.
317
318
319 --EfN <n>
320 Sets the number of sequences in simulation that estimates the
321 location parameter tau for Forward E-values. Default is 200.
322
323
324 --Eft <x>
325 Sets the tail mass fraction to fit in the simulation that esti‐
326 mates the location parameter tau for Forward evalues. Default is
327 0.04.
328
329
330
332 --cpu <n>
333 Set the number of parallel worker threads to <n>. On multicore
334 machines, the default is 2. You can also control this number by
335 setting an environment variable, HMMER_NCPU. There is also a
336 master thread, so the actual number of threads that HMMER spawns
337 is <n>+1.
338
339 This option is not available if HMMER was compiled with POSIX
340 threads support turned off.
341
342
343
344
345 --informat <s>
346 Assert that input msafile is in alignment format <s>, bypassing
347 format autodetection. Common choices for <s> include: stock‐
348 holm, a2m, afa, psiblast, clustal, phylip. For more informa‐
349 tion, and for codes for some less common formats, see main docu‐
350 mentation. The string <s> is case-insensitive (a2m or A2M both
351 work).
352
353
354
355 --seed <n>
356 Seed the random number generator with <n>, an integer >= 0. If
357 <n> is nonzero, any stochastic simulations will be reproducible;
358 the same command will give the same results. If <n> is 0, the
359 random number generator is seeded arbitrarily, and stochastic
360 simulations will vary from run to run of the same command. The
361 default seed is 42.
362
363
364
365 --w_beta <x>
366 Window length tail mass. The upper bound, W, on the length at
367 which nhmmer expects to find an instance of the model is set
368 such that the fraction of all sequences generated by the model
369 with length >= W is less than <x>. The default is 1e-7.
370
371
372
373
374 --w_length <n>
375 Override the model instance length upper bound, W, which is oth‐
376 erwise controlled by --w_beta. It should be larger than the
377 model length. The value of W is used deep in the acceleration
378 pipeline, and modest changes are not expected to impact results
379 (though larger values of W do lead to longer run time).
380
381
382
383 --mpi Run as a parallel MPI program. Each alignment is assigned to a
384 MPI worker node for construction. (Therefore, the maximum paral‐
385 lelization cannot exceed the number of alignments in the input
386 msafile.) This is useful when building large profile libraries.
387 This option is only available if optional MPI capability was en‐
388 abled at compile-time.
389
390
391
392 --stall
393 For debugging MPI parallelization: arrest program execution im‐
394 mediately after start, and wait for a debugger to attach to the
395 running process and release the arrest.
396
397
398
399 --maxinsertlen <n>
400 Restrict insert length parameterization such that the expected
401 insert length at each position of the model is no more than <n>.
402
403
404
405
406
408 See hmmer(1) for a master man page with a list of all the individual
409 man pages for programs in the HMMER package.
410
411
412 For complete documentation, see the user guide that came with your HM‐
413 MER distribution (Userguide.pdf); or see the HMMER web page (http://hm‐
414 mer.org/).
415
416
417
418
420 Copyright (C) 2020 Howard Hughes Medical Institute.
421 Freely distributed under the BSD open source license.
422
423 For additional information on copyright and licensing, see the file
424 called COPYRIGHT in your HMMER source distribution, or see the HMMER
425 web page (http://hmmer.org/).
426
427
428
430 http://eddylab.org
431
432
433
434
435HMMER 3.3.2 Nov 2020 hmmbuild(1)