1hmmbuild(1) HMMER Manual hmmbuild(1)
2
3
4
6 hmmbuild - build a profile HMM from an alignment
7
8
10 hmmbuild [options] hmmfile alignfile
11
12
14 hmmbuild reads a multiple sequence alignment file alignfile , builds a
15 new profile HMM, and saves the HMM in hmmfile.
16
17
18 alignfile may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned
19 FASTA alignment format. The format is automatically detected.
20
21
22 By default, the model is configured to find one or more nonoverlapping
23 alignments to the complete model: multiple global alignments with
24 respect to the model, and local with respect to the sequence. This is
25 analogous to the behavior of the hmmls program of HMMER 1. To config‐
26 ure the model for multiple local alignments with respect to the model
27 and local with respect to the sequence, a la the old program hmmfs, use
28 the -f (fragment) option. More rarely, you may want to configure the
29 model for a single global alignment (global with respect to both model
30 and sequence), using the -g option; or to configure the model for a
31 single local/local alignment (a la standard Smith/Waterman, or the old
32 hmmsw program), use the -s option.
33
34
36 -f Configure the model for finding multiple domains per sequence,
37 where each domain can be a local (fragmentary) alignment. This
38 is analogous to the old hmmfs program of HMMER 1.
39
40
41 -g Configure the model for finding a single global alignment to a
42 target sequence, analogous to the old hmms program of HMMER 1.
43
44
45 -h Print brief help; includes version number and summary of all
46 options, including expert options.
47
48
49 -n <s> Name this HMM <s>. <s> can be any string of non-whitespace
50 characters (e.g. one "word"). There is no length limit (at
51 least not one imposed by HMMER; your shell will complain about
52 command line lengths first).
53
54
55 -o <f> Re-save the starting alignment to <f>, in Stockholm format. The
56 columns which were assigned to match states will be marked with
57 x's in an #=RF annotation line. If either the --hand or --fast
58 construction options were chosen, the alignment may have been
59 slightly altered to be compatible with Plan 7 transitions, so
60 saving the final alignment and comparing to the starting align‐
61 ment can let you view these alterations. See the User's Guide
62 for more information on this arcane side effect.
63
64
65 -s Configure the model for finding a single local alignment per
66 target sequence. This is analogous to the standard Smith/Water‐
67 man algorithm or the hmmsw program of HMMER 1.
68
69
70 -A Append this model to an existing hmmfile rather than creating
71 hmmfile. Useful for building HMM libraries (like Pfam).
72
73
74 -F Force overwriting of an existing hmmfile. Otherwise HMMER will
75 refuse to clobber your existing HMM files, for safety's sake.
76
77
79 --amino
80 Force the sequence alignment to be interpreted as amino acid
81 sequences. Normally HMMER autodetects whether the alignment is
82 protein or DNA, but sometimes alignments are so small that
83 autodetection is ambiguous. See --nucleic.
84
85
86 --archpri <x>
87 Set the "architecture prior" used by MAP architecture construc‐
88 tion to <x>, where <x> is a probability between 0 and 1. This
89 parameter governs a geometric prior distribution over model
90 lengths. As <x> increases, longer models are favored a priori.
91 As <x> decreases, it takes more residue conservation in a column
92 to make a column a "consensus" match column in the model archi‐
93 tecture. The 0.85 default has been chosen empirically as a rea‐
94 sonable setting.
95
96
97 --binary
98 Write the HMM to hmmfile in HMMER binary format instead of read‐
99 able ASCII text.
100
101
102 --cfile <f>
103 Save the observed emission and transition counts to <f> after
104 the architecture has been determined (e.g. after residues/gaps
105 have been assigned to match, delete, and insert states). This
106 option is used in HMMER development for generating data files
107 useful for training new Dirichlet priors. The format of count
108 files is documented in the User's Guide.
109
110
111 --fast Quickly and heuristically determine the architecture of the
112 model by assigning all columns will more than a certain fraction
113 of gap characters to insert states. By default this fraction is
114 0.5, and it can be changed using the --gapmax option. The
115 default construction algorithm is a maximum a posteriori (MAP)
116 algorithm, which is slower.
117
118
119 --gapmax <x>
120 Controls the --fast model construction algorithm, but if --fast
121 is not being used, has no effect. If a column has more than a
122 fraction <x> of gap symbols in it, it gets assigned to an insert
123 column. <x> is a frequency from 0 to 1, and by default is set
124 to 0.5. Higher values of <x> mean more columns get assigned to
125 consensus, and models get longer; smaller values of <x> mean
126 fewer columns get assigned to consensus, and models get smaller.
127 <x>
128
129
130 --hand Specify the architecture of the model by hand: the alignment
131 file must be in SELEX or Stockholm format, and the reference
132 annotation line (#=RF in SELEX, #=GC RF in Stockholm) is used to
133 specify the architecture. Any column marked with a non-gap sym‐
134 bol (such as an 'x', for instance) is assigned as a consensus
135 (match) column in the model.
136
137
138 --idlevel <x>
139 Controls both the determination of effective sequence number and
140 the behavior of the --wblosum weighting option. The sequence
141 alignment is clustered by percent identity, and the number of
142 clusters at a cutoff threshold of <x> is used to determine the
143 effective sequence number. Higher values of <x> give more clus‐
144 ters and higher effective sequence numbers; lower values of <x>
145 give fewer clusters and lower effective sequence numbers. <x>
146 is a fraction from 0 to 1, and by default is set to 0.62 (corre‐
147 sponding to the clustering level used in constructing the BLO‐
148 SUM62 substitution matrix).
149
150
151 --informat <s>
152 Assert that the input seqfile is in format <s>; do not run
153 Babelfish format autodection. This increases the reliability of
154 the program somewhat, because the Babelfish can make mistakes;
155 particularly recommended for unattended, high-throughput runs of
156 HMMER. Valid format strings include FASTA, GENBANK, EMBL, GCG,
157 PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the User's
158 Guide for a complete list.
159
160
161 --noeff
162 Turn off the effective sequence number calculation, and use the
163 true number of sequences instead. This will usually reduce the
164 sensitivity of the final model (so don't do it without good rea‐
165 son!)
166
167
168 --nucleic
169 Force the alignment to be interpreted as nucleic acid sequence,
170 either RNA or DNA. Normally HMMER autodetects whether the align‐
171 ment is protein or DNA, but sometimes alignments are so small
172 that autodetection is ambiguous. See --amino.
173
174
175 --null <f>
176 Read a null model from <f>. The default for protein is to use
177 average amino acid frequencies from Swissprot 34 and p1 =
178 350/351; for nucleic acid, the default is to use 0.25 for each
179 base and p1 = 1000/1001. For documentation of the format of the
180 null model file and further explanation of how the null model is
181 used, see the User's Guide.
182
183
184 --pam <f>
185 Apply a heuristic PAM- (substitution matrix-) based prior on
186 match emission probabilities instead of the default mixture
187 Dirichlet. The substitution matrix is read from <f>. See
188 --pamwgt.
189
190 The default Dirichlet state transition prior and insert emission
191 prior are unaffected. Therefore in principle you could combine
192 --prior with --pam but this isn't recommended, as it hasn't been
193 tested. ( --pam itself hasn't been tested much!)
194
195
196 --pamwgt <x>
197 Controls the weight on a PAM-based prior. Only has effect if
198 --pam option is also in use. <x> is a positive real number,
199 20.0 by default. <x> is the number of "pseudocounts" contri‐
200 ubuted by the heuristic prior. Very high values of <x> can force
201 a scoring system that is entirely driven by the substitution
202 matrix, making HMMER somewhat approximate Gribskov profiles.
203
204
205 --pbswitch <n>
206 For alignments with a very large number of sequences, the GSC,
207 BLOSUM, and Voronoi weighting schemes are slow; they're O(N^2)
208 for N sequences. Henikoff position-based weights (PB weights)
209 are more efficient. At or above a certain threshold sequence
210 number <n> hmmbuild will switch from GSC, BLOSUM, or Voronoi
211 weights to PB weights. To disable this switching behavior (at
212 the cost of compute time, set <n> to be something larger than
213 the number of sequences in your alignment. <n> is a positive
214 integer; the default is 1000.
215
216
217 --prior <f>
218 Read a Dirichlet prior from <f>, replacing the default mixture
219 Dirichlet. The format of prior files is documented in the
220 User's Guide, and an example is given in the Demos directory of
221 the HMMER distribution.
222
223
224 --swentry <x>
225 Controls the total probability that is distributed to local
226 entries into the model, versus starting at the beginning of the
227 model as in a global alignment. <x> is a probability from 0 to
228 1, and by default is set to 0.5. Higher values of <x> mean that
229 hits that are fragments on their left (N or 5'-terminal) side
230 will be penalized less, but complete global alignments will be
231 penalized more. Lower values of <x> mean that fragments on the
232 left will be penalized more, and global alignments on this side
233 will be favored. This option only affects the configurations
234 that allow local alignments, e.g. -s and -f; unless one of
235 these options is also activated, this option has no effect. You
236 have independent control over local/global alignment behavior
237 for the N/C (5'/3') termini of your target sequences using
238 --swentry and --swexit.
239
240
241 --swexit <x>
242 Controls the total probability that is distributed to local
243 exits from the model, versus ending an alignment at the end of
244 the model as in a global alignment. <x> is a probability from 0
245 to 1, and by default is set to 0.5. Higher values of <x> mean
246 that hits that are fragments on their right (C or 3'-terminal)
247 side will be penalized less, but complete global alignments will
248 be penalized more. Lower values of <x> mean that fragments on
249 the right will be penalized more, and global alignments on this
250 side will be favored. This option only affects the configura‐
251 tions that allow local alignments, e.g. -s and -f; unless one
252 of these options is also activated, this option has no effect.
253 You have independent control over local/global alignment behav‐
254 ior for the N/C (5'/3') termini of your target sequences using
255 --swentry and --swexit.
256
257
258 --verbose
259 Print more possibly useful stuff, such as the individual scores
260 for each sequence in the alignment.
261
262
263 --wblosum
264 Use the BLOSUM filtering algorithm to weight the sequences,
265 instead of the default. Cluster the sequences at a given per‐
266 centage identity (see --idlevel); assign each cluster a total
267 weight of 1.0, distributed equally amongst the members of that
268 cluster.
269
270
271
272 --wgsc Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting
273 algorithm. This is already the default, so this option has no
274 effect (unless it follows another option in the --w family, in
275 which case it overrides it).
276
277
278 --wme Use the Krogh/Mitchison maximum entropy algorithm to "weight"
279 the sequences. This supercedes the Eddy/Mitchison/Durbin maximum
280 discrimination algorithm, which gives almost identical weights
281 but is less robust. ME weighting seems to give a marginal
282 increase in sensitivity over the default GSC weights, but takes
283 a fair amount of time.
284
285
286 --wnone
287 Turn off all sequence weighting.
288
289
290 --wpb Use the Henikoff position-based weighting scheme.
291
292
293 --wvoronoi
294 Use the Sibbald/Argos Voronoi sequence weighting algorithm in
295 place of the default GSC weighting.
296
297
298
300 Master man page, with full list of and guide to the individual man
301 pages: see hmmer(1).
302
303 For complete documentation, see the user guide that came with the dis‐
304 tribution (Userguide.pdf); or see the HMMER web page,
305 http://hmmer.wustl.edu/.
306
307
309 Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
310 Freely distributed under the GNU General Public License (GPL).
311 See the file COPYING in your distribution for details on redistribution
312 conditions.
313
314
316 Sean Eddy
317 HHMI/Dept. of Genetics
318 Washington Univ. School of Medicine
319 4566 Scott Ave.
320 St Louis, MO 63110 USA
321 http://www.genetics.wustl.edu/eddy/
322
323
324
325
326
327HMMER 2.3.2 Oct 2003 hmmbuild(1)