hmmbuild(1)

1hmmbuild(1)                      HMMER Manual                      hmmbuild(1)
2
3
4

NAME

6       hmmbuild - build a profile HMM from an alignment
7
8

SYNOPSIS

10       hmmbuild [options] hmmfile alignfile
11
12

DESCRIPTION

14       hmmbuild  reads a multiple sequence alignment file alignfile , builds a
15       new profile HMM, and saves the HMM in hmmfile.
16
17
18       alignfile may be in ClustalW, GCG MSF,  SELEX,  Stockholm,  or  aligned
19       FASTA alignment format. The format is automatically detected.
20
21
22       By  default, the model is configured to find one or more nonoverlapping
23       alignments to the  complete  model:  multiple  global  alignments  with
24       respect  to the model, and local with respect to the sequence.  This is
25       analogous to the behavior of the hmmls program of HMMER 1.  To  config‐
26       ure  the  model for multiple local alignments with respect to the model
27       and local with respect to the sequence, a la the old program hmmfs, use
28       the  -f  (fragment)  option. More rarely, you may want to configure the
29       model for a single global alignment (global with respect to both  model
30       and  sequence),  using  the  -g option; or to configure the model for a
31       single local/local alignment (a la standard Smith/Waterman, or the  old
32       hmmsw program), use the -s option.
33
34

OPTIONS

36       -f     Configure  the  model for finding multiple domains per sequence,
37              where each domain can be a local (fragmentary)  alignment.  This
38              is analogous to the old hmmfs program of HMMER 1.
39
40
41       -g     Configure  the  model for finding a single global alignment to a
42              target sequence, analogous to the old hmms program of HMMER 1.
43
44
45       -h     Print brief help; includes version number  and  summary  of  all
46              options, including expert options.
47
48
49       -n <s> Name  this  HMM  <s>.   <s>  can be any string of non-whitespace
50              characters (e.g. one "word").  There  is  no  length  limit  (at
51              least  not  one imposed by HMMER; your shell will complain about
52              command line lengths first).
53
54
55       -o <f> Re-save the starting alignment to <f>, in Stockholm format.  The
56              columns  which were assigned to match states will be marked with
57              x's in an #=RF annotation line.  If either the --hand or  --fast
58              construction  options  were  chosen, the alignment may have been
59              slightly altered to be compatible with Plan  7  transitions,  so
60              saving  the final alignment and comparing to the starting align‐
61              ment can let you view these alterations.  See the  User's  Guide
62              for more information on this arcane side effect.
63
64
65       -s     Configure  the  model  for  finding a single local alignment per
66              target sequence. This is analogous to the standard  Smith/Water‐
67              man algorithm or the hmmsw program of HMMER 1.
68
69
70       -A     Append  this  model  to an existing hmmfile rather than creating
71              hmmfile.  Useful for building HMM libraries (like Pfam).
72
73
74       -F     Force overwriting of an existing hmmfile.  Otherwise HMMER  will
75              refuse to clobber your existing HMM files, for safety's sake.
76
77

EXPERT OPTIONS

79       --amino
80              Force  the  sequence  alignment  to be interpreted as amino acid
81              sequences. Normally HMMER autodetects whether the  alignment  is
82              protein  or  DNA,  but  sometimes  alignments  are so small that
83              autodetection is ambiguous. See --nucleic.
84
85
86       --archpri <x>
87              Set the "architecture prior" used by MAP architecture  construc‐
88              tion  to  <x>,  where <x> is a probability between 0 and 1. This
89              parameter governs a  geometric  prior  distribution  over  model
90              lengths.  As  <x> increases, longer models are favored a priori.
91              As <x> decreases, it takes more residue conservation in a column
92              to  make a column a "consensus" match column in the model archi‐
93              tecture.  The 0.85 default has been chosen empirically as a rea‐
94              sonable setting.
95
96
97       --binary
98              Write the HMM to hmmfile in HMMER binary format instead of read‐
99              able ASCII text.
100
101
102       --cfile <f>
103              Save the observed emission and transition counts  to  <f>  after
104              the  architecture  has been determined (e.g. after residues/gaps
105              have been assigned to match, delete, and insert  states).   This
106              option  is  used  in HMMER development for generating data files
107              useful for training new Dirichlet priors. The  format  of  count
108              files is documented in the User's Guide.
109
110
111       --fast Quickly  and  heuristically  determine  the  architecture of the
112              model by assigning all columns will more than a certain fraction
113              of  gap characters to insert states. By default this fraction is
114              0.5, and it can be  changed  using  the  --gapmax  option.   The
115              default  construction  algorithm is a maximum a posteriori (MAP)
116              algorithm, which is slower.
117
118
119       --gapmax <x>
120              Controls the --fast model construction algorithm, but if  --fast
121              is  not  being used, has no effect.  If a column has more than a
122              fraction <x> of gap symbols in it, it gets assigned to an insert
123              column.   <x>  is a frequency from 0 to 1, and by default is set
124              to 0.5. Higher values of <x> mean more columns get  assigned  to
125              consensus,  and  models  get  longer; smaller values of <x> mean
126              fewer columns get assigned to consensus, and models get smaller.
127              <x>
128
129
130       --hand Specify  the  architecture  of  the model by hand: the alignment
131              file must be in SELEX or Stockholm  format,  and  the  reference
132              annotation line (#=RF in SELEX, #=GC RF in Stockholm) is used to
133              specify the architecture. Any column marked with a non-gap  sym‐
134              bol  (such  as  an 'x', for instance) is assigned as a consensus
135              (match) column in the model.
136
137
138       --idlevel <x>
139              Controls both the determination of effective sequence number and
140              the  behavior  of  the  --wblosum weighting option. The sequence
141              alignment is clustered by percent identity, and  the  number  of
142              clusters  at  a cutoff threshold of <x> is used to determine the
143              effective sequence number.  Higher values of <x> give more clus‐
144              ters  and higher effective sequence numbers; lower values of <x>
145              give fewer clusters and lower effective sequence  numbers.   <x>
146              is a fraction from 0 to 1, and by default is set to 0.62 (corre‐
147              sponding to the clustering level used in constructing  the  BLO‐
148              SUM62 substitution matrix).
149
150
151       --informat <s>
152              Assert  that  the  input  seqfile  is  in format <s>; do not run
153              Babelfish format autodection. This increases the reliability  of
154              the  program  somewhat, because the Babelfish can make mistakes;
155              particularly recommended for unattended, high-throughput runs of
156              HMMER.  Valid  format strings include FASTA, GENBANK, EMBL, GCG,
157              PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the  User's
158              Guide for a complete list.
159
160
161       --noeff
162              Turn  off the effective sequence number calculation, and use the
163              true number of sequences instead. This will usually  reduce  the
164              sensitivity of the final model (so don't do it without good rea‐
165              son!)
166
167
168       --nucleic
169              Force the alignment to be interpreted as nucleic acid  sequence,
170              either RNA or DNA. Normally HMMER autodetects whether the align‐
171              ment is protein or DNA, but sometimes alignments  are  so  small
172              that autodetection is ambiguous. See --amino.
173
174
175       --null <f>
176              Read  a  null model from <f>.  The default for protein is to use
177              average amino acid  frequencies  from  Swissprot  34  and  p1  =
178              350/351;  for  nucleic acid, the default is to use 0.25 for each
179              base and p1 = 1000/1001. For documentation of the format of  the
180              null model file and further explanation of how the null model is
181              used, see the User's Guide.
182
183
184       --pam <f>
185              Apply a heuristic PAM- (substitution  matrix-)  based  prior  on
186              match  emission  probabilities  instead  of  the default mixture
187              Dirichlet. The  substitution  matrix  is  read  from  <f>.   See
188              --pamwgt.
189
190              The default Dirichlet state transition prior and insert emission
191              prior are unaffected. Therefore in principle you  could  combine
192              --prior with --pam but this isn't recommended, as it hasn't been
193              tested. ( --pam itself hasn't been tested much!)
194
195
196       --pamwgt <x>
197              Controls the weight on a PAM-based prior.  Only  has  effect  if
198              --pam  option  is  also  in use.  <x> is a positive real number,
199              20.0 by default.  <x> is the number  of  "pseudocounts"  contri‐
200              ubuted by the heuristic prior. Very high values of <x> can force
201              a scoring system that is entirely  driven  by  the  substitution
202              matrix, making HMMER somewhat approximate Gribskov profiles.
203
204
205       --pbswitch <n>
206              For  alignments  with a very large number of sequences, the GSC,
207              BLOSUM, and Voronoi weighting schemes are slow;  they're  O(N^2)
208              for  N  sequences.  Henikoff position-based weights (PB weights)
209              are more efficient. At or above  a  certain  threshold  sequence
210              number  <n>  hmmbuild  will  switch from GSC, BLOSUM, or Voronoi
211              weights to PB weights. To disable this  switching  behavior  (at
212              the  cost  of  compute time, set <n> to be something larger than
213              the number of sequences in your alignment.  <n>  is  a  positive
214              integer; the default is 1000.
215
216
217       --prior <f>
218              Read  a  Dirichlet prior from <f>, replacing the default mixture
219              Dirichlet.  The format of  prior  files  is  documented  in  the
220              User's  Guide, and an example is given in the Demos directory of
221              the HMMER distribution.
222
223
224       --swentry <x>
225              Controls the total probability  that  is  distributed  to  local
226              entries  into the model, versus starting at the beginning of the
227              model as in a global alignment.  <x> is a probability from 0  to
228              1, and by default is set to 0.5.  Higher values of <x> mean that
229              hits that are fragments on their left (N  or  5'-terminal)  side
230              will  be  penalized less, but complete global alignments will be
231              penalized more.  Lower values of <x> mean that fragments on  the
232              left  will be penalized more, and global alignments on this side
233              will be favored.  This option only  affects  the  configurations
234              that  allow  local  alignments,  e.g.   -s and -f; unless one of
235              these options is also activated, this option has no effect.  You
236              have  independent  control  over local/global alignment behavior
237              for the N/C (5'/3')  termini  of  your  target  sequences  using
238              --swentry and --swexit.
239
240
241       --swexit <x>
242              Controls  the  total  probability  that  is distributed to local
243              exits from the model, versus ending an alignment at the  end  of
244              the model as in a global alignment.  <x> is a probability from 0
245              to 1, and by default is set to 0.5.  Higher values of  <x>  mean
246              that  hits  that are fragments on their right (C or 3'-terminal)
247              side will be penalized less, but complete global alignments will
248              be  penalized  more.  Lower values of <x> mean that fragments on
249              the right will be penalized more, and global alignments on  this
250              side  will  be favored.  This option only affects the configura‐
251              tions that allow local alignments, e.g.  -s and -f;  unless  one
252              of  these  options is also activated, this option has no effect.
253              You have independent control over local/global alignment  behav‐
254              ior  for  the N/C (5'/3') termini of your target sequences using
255              --swentry and --swexit.
256
257
258       --verbose
259              Print more possibly useful stuff, such as the individual  scores
260              for each sequence in the alignment.
261
262
263       --wblosum
264              Use  the  BLOSUM  filtering  algorithm  to weight the sequences,
265              instead of the default.  Cluster the sequences at a  given  per‐
266              centage  identity  (see  --idlevel); assign each cluster a total
267              weight of 1.0, distributed equally amongst the members  of  that
268              cluster.
269
270
271
272       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  ad hoc sequence weighting
273              algorithm. This is already the default, so this  option  has  no
274              effect  (unless  it follows another option in the --w family, in
275              which case it overrides it).
276
277
278       --wme  Use the Krogh/Mitchison maximum entropy  algorithm  to  "weight"
279              the sequences. This supercedes the Eddy/Mitchison/Durbin maximum
280              discrimination algorithm, which gives almost  identical  weights
281              but  is  less  robust.  ME  weighting  seems  to give a marginal
282              increase in sensitivity over the default GSC weights, but  takes
283              a fair amount of time.
284
285
286       --wnone
287              Turn off all sequence weighting.
288
289
290       --wpb  Use the Henikoff position-based weighting scheme.
291
292
293       --wvoronoi
294              Use  the  Sibbald/Argos  Voronoi sequence weighting algorithm in
295              place of the default GSC weighting.
296
297
298

COPYRIGHT

309       Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
310       Freely distributed under the GNU General Public License (GPL).
311       See the file COPYING in your distribution for details on redistribution
312       conditions.
313
314

AUTHOR

316       Sean Eddy
317       HHMI/Dept. of Genetics
318       Washington Univ. School of Medicine
319       4566 Scott Ave.
320       St Louis, MO 63110 USA
321       http://www.genetics.wustl.edu/eddy/
322
323
324
325
326
327HMMER 2.3.2                        Oct 2003                        hmmbuild(1)