1esl-alimask(1) Easel Manual esl-alimask(1)
2
3
4
6 esl-alimask - remove columns from a multiple sequence alignment
7
8
10 esl-alimask [options] msafile maskfile
11 (remove columns based on a mask in an input file)
12
13 esl-alimask -t [options] msafile coords
14 (remove a contiguous set of columns at the start and end of an alignment)
15
16 esl-alimask -g [options] msafile
17 (remove columns based on their frequency of gaps)
18
19 esl-alimask -p [options] msafile
20 (remove columns based on their posterior probability annotation)
21
22 esl-alimask --rf-is-mask [options] msafile
23 (only remove columns that are gaps in the RF annotation)
24
25 The -g and -p options may be used in combination.
26
27
28
30 esl-alimask reads a single input alignment, removes some columns from
31 it (i.e. masks it), and outputs the masked alignment.
32
33
34 esl-alimask can be run in several different modes.
35
36
37 esl-alimask runs in "mask file mode" by default when two command-line
38 arguments (msafile and maskfile) are supplied. In this mode, a bit-vec‐
39 tor mask in the maskfile defines which columns to keep/remove. The
40 mask is a string that may only contain the characters '0' and '1'. A
41 '0' at position x of the mask indicates that column x is excluded by
42 the mask and should be removed during masking. A '1' at position x of
43 the mask indicates that column x is included by the mask and should not
44 be removed during masking. All lines in the maskfile that begin with
45 '#' are considered comment lines and are ignored. All non-whitespace
46 characters in non-comment lines are considered to be part of the mask.
47 The length of the mask must equal either the total number of columns in
48 the (first) alignment in msafile, or the number of columns that are not
49 gaps in the RF annotation of that alignment. The latter case is only
50 valid if msafile is in Stockholm format and contains '#=GC RF' annota‐
51 tion. If the mask length is equal to the non-gap RF length, all gap RF
52 columns will automatically be removed.
53
54
55 esl-alimask runs in "truncation mode" if the -t option is used along
56 with two command line arguments (msafile and coords). In this mode, the
57 alignment will be truncated by removing a contiguous set of columns
58 from the beginning and end of the alignment. The second command line
59 argument is the coords string, that specifies what range of columns to
60 keep in the alignment, all columns outside of this range will be re‐
61 moved. The coords string consists of start and end coordinates sepa‐
62 rated by any nonnumeric, nonwhitespace character or characters you
63 like; for example, 23..100, 23/100, or 23-100 all work. To keep all
64 alignment columns beginning at 23 until the end of the alignment, you
65 can omit the end; for example, 23: would work. If the --t-rf option is
66 used in combination with -t, the coordinates in coords are interpreted
67 as non-gap RF column coordinates. For example, with --t-rf, a coords
68 string of 23-100 would remove all columns before the 23rd non-gap
69 residue in the "#=GC RF" annotation and after the 100th non-gap RF
70 residue.
71
72
73 esl-alimask runs in "RF mask" mode if the --rf-is-mask option is en‐
74 abled. In this mode, the alignment must be in Stockholm format and con‐
75 tain '#=GC RF' annotation. esl-alimask will simply remove all columns
76 that are gaps in the RF annotation.
77
78
79 esl-alimask runs in "gap frequency mode" if -g is enabled. In this mode
80 columns for which greater than <f> fraction of the aligned sequences
81 have gap residues will be removed. By default, <f> is 0.5, but this
82 value can be changed to <f> with the --gapthresh <f> option. In this
83 mode, if the alignment is in Stockholm format and has RF annotation,
84 then all columns that are gaps in the RF annotation will automatically
85 be removed, unless --saveins is enabled.
86
87
88 esl-alimask runs in "posterior probability mode" if -p is enabled. In
89 this mode, masking is based on posterior probability annotation, and
90 the input alignment must be in Stockholm format and contain '#=GR PP'
91 (posterior probability) annotation for all sequences. As a special
92 case, if -p is used in combination with --ppcons, then the input align‐
93 ment need not have '#=GR PP' annotation, but must contain '#=GC
94 PP_cons' (posterior probability consensus) annotation.
95
96
97 Characters in Stockholm alignment posterior probability annotation
98 (both '#=GR PP' and '#=GC PP_cons') can have 12 possible values: the
99 ten digits '0-9', '*', and '.'. If '.', the position must correspond to
100 a gap in the sequence (for '#=GR PP') or in the RF annotation (for
101 '#=GC PP_cons'). A value of '0' indicates a posterior probability of
102 between 0.0 and 0.05, '1' indicates between 0.05 and 0.15, '2' indi‐
103 cates between 0.15 and 0.25 and so on up to '9' which indicates between
104 0.85 and 0.95. A value of '*' indicates a posterior probability of be‐
105 tween 0.95 and 1.0. Higher posterior probabilities correspond to
106 greater confidence that the aligned residue belongs where it appears in
107 the alignment.
108
109
110 When -p is enabled with --ppcons <x>, columns which have a consensus
111 posterior probability of less than <x> will be removed during masking,
112 and all other columns will not be removed.
113
114
115 When -p is enabled without --ppcons, the number of each possible PP
116 value in each column is counted. If <x> fraction of the sequences that
117 contain aligned residues (i.e. do not contain gaps) in a column have a
118 posterior probability greater than or equal to <y>, then that column
119 will not be removed during masking. All columns that do not meet this
120 criterion will be removed. By default, the values of both <x> and <y>
121 are 0.95, but they can be changed with the --pfract <x> and --pthresh
122 <y> options, respectively.
123
124
125 In posterior probability mode, all columns that have 0 residues (i.e.
126 that are 100% gaps) will be automatically removed, unless the --pall‐
127 gapok option is enabled, in which case such columns will not be re‐
128 moved.
129
130
131 Importantly, during posterior probability masking, unless --pavg is
132 used, PP annotation values are always considered to be the minimum nu‐
133 merical value in their corresponding range. For example, a PP '9' char‐
134 acter is converted to a numerical posterior probability of 0.85. If
135 --pavg is used, PP annotation values are considered to be the average
136 numerical value in their range. For example, a PP '9' character is con‐
137 verted to a numerical posterior probability of 0.90.
138
139
140 In posterior probability mode, if the alignment is in Stockholm format
141 and has RF annotation, then all columns that are gaps in the RF annota‐
142 tion will automatically be removed, unless --saveins is enabled.
143
144
145 A single run of esl-alimask can perform both gap frequency-based mask‐
146 ing and posterior probability-based masking if both the -g and -p op‐
147 tions are enabled. In this case, a gap frequency-based mask and a pos‐
148 terior probability-based mask are independently computed. These two
149 masks are combined to create the final mask using a logical 'and' oper‐
150 ation. Any column that is to be removed by either the gap or PP mask
151 will be removed by the final mask.
152
153
154 With the --small option, esl-alimask will operate in memory saving mode
155 and the required RAM for the masking will be minimal (usually less than
156 a Mb) and independent of the alignment size. To use --small, the align‐
157 ment alphabet must be specified with either --amino, --dna, or --rna,
158 and the alignment must be in Pfam format (non-interleaved, 1 line/se‐
159 quence Stockholm format). Pfam format is the default output format of
160 INFERNAL's cmalign program. Without --small the required RAM will be
161 equal to roughly the size of the first input alignment (the size of the
162 alignment file itself if it only contains one alignment).
163
164
165
167 By default, esl-alimask will print only the masked alignment to stdout
168 and then exit. If the -o <f> option is used, the alignment will be
169 saved to file <f> , and information on the number of columns kept and
170 removed will be printed to stdout. If -q is used in combination with
171 -o, nothing is printed to stdout.
172
173
174 The mask(s) computed by esl-alimask when the -t, -p, -g, or
175 --rf-is-mask options are used can be saved to output files using the
176 options --fmask-rf <f>, --fmask-all <f>, --gmask-rf <f>, --gmask-all
177 <f>, --pmask-rf <f>, and --pmask-all <f>. In all cases, <f> will con‐
178 tain a single line, a bit vector of length <n>, where <n> is the either
179 the total number of columns in the alignment (for the options suffixed
180 with 'all') or the number of non-gap columns in the RF annotation (for
181 the options suffixed with 'rf'). The mask will be a string of '0' and
182 '1' characters: a '0' at position x in the mask indicates column x was
183 removed (excluded) by the mask, and a '1' at position x indicates col‐
184 umn x was kept (included) by the mask. For the 'rf' suffixed options,
185 the mask only applies to non-gap RF columns. The options beginning
186 with 'f' will save the 'final' mask used to keep/remove columns from
187 the alignment. The options beginning with 'g' save the masks based on
188 gap frequency and require -g. The options beginning with 'p' save the
189 masks based on posterior probabilities and require -p.
190
191
192
194 -h Print brief help; includes version number and summary of all op‐
195 tions, including expert options.
196
197
198 -o <f> Output the final, masked alignment to file <f> instead of to
199 stdout. When this option is used, information about the number
200 of columns kept/removed is printed to stdout.
201
202
203 -q Be quiet; do not print anything to stdout. This option can only
204 be used in combination with the -o option.
205
206
207 --small
208 Operate in memory saving mode. Required RAM will be independent
209 of the size of the input alignment to mask, instead of roughly
210 the size of the input alignment. When enabled, the alignment
211 must be in Pfam Stockholm (non-interleaved 1 line/seq) format
212 (see esl-reformat) and the output alignment will be in Pfam for‐
213 mat.
214
215
216 --informat <s>
217 Assert that input msafile is in alignment format <s>. Common
218 choices for <s> include: stockholm, a2m, afa, psiblast, clustal,
219 phylip. For more information, and for codes for some less com‐
220 mon formats, see main documentation. The string <s> is case-in‐
221 sensitive (a2m or A2M both work). Default is stockholm format,
222 unless --small is used, in which case pfam format (non-inter‐
223 leaved Stockholm) is assumed.
224
225
226 --outformat <s>
227 Write the output msafile in alignment format <s>. Common
228 choices for <s> include: stockholm, a2m, afa, psiblast, clustal,
229 phylip. The string <s> is case-insensitive (a2m or A2M both
230 work). Default is stockholm, unless --small is enabled, in
231 which case pfam (noninterleaved Stockholm) is the default output
232 format.
233
234
235
236 --fmask-rf <f>
237 Save the non-gap RF-length final mask used to mask the alignment
238 to file <f>. The input alignment must be in Stockholm format
239 and contain '#=GC RF' annotation for this option to be valid.
240 See the OUTPUT section above for more details on output mask
241 files.
242
243
244 --fmask-all <f>
245 Save the full alignment-length final mask used to mask the
246 alignment to file <f>. See the OUTPUT section above for more
247 details on output mask files.
248
249
250 --amino
251 Specify that the input alignment is a protein alignment. By de‐
252 fault, esl-alimask will try to autodetect the alphabet, but if
253 the alignment is sufficiently small it may be ambiguous. This
254 option defines the alphabet as protein. Importantly, if --small
255 is enabled, the alphabet must be specified with either --amino,
256 --dna, or --rna.
257
258
259 --dna Specify that the input alignment is a DNA alignment.
260
261
262 --rna Specify that the input alignment is an RNA alignment.
263
264
265 --t-rf With -t, specify that the start and end coordinates defined in
266 the second command line argument coords correspond to non-gap RF
267 coordinates. To use this option, the alignment must be in Stock‐
268 holm format and have "#=GC RF" annotation. See the DESCRIPTION
269 section for an example of using the --t-rf option.
270
271
272 --t-rmins
273 With -t, specify that all columns that are gaps in the reference
274 (RF) annotation in between the specified start and end coordi‐
275 nates be removed. By default, these columns will be kept. To
276 use this option, the alignment must be in Stockholm format and
277 have "#=GC RF" annotation.
278
279
280 --gapthresh <x>
281 With -g, specify that a column is kept (included by mask) if no
282 more than <f> fraction of sequences in the alignment have a gap
283 ('.', '-', or '_') at that position. All other columns are re‐
284 moved (excluded by mask). By default, <x> is 0.5.
285
286
287 --gmask-rf <f>
288 Save the non-gap RF-length gap frequency-based mask used to mask
289 the alignment to file <f>. The input alignment must be in
290 Stockholm format and contain '#=GC RF' annotation for this op‐
291 tion to be valid. See the OUTPUT section above for more details
292 on output mask files.
293
294
295 --gmask-all <f>
296 Save the full alignment-length gap frequency-based mask used to
297 mask the alignment to file <f>. See the OUTPUT section above
298 for more details on output mask files.
299
300
301
302 --pfract <x>
303 With -p, specify that a column is kept (included by mask) if the
304 fraction of sequences with a non-gap residue in that column with
305 a posterior probability of at least <y> (from --pthresh <y>) is
306 <x> or greater. All other columns are removed (excluded by
307 mask). By default <x> is 0.95.
308
309
310 --pthresh <y>
311 With -p, specify that a column is kept (included by mask) if <x>
312 (from --pfract <x>) fraction of sequences with a non-gap residue
313 in that column have a posterior probability of at least <y>.
314 All other columns are removed (excluded by mask). By default
315 <y> is 0.95. See the DESCRIPTION section for more on posterior
316 probability (PP) masking. Due to the granularity of the PP an‐
317 notation, different <y> values within a range covered by a sin‐
318 gle PP character will be have the same effect on masking. For
319 example, using --pthresh 0.86 will have the same effect as using
320 --pthresh 0.94.
321
322
323 --pavg <x>
324 With -p, specify that a column is kept (included by mask) if the
325 average posterior probability of non-gap residues in that column
326 is at least <x>. See the DESCRIPTION section for more on poste‐
327 rior probability (PP) masking.
328
329
330 --ppcons <x>
331 With -p, use the '#=GC PP_cons' annotation to define which col‐
332 umns to keep/remove. A column is kept (included by mask) if the
333 PP_cons value for that column is <x> or greater. Otherwise it is
334 removed.
335
336
337 --pallgapok
338 With -p, do not automatically remove any columns that are 100%
339 gaps (i.e. contain 0 aligned residues). By default, such columns
340 will be removed.
341
342
343 --pmask-rf <f>
344 Save the non-gap RF-length posterior probability-based mask used
345 to mask the alignment to file <f>. The input alignment must be
346 in Stockholm format and contain '#=GC RF' annotation for this
347 option to be valid. See the OUTPUT section above for more de‐
348 tails on output mask files.
349
350
351 --pmask-all <f>
352 Save the full alignment-length posterior probability-based mask
353 used to mask the alignment to file <f>. See the OUTPUT section
354 above for more details on output mask files.
355
356
357
358 --keepins
359 If -p and/or -g is enabled and the alignment is in Stockholm or
360 Pfam format and has '#=GC RF' annotation, then allow columns
361 that are gaps in the RF annotation to possibly be kept. By de‐
362 fault, all gap RF columns would be removed automatically, but
363 with this option enabled gap and non-gap RF columns are treated
364 identically. To automatically remove all gap RF columns when
365 using a maskfile , then define the mask in maskfile as having
366 length equal to the non-gap RF length in the alignment. To au‐
367 tomatically remove all gap RF columns when using -t, use the
368 --t-rmins option.
369
370
371
372
373
374
375
376
377
379 http://bioeasel.org/
380
381
383 Copyright (C) 2020 Howard Hughes Medical Institute.
384 Freely distributed under the BSD open source license.
385
386
388 http://eddylab.org
389
390
391
392Easel 0.48 Nov 2020 esl-alimask(1)