1esl-alimask(1)                   Easel Manual                   esl-alimask(1)
2
3
4

NAME

6       esl-alimask - remove columns from a multiple sequence alignment
7
8

SYNOPSIS

10       esl-alimask [options] msafile maskfile
11         (remove columns based on a mask in an input file)
12
13       esl-alimask -t [options] msafile coords
14         (remove a contiguous set of columns at the start and end of an alignment)
15
16       esl-alimask -g [options] msafile
17         (remove columns based on their frequency of gaps)
18
19       esl-alimask -p [options] msafile
20         (remove columns based on their posterior probability annotation)
21
22       esl-alimask --rf-is-mask [options] msafile
23         (only remove columns that are gaps in the RF annotation)
24
25       The -g and -p options may be used in combination.
26
27
28

DESCRIPTION

30       esl-alimask  reads  a single input alignment, removes some columns from
31       it (i.e. masks it), and outputs the masked alignment.
32
33
34       esl-alimask can be run in several different modes.
35
36
37       esl-alimask runs in "mask file mode" by default when  two  command-line
38       arguments (msafile and maskfile) are supplied. In this mode, a bit-vec‐
39       tor mask in the maskfile defines which  columns  to  keep/remove.   The
40       mask  is  a  string that may only contain the characters '0' and '1'. A
41       '0' at position x of the mask indicates that column x  is  excluded  by
42       the  mask and should be removed during masking.  A '1' at position x of
43       the mask indicates that column x is included by the mask and should not
44       be  removed  during masking.  All lines in the maskfile that begin with
45       '#' are considered comment lines and are ignored.   All  non-whitespace
46       characters  in non-comment lines are considered to be part of the mask.
47       The length of the mask must equal either the total number of columns in
48       the (first) alignment in msafile, or the number of columns that are not
49       gaps in the RF annotation of that alignment. The latter  case  is  only
50       valid  if msafile is in Stockholm format and contains '#=GC RF' annota‐
51       tion.  If the mask length is equal to the non-gap RF length, all gap RF
52       columns will automatically be removed.
53
54
55       esl-alimask  runs  in  "truncation mode" if the -t option is used along
56       with two command line arguments (msafile and coords). In this mode, the
57       alignment  will  be  truncated  by removing a contiguous set of columns
58       from the beginning and end of the alignment. The  second  command  line
59       argument  is the coords string, that specifies what range of columns to
60       keep in the alignment, all columns outside of this range  will  be  re‐
61       moved.   The  coords string consists of start and end coordinates sepa‐
62       rated by any nonnumeric,  nonwhitespace  character  or  characters  you
63       like;  for  example,  23..100,  23/100, or 23-100 all work. To keep all
64       alignment columns beginning at 23 until the end of the  alignment,  you
65       can omit the end; for example, 23: would work.  If the --t-rf option is
66       used in combination with -t, the coordinates in coords are  interpreted
67       as  non-gap  RF  column coordinates. For example, with --t-rf, a coords
68       string of 23-100 would remove  all  columns  before  the  23rd  non-gap
69       residue  in  the  "#=GC  RF"  annotation and after the 100th non-gap RF
70       residue.
71
72
73       esl-alimask runs in "RF mask" mode if the --rf-is-mask  option  is  en‐
74       abled. In this mode, the alignment must be in Stockholm format and con‐
75       tain '#=GC RF' annotation.  esl-alimask will simply remove all  columns
76       that are gaps in the RF annotation.
77
78
79       esl-alimask runs in "gap frequency mode" if -g is enabled. In this mode
80       columns for which greater than <f> fraction of  the  aligned  sequences
81       have  gap  residues  will be removed.  By default, <f> is 0.5, but this
82       value can be changed to <f> with the --gapthresh <f>  option.  In  this
83       mode,  if  the  alignment is in Stockholm format and has RF annotation,
84       then all columns that are gaps in the RF annotation will  automatically
85       be removed, unless --saveins is enabled.
86
87
88       esl-alimask  runs  in "posterior probability mode" if -p is enabled. In
89       this mode,  masking is based on posterior probability  annotation,  and
90       the  input  alignment must be in Stockholm format and contain '#=GR PP'
91       (posterior probability) annotation for  all  sequences.  As  a  special
92       case, if -p is used in combination with --ppcons, then the input align‐
93       ment need not  have  '#=GR  PP'  annotation,  but  must  contain  '#=GC
94       PP_cons' (posterior probability consensus) annotation.
95
96
97       Characters  in  Stockholm  alignment  posterior  probability annotation
98       (both '#=GR PP' and '#=GC PP_cons') can have 12  possible  values:  the
99       ten digits '0-9', '*', and '.'. If '.', the position must correspond to
100       a gap in the sequence (for '#=GR PP') or  in  the  RF  annotation  (for
101       '#=GC  PP_cons').   A value of '0' indicates a posterior probability of
102       between 0.0 and 0.05, '1' indicates between 0.05 and  0.15,  '2'  indi‐
103       cates between 0.15 and 0.25 and so on up to '9' which indicates between
104       0.85 and 0.95. A value of '*' indicates a posterior probability of  be‐
105       tween  0.95  and  1.0.  Higher  posterior  probabilities  correspond to
106       greater confidence that the aligned residue belongs where it appears in
107       the alignment.
108
109
110       When  -p  is  enabled with --ppcons <x>, columns which have a consensus
111       posterior probability of less than <x> will be removed during  masking,
112       and all other columns will not be removed.
113
114
115       When  -p  is  enabled  without --ppcons, the number of each possible PP
116       value in each column is counted.  If <x> fraction of the sequences that
117       contain  aligned residues (i.e. do not contain gaps) in a column have a
118       posterior probability greater than or equal to <y>,  then  that  column
119       will  not  be removed during masking. All columns that do not meet this
120       criterion will be removed. By default, the values of both <x>  and  <y>
121       are  0.95,  but they can be changed with the --pfract <x> and --pthresh
122       <y> options, respectively.
123
124
125       In posterior probability mode, all columns that have 0  residues  (i.e.
126       that  are  100% gaps) will be automatically removed, unless the --pall‐
127       gapok option is enabled, in which case such columns  will  not  be  re‐
128       moved.
129
130
131       Importantly,  during  posterior  probability  masking, unless --pavg is
132       used, PP annotation values are always considered to be the minimum  nu‐
133       merical value in their corresponding range. For example, a PP '9' char‐
134       acter is converted to a numerical posterior  probability  of  0.85.  If
135       --pavg  is  used, PP annotation values are considered to be the average
136       numerical value in their range. For example, a PP '9' character is con‐
137       verted to a numerical posterior probability of 0.90.
138
139
140       In  posterior probability mode, if the alignment is in Stockholm format
141       and has RF annotation, then all columns that are gaps in the RF annota‐
142       tion will automatically be removed, unless --saveins is enabled.
143
144
145       A  single run of esl-alimask can perform both gap frequency-based mask‐
146       ing and posterior probability-based masking if both the -g and  -p  op‐
147       tions  are enabled. In this case, a gap frequency-based mask and a pos‐
148       terior probability-based mask are independently  computed.   These  two
149       masks are combined to create the final mask using a logical 'and' oper‐
150       ation. Any column that is to be removed by either the gap  or  PP  mask
151       will be removed by the final mask.
152
153
154       With the --small option, esl-alimask will operate in memory saving mode
155       and the required RAM for the masking will be minimal (usually less than
156       a Mb) and independent of the alignment size. To use --small, the align‐
157       ment alphabet must be specified with either --amino, --dna,  or  --rna,
158       and  the  alignment must be in Pfam format (non-interleaved, 1 line/se‐
159       quence Stockholm format). Pfam format is the default output  format  of
160       INFERNAL's  cmalign  program.  Without --small the required RAM will be
161       equal to roughly the size of the first input alignment (the size of the
162       alignment file itself if it only contains one alignment).
163
164
165

OUTPUT

167       By  default, esl-alimask will print only the masked alignment to stdout
168       and then exit.  If the -o <f> option is used,  the  alignment  will  be
169       saved  to  file <f> , and information on the number of columns kept and
170       removed will be printed to stdout. If -q is used  in  combination  with
171       -o, nothing is printed to stdout.
172
173
174       The   mask(s)   computed  by  esl-alimask  when  the  -t,  -p,  -g,  or
175       --rf-is-mask options are used can be saved to output  files  using  the
176       options  --fmask-rf  <f>,  --fmask-all <f>, --gmask-rf <f>, --gmask-all
177       <f>, --pmask-rf <f>, and --pmask-all <f>.  In all cases, <f> will  con‐
178       tain a single line, a bit vector of length <n>, where <n> is the either
179       the total number of columns in the alignment (for the options  suffixed
180       with  'all') or the number of non-gap columns in the RF annotation (for
181       the options suffixed with 'rf'). The mask will be a string of  '0'  and
182       '1'  characters: a '0' at position x in the mask indicates column x was
183       removed (excluded) by the mask, and a '1' at position x indicates  col‐
184       umn  x  was kept (included) by the mask. For the 'rf' suffixed options,
185       the mask only applies to non-gap RF  columns.   The  options  beginning
186       with  'f'  will  save the 'final' mask used to keep/remove columns from
187       the alignment. The options beginning with 'g' save the masks  based  on
188       gap  frequency and require -g.  The options beginning with 'p' save the
189       masks based on posterior probabilities and require -p.
190
191
192

OPTIONS

194       -h     Print brief help; includes version number and summary of all op‐
195              tions, including expert options.
196
197
198       -o <f> Output  the  final,  masked  alignment to file <f> instead of to
199              stdout.  When this option is used, information about the  number
200              of columns kept/removed is printed to stdout.
201
202
203       -q     Be quiet; do not print anything to stdout.  This option can only
204              be used in combination with the -o option.
205
206
207       --small
208              Operate in memory saving mode. Required RAM will be  independent
209              of  the  size of the input alignment to mask, instead of roughly
210              the size of the input alignment.  When  enabled,  the  alignment
211              must  be  in  Pfam Stockholm (non-interleaved 1 line/seq) format
212              (see esl-reformat) and the output alignment will be in Pfam for‐
213              mat.
214
215
216       --informat <s>
217              Assert  that  input  msafile is in alignment format <s>.  Common
218              choices for <s> include: stockholm, a2m, afa, psiblast, clustal,
219              phylip.   For more information, and for codes for some less com‐
220              mon formats, see main documentation.  The string <s> is case-in‐
221              sensitive  (a2m or A2M both work).  Default is stockholm format,
222              unless --small is used, in which case  pfam  format  (non-inter‐
223              leaved Stockholm) is assumed.
224
225
226       --outformat <s>
227              Write  the  output  msafile  in  alignment  format  <s>.  Common
228              choices for <s> include: stockholm, a2m, afa, psiblast, clustal,
229              phylip.   The  string  <s>  is case-insensitive (a2m or A2M both
230              work).  Default is stockholm,  unless  --small  is  enabled,  in
231              which case pfam (noninterleaved Stockholm) is the default output
232              format.
233
234
235
236       --fmask-rf <f>
237              Save the non-gap RF-length final mask used to mask the alignment
238              to  file  <f>.   The input alignment must be in Stockholm format
239              and contain '#=GC RF' annotation for this option  to  be  valid.
240              See  the  OUTPUT  section  above for more details on output mask
241              files.
242
243
244       --fmask-all <f>
245              Save the full alignment-length  final  mask  used  to  mask  the
246              alignment  to  file  <f>.  See the OUTPUT section above for more
247              details on output mask files.
248
249
250       --amino
251              Specify that the input alignment is a protein alignment.  By de‐
252              fault,  esl-alimask  will try to autodetect the alphabet, but if
253              the alignment is sufficiently small it may  be  ambiguous.  This
254              option  defines the alphabet as protein. Importantly, if --small
255              is enabled, the alphabet must be specified with either  --amino,
256              --dna, or --rna.
257
258
259       --dna  Specify that the input alignment is a DNA alignment.
260
261
262       --rna  Specify that the input alignment is an RNA alignment.
263
264
265       --t-rf With  -t,  specify that the start and end coordinates defined in
266              the second command line argument coords correspond to non-gap RF
267              coordinates. To use this option, the alignment must be in Stock‐
268              holm format and have "#=GC RF" annotation. See  the  DESCRIPTION
269              section for an example of using the --t-rf option.
270
271
272       --t-rmins
273              With -t, specify that all columns that are gaps in the reference
274              (RF) annotation in between the specified start and  end  coordi‐
275              nates  be  removed.  By default, these columns will be kept.  To
276              use this option, the alignment must be in  Stockholm format  and
277              have "#=GC RF" annotation.
278
279
280       --gapthresh <x>
281              With  -g, specify that a column is kept (included by mask) if no
282              more than <f> fraction of sequences in the alignment have a  gap
283              ('.',  '-',  or '_') at that position. All other columns are re‐
284              moved (excluded by mask).  By default, <x> is 0.5.
285
286
287       --gmask-rf <f>
288              Save the non-gap RF-length gap frequency-based mask used to mask
289              the  alignment  to  file  <f>.   The  input alignment must be in
290              Stockholm format and contain '#=GC RF' annotation for  this  op‐
291              tion  to be valid. See the OUTPUT section above for more details
292              on output mask files.
293
294
295       --gmask-all <f>
296              Save the full alignment-length gap frequency-based mask used  to
297              mask  the  alignment  to file <f>.  See the OUTPUT section above
298              for more details on output mask files.
299
300
301
302       --pfract <x>
303              With -p, specify that a column is kept (included by mask) if the
304              fraction of sequences with a non-gap residue in that column with
305              a posterior probability of at least <y> (from --pthresh <y>)  is
306              <x>  or  greater.  All  other  columns  are removed (excluded by
307              mask).  By default <x> is 0.95.
308
309
310       --pthresh <y>
311              With -p, specify that a column is kept (included by mask) if <x>
312              (from --pfract <x>) fraction of sequences with a non-gap residue
313              in that column have a posterior probability  of  at  least  <y>.
314              All  other  columns  are removed (excluded by mask).  By default
315              <y> is 0.95. See the DESCRIPTION section for more  on  posterior
316              probability  (PP) masking.  Due to the granularity of the PP an‐
317              notation, different <y> values within a range covered by a  sin‐
318              gle  PP  character  will be have the same effect on masking. For
319              example, using --pthresh 0.86 will have the same effect as using
320              --pthresh 0.94.
321
322
323       --pavg <x>
324              With -p, specify that a column is kept (included by mask) if the
325              average posterior probability of non-gap residues in that column
326              is at least <x>.  See the DESCRIPTION section for more on poste‐
327              rior probability (PP) masking.
328
329
330       --ppcons <x>
331              With -p, use the '#=GC PP_cons' annotation to define which  col‐
332              umns  to keep/remove. A column is kept (included by mask) if the
333              PP_cons value for that column is <x> or greater. Otherwise it is
334              removed.
335
336
337       --pallgapok
338              With  -p,  do not automatically remove any columns that are 100%
339              gaps (i.e. contain 0 aligned residues). By default, such columns
340              will be removed.
341
342
343       --pmask-rf <f>
344              Save the non-gap RF-length posterior probability-based mask used
345              to mask the alignment to file <f>.  The input alignment must  be
346              in  Stockholm  format  and contain '#=GC RF' annotation for this
347              option to be valid. See the OUTPUT section above  for  more  de‐
348              tails on output mask files.
349
350
351       --pmask-all <f>
352              Save  the full alignment-length posterior probability-based mask
353              used to mask the alignment to file <f>.  See the OUTPUT  section
354              above for more details on output mask files.
355
356
357
358       --keepins
359              If  -p and/or -g is enabled and the alignment is in Stockholm or
360              Pfam format and has '#=GC RF'  annotation,  then  allow  columns
361              that  are  gaps in the RF annotation to possibly be kept. By de‐
362              fault, all gap RF columns would be  removed  automatically,  but
363              with  this option enabled gap and non-gap RF columns are treated
364              identically.  To automatically remove all gap  RF  columns  when
365              using  a  maskfile  , then define the mask in maskfile as having
366              length equal to the non-gap RF length in the alignment.  To  au‐
367              tomatically  remove  all  gap  RF columns when using -t, use the
368              --t-rmins option.
369
370
371
372
373
374
375
376
377

SEE ALSO

379       http://bioeasel.org/
380
381
383       Copyright (C) 2020 Howard Hughes Medical Institute.
384       Freely distributed under the BSD open source license.
385
386

AUTHOR

388       http://eddylab.org
389
390
391
392Easel 0.48                         Nov 2020                     esl-alimask(1)
Impressum