esl-alimanip(1)

1esl-alimanip(1)                  Easel Manual                  esl-alimanip(1)
2
3
4

NAME

6       esl-alimanip - manipulate a multiple sequence alignment
7
8

SYNOPSIS

10       esl-alimanip [options] msafile
11
12

DESCRIPTION

14       esl-alimanip  can  manipulate  the  multiple  sequence  alignment(s) in
15       msafile in various ways. Options exist to  remove  specific  sequences,
16       reorder  sequences,  designate  reference columns using Stockholm "#=GC
17       RF" markup, and add annotation that numbers columns.
18
19
20       The alignments can be of protein or DNA/RNA sequences.  All  alignments
21       in  the  same  msafile  must be either protein or DNA/RNA. The alphabet
22       will be autodetected unless one of the options --amino, --dna, or --rna
23       are given.
24
25
26
27

OPTIONS

29       -h     Print  brief  help;   includes version number and summary of all
30              options, including expert options.
31
32
33       -o <f> Save the resulting, modified alignment in Stockholm format to  a
34              file <f>.  The default is to write it to standard output.
35
36
37       --informat <s>
38              Assert that msafile is in alignment format <s>, bypassing format
39              autodetection.  Common choices for <s> include: stockholm,  a2m,
40              afa,  psiblast,  clustal, phylip.  For more information, and for
41              codes for some less common formats, see main documentation.  The
42              string <s> is case-insensitive (a2m or A2M both work).
43
44
45
46       --outformat <s>
47              Write  the  output  in alignment format <s>.  Common choices for
48              <s> include: stockholm, a2m,  afa,  psiblast,  clustal,  phylip.
49              The  string <s> is case-insensitive (a2m or A2M both work).  De‐
50              fault is stockholm.
51
52
53       --devhelp
54              Print help, as with -h, but also include undocumented  developer
55              options.  These options are not listed below, are under develop‐
56              ment or experimental, and are not guaranteed to even  work  cor‐
57              rectly.  Use  developer  options  at your own risk. The only re‐
58              sources for understanding what they actually do  are  the  brief
59              one-line  description printed when --devhelp is enabled, and the
60              source code.
61
62

EXPERT OPTIONS

64       --lnfract <x>
65              Remove any sequences with length  less  than  <x>  fraction  the
66              length of the median length sequence in the alignment.
67
68
69       --lxfract <x>
70              Remove  any  sequences  with  length  more than <x> fraction the
71              length of the median length sequence in the alignment.
72
73
74       --lmin <n>
75              Remove any sequences with length less than <n> residues.
76
77
78       --lmax <n>
79              Remove any sequences with length more than <n> residues.
80
81
82       --rfnfract <x>
83              Remove any sequences with nongap RF length less than  <x>  frac‐
84              tion the nongap RF length of the alignment.
85
86
87       --detrunc <n>
88              Remove any sequences that have all gaps in the first <n> non-gap
89              #=GC RF columns or the last <n> non-gap #=GC RF columns.
90
91
92       --xambig <n>
93              Remove any sequences that has more than <n> ambiguous  (degener‐
94              ate) residues.
95
96
97       --seq-r <f>
98              Remove  any  sequences  with names listed in file <f>.  Sequence
99              names listed in <f> can be separated by tabs, new lines, or spa‐
100              ces.   The  file  must be in Stockholm format for this option to
101              work.
102
103
104       --seq-k <f>
105              Keep only sequences with names listed  in  file  <f>.   Sequence
106              names listed in <f> can be separated by tabs, new lines, or spa‐
107              ces.  By default, the kept sequences will remain in the original
108              order  they  appeared in msafile, but the order from <f> will be
109              used if the --k-reorder option is enabled.  The file must be  in
110              Stockholm format for this option to work.
111
112
113       --small
114              With  --seq-k  or  --seq-r,  operate  in small memory mode.  The
115              alignment(s) will not be stored  in  memory,  thus  --seq-k  and
116              --seq-r will be able to work on very large alignments regardless
117              of the amount of available RAM.  The alignment file must  be  in
118              Pfam  format  and  --informat pfam and one of --amino, --dna, or
119              --rna must be given as well.
120
121
122       --k-reorder
123              With --seq-k <f>, reorder  the  kept  sequences  in  the  output
124              alignment to the order from the list file <f>.
125
126
127       --seq-ins <n>
128              Keep  only sequences that have at least 1 inserted residue after
129              nongap RF position <n>.
130
131
132       --seq-ni <n>
133              With --seq-ins require at least <n> inserted residues in  a  se‐
134              quence for it to be kept.
135
136
137       --seq-xi <n>
138              With --seq-ins allow at most <n> inserted residues in a sequence
139              for it to be kept.
140
141
142       --trim <f>
143              File <f> is an unaligned FASTA file  containing  truncated  ver‐
144              sions  of  each  sequence in the msafile.  Trim the sequences in
145              the alignment to match their truncated versions in <f>.  If  the
146              alignment  output  format  is Stockholm (the default output for‐
147              mat), all per-column (GC) and per-residue (GR)  annotation  will
148              be  removed  from the alignment when --trim is used. However, if
149              --t-keeprf is also used, the reference annotation (GC  RF)  will
150              be kept.
151
152
153       --t-keeprf
154              Specify  that the 'trimmed' alignment maintain the original ref‐
155              erence (GC  RF)  annotation.  Only  works  in  combination  with
156              --trim.
157
158
159       --minpp <x>
160              Replace  all  residues in the alignments for which the posterior
161              probability annotation (#=GR PP) is less than <x> with gaps. The
162              PP annotation for these residues is also converted to gaps.  <x>
163              must be greater than 0.0 and less than or equal to 0.95.
164
165
166       --tree <f>
167              Reorder sequences by tree order.  Perform single  linkage  clus‐
168              tering on the sequences in the alignment based on sequence iden‐
169              tity given the alignment to define a 'tree'  of  the  sequences.
170              The  sequences  in  the alignment are reordered according to the
171              tree, which groups similar sequences together. The tree is  out‐
172              put in Newick format to <f>.
173
174
175       --reorder <f>
176              Reorder  sequences  to  the  order listed in file <f>.  Each se‐
177              quence in the alignment must be listed in <f>.  Use  --k-reorder
178              to  reorder  only  a  subset  of sequences to a subset alignment
179              file.  The file must be in Stockholm format for this  option  to
180              work.
181
182
183       --mask2rf <f>
184              Read in the 'mask' file <f> and use it to define new #=GC RF an‐
185              notation for the alignment.  <f> must be a single line, with ex‐
186              actly  <alen>  or  <rflen> characters, either the full alignment
187              length or the number of nongap #=GC RF characters, respectively.
188              Each  character  must  be either a '1' or a '0'. The new #=GC RF
189              markup will contain an 'x' for each column that is a '1' in lane
190              mask file, and a '.' for each column that is a '0'.  If the mask
191              is of length <rflen> then it is interpreted as applying to  only
192              nongap  RF  characters in the existing RF annotation, all gap RF
193              characters will remain gaps and nongap RF characters will be re‐
194              defined as above.
195
196
197       --m-keeprf
198              With  --mask2rf,  do not overwrite existing nongap RF characters
199              that are included by the input mask as 'x', leave  them  as  the
200              character they are.
201
202
203       --num-all
204              Add  annotation to the alignment numbering all of the columns in
205              the alignment.
206
207
208       --num-rf
209              Add annotation to the alignment numbering the non-gap (non  '.')
210              #=GC RF columns of the alignment.
211
212
213       --rm-gc <s>
214              Remove certain types of #=GC annotation from the alignment.  <s>
215              must be one of: RF, SS_cons, SA_cons, PP_cons.
216
217
218       --sindi
219              Annotate individual secondary structures for  each  sequence  by
220              imposing  the  consensus secondary structure defined by the #=GC
221              SS_cons annotation.
222
223
224       --post2pp
225              Update  Infernal's  cmalign  0.72-1.0.2  posterior   probability
226              "POST"  annotation  to  "PP"  annotation, which is read by other
227              miniapps, including esl-alimask and esl-alistat.
228
229
230       --amino
231              Assert that the msafile contains protein sequences.
232
233
234       --dna  Assert that the msafile contains DNA sequences.
235
236
237       --rna  Assert that the msafile contains RNA sequences.
238
239
240
241

COPYRIGHT

247       Copyright (C) 2020 Howard Hughes Medical Institute.
248       Freely distributed under the BSD open source license.
249
250

AUTHOR

252       http://eddylab.org
253
254
255
256Easel 0.48                         Nov 2020                    esl-alimanip(1)