1esl-alimanip(1) Easel Manual esl-alimanip(1)
2
3
4
6 esl-alimanip - manipulate a multiple sequence alignment
7
8
10 esl-alimanip [options] msafile
11
12
14 esl-alimanip can manipulate the multiple sequence alignment(s) in
15 msafile in various ways. Options exist to remove specific sequences,
16 reorder sequences, designate reference columns using Stockholm "#=GC
17 RF" markup, and add annotation that numbers columns.
18
19
20 The alignments can be of protein or DNA/RNA sequences. All alignments
21 in the same msafile must be either protein or DNA/RNA. The alphabet
22 will be autodetected unless one of the options --amino, --dna, or --rna
23 are given.
24
25
26
27
29 -h Print brief help; includes version number and summary of all
30 options, including expert options.
31
32
33 -o <f> Save the resulting, modified alignment in Stockholm format to a
34 file <f>. The default is to write it to standard output.
35
36
37 --informat <s>
38 Assert that msafile is in alignment format <s>, bypassing format
39 autodetection. Common choices for <s> include: stockholm, a2m,
40 afa, psiblast, clustal, phylip. For more information, and for
41 codes for some less common formats, see main documentation. The
42 string <s> is case-insensitive (a2m or A2M both work).
43
44
45
46 --outformat <s>
47 Write the output in alignment format <s>. Common choices for
48 <s> include: stockholm, a2m, afa, psiblast, clustal, phylip.
49 The string <s> is case-insensitive (a2m or A2M both work). De‐
50 fault is stockholm.
51
52
53 --devhelp
54 Print help, as with -h, but also include undocumented developer
55 options. These options are not listed below, are under develop‐
56 ment or experimental, and are not guaranteed to even work cor‐
57 rectly. Use developer options at your own risk. The only re‐
58 sources for understanding what they actually do are the brief
59 one-line description printed when --devhelp is enabled, and the
60 source code.
61
62
64 --lnfract <x>
65 Remove any sequences with length less than <x> fraction the
66 length of the median length sequence in the alignment.
67
68
69 --lxfract <x>
70 Remove any sequences with length more than <x> fraction the
71 length of the median length sequence in the alignment.
72
73
74 --lmin <n>
75 Remove any sequences with length less than <n> residues.
76
77
78 --lmax <n>
79 Remove any sequences with length more than <n> residues.
80
81
82 --rfnfract <x>
83 Remove any sequences with nongap RF length less than <x> frac‐
84 tion the nongap RF length of the alignment.
85
86
87 --detrunc <n>
88 Remove any sequences that have all gaps in the first <n> non-gap
89 #=GC RF columns or the last <n> non-gap #=GC RF columns.
90
91
92 --xambig <n>
93 Remove any sequences that has more than <n> ambiguous (degener‐
94 ate) residues.
95
96
97 --seq-r <f>
98 Remove any sequences with names listed in file <f>. Sequence
99 names listed in <f> can be separated by tabs, new lines, or spa‐
100 ces. The file must be in Stockholm format for this option to
101 work.
102
103
104 --seq-k <f>
105 Keep only sequences with names listed in file <f>. Sequence
106 names listed in <f> can be separated by tabs, new lines, or spa‐
107 ces. By default, the kept sequences will remain in the original
108 order they appeared in msafile, but the order from <f> will be
109 used if the --k-reorder option is enabled. The file must be in
110 Stockholm format for this option to work.
111
112
113 --small
114 With --seq-k or --seq-r, operate in small memory mode. The
115 alignment(s) will not be stored in memory, thus --seq-k and
116 --seq-r will be able to work on very large alignments regardless
117 of the amount of available RAM. The alignment file must be in
118 Pfam format and --informat pfam and one of --amino, --dna, or
119 --rna must be given as well.
120
121
122 --k-reorder
123 With --seq-k <f>, reorder the kept sequences in the output
124 alignment to the order from the list file <f>.
125
126
127 --seq-ins <n>
128 Keep only sequences that have at least 1 inserted residue after
129 nongap RF position <n>.
130
131
132 --seq-ni <n>
133 With --seq-ins require at least <n> inserted residues in a se‐
134 quence for it to be kept.
135
136
137 --seq-xi <n>
138 With --seq-ins allow at most <n> inserted residues in a sequence
139 for it to be kept.
140
141
142 --trim <f>
143 File <f> is an unaligned FASTA file containing truncated ver‐
144 sions of each sequence in the msafile. Trim the sequences in
145 the alignment to match their truncated versions in <f>. If the
146 alignment output format is Stockholm (the default output for‐
147 mat), all per-column (GC) and per-residue (GR) annotation will
148 be removed from the alignment when --trim is used. However, if
149 --t-keeprf is also used, the reference annotation (GC RF) will
150 be kept.
151
152
153 --t-keeprf
154 Specify that the 'trimmed' alignment maintain the original ref‐
155 erence (GC RF) annotation. Only works in combination with
156 --trim.
157
158
159 --minpp <x>
160 Replace all residues in the alignments for which the posterior
161 probability annotation (#=GR PP) is less than <x> with gaps. The
162 PP annotation for these residues is also converted to gaps. <x>
163 must be greater than 0.0 and less than or equal to 0.95.
164
165
166 --tree <f>
167 Reorder sequences by tree order. Perform single linkage clus‐
168 tering on the sequences in the alignment based on sequence iden‐
169 tity given the alignment to define a 'tree' of the sequences.
170 The sequences in the alignment are reordered according to the
171 tree, which groups similar sequences together. The tree is out‐
172 put in Newick format to <f>.
173
174
175 --reorder <f>
176 Reorder sequences to the order listed in file <f>. Each se‐
177 quence in the alignment must be listed in <f>. Use --k-reorder
178 to reorder only a subset of sequences to a subset alignment
179 file. The file must be in Stockholm format for this option to
180 work.
181
182
183 --mask2rf <f>
184 Read in the 'mask' file <f> and use it to define new #=GC RF an‐
185 notation for the alignment. <f> must be a single line, with ex‐
186 actly <alen> or <rflen> characters, either the full alignment
187 length or the number of nongap #=GC RF characters, respectively.
188 Each character must be either a '1' or a '0'. The new #=GC RF
189 markup will contain an 'x' for each column that is a '1' in lane
190 mask file, and a '.' for each column that is a '0'. If the mask
191 is of length <rflen> then it is interpreted as applying to only
192 nongap RF characters in the existing RF annotation, all gap RF
193 characters will remain gaps and nongap RF characters will be re‐
194 defined as above.
195
196
197 --m-keeprf
198 With --mask2rf, do not overwrite existing nongap RF characters
199 that are included by the input mask as 'x', leave them as the
200 character they are.
201
202
203 --num-all
204 Add annotation to the alignment numbering all of the columns in
205 the alignment.
206
207
208 --num-rf
209 Add annotation to the alignment numbering the non-gap (non '.')
210 #=GC RF columns of the alignment.
211
212
213 --rm-gc <s>
214 Remove certain types of #=GC annotation from the alignment. <s>
215 must be one of: RF, SS_cons, SA_cons, PP_cons.
216
217
218 --sindi
219 Annotate individual secondary structures for each sequence by
220 imposing the consensus secondary structure defined by the #=GC
221 SS_cons annotation.
222
223
224 --post2pp
225 Update Infernal's cmalign 0.72-1.0.2 posterior probability
226 "POST" annotation to "PP" annotation, which is read by other
227 miniapps, including esl-alimask and esl-alistat.
228
229
230 --amino
231 Assert that the msafile contains protein sequences.
232
233
234 --dna Assert that the msafile contains DNA sequences.
235
236
237 --rna Assert that the msafile contains RNA sequences.
238
239
240
241
243 http://bioeasel.org/
244
245
247 Copyright (C) 2020 Howard Hughes Medical Institute.
248 Freely distributed under the BSD open source license.
249
250
252 http://eddylab.org
253
254
255
256Easel 0.48 Nov 2020 esl-alimanip(1)