1esl-alistat(1) Easel Manual esl-alistat(1)
2
3
4
6 esl-alistat - summarize a multiple sequence alignment file
7
8
10 esl-alistat [options] msafile
11
12
14 esl-alistat summarizes the contents of the multiple sequence align‐
15 ment(s) in msafile, such as the alignment name, format, alignment
16 length (number of aligned columns), number of sequences, average pair‐
17 wise % identity, and mean, smallest, and largest raw (unaligned)
18 lengths of the sequences.
19
20
21 If msafile is - (a single dash), multiple alignment input is read from
22 stdin.
23
24
25
26
27 The --list, --icinfo, --rinfo, --pcinfo, --psinfo, --cinfo, --bpinfo,
28 and --iinfo options allow dumping various statistics on the alignment
29 to optional output files as described for each of those options below.
30
31
32 The --small option allows summarizing alignments without storing them
33 in memory and can be useful for large alignment files with sizes that
34 approach or exceed the amount of available RAM. When --small is used,
35 esl-alistat will print fewer statistics on the alignment, omitting data
36 on the smallest and largest sequences and the average identity of the
37 alignment. --small only works on Pfam formatted alignments (a special
38 type of non-interleaved Stockholm alignment in which each sequence oc‐
39 curs on a single line) and --informat pfam must be given with --small.
40 Further, when --small is used, the alphabet must be specified with
41 --amino, --dna, or --rna.
42
43
44
45
47 -h Print brief help; includes version number and summary of all
48 options, including expert options.
49
50
51 -1 Use a tabular output format with one line of statistics per
52 alignment in msafile. This is most useful when msafile contains
53 many different alignments (such as a Pfam database in Stockholm
54 format).
55
56
57
59 --informat <s>
60 Assert that input msafile is in alignment format <s>, bypassing
61 format autodetection. Common choices for <s> include: stock‐
62 holm, a2m, afa, psiblast, clustal, phylip. For more informa‐
63 tion, and for codes for some less common formats, see main docu‐
64 mentation. The string <s> is case-insensitive (a2m or A2M both
65 work).
66
67
68
69 --amino
70 Assert that the msafile contains protein sequences.
71
72
73 --dna Assert that the msafile contains DNA sequences.
74
75
76 --rna Assert that the msafile contains RNA sequences.
77
78
79 --small
80 Operate in small memory mode for Pfam formatted alignments.
81 --informat pfam and one of --amino, --dna, or --rna must be
82 given as well.
83
84
85 --list <f>
86 List the names of all sequences in all alignments in msafile to
87 file <f>. Each sequence name is written on its own line.
88
89
90 --icinfo <f>
91 Dump the information content per position in tabular format to
92 file <f>. Lines prefixed with "#" are comment lines, which ex‐
93 plain the meanings of each of the tab-delimited fields.
94
95
96 --rinfo <f>
97 Dump information on the frequency of gaps versus nongap residues
98 per position in tabular format to file <f>. Lines prefixed with
99 "#" are comment lines, which explain the meanings of each of the
100 tab-delimited fields.
101
102
103 --pcinfo <f>
104 Dump per column information on posterior probabilities in tabu‐
105 lar format to file <f>. Lines prefixed with "#" are comment
106 lines, which explain the meanings of each of the tab-delimited
107 fields.
108
109
110 --psinfo <f>
111 Dump per sequence information on posterior probabilities in tab‐
112 ular format to file <f>. Lines prefixed with "#" are comment
113 lines, which explain the meanings of each of the tab-delimited
114 fields.
115
116
117 --iinfo <f>
118 Dump information on inserted residues in tabular format to file
119 <f>. Insert columns of the alignment are those that are gaps in
120 the reference (#=GC RF) annotation. This option only works if
121 the input file is in Stockholm format with reference annotation.
122 Lines prefixed with "#" are comment lines, which explain the
123 meanings of each of the tab-delimited fields.
124
125
126 --cinfo <f>
127 Dump per-column residue counts to file <f>. If used in combina‐
128 tion with --noambig ambiguous (degenerate) residues will be ig‐
129 nored and not counted. Otherwise, they will be marginalized. For
130 example, in an RNA sequence file, a 'N' will be counted as 0.25
131 'A', 0.25 'C', 0.25 'G', and 0.25 'U'.
132
133
134 --noambig
135 With --cinfo, do not count ambiguous (degenerate) residues.
136
137
138 --bpinfo
139 Dump per-column basepair counts to file <f>. Counts appear for
140 each basepair in the consensus secondary structure (annotated as
141 "#=GC SS_cons"). Only basepairs from sequences for which both
142 paired positions are canonical residues will be counted. That
143 is, any basepair that is a gap or an ambiguous (degenerate)
144 residue at either position of the pair is ignored and not
145 counted.
146
147
148
149 --weight
150 With --icinfo, --rinfo, --pcinfo, --iinfo, --cinfo, and
151 --bpinfo, weight counts based on #=GS WT annotation in the input
152 msafile. A residue or basepair from a sequence with a weight of
153 <x> will be considered <x> counts. By default, raw, unweighted
154 counts are reported; corresponding to each sequence having an
155 equal weight of 1.
156
157
158
159
160
162 http://bioeasel.org/
163
164
166 Copyright (C) 2020 Howard Hughes Medical Institute.
167 Freely distributed under the BSD open source license.
168
169
171 http://eddylab.org
172
173
174
175Easel 0.48 Nov 2020 esl-alistat(1)