esl-alistat(1)

1esl-alistat(1)                   Easel Manual                   esl-alistat(1)
2
3
4

NAME

6       esl-alistat - summarize a multiple sequence alignment file
7
8

SYNOPSIS

10       esl-alistat [options] msafile
11
12

DESCRIPTION

14       esl-alistat  summarizes  the  contents  of the multiple sequence align‐
15       ment(s) in msafile, such  as  the  alignment  name,  format,  alignment
16       length  (number of aligned columns), number of sequences, average pair‐
17       wise % identity,  and  mean,  smallest,  and  largest  raw  (unaligned)
18       lengths of the sequences.
19
20
21       If  msafile is - (a single dash), multiple alignment input is read from
22       stdin.
23
24
25
26
27       The --list, --icinfo, --rinfo, --pcinfo, --psinfo,  --cinfo,  --bpinfo,
28       and  --iinfo  options allow dumping various statistics on the alignment
29       to optional output files as described for each of those options below.
30
31
32       The --small option allows summarizing alignments without  storing  them
33       in  memory  and can be useful for large alignment files with sizes that
34       approach or exceed the amount of available RAM.  When --small is  used,
35       esl-alistat will print fewer statistics on the alignment, omitting data
36       on the smallest and largest sequences and the average identity  of  the
37       alignment.   --small only works on Pfam formatted alignments (a special
38       type of non-interleaved Stockholm alignment in which each sequence  oc‐
39       curs  on a single line) and --informat pfam must be given with --small.
40       Further, when --small is used, the  alphabet  must  be  specified  with
41       --amino, --dna, or --rna.
42
43
44
45

OPTIONS

47       -h     Print  brief  help;   includes version number and summary of all
48              options, including expert options.
49
50
51       -1     Use a tabular output format with  one  line  of  statistics  per
52              alignment in msafile.  This is most useful when msafile contains
53              many different alignments (such as a Pfam database in  Stockholm
54              format).
55
56
57

EXPERT OPTIONS

59       --informat <s>
60              Assert  that input msafile is in alignment format <s>, bypassing
61              format autodetection.  Common choices for  <s>  include:  stock‐
62              holm,  a2m,  afa,  psiblast, clustal, phylip.  For more informa‐
63              tion, and for codes for some less common formats, see main docu‐
64              mentation.   The string <s> is case-insensitive (a2m or A2M both
65              work).
66
67
68
69       --amino
70              Assert that the msafile contains protein sequences.
71
72
73       --dna  Assert that the msafile contains DNA sequences.
74
75
76       --rna  Assert that the msafile contains RNA sequences.
77
78
79       --small
80              Operate in small memory  mode  for  Pfam  formatted  alignments.
81              --informat  pfam  and  one  of  --amino, --dna, or --rna must be
82              given as well.
83
84
85       --list <f>
86              List the names of all sequences in all alignments in msafile  to
87              file <f>.  Each sequence name is written on its own line.
88
89
90       --icinfo <f>
91              Dump  the  information content per position in tabular format to
92              file <f>.  Lines prefixed with "#" are comment lines, which  ex‐
93              plain the meanings of each of the tab-delimited fields.
94
95
96       --rinfo <f>
97              Dump information on the frequency of gaps versus nongap residues
98              per position in tabular format to file <f>.  Lines prefixed with
99              "#" are comment lines, which explain the meanings of each of the
100              tab-delimited fields.
101
102
103       --pcinfo <f>
104              Dump per column information on posterior probabilities in  tabu‐
105              lar  format  to  file  <f>.  Lines prefixed with "#" are comment
106              lines, which explain the meanings of each of  the  tab-delimited
107              fields.
108
109
110       --psinfo <f>
111              Dump per sequence information on posterior probabilities in tab‐
112              ular format to file <f>.  Lines prefixed with  "#"  are  comment
113              lines,  which  explain the meanings of each of the tab-delimited
114              fields.
115
116
117       --iinfo <f>
118              Dump information on inserted residues in tabular format to  file
119              <f>.  Insert columns of the alignment are those that are gaps in
120              the reference (#=GC RF) annotation. This option  only  works  if
121              the input file is in Stockholm format with reference annotation.
122              Lines prefixed with "#" are comment  lines,  which  explain  the
123              meanings of each of the tab-delimited fields.
124
125
126       --cinfo <f>
127              Dump per-column residue counts to file <f>.  If used in combina‐
128              tion with --noambig ambiguous (degenerate) residues will be  ig‐
129              nored and not counted. Otherwise, they will be marginalized. For
130              example, in an RNA sequence file, a 'N' will be counted as  0.25
131              'A', 0.25 'C', 0.25 'G', and 0.25 'U'.
132
133
134       --noambig
135              With --cinfo, do not count ambiguous (degenerate) residues.
136
137
138       --bpinfo
139              Dump  per-column basepair counts to file <f>.  Counts appear for
140              each basepair in the consensus secondary structure (annotated as
141              "#=GC  SS_cons").  Only  basepairs from sequences for which both
142              paired positions are canonical residues will  be  counted.  That
143              is,  any  basepair  that  is  a gap or an ambiguous (degenerate)
144              residue at either position  of  the  pair  is  ignored  and  not
145              counted.
146
147
148
149       --weight
150              With   --icinfo,   --rinfo,   --pcinfo,  --iinfo,  --cinfo,  and
151              --bpinfo, weight counts based on #=GS WT annotation in the input
152              msafile.  A residue or basepair from a sequence with a weight of
153              <x> will be considered <x> counts.  By default, raw,  unweighted
154              counts  are  reported;  corresponding to each sequence having an
155              equal weight of 1.
156
157
158
159
160

COPYRIGHT

166       Copyright (C) 2020 Howard Hughes Medical Institute.
167       Freely distributed under the BSD open source license.
168
169

AUTHOR

171       http://eddylab.org
172
173
174
175Easel 0.48                         Nov 2020                     esl-alistat(1)