1esl-translate(1) Easel Manual esl-translate(1)
2
3
4
6 esl-translate - translate DNA sequence in six frames into individual
7 ORFs
8
9
11 esl-translate [options] seqfile
12
13
14
16 Given a seqfile containing DNA or RNA sequences, esl-translate outputs
17 a six-frame translation of them as individual open reading frames in
18 FASTA format.
19
20
21 By default, only open reading frames greater than 20aa are reported.
22 This minimum ORF length can be changed with the -l option.
23
24
25 By default, no specific initiation codon is required, and any amino
26 acid can start an open reading frame. This is so esl-translate may be
27 used on sequence fragments, eukaryotic genes with introns, or other
28 cases where we do not want to assume that ORFs are complete coding re‐
29 gions. This behavior can be changed. With the -m option, ORFs start
30 with an initiator AUG Met. With the -M option, ORFs start with any of
31 the initiation codons allowed by the genetic code. For example, the
32 "standard" code (NCBI transl_table 1) allows AUG, CUG, and UUG as ini‐
33 tiators. When -m or -M are used, an initiator is always translated to
34 Met (even if the initiator is something like UUG or CUG that doesn't
35 encode Met as an elongator).
36
37
38 If seqfile is - (a single dash), input is read from the stdin pipe.
39 This (combined with the output being a standard FASTA file) allows
40 esl-translate to be used in command line incantations. If seqfile ends
41 in .gz, it is assumed to be a gzip-compressed file, and Easel will try
42 to read it as a stream from gunzip -c.
43
44
45
46
48 The output FASTA name/description line contains information about the
49 source and coordinates of each ORF. Each ORF is named orf1, etc., with
50 numbering starting from 1, in order of their start position on the top
51 strand followed by the bottom strand. The rest of the FASTA name/desc
52 line contains 4 additional fields, followed by the description of the
53 source sequence:
54
55
56 source=<s>
57 <s> is the name of the source DNA/RNA sequence.
58
59
60 coords=start..end
61 Coords, 1..L, for the translated ORF in a source DNA sequence of
62 length L. If start is greater than end, the ORF is on the bottom
63 (reverse complement) strand. The start is the first nucleotide
64 of the first codon; the end is the last nucleotide of the last
65 codon. The stop codon is not included in the coordinates (unlike
66 in CDS annotation in GenBank, for example.)
67
68
69 length=<n>
70 Length of the ORF in amino acids.
71
72
73 frame=<n>
74 Which frame the ORF is in. Frames 1..3 are the top strand; 4..6
75 are the bottom strand. Frame 1 starts at nucleotide 1. Frame 4
76 starts at nucleotide L.
77
78
79
80
82 By default, the "standard" genetic code is used (NCBI transl_table 1).
83 Any NCBI genetic code transl_table can be selected with the -c option,
84 as follows:
85
86
87 1 Standard
88
89 2 Vertebrate mitochondrial
90
91 3 Yeast mitochondrial
92
93 4 Mold, protozoan, coelenterate mitochondrial; Mycoplasma/Spiro‐
94 plasma
95
96 5 Invertebrate mitochondrial
97
98 6 Ciliate, dasycladacean, Hexamita nuclear
99
100 9 Echinoderm and flatworm mitochondrial
101
102 10 Euplotid nuclear
103
104 11 Bacterial, archaeal; and plant plastid
105
106 12 Alternative yeast
107
108 13 Ascidian mitochondrial
109
110 14 Alternative flatworm mitochondrial
111
112 16 Chlorophycean mitochondrial
113
114 21 Trematode mitochondrial
115
116 22 Scenedesmus obliquus mitochondrial
117
118 23 Thraustochytrium mitochondrial
119
120 24 Pterobranchia mitochondrial
121
122 25 Candidate Division SR1 and Gracilibacteria
123
124
125
126 As of this writing, more information about the genetic codes in the
127 NCBI translation tables is at http://www.ncbi.nlm.nih.gov/Taxonomy/ at
128 a link titled Genetic codes.
129
130
132 DNA sequences may contain IUPAC degeneracy codes, such as N, R, Y, etc.
133 If all codons consistent with a degenerate codon translate to the same
134 amino acid (or to a stop), that translation is done; otherwise, the
135 codon is translated as X (even if one or more compatible codons are
136 stops). For example, in the standard code, UAR translates to * (stop),
137 GGN translates to G (glycine), NNN translates to X, and UGR translates
138 to X (it could be either a UGA stop or a UGG Trp).
139
140
141 Degenerate initiation codons are handled essentially the same. If all
142 codons consistent with the degenerate codon are legal initiators, then
143 the codon is allowed to initiate a new ORF. Stop codons are never a le‐
144 gal initiator (not only with -m or -M but also with the default of al‐
145 lowing any amino acid to initiate), so degenerate codons consistent
146 with a stop cannot be initiators. For example, NNN cannot initiate an
147 ORF, nor can UGR -- even though they translate to X. This means that we
148 don't translate long stretches of N's as long ORFs of X's, which is
149 probably a feature, given the prevalence of artificial runs of N's in
150 genome sequence assemblies.
151
152
153 Degenerate DNA codons are not translated to degenerate amino acids
154 other than X, even when that is possible. For example, SAR and MUH are
155 decoded as X, not Z (Q|E) and J (I|L). The extra complexity needed for
156 a degenerate to degenerate translation doesn't seem worthwhile.
157
158
159
161 -h Print brief help. Includes version number and summary of all op‐
162 tions. Also includes a list of the available NCBI transl_tables
163 and their numerical codes, for the -c option.
164
165
166 -c <id>
167 Choose alternative genetic code <id> where <id> is the numerical
168 code of one of the NCBI transl_tables.
169
170
171 -l <n> Set the minimum reported ORF length to <n> aa.
172
173
174 -m Require ORFs to start with an initiator codon AUG (Met).
175
176
177 -M Require ORFs to start with an initiator codon, as specified by
178 the allowed initiator codons in the NCBI transl_table. In the
179 default Standard code, AUG, CUG, and UUG are allowed as initia‐
180 tors. An initiation codon is always translated as Met, even if
181 it does not normally encode Met as an elongator.
182
183
184 -W Use a memory-efficient windowed sequence reader. The default is
185 to read entire DNA sequences into memory, which may become mem‐
186 ory limited for some very large eukaryotic chromosomes. The
187 windowed reader cannot reverse complement a nonrewindable input
188 stream, so either seqfile must be a file, or you must use --wat‐
189 son to limit translation to the top strand.
190
191
192
193 --informat <s>
194 Assert that input seqfile is in format <s>, bypassing format au‐
195 todetection. Common choices for <s> include: fasta, embl, gen‐
196 bank. Alignment formats also work; common choices include:
197 stockholm, a2m, afa, psiblast, clustal, phylip. For more infor‐
198 mation, and for codes for some less common formats, see main
199 documentation. The string <s> is case-insensitive (fasta or
200 FASTA both work).
201
202
203
204 --watson
205 Only translate the top strand.
206
207
208 --crick
209 Only translate the bottom strand.
210
211
212
213
214
215
217 http://bioeasel.org/
218
219
221 Copyright (C) 2020 Howard Hughes Medical Institute.
222 Freely distributed under the BSD open source license.
223
224
226 http://eddylab.org
227
228
229
230Easel 0.48 Nov 2020 esl-translate(1)