esl-translate(1)

1esl-translate(1)                 Easel Manual                 esl-translate(1)
2
3
4

NAME

6       esl-translate  -  translate  DNA sequence in six frames into individual
7       ORFs
8
9

SYNOPSIS

11       esl-translate [options] seqfile
12
13
14

DESCRIPTION

16       Given a seqfile containing DNA or RNA sequences, esl-translate  outputs
17       a  six-frame  translation  of them as individual open reading frames in
18       FASTA format.
19
20
21       By default, only open reading frames greater than  20aa  are  reported.
22       This minimum ORF length can be changed with the -l option.
23
24
25       By  default,  no  specific  initiation codon is required, and any amino
26       acid can start an open reading frame.  This is so esl-translate may  be
27       used  on  sequence  fragments,  eukaryotic genes with introns, or other
28       cases where we do not want to assume that ORFs are complete coding  re‐
29       gions.   This  behavior  can be changed. With the -m option, ORFs start
30       with an initiator AUG Met. With the -M option, ORFs start with  any  of
31       the  initiation  codons  allowed  by the genetic code. For example, the
32       "standard" code (NCBI transl_table 1) allows AUG, CUG, and UUG as  ini‐
33       tiators.  When  -m or -M are used, an initiator is always translated to
34       Met (even if the initiator is something like UUG or  CUG  that  doesn't
35       encode Met as an elongator).
36
37
38       If  seqfile  is  -  (a single dash), input is read from the stdin pipe.
39       This (combined with the output being  a  standard  FASTA  file)  allows
40       esl-translate to be used in command line incantations.  If seqfile ends
41       in .gz, it is assumed to be a gzip-compressed file, and Easel will  try
42       to read it as a stream from gunzip -c.
43
44
45
46

OUTPUT FORMAT

48       The  output  FASTA name/description line contains information about the
49       source and coordinates of each ORF. Each ORF is named orf1, etc.,  with
50       numbering  starting from 1, in order of their start position on the top
51       strand followed by the bottom strand.  The rest of the FASTA  name/desc
52       line  contains  4 additional fields, followed by the description of the
53       source sequence:
54
55
56       source=<s>
57              <s> is the name of the source DNA/RNA sequence.
58
59
60       coords=start..end
61              Coords, 1..L, for the translated ORF in a source DNA sequence of
62              length L. If start is greater than end, the ORF is on the bottom
63              (reverse complement) strand. The start is the  first  nucleotide
64              of  the  first codon; the end is the last nucleotide of the last
65              codon. The stop codon is not included in the coordinates (unlike
66              in CDS annotation in GenBank, for example.)
67
68
69       length=<n>
70              Length of the ORF in amino acids.
71
72
73       frame=<n>
74              Which  frame the ORF is in. Frames 1..3 are the top strand; 4..6
75              are the bottom strand. Frame 1 starts at nucleotide 1.  Frame  4
76              starts at nucleotide L.
77
78
79
80

ALTERNATIVE GENETIC CODES

82       By  default, the "standard" genetic code is used (NCBI transl_table 1).
83       Any NCBI genetic code transl_table can be selected with the -c  option,
84       as follows:
85
86
87       1      Standard
88
89       2      Vertebrate mitochondrial
90
91       3      Yeast mitochondrial
92
93       4      Mold,  protozoan,  coelenterate mitochondrial; Mycoplasma/Spiro‐
94              plasma
95
96       5      Invertebrate mitochondrial
97
98       6      Ciliate, dasycladacean, Hexamita nuclear
99
100       9      Echinoderm and flatworm mitochondrial
101
102       10     Euplotid nuclear
103
104       11     Bacterial, archaeal; and plant plastid
105
106       12     Alternative yeast
107
108       13     Ascidian mitochondrial
109
110       14     Alternative flatworm mitochondrial
111
112       16     Chlorophycean mitochondrial
113
114       21     Trematode mitochondrial
115
116       22     Scenedesmus obliquus mitochondrial
117
118       23     Thraustochytrium mitochondrial
119
120       24     Pterobranchia mitochondrial
121
122       25     Candidate Division SR1 and Gracilibacteria
123
124
125
126       As of this writing, more information about the  genetic  codes  in  the
127       NCBI  translation tables is at http://www.ncbi.nlm.nih.gov/Taxonomy/ at
128       a link titled Genetic codes.
129
130

IUPAC DEGENERACY CODES IN DNA

132       DNA sequences may contain IUPAC degeneracy codes, such as N, R, Y, etc.
133       If  all codons consistent with a degenerate codon translate to the same
134       amino acid (or to a stop), that translation  is  done;  otherwise,  the
135       codon  is  translated  as  X (even if one or more compatible codons are
136       stops). For example, in the standard code, UAR translates to *  (stop),
137       GGN  translates to G (glycine), NNN translates to X, and UGR translates
138       to X (it could be either a UGA stop or a UGG Trp).
139
140
141       Degenerate initiation codons are handled essentially the same.  If  all
142       codons  consistent with the degenerate codon are legal initiators, then
143       the codon is allowed to initiate a new ORF. Stop codons are never a le‐
144       gal  initiator (not only with -m or -M but also with the default of al‐
145       lowing any amino acid to initiate),  so  degenerate  codons  consistent
146       with  a stop cannot be initiators.  For example, NNN cannot initiate an
147       ORF, nor can UGR -- even though they translate to X. This means that we
148       don't  translate  long  stretches  of N's as long ORFs of X's, which is
149       probably a feature, given the prevalence of artificial runs of  N's  in
150       genome sequence assemblies.
151
152
153       Degenerate  DNA  codons  are  not  translated to degenerate amino acids
154       other than X, even when that is possible. For example, SAR and MUH  are
155       decoded  as X, not Z (Q|E) and J (I|L). The extra complexity needed for
156       a degenerate to degenerate translation doesn't seem worthwhile.
157
158
159

OPTIONS

161       -h     Print brief help. Includes version number and summary of all op‐
162              tions.  Also includes a list of the available NCBI transl_tables
163              and their numerical codes, for the -c option.
164
165
166       -c <id>
167              Choose alternative genetic code <id> where <id> is the numerical
168              code of one of the NCBI transl_tables.
169
170
171       -l <n> Set the minimum reported ORF length to <n> aa.
172
173
174       -m     Require ORFs to start with an initiator codon AUG (Met).
175
176
177       -M     Require  ORFs  to start with an initiator codon, as specified by
178              the allowed initiator codons in the NCBI  transl_table.  In  the
179              default  Standard code, AUG, CUG, and UUG are allowed as initia‐
180              tors. An initiation codon is always translated as Met,  even  if
181              it does not normally encode Met as an elongator.
182
183
184       -W     Use a memory-efficient windowed sequence reader.  The default is
185              to read entire DNA sequences into memory, which may become  mem‐
186              ory  limited  for  some  very large eukaryotic chromosomes.  The
187              windowed reader cannot reverse complement a nonrewindable  input
188              stream, so either seqfile must be a file, or you must use --wat‐
189              son to limit translation to the top strand.
190
191
192
193       --informat <s>
194              Assert that input seqfile is in format <s>, bypassing format au‐
195              todetection.   Common choices for <s> include: fasta, embl, gen‐
196              bank.  Alignment formats  also  work;  common  choices  include:
197              stockholm, a2m, afa, psiblast, clustal, phylip.  For more infor‐
198              mation, and for codes for some less  common  formats,  see  main
199              documentation.   The  string  <s>  is case-insensitive (fasta or
200              FASTA both work).
201
202
203
204       --watson
205              Only translate the top strand.
206
207
208       --crick
209              Only translate the bottom strand.
210
211
212
213
214
215

COPYRIGHT

221       Copyright (C) 2020 Howard Hughes Medical Institute.
222       Freely distributed under the BSD open source license.
223
224

AUTHOR

226       http://eddylab.org
227
228
229
230Easel 0.48                         Nov 2020                   esl-translate(1)