esl-sfetch(1)

1esl-sfetch(1)                    Easel Manual                    esl-sfetch(1)
2
3
4

NAME

6       esl-sfetch - retrieve (sub-)sequences from a sequence file
7
8

SYNOPSIS

10       esl-sfetch [options] seqfile key
11         (retrieve a single sequence by key)
12
13       esl-sfetch -c from..to [options] seqfile key
14         (retrieve a single subsequence by key and coords)
15
16       esl-sfetch -f [options] seqfile keyfile
17         (retrieve multiple sequences using a file of keys)
18
19       esl-sfetch -Cf [options] seqfile subseq-coord-file
20         (retrieve multiple subsequences using file of keys and coords)
21
22       esl-sfetch --index msafile
23         (index a sequence file for retrievals)
24
25
26

DESCRIPTION

28       esl-sfetch  retrieves  one  or more sequences or subsequences from seq‐
29       file.
30
31
32       The seqfile must be indexed using  esl-sfetch  --index  seqfile.   This
33       creates an SSI index file seqfile.ssi.
34
35
36       To  retrieve  a  single  complete  sequence, do esl-sfetch seqfile key,
37       where key is the name or accession of the desired sequence.
38
39
40       To retrieve a single subsequence rather than a complete  sequence,  use
41       the  -c  start..end  option  to  provide start and end coordinates. The
42       start and end coordinates are provided as one string, separated by  any
43       nonnumeric,  nonwhitespace character or characters you like; see the -c
44       option below for more details.
45
46
47       To retrieve more than one complete sequence at once, you may use the -f
48       option, and the second command line argument will specify the name of a
49       keyfile that contains a list of names or accessions, one per line;  the
50       first whitespace-delimited field on each line of this file is parsed as
51       the name/accession.
52
53
54       To retrieve more than one subsequence at once, use the -C option in ad‐
55       dition to -f, and now the second argument is parsed as a list of subse‐
56       quence coordinate lines. See the -C option below for more details,  in‐
57       cluding the format of these lines.
58
59
60
61       In DNA/RNA files, you may extract (sub-)sequences in reverse complement
62       orientation in two different ways: either by providing a  from  coordi‐
63       nate that is greater than to, or by providing the -r option.
64
65
66       When the -f option is used to do multiple (sub-)sequence retrieval, the
67       file argument may be - (a single dash),  in  which  case  the  list  of
68       names/accessions  (or  subsequence coordinate lines) is read from stan‐
69       dard input. However, because a standard input stream can't be  SSI  in‐
70       dexed, (sub-)sequence retrieval from stdin may be slow.
71
72
73

OPTIONS

75       -h     Print brief help; includes version number and summary of all op‐
76              tions, including expert options.
77
78
79       -c coords
80              Retrieve a subsequence with start and end coordinates  specified
81              by  the coords string. This string consists of start and end co‐
82              ordinates separated by any nonnumeric,  nonwhitespace  character
83              or  characters  you like; for example, -c 23..100, -c 23/100, or
84              -c 23-100 all work. To retrieve a suffix of a  subsequence,  you
85              can  omit  the end ; for example, -c 23: would work.  To specify
86              reverse complement (for DNA/RNA sequence), you can specify  from
87              greater  than  to; for example, -c 100..23 retrieves the reverse
88              complement strand from 100 to 23.
89
90
91       -f     Interpret the second argument as a keyfile instead  of  as  just
92              one  key.   The  first  whitespace-limited field on each line of
93              keyfile is interpreted as a name or  accession  to  be  fetched.
94              This  option  doesn't  work  with the --index option.  Any other
95              fields on a line after the first one are  ignored.  Blank  lines
96              and lines beginning with # are ignored.
97
98
99       -o <f> Output retrieved sequences to a file <f> instead of to stdout.
100
101
102
103       -n <s> Rename the retrieved (sub-)sequence <s>.  Incompatible with -f.
104
105
106       -r     Reverse  complement  the retrieved (sub-)sequence. Only accepted
107              for DNA/RNA sequences.
108
109
110       -C     Multiple subsequence retrieval mode, with -f option  (required).
111              Specifies  that the second command line argument is to be parsed
112              as a subsequence coordinate file, consisting of lines containing
113              four  whitespace-delimited  fields: new_name, from, to, name/ac‐
114              cession.  For each such line, sequence name/accession is  found,
115              a  subsequence from..to is extracted, and the subsequence is re‐
116              named new_name before being output.  Any other fields after  the
117              first  four  are ignored. Blank lines and lines beginning with #
118              are ignored.
119
120
121
122       -O     Output retrieved sequence to a file named key.  This is a conve‐
123              nience for saving some typing: instead of
124                % esl-sfetch -o SRPA_HUMAN swissprot SRPA_HUMAN
125              you can just type
126                % esl-sfetch -O swissprot SRPA_HUMAN
127              The  -O  option  only works if you're retrieving a single align‐
128              ment; it is incompatible with -f.
129
130
131       --index
132              Instead of retrieving a  key,  the  special  command  esl-sfetch
133              --index  seqfile  produces  an SSI index of the names and acces‐
134              sions of the alignments in the seqfile.  Indexing should be done
135              once on the seqfile to prepare it for all future fetches.
136
137
138

EXPERT OPTIONS

140       --informat <s>
141              Assert  that  seqfile is in format <s>, bypassing format autode‐
142              tection.  Common choices for <s> include: fasta, embl,  genbank.
143              Alignment  formats also work; common choices include: stockholm,
144              a2m, afa, psiblast, clustal, phylip.  For more information,  and
145              for  codes for some less common formats, see main documentation.
146              The string <s> is case-insensitive (fasta or FASTA both work).
147
148
149
150

COPYRIGHT

156       Copyright (C) 2020 Howard Hughes Medical Institute.
157       Freely distributed under the BSD open source license.
158
159

AUTHOR

161       http://eddylab.org
162
163
164
165Easel 0.48                         Nov 2020                      esl-sfetch(1)