1Ace::Sequence(3) User Contributed Perl Documentation Ace::Sequence(3)
2
3
4
6 Ace::Sequence - Examine ACeDB Sequence Objects
7
9 # open database connection and get an Ace::Object sequence
10 use Ace::Sequence;
11
12 $db = Ace->connect(-host => 'stein.cshl.org',-port => 200005);
13 $obj = $db->fetch(Predicted_gene => 'ZK154.3');
14
15 # Wrap it in an Ace::Sequence object
16 $seq = Ace::Sequence->new($obj);
17
18 # Find all the exons
19 @exons = $seq->features('exon');
20
21 # Find all the exons predicted by various versions of "genefinder"
22 @exons = $seq->features('exon:genefinder.*');
23
24 # Iterate through the exons, printing their start, end and DNA
25 for my $exon (@exons) {
26 print join "\t",$exon->start,$exon->end,$exon->dna,"\n";
27 }
28
29 # Find the region 1000 kb upstream of the first exon
30 $sub = Ace::Sequence->new(-seq=>$exons[0],
31 -offset=>-1000,-length=>1000);
32
33 # Find all features in that area
34 @features = $sub->features;
35
36 # Print its DNA
37 print $sub->dna;
38
39 # Create a new Sequence object from the first 500 kb of chromosome 1
40 $seq = Ace::Sequence->new(-name=>'CHROMOSOME_I',-db=>$db,
41 -offset=>0,-length=>500_000);
42
43 # Get the GFF dump as a text string
44 $gff = $seq->gff;
45
46 # Limit dump to Predicted_genes
47 $gff_genes = $seq->gff(-features=>'Predicted_gene');
48
49 # Return a GFF object (using optional GFF.pm module from Sanger)
50 $gff_obj = $seq->GFF;
51
53 Ace::Sequence, and its allied classes Ace::Sequence::Feature and
54 Ace::Sequence::FeatureList, provide a convenient interface to the ACeDB
55 Sequence classes and the GFF sequence feature file format.
56
57 Using this class, you can define a region of the genome by using a
58 landmark (sequenced clone, link, superlink, predicted gene), an offset
59 from that landmark, and a distance. Offsets and distances can be posi‐
60 tive or negative. This will return an Ace::Sequence object. Once a
61 region is defined, you may retrieve its DNA sequence, or query the
62 database for any features that may be contained within this region.
63 Features can be returned as objects (using the Ace::Sequence::Feature
64 class), as GFF text-only dumps, or in the form of the GFF class defined
65 by the Sanger Centre's GFF.pm module.
66
67 This class builds on top of Ace and Ace::Object. Please see their man‐
68 ual pages before consulting this one.
69
71 $seq = Ace::Sequence->new($object);
72
73 $seq = Ace::Sequence->new(-source => $object,
74 -offset => $offset,
75 -length => $length,
76 -refseq => $reference_sequence);
77
78 $seq = Ace::Sequence->new(-name => $name,
79 -db => $db,
80 -offset => $offset,
81 -length => $length,
82 -refseq => $reference_sequence);
83
84 In order to create an Ace::Sequence you will need an active Ace data‐
85 base accessor. Sequence regions are defined using a "source" sequence,
86 an offset, and a length. Optionally, you may also provide a "reference
87 sequence" to establish the coordinate system for all inquiries.
88 Sequences may be generated from existing Ace::Object sequence objects,
89 from other Ace::Sequence and Ace::Sequence::Feature objects, or from a
90 sequence name and a database handle.
91
92 The class method named new() is the interface to these facilities. In
93 its simplest, one-argument form, you provide new() with a previously-
94 created Ace::Object that points to Sequence or sequence-like object
95 (the meaning of "sequence-like" is explained in more detail below.)
96 The new() method will return an Ace::Sequence object extending from the
97 beginning of the object through to its natural end.
98
99 In the named-parameter form of new(), the following arguments are rec‐
100 ognized:
101
102 -source
103 The sequence source. This must be an Ace::Object of the "Sequence"
104 class, or be a sequence-like object containing the SMap tag (see
105 below).
106
107 -offset
108 An offset from the beginning of the source sequence. The retrieved
109 Ace::Sequence will begin at this position. The offset can be any
110 positive or negative integer. Offets are 0-based.
111
112 -length
113 The length of the sequence to return. Either a positive or nega‐
114 tive integer can be specified. If a negative length is given, the
115 returned sequence will be complemented relative to the source
116 sequence.
117
118 -refseq
119 The sequence to use to establish the coordinate system for the
120 returned sequence. Normally the source sequence is used to estab‐
121 lish the coordinate system, but this can be used to override that
122 choice. You can provide either an Ace::Object or just a sequence
123 name for this argument. The source and reference sequences must
124 share a common ancestor, but do not have to be directly related.
125 An attempt to use a disjunct reference sequence, such as one on a
126 different chromosome, will fail.
127
128 -name
129 As an alternative to using an Ace::Object with the -source argu‐
130 ment, you may specify a source sequence using -name and -db. The
131 Ace::Sequence module will use the provided database accessor to
132 fetch a Sequence object with the specified name. new() will return
133 undef is no Sequence by this name is known.
134
135 -db This argument is required if the source sequence is specified by
136 name rather than by object reference.
137
138 If new() is successful, it will create an Ace::Sequence object and
139 return it. Otherwise it will return undef and return a descriptive
140 message in Ace->error(). Certain programming errors, such as a failure
141 to provide required arguments, cause a fatal error.
142
143 Reference Sequences and the Coordinate System
144
145 When retrieving information from an Ace::Sequence, the coordinate sys‐
146 tem is based on the sequence segment selected at object creation time.
147 That is, the "+1" strand is the natural direction of the Ace::Sequence
148 object, and base pair 1 is its first base pair. This behavior can be
149 overridden by providing a reference sequence to the new() method, in
150 which case the orientation and position of the reference sequence
151 establishes the coordinate system for the object.
152
153 In addition to the reference sequence, there are two other sequences
154 used by Ace::Sequence for internal bookeeping. The "source" sequence
155 corresponds to the smallest ACeDB sequence object that completely
156 encloses the selected sequence segment. The "parent" sequence is the
157 smallest ACeDB sequence object that contains the "source". The parent
158 is used to derive the length and orientation of source sequences that
159 are not directly associated with DNA objects.
160
161 In many cases, the source sequence will be identical to the sequence
162 initially passed to the new() method. However, there are exceptions to
163 this rule. One common exception occurs when the offset and/or length
164 cross the boundaries of the passed-in sequence. In this case, the
165 ACeDB database is searched for the smallest sequence that contains both
166 endpoints of the Ace::Sequence object.
167
168 The other common exception occurs in Ace 4.8, where there is support
169 for "sequence-like" objects that contain the "SMap" ("Sequence Map")
170 tag. The "SMap" tag provides genomic location information for arbi‐
171 trary object -- not just those descended from the Sequence class. This
172 allows ACeDB to perform genome map operations on objects that are not
173 directly related to sequences, such as genetic loci that have been
174 interpolated onto the physical map. When an "SMap"-containing object
175 is passed to the Ace::Sequence new() method, the module will again
176 choose the smallest ACeDB Sequence object that contains both end-points
177 of the desired region.
178
179 If an Ace::Sequence object is used to create a new Ace::Sequence
180 object, then the original object's source is inherited.
181
183 Once an Ace::Sequence object is created, you can query it using the
184 following methods:
185
186 asString()
187
188 $name = $seq->asString;
189
190 Returns a human-readable identifier for the sequence in the form
191 Source/start-end, where "Source" is the name of the source sequence,
192 and "start" and "end" are the endpoints of the sequence relative to the
193 source (using 1-based indexing). This method is called automatically
194 when the Ace::Sequence is used in a string context.
195
196 source_seq()
197
198 $source = $seq->source_seq;
199
200 Return the source of the Ace::Sequence.
201
202 parent_seq()
203
204 $parent = $seq->parent_seq;
205
206 Return the immediate ancestor of the sequence. The parent of the top-
207 most sequence (such as the CHROMOSOME link) is itself. This method is
208 used internally to ascertain the length of source sequences which are
209 not associated with a DNA object.
210
211 NOTE: this procedure is a trifle funky and cannot reliably be used to
212 traverse upwards to the top-most sequence. The reason for this is that
213 it will return an Ace::Sequence in some cases, and an Ace::Object in
214 others. Use get_parent() to traverse upwards through a uniform series
215 of Ace::Sequence objects upwards.
216
217 refseq([$seq])
218
219 $refseq = $seq->refseq;
220
221 Returns the reference sequence, if one is defined.
222
223 $seq->refseq($new_ref);
224
225 Set the reference sequence. The reference sequence must share the same
226 ancestor with $seq.
227
228 start()
229
230 $start = $seq->start;
231
232 Start of this sequence, relative to the source sequence, using 1-based
233 indexing.
234
235 end()
236
237 $end = $seq->end;
238
239 End of this sequence, relative to the source sequence, using 1-based
240 indexing.
241
242 offset()
243
244 $offset = $seq->offset;
245
246 Offset of the beginning of this sequence relative to the source
247 sequence, using 0-based indexing. The offset may be negative if the
248 beginning of the sequence is to the left of the beginning of the source
249 sequence.
250
251 length()
252
253 $length = $seq->length;
254
255 The length of this sequence, in base pairs. The length may be negative
256 if the sequence's orientation is reversed relative to the source
257 sequence. Use abslength() to obtain the absolute value of the sequence
258 length.
259
260 abslength()
261
262 $length = $seq->abslength;
263
264 Return the absolute value of the length of the sequence.
265
266 strand()
267
268 $strand = $seq->strand;
269
270 Returns +1 for a sequence oriented in the natural direction of the
271 genomic reference sequence, or -1 otherwise.
272
273 reversed()
274
275 Returns true if the segment is reversed relative to the canonical
276 genomic direction. This is the same as $seq->strand < 0.
277
278 dna()
279
280 $dna = $seq->dna;
281
282 Return the DNA corresponding to this sequence. If the sequence length
283 is negative, the reverse complement of the appropriate segment will be
284 returned.
285
286 ACeDB allows Sequences to exist without an associated DNA object (which
287 typically happens during intermediate stages of a sequencing project.
288 In such a case, the returned sequence will contain the correct number
289 of "-" characters.
290
291 name()
292
293 $name = $seq->name;
294
295 Return the name of the source sequence as a string.
296
297 get_parent()
298
299 $parent = $seq->parent;
300
301 Return the immediate ancestor of this Ace::Sequence (i.e., the sequence
302 that contains this one). The return value is a new Ace::Sequence or
303 undef, if no parent sequence exists.
304
305 get_children()
306
307 @children = $seq->get_children();
308
309 Returns all subsequences that exist as independent objects in the ACeDB
310 database. What exactly is returned is dependent on the data model. In
311 older ACeDB databases, the only subsequences are those under the
312 catchall Subsequence tag. In newer ACeDB databases, the objects
313 returned correspond to objects to the right of the S_Child subtag using
314 a tag[2] syntax, and may include Predicted_genes, Sequences, Links, or
315 other objects. The return value is a list of Ace::Sequence objects.
316
317 features()
318
319 @features = $seq->features;
320 @features = $seq->features('exon','intron','Predicted_gene');
321 @features = $seq->features('exon:GeneFinder','Predicted_gene:hand.*');
322
323 features() returns an array of Sequence::Feature objects. If called
324 without arguments, features() returns all features that cross the
325 sequence region. You may also provide a filter list to select a set of
326 features by type and subtype. The format of the filter list is:
327
328 type:subtype
329
330 Where type is the class of the feature (the "feature" field of the GFF
331 format), and subtype is a description of how the feature was derived
332 (the "source" field of the GFF format). Either of these fields can be
333 absent, and either can be a regular expression. More advanced filter‐
334 ing is not supported, but is provided by the Sanger Centre's GFF mod‐
335 ule.
336
337 The order of the features in the returned list is not specified. To
338 obtain features sorted by position, use this idiom:
339
340 @features = sort { $a->start <=> $b->start } $seq->features;
341
342 feature_list()
343
344 my $list = $seq->feature_list();
345
346 This method returns a summary list of the features that cross the
347 sequence in the form of a Ace::Feature::List object. From the
348 Ace::Feature::List object you can obtain the list of feature names and
349 the number of each type. The feature list is obtained from the ACeDB
350 server with a single short transaction, and therefore has much less
351 overhead than features().
352
353 See Ace::Feature::List for more details.
354
355 transcripts()
356
357 This returns a list of Ace::Sequence::Transcript objects, which are
358 specializations of Ace::Sequence::Feature. See Ace::Sequence::Tran‐
359 script for details.
360
361 clones()
362
363 This returns a list of Ace::Sequence::Feature objects containing recon‐
364 structed clones. This is a nasty hack, because ACEDB currently records
365 clone ends, but not the clones themselves, meaning that we will not
366 always know both ends of the clone. In this case the missing end has a
367 synthetic position of -99,999,999 or +99,999,999. Sorry.
368
369 gff()
370
371 $gff = $seq->gff();
372 $gff = $seq->gff(-abs => 1,
373 -features => ['exon','intron:GeneFinder']);
374
375 This method returns a GFF file as a scalar. The following arguments
376 are optional:
377
378 -abs
379 Ordinarily the feature entries in the GFF file will be returned in
380 coordinates relative to the start of the Ace::Sequence object.
381 Position 1 will be the start of the sequence object, and the "+"
382 strand will be the sequence object's natural orientation. However
383 if a true value is provided to -abs, the coordinate system used
384 will be relative to the start of the source sequence, i.e. the
385 native ACeDB Sequence object (usually a cosmid sequence or a link).
386
387 If a reference sequence was provided when the Ace::Sequence was
388 created, it will be used by default to set the coordinate system.
389 Relative coordinates can be reenabled by providing a false value to
390 -abs.
391
392 Ordinarily the coordinate system manipulations automatically "do
393 what you want" and you will not need to adjust them. See also the
394 abs() method described below.
395
396 -features
397 The -features argument filters the features according to a list of
398 types and subtypes. The format is identical to the one described
399 for the features() method. A single filter may be provided as a
400 scalar string. Multiple filters may be passed as an array refer‐
401 ence.
402
403 See also the GFF() method described next.
404
405 GFF()
406
407 $gff_object = $seq->gff;
408 $gff_object = $seq->gff(-abs => 1,
409 -features => ['exon','intron:GeneFinder']);
410
411 The GFF() method takes the same arguments as gff() described above, but
412 it returns a GFF::GeneFeatureSet object from the GFF.pm module. If the
413 GFF module is not installed, this method will generate a fatal error.
414
415 absolute()
416
417 $abs = $seq->absolute;
418 $abs = $seq->absolute(1);
419
420 This method controls whether the coordinates of features are returned
421 in absolute or relative coordinates. "Absolute" coordinates are rela‐
422 tive to the underlying source or reference sequence. "Relative" coor‐
423 dinates are relative to the Ace::Sequence object. By default, coordi‐
424 nates are relative unless new() was provided with a reference sequence.
425 This default can be examined and changed using absolute().
426
427 automerge()
428
429 $merge = $seq->automerge;
430 $seq->automerge(0);
431
432 This method controls whether groups of features will automatically be
433 merged together by the features() call. If true (the default), then
434 the left and right end of clones will be merged into "clone" features,
435 introns, exons and CDS entries will be merged into Ace::Sequence::Tran‐
436 script objects, and similarity entries will be merged into
437 Ace::Sequence::GappedAlignment objects.
438
439 db()
440
441 $db = $seq->db;
442
443 Returns the Ace database accessor associated with this sequence.
444
446 Ace, Ace::Object, Ace::Sequence::Feature, Ace::Sequence::FeatureList,
447 GFF
448
450 Lincoln Stein <lstein@cshl.org> with extensive help from Jean Thierry-
451 Mieg <mieg@kaa.crbm.cnrs-mop.fr>
452
453 Many thanks to David Block <dblock@gene.pbi.nrc.ca> for finding and
454 fixing the nasty off-by-one errors.
455
456 Copyright (c) 1999, Lincoln D. Stein
457
458 This library is free software; you can redistribute it and/or modify it
459 under the same terms as Perl itself. See DISCLAIMER.txt for dis‐
460 claimers of warranty.
461
462
463
464perl v5.8.8 2001-02-20 Ace::Sequence(3)