1Boulder::Genbank(3) User Contributed Perl Documentation Boulder::Genbank(3)
2
3
4
6 Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones
7
9 use Boulder::Genbank
10
11 # network access via Entrez
12 $gb = Boulder::Genbank->newFh( qw(M57939 M28274 L36028) );
13
14 while ($data = <$gb>) {
15 print $data->Accession;
16
17 @introns = $data->features->Intron;
18 print "There are ",scalar(@introns)," introns.\n";
19 $dna = $data->Sequence;
20 print "The dna is ",length($dna)," bp long.\n";
21
22 my @features = $data->features(-type=>[ qw(Exon Source Satellite) ],
23 -pos=>[90,310] );
24 foreach (@features) {
25 print $_->Type,"\n";
26 print $_->Position,"\n";
27 print $_->Gene,"\n";
28 }
29 }
30
31 # another syntax
32 $gb = new Boulder::Genbank(-accessor=>'Entrez',
33 -fetch => [qw/M57939 M28274 L36028/]);
34
35 # local access via Yank
36 $gb = new Boulder::Genbank(-accessor=>'Yank',
37 -fetch=>[qw/M57939 M28274 L36028/]);
38 while (my $s = $gb->get) {
39 # etc.
40 }
41
42 # parse a file of Genbank records
43 $gb = new Boulder::Genbank(-accessor=>'File',
44 -fetch => '/usr/local/db/gbpri3.seq');
45 while (my $s = $gb->get) {
46 # etc.
47 }
48
49 # parse flatfile records yourself
50 open (GB,"/usr/local/db/gbpri3.seq");
51 local $/ = "//\n";
52 while (<GB>) {
53 my $s = Boulder::Genbank->parse($_);
54 # etc.
55 }
56
58 Boulder::Genbank provides retrieval and parsing services for NCBI
59 Genbank-format records. It returns Genbank entries in Stone format,
60 allowing easy access to the various fields and values.
61 Boulder::Genbank is a descendent of Boulder::Stream, and provides a
62 stream-like interface to a series of Stone objects.
63
64 >> IMPORTANT NOTE <<
65
66 As of January 2002, NCBI has changed their Batch Entrez interface. I
67 have modified Boulder::Genbank so as to use a "demo" interface, which
68 fixes things, but this isn't guaranteed in the long run.
69
70 I have written to NCBI, and they may fix this -- or they may not.
71
72 >> IMPORTANT NOTE <<
73
74 Access to Genbank is provided by three different accessors, which
75 together give access to remote and local Genbank databases. When you
76 create a new Boulder::Genbank stream, you provide one of the three
77 accessors, along with accessor-specific parameters that control what
78 entries to fetch. The three accessors are:
79
80 Entrez
81 This provides access to NetEntrez, accessing the most recent
82 Genbank information directly from NCBI's Web site. The parameters
83 passed to this accessor are either a series of Genbank accession
84 numbers, or an Entrez query (see
85 http://www.ncbi.nlm.nih.gov/Entrez/linking.html). If you provide a
86 list of accession numbers, the stream will return a series of
87 stones corresponding to the numbers. Otherwise, if you provided an
88 Entrez query, the entries returned will be in the order returned by
89 Entez.
90
91 File
92 This provides access to local Genbank entries by reading from a
93 flat file (typically one of the .seq files downloadable from NCBI's
94 Web site). The stream will return a Stone corresponding to each of
95 the entries in the file, starting from the top of the file and
96 working downward. The parameter in this case is the path to the
97 local file.
98
99 Yank
100 This provides access to local Genbank entries using Will Fitzhugh's
101 Yank program. Yank provides fast indexed access to a Genbank flat
102 file using the accession number as the key. The parameter passed
103 to the Yank accessor is a list of accession numbers. Stones will
104 be returned in the requested order. By default the yank binary
105 lives in /usr/local/bin/yank. To support other locations, you may
106 define the environment variable YANK to contain the full path.
107
108 It is also possible to parse a single Genbank entry from a text string
109 stored in a scalar variable, returning a Stone object.
110
111 Boulder::Genbank methods
112 This section lists the public methods that the Boulder::Genbank class
113 makes available.
114
115 new()
116 # Network fetch via Entrez, with accession numbers
117 $gb=new Boulder::Genbank(-accessor => 'Entrez',
118 -fetch => [qw/M57939 M28274 L36028/]);
119
120 # Same, but shorter and uses -> operator
121 $gb = Boulder::Genbank->new qw(M57939 M28274 L36028);
122
123 # Network fetch via Entrez, with a query
124
125 # Network fetch via Entrez, with a query
126 $query = 'Homo sapiens[Organism] AND EST[Keyword]';
127 $gb=new Boulder::Genbank(-accessor => 'Entrez',
128 -fetch => $query);
129
130 # Local fetch via Yank, with accession numbers
131 $gb=new Boulder::Genbank(-accessor => 'Yank',
132 -fetch => [qw/M57939 M28274 L36028/]);
133
134 # Local fetch via File
135 $gb=new Boulder::Genbank(-accessor => 'File',
136 -fetch => '/usr/local/genbank/gbpri3.seq');
137
138 The new() method creates a new Boulder::Genbank stream on the
139 accessor provided. The three possible accessors are Entrez, Yank
140 and File. If successful, the method returns the stream object.
141 Otherwise it returns undef.
142
143 new() takes the following arguments:
144
145 -accessor Name of the accessor to use
146 -fetch Parameters to pass to the accessor
147 -proxy Path to an HTTP proxy, used when using
148 the Entrez accessor over a firewall.
149
150 Specify the accessor to use with the -accessor argument. If not
151 specified, it defaults to Entrez.
152
153 -fetch is an accessor-specific argument. The possibilities are:
154
155 For Entrez, the -fetch argument may point to a scalar, in which
156 case it is interpreted as an Entrez query string. See
157 http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a description
158 of the query syntax. Alternatively, -fetch may point to an array
159 reference, in which case it is interpreted as a list of accession
160 numbers to retrieve. If -fetch points to a hash, it is interpreted
161 as extended information. See "Extended Entrez Parameters" below.
162
163 For Yank, the -fetch argument must point to an array reference
164 containing the accession numbers to retrieve.
165
166 For File, the -fetch argument must point to a string-valued scalar,
167 which will be interpreted as the path to the file to read Genbank
168 entries from.
169
170 For Entrez (and Entrez only) Boulder::Genbank allows you to use a
171 shortcut syntax in which you provde new() with a list of accession
172 numbers:
173
174 $gb = new Boulder::Genbank('M57939','M28274','L36028');
175
176 newFh()
177 This works like new(), but returns a filehandle. To recover each
178 GenBank record read from the filehandle with the <> operator:
179
180 $fh = Boulder::GenBank->newFh('M57939','M28274','L36028');
181 while ($record = <$fh>) {
182 print $record->asString;
183 }
184
185 get()
186 The get() method is inherited from Boulder::Stream, and simply
187 returns the next parsed Genbank Stone, or undef if there is nothing
188 more to fetch. It has the same semantics as the parent class,
189 including the ability to restrict access to certain top-level tags.
190
191 The object returned is a Stone::GB_Sequence object, which is a
192 descendent of Stone.
193
194 put()
195 The put() method is inherited from the parent Boulder::Stream
196 class, and will write the passed Stone to standard output in
197 Boulder format. This means that it is currently not possible to
198 write a Boulder::Genbank object back into Genbank flatfile form.
199
200 Extended Entrez Parameters
201 The Entrez accessor recognizes extended parameters that allow you the
202 ability to customize the search. Instead of passing a query string
203 scalar or a list of accession numbers as the -fetch argument, pass a
204 hash reference. The hashref should contain one or more of the
205 following keys:
206
207 -query
208 The Entrez query to process.
209
210 -accession
211 The list of accession numbers to fetch, as an array ref.
212
213 -db The database to search. This is a single-letter database code
214 selected from the following list:
215
216 m MEDLINE
217 p Protein
218 n Nucleotide
219 s Popset
220
221 -proxy
222 An HTTP proxy to use. For example:
223
224 -proxy => http://www.firewall.com:9000
225
226 If you think you need this, get the correct URL from your system
227 administrator.
228
229 As an example, here's how to search for ESTs from Oryza sativa that
230 have been entered or modified since 1999.
231
232 my $gb = new Boulder::Genbank( -accessor=>Entrez,
233 -query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]',
234 -db => 'n'
235 });
236
238 Each record returned from the Boulder::Genbank stream defines a set of
239 methods that correspond to features and other fields in the Genbank
240 flat file record. Stone::GB_Sequence gives the full details, but they
241 are listed for reference here:
242
243 $length = $entry->length
244 Get the length of the sequence.
245
246 $start = $entry->start
247 Get the start position of the sequence, currently always "1".
248
249 $end = $entry->end
250 Get the end position of the sequence, currently always the same as the
251 length.
252
253 @feature_list = $entry->features(-pos=>[50,450],-type=>['CDS','Exon'])
254 features() will search the entry feature list for those features that
255 meet certain criteria. The criteria are specified using the -pos
256 and/or -type argument names, as shown below.
257
258 -pos
259 Provide a position or range of positions which the feature must
260 overlap. A single position is specified in this way:
261
262 -pos => 1500; # feature must overlap postion 1500
263
264 or a range of positions in this way:
265
266 -pos => [1000,1500]; # 1000 to 1500 inclusive
267
268 If no criteria are provided, then features() returns all the
269 features, and is equivalent to calling the Features() accessor.
270
271 -type, -types
272 Filter the list of features by type or a set of types. Matches are
273 case-insensitive, so "exon", "Exon" and "EXON" are all equivalent.
274 You may call with a single type as in:
275
276 -type => 'Exon'
277
278 or with a list of types, as in
279
280 -types => ['Exon','CDS']
281
282 The names "-type" and "-types" can be used interchangeably.
283
284 $seqObj = $entry->bioSeq;
285 Returns a Bio::Seq object from the Bioperl project. Dies with an error
286 message unless the Bio::Seq module is installed.
287
289 The tags returned by the parsing operation are taken from the NCBI
290 ASN.1 schema. For consistency, they are normalized so that the initial
291 letter is capitalized, and all subsequent letters are lowercase. This
292 section contains an abbreviated list of the most useful/common tags.
293 See "The NCBI Data Model", by James Ostell and Jonathan Kans in
294 "Bioinformatics: A Practical Guide to the Analysis of Genes and
295 Proteins" (Eds. A. Baxevanis and F. Ouellette), pp 121-144 for the full
296 listing.
297
298 Top-Level Tags
299 These are tags that appear at the top level of the parsed Genbank
300 entry.
301
302 Accession
303 The accession number of this entry. Because of the vagaries of the
304 Genbank data model, an entry may have multiple accession numbers
305 (e.g. after a merging operation). Accession may therefore be a
306 multi-valued tag.
307
308 Example:
309
310 my $accessionNo = $s->Accession;
311
312 Authors
313 The list of authors, as they appear on the AUTHORS line of the
314 Genbank record. No attempt is made to parse them into individual
315 authors.
316
317 Basecount
318 The nucleotide basecount for the entry. It is presented as a
319 Boulder Stone with keys "a", "c", "t" and "g". Example:
320
321 my $A = $s->Basecount->A;
322 my $C = $s->Basecount->C;
323 my $G = $s->Basecount->G;
324 my $T = $s->Basecount->T;
325 print "GC content is ",($G+$C)/($A+$C+$G+$T),"\n";
326
327 Blob
328 The entire flatfile record as an unparsed chunk of text (a "blob").
329 This is a handy way of reassembling the record for human
330 inspection.
331
332 Comment
333 The COMMENT line from the Genbank record.
334
335 Definition
336 The DEFINITION line from the Genbank record, unmodified.
337
338 Features
339 The FEATURES table. This is a complex stone object with multiple
340 subtags. See the "The Features Tag" for details.
341
342 Journal
343 The JOURNAL line from the Genbank record, unmodified.
344
345 Keywords
346 The KEYWORDS line from the Genbank record, unmodified. No attempt
347 is made to parse the keywords into separate values.
348
349 Example:
350
351 my $keywords = $s->Keywords
352
353 Locus
354 The LOCUS line from the Genbank record. It is not further parsed.
355
356 Medline, Nid
357 References to other database accession numbers.
358
359 Organism
360 The taxonomic name of the organism from which this entry was
361 derived. This line is taken from the Genbank entry unmodified. See
362 the NCBI data model documentation for an explanation of their
363 taxonomic syntax.
364
365 Reference
366 The REFERENCE line from the Genbank entry. There are often
367 multiple Reference lines. Example:
368
369 my @references = $s->Reference;
370
371 Sequence
372 The DNA or RNA sequence of the entry. This is presented as a
373 single lower-case string, with all base numbers and formatting
374 characters removed.
375
376 Source
377 The entry's SOURCE field; often giving clues on how the sequencing
378 was performed.
379
380 Title
381 The TITLE field from the paper describing this entry, if any.
382
383 The Features Tag
384 The Features tag points to a Stone record that contains multiple
385 subtags. Each subtag is the name of a feature which points, in turn,
386 to a Stone that describes the feature's location and other attributes.
387 The full list of feature is beyond this document, but the following are
388 the features that are most often seen:
389
390 Cds a CDS
391 Intron an intron
392 Exon an exon
393 Gene a gene
394 Mrna an mRNA
395 Polya_site a putative polyadenylation signal
396 Repeat_unit a repetitive region
397 Source More information about the organism and cell
398 type the sequence was derived from
399 Satellite a microsatellite (dinucleotide repeat)
400
401 Each feature will contain one or more of the following subtags:
402
403 DB_xref
404 A cross-reference to another database in the form
405 DB_NAME:accession_number. See the NCBI Web site for a description
406 of these cross references.
407
408 Evidence
409 The evidence for this feature, either "experimental" or
410 "predicted".
411
412 Gene
413 If the feature involves a gene, this will be the gene's name (or
414 one of its names). This subtag is often seen in "Gene" and Cds
415 features.
416
417 Example:
418
419 foreach ($s->Features->Cds) {
420 my $gene = $_->Gene;
421 my $position = $_->Position;
422 Print "Gene $gene ($position)\n";
423 }
424
425 Map If the feature is mapped, this provides a map position, usually as
426 a cytogenetic band.
427
428 Note
429 A grab-back for various text notes.
430
431 Number
432 When multiple features of this type occur, this field is used to
433 number them. Ordinarily this field is not needed because
434 Boulder::Genbank preserves the order of features.
435
436 Organism
437 If the feature is Source, this provides the source organism.
438
439 Position
440 The position of this feature, usually expresed as a range
441 (1970..1975).
442
443 Product
444 The protein product of the feature, if applicable, as a text
445 string.
446
447 Translation
448 The protein translation of the feature, if applicable.
449
451 Boulder, Boulder::Blast
452
454 Lincoln Stein <lstein@cshl.org>.
455
456 Copyright (c) 1997-2000 Lincoln D. Stein
457
458 This library is free software; you can redistribute it and/or modify it
459 under the same terms as Perl itself. See DISCLAIMER.txt for
460 disclaimers of warranty.
461
463 The following is an excerpt from a moderately complex Genbank Stone.
464 The Sequence line and several other long lines have been truncated for
465 readability.
466
467 Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U.
468 Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W.
469 Locus=HUMRNP7011 2155 bp DNA PRI 03-JUL-1991
470 Accession=M57939
471 Accession=J04772
472 Accession=M57733
473 Keywords=ribonucleoprotein antigen.
474 Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag...
475 Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11.
476 Journal=Nucleic Acids Res. 15, 10373-10391 (1987)
477 Journal=Genomics 8, 371-379 (1990)
478 Nid=g337441
479 Medline=88096573
480 Medline=91065657
481 Features={
482 Polya_site={
483 Evidence=experimental
484 Position=1989
485 Gene=U1-70K
486 }
487 Polya_site={
488 Position=1990
489 Gene=U1-70K
490 }
491 Polya_site={
492 Evidence=experimental
493 Position=1992
494 Gene=U1-70K
495 }
496 Polya_site={
497 Evidence=experimental
498 Position=1998
499 Gene=U1-70K
500 }
501 Source={
502 Organism=Homo sapiens
503 Db_xref=taxon:9606
504 Position=1..2155
505 Map=19q13.3
506 }
507 Cds={
508 Codon_start=1
509 Product=ribonucleoprotein antigen
510 Db_xref=PID:g337445
511 Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
512 Gene=U1-70K
513 Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR...
514 }
515 Cds={
516 Codon_start=1
517 Product=ribonucleoprotein antigen
518 Db_xref=PID:g337444
519 Evidence=experimental
520 Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
521 Gene=U1-70K
522 Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR...
523 }
524 Polya_signal={
525 Position=1970..1975
526 Note=putative
527 Gene=U1-70K
528 }
529 Intron={
530 Evidence=experimental
531 Position=1100..1208
532 Gene=U1-70K
533 }
534 Intron={
535 Number=10
536 Evidence=experimental
537 Position=1100..1181
538 Gene=U1-70K
539 }
540 Intron={
541 Number=9
542 Evidence=experimental
543 Position=order(M57937:702..921,1..1011)
544 Note=2.1 kb gap
545 Gene=U1-70K
546 }
547 Intron={
548 Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208)
549 Gene=U1-70K
550 }
551 Intron={
552 Evidence=experimental
553 Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208)
554 Note=first gap-0.14 kb, second gap-0.62 kb
555 Gene=U1-70K
556 }
557 Intron={
558 Number=8
559 Evidence=experimental
560 Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181)
561 Note=first gap-0.14 kb, second gap-0.62 kb
562 Gene=U1-70K
563 }
564 Exon={
565 Number=10
566 Evidence=experimental
567 Position=1012..1099
568 Gene=U1-70K
569 }
570 Exon={
571 Number=11
572 Evidence=experimental
573 Position=1182..(1989.1998)
574 Gene=U1-70K
575 }
576 Exon={
577 Evidence=experimental
578 Position=1209..(1989.1998)
579 Gene=U1-70K
580 }
581 Mrna={
582 Product=ribonucleoprotein antigen
583 Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
584 Gene=U1-70K
585 }
586 Mrna={
587 Product=ribonucleoprotein antigen
588 Citation=[2]
589 Evidence=experimental
590 Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
591 Gene=U1-70K
592 }
593 Gene={
594 Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ...
595 Gene=U1-70K
596 }
597 }
598 Reference=1 (sites)
599 Reference=2 (bases 1 to 2155)
600 =
601
603 Hey! The above document had some coding errors, which are explained
604 below:
605
606 Around line 342:
607 You forgot a '=back' before '=head2'
608
609 Around line 347:
610 =back without =over
611
612
613
614perl v5.30.1 2020-01-29 Boulder::Genbank(3)