1Boulder::Genbank(3)   User Contributed Perl Documentation  Boulder::Genbank(3)
2
3
4

NAME

6       Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones
7

SYNOPSIS

9         use Boulder::Genbank
10
11         # network access via Entrez
12          $gb = Boulder::Genbank->newFh( qw(M57939 M28274 L36028) );
13
14          while ($data = <$gb>) {
15              print $data->Accession;
16
17              @introns = $data->features->Intron;
18              print "There are ",scalar(@introns)," introns.\n";
19              $dna = $data->Sequence;
20              print "The dna is ",length($dna)," bp long.\n";
21
22              my @features = $data->features(-type=>[ qw(Exon Source Satellite) ],
23                                             -pos=>[90,310] );
24              foreach (@features) {
25                 print $_->Type,"\n";
26                 print $_->Position,"\n";
27                 print $_->Gene,"\n";
28             }
29           }
30
31         # another syntax
32         $gb = new Boulder::Genbank(-accessor=>'Entrez',
33                                    -fetch => [qw/M57939 M28274 L36028/]);
34
35         # local access via Yank
36         $gb = new Boulder::Genbank(-accessor=>'Yank',
37                                    -fetch=>[qw/M57939 M28274 L36028/]);
38         while (my $s = $gb->get) {
39            # etc.
40         }
41
42         # parse a file of Genbank records
43         $gb = new Boulder::Genbank(-accessor=>'File',
44                                    -fetch => '/usr/local/db/gbpri3.seq');
45         while (my $s = $gb->get) {
46            # etc.
47         }
48
49         # parse flatfile records yourself
50         open (GB,"/usr/local/db/gbpri3.seq");
51         local $/ = "//\n";
52         while (<GB>) {
53            my $s = Boulder::Genbank->parse($_);
54            # etc.
55         }
56

DESCRIPTION

58       Boulder::Genbank provides retrieval and parsing services for NCBI
59       Genbank-format records.  It returns Genbank entries in Stone format,
60       allowing easy access to the various fields and values.
61       Boulder::Genbank is a descendent of Boulder::Stream, and provides a
62       stream-like interface to a series of Stone objects.
63
64       >> IMPORTANT NOTE <<
65
66       As of January 2002, NCBI has changed their Batch Entrez interface.  I
67       have modified Boulder::Genbank so as to use a "demo" interface, which
68       fixes things, but this isn't guaranteed in the long run.
69
70       I have written to NCBI, and they may fix this -- or they may not.
71
72       >> IMPORTANT NOTE <<
73
74       Access to Genbank is provided by three different accessors, which
75       together give access to remote and local Genbank databases.  When you
76       create a new Boulder::Genbank stream, you provide one of the three
77       accessors, along with accessor-specific parameters that control what
78       entries to fetch.  The three accessors are:
79
80       Entrez
81           This provides access to NetEntrez, accessing the most recent
82           Genbank information directly from NCBI's Web site.  The parameters
83           passed to this accessor are either a series of Genbank accession
84           numbers, or an Entrez query (see
85           http://www.ncbi.nlm.nih.gov/Entrez/linking.html).  If you provide a
86           list of accession numbers, the stream will return a series of
87           stones corresponding to the numbers.  Otherwise, if you provided an
88           Entrez query, the entries returned will be in the order returned by
89           Entez.
90
91       File
92           This provides access to local Genbank entries by reading from a
93           flat file (typically one of the .seq files downloadable from NCBI's
94           Web site).  The stream will return a Stone corresponding to each of
95           the entries in the file, starting from the top of the file and
96           working downward.  The parameter in this case is the path to the
97           local file.
98
99       Yank
100           This provides access to local Genbank entries using Will Fitzhugh's
101           Yank program.  Yank provides fast indexed access to a Genbank flat
102           file using the accession number as the key.  The parameter passed
103           to the Yank accessor is a list of accession numbers.  Stones will
104           be returned in the requested order.  By default the yank binary
105           lives in /usr/local/bin/yank.  To support other locations, you may
106           define the environment variable YANK to contain the full path.
107
108       It is also possible to parse a single Genbank entry from a text string
109       stored in a scalar variable, returning a Stone object.
110
111   Boulder::Genbank methods
112       This section lists the public methods that the Boulder::Genbank class
113       makes available.
114
115       new()
116              # Network fetch via Entrez, with accession numbers
117              $gb=new Boulder::Genbank(-accessor  =>  'Entrez',
118                                       -fetch     =>  [qw/M57939 M28274 L36028/]);
119
120              # Same, but shorter and uses -> operator
121              $gb = Boulder::Genbank->new qw(M57939 M28274 L36028);
122
123              # Network fetch via Entrez, with a query
124
125              # Network fetch via Entrez, with a query
126              $query = 'Homo sapiens[Organism] AND EST[Keyword]';
127              $gb=new Boulder::Genbank(-accessor  =>  'Entrez',
128                                       -fetch     =>  $query);
129
130              # Local fetch via Yank, with accession numbers
131              $gb=new Boulder::Genbank(-accessor  =>  'Yank',
132                                       -fetch     =>  [qw/M57939 M28274 L36028/]);
133
134              # Local fetch via File
135              $gb=new Boulder::Genbank(-accessor  =>  'File',
136                                       -fetch     =>  '/usr/local/genbank/gbpri3.seq');
137
138           The new() method creates a new Boulder::Genbank stream on the
139           accessor provided.  The three possible accessors are Entrez, Yank
140           and File.  If successful, the method returns the stream object.
141           Otherwise it returns undef.
142
143           new() takes the following arguments:
144
145                   -accessor       Name of the accessor to use
146                   -fetch          Parameters to pass to the accessor
147                   -proxy          Path to an HTTP proxy, used when using
148                                    the Entrez accessor over a firewall.
149
150           Specify the accessor to use with the -accessor argument.  If not
151           specified, it defaults to Entrez.
152
153           -fetch is an accessor-specific argument.  The possibilities are:
154
155           For Entrez, the -fetch argument may point to a scalar, in which
156           case it is interpreted as an Entrez query string.  See
157           http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a description
158           of the query syntax.  Alternatively, -fetch may point to an array
159           reference, in which case it is interpreted as a list of accession
160           numbers to retrieve.  If -fetch points to a hash, it is interpreted
161           as extended information.  See "Extended Entrez Parameters" below.
162
163           For Yank, the -fetch argument must point to an array reference
164           containing the accession numbers to retrieve.
165
166           For File, the -fetch argument must point to a string-valued scalar,
167           which will be interpreted as the path to the file to read Genbank
168           entries from.
169
170           For Entrez (and Entrez only) Boulder::Genbank allows you to use a
171           shortcut syntax in which you provde new() with a list of accession
172           numbers:
173
174             $gb = new Boulder::Genbank('M57939','M28274','L36028');
175
176       newFh()
177           This works like new(), but returns a filehandle.  To recover each
178           GenBank record read from the filehandle with the <> operator:
179
180             $fh = Boulder::GenBank->newFh('M57939','M28274','L36028');
181             while ($record = <$fh>) {
182                print $record->asString;
183             }
184
185       get()
186           The get() method is inherited from Boulder::Stream, and simply
187           returns the next parsed Genbank Stone, or undef if there is nothing
188           more to fetch.  It has the same semantics as the parent class,
189           including the ability to restrict access to certain top-level tags.
190
191           The object returned is a Stone::GB_Sequence object, which is a
192           descendent of Stone.
193
194       put()
195           The put() method is inherited from the parent Boulder::Stream
196           class, and will write the passed Stone to standard output in
197           Boulder format.  This means that it is currently not possible to
198           write a Boulder::Genbank object back into Genbank flatfile form.
199
200   Extended Entrez Parameters
201       The Entrez accessor recognizes extended parameters that allow you the
202       ability to customize the search.  Instead of passing a query string
203       scalar or a list of accession numbers as the -fetch argument, pass a
204       hash reference.  The hashref should contain one or more of the
205       following keys:
206
207       -query
208           The Entrez query to process.
209
210       -accession
211           The list of accession numbers to fetch, as an array ref.
212
213       -db The database to search.  This is a single-letter database code
214           selected from the following list:
215
216             m  MEDLINE
217             p  Protein
218             n  Nucleotide
219             s  Popset
220
221       -proxy
222           An HTTP proxy to use.  For example:
223
224              -proxy => http://www.firewall.com:9000
225
226           If you think you need this, get the correct URL from your system
227           administrator.
228
229       As an example, here's how to search for ESTs from Oryza sativa that
230       have been entered or modified since 1999.
231
232         my $gb = new Boulder::Genbank( -accessor=>Entrez,
233                                        -query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]',
234                                        -db   => 'n'
235                                       });
236

METHODS DEFINED BY THE GENBANK STONE OBJECT

238       Each record returned from the Boulder::Genbank stream defines a set of
239       methods that correspond to features and other fields in the Genbank
240       flat file record.  Stone::GB_Sequence gives the full details, but they
241       are listed for reference here:
242
243   $length = $entry->length
244       Get the length of the sequence.
245
246   $start = $entry->start
247       Get the start position of the sequence, currently always "1".
248
249   $end = $entry->end
250       Get the end position of the sequence, currently always the same as the
251       length.
252
253   @feature_list = $entry->features(-pos=>[50,450],-type=>['CDS','Exon'])
254       features() will search the entry feature list for those features that
255       meet certain criteria.  The criteria are specified using the -pos
256       and/or -type argument names, as shown below.
257
258       -pos
259           Provide a position or range of positions which the feature must
260           overlap.  A single position is specified in this way:
261
262              -pos => 1500;         # feature must overlap postion 1500
263
264           or a range of positions in this way:
265
266              -pos => [1000,1500];  # 1000 to 1500 inclusive
267
268           If no criteria are provided, then features() returns all the
269           features, and is equivalent to calling the Features() accessor.
270
271       -type, -types
272           Filter the list of features by type or a set of types.  Matches are
273           case-insensitive, so "exon", "Exon" and "EXON" are all equivalent.
274           You may call with a single type as in:
275
276              -type => 'Exon'
277
278           or with a list of types, as in
279
280              -types => ['Exon','CDS']
281
282           The names "-type" and "-types" can be used interchangeably.
283
284   $seqObj = $entry->bioSeq;
285       Returns a Bio::Seq object from the Bioperl project.  Dies with an error
286       message unless the Bio::Seq module is installed.
287

OUTPUT TAGS

289       The tags returned by the parsing operation are taken from the NCBI
290       ASN.1 schema.  For consistency, they are normalized so that the initial
291       letter is capitalized, and all subsequent letters are lowercase.  This
292       section contains an abbreviated list of the most useful/common tags.
293       See "The NCBI Data Model", by James Ostell and Jonathan Kans in
294       "Bioinformatics: A Practical Guide to the Analysis of Genes and
295       Proteins" (Eds. A. Baxevanis and F. Ouellette), pp 121-144 for the full
296       listing.
297
298   Top-Level Tags
299       These are tags that appear at the top level of the parsed Genbank
300       entry.
301
302       Accession
303           The accession number of this entry.  Because of the vagaries of the
304           Genbank data model, an entry may have multiple accession numbers
305           (e.g. after a merging operation).  Accession may therefore be a
306           multi-valued tag.
307
308           Example:
309
310                 my $accessionNo = $s->Accession;
311
312       Authors
313           The list of authors, as they appear on the AUTHORS line of the
314           Genbank record.  No attempt is made to parse them into individual
315           authors.
316
317       Basecount
318           The nucleotide basecount for the entry.  It is presented as a
319           Boulder Stone with keys "a", "c", "t" and "g".  Example:
320
321                my $A = $s->Basecount->A;
322                my $C = $s->Basecount->C;
323                my $G = $s->Basecount->G;
324                my $T = $s->Basecount->T;
325                print "GC content is ",($G+$C)/($A+$C+$G+$T),"\n";
326
327       Blob
328           The entire flatfile record as an unparsed chunk of text (a "blob").
329           This is a handy way of reassembling the record for human
330           inspection.
331
332       Comment
333           The COMMENT line from the Genbank record.
334
335       Definition
336           The DEFINITION line from the Genbank record, unmodified.
337
338       Features
339           The FEATURES table.  This is a complex stone object with multiple
340           subtags.  See the "The Features Tag" for details.
341
342       Journal
343           The JOURNAL line from the Genbank record, unmodified.
344
345       Keywords
346           The KEYWORDS line from the Genbank record, unmodified.  No attempt
347           is made to parse the keywords into separate values.
348
349           Example:
350
351               my $keywords = $s->Keywords
352
353       Locus
354           The LOCUS line from the Genbank record.  It is not further parsed.
355
356       Medline, Nid
357           References to other database accession numbers.
358
359       Organism
360           The taxonomic name of the organism from which this entry was
361           derived. This line is taken from the Genbank entry unmodified.  See
362           the NCBI data model documentation for an explanation of their
363           taxonomic syntax.
364
365       Reference
366           The REFERENCE line from the Genbank entry.  There are often
367           multiple Reference lines.  Example:
368
369             my @references = $s->Reference;
370
371       Sequence
372           The DNA or RNA sequence of the entry.  This is presented as a
373           single lower-case string, with all base numbers and formatting
374           characters removed.
375
376       Source
377           The entry's SOURCE field; often giving clues on how the sequencing
378           was performed.
379
380       Title
381           The TITLE field from the paper describing this entry, if any.
382
383   The Features Tag
384       The Features tag points to a Stone record that contains multiple
385       subtags.  Each subtag is the name of a feature which points, in turn,
386       to a Stone that describes the feature's location and other attributes.
387       The full list of feature is beyond this document, but the following are
388       the features that are most often seen:
389
390               Cds             a CDS
391               Intron          an intron
392               Exon            an exon
393               Gene            a gene
394               Mrna            an mRNA
395               Polya_site      a putative polyadenylation signal
396               Repeat_unit     a repetitive region
397               Source          More information about the organism and cell
398                               type the sequence was derived from
399               Satellite       a microsatellite (dinucleotide repeat)
400
401       Each feature will contain one or more of the following subtags:
402
403       DB_xref
404           A cross-reference to another database in the form
405           DB_NAME:accession_number.  See the NCBI Web site for a description
406           of these cross references.
407
408       Evidence
409           The evidence for this feature, either "experimental" or
410           "predicted".
411
412       Gene
413           If the feature involves a gene, this will be the gene's name (or
414           one of its names).  This subtag is often seen in "Gene" and Cds
415           features.
416
417           Example:
418
419                   foreach ($s->Features->Cds) {
420                      my $gene = $_->Gene;
421                      my $position = $_->Position;
422                      Print "Gene $gene ($position)\n";
423                   }
424
425       Map If the feature is mapped, this provides a map position, usually as
426           a cytogenetic band.
427
428       Note
429           A grab-back for various text notes.
430
431       Number
432           When multiple features of this type occur, this field is used to
433           number them.  Ordinarily this field is not needed because
434           Boulder::Genbank preserves the order of features.
435
436       Organism
437           If the feature is Source, this provides the source organism.
438
439       Position
440           The position of this feature, usually expresed as a range
441           (1970..1975).
442
443       Product
444           The protein product of the feature, if applicable, as a text
445           string.
446
447       Translation
448           The protein translation of the feature, if applicable.
449

SEE ALSO

451       Boulder, Boulder::Blast
452

AUTHOR

454       Lincoln Stein <lstein@cshl.org>.
455
456       Copyright (c) 1997-2000 Lincoln D. Stein
457
458       This library is free software; you can redistribute it and/or modify it
459       under the same terms as Perl itself.  See DISCLAIMER.txt for
460       disclaimers of warranty.
461

EXAMPLE GENBANK OBJECT

463       The following is an excerpt from a moderately complex Genbank Stone.
464       The Sequence line and several other long lines have been truncated for
465       readability.
466
467        Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U.
468        Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W.
469        Locus=HUMRNP7011   2155 bp    DNA             PRI       03-JUL-1991
470        Accession=M57939
471        Accession=J04772
472        Accession=M57733
473        Keywords=ribonucleoprotein antigen.
474        Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag...
475        Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11.
476        Journal=Nucleic Acids Res. 15, 10373-10391 (1987)
477        Journal=Genomics 8, 371-379 (1990)
478        Nid=g337441
479        Medline=88096573
480        Medline=91065657
481        Features={
482          Polya_site={
483            Evidence=experimental
484            Position=1989
485            Gene=U1-70K
486          }
487          Polya_site={
488            Position=1990
489            Gene=U1-70K
490          }
491          Polya_site={
492            Evidence=experimental
493            Position=1992
494            Gene=U1-70K
495          }
496          Polya_site={
497            Evidence=experimental
498            Position=1998
499            Gene=U1-70K
500          }
501          Source={
502            Organism=Homo sapiens
503            Db_xref=taxon:9606
504            Position=1..2155
505            Map=19q13.3
506          }
507          Cds={
508            Codon_start=1
509            Product=ribonucleoprotein antigen
510            Db_xref=PID:g337445
511            Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
512            Gene=U1-70K
513            Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR...
514          }
515          Cds={
516            Codon_start=1
517            Product=ribonucleoprotein antigen
518            Db_xref=PID:g337444
519            Evidence=experimental
520            Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
521            Gene=U1-70K
522            Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR...
523          }
524          Polya_signal={
525            Position=1970..1975
526            Note=putative
527            Gene=U1-70K
528          }
529          Intron={
530            Evidence=experimental
531            Position=1100..1208
532            Gene=U1-70K
533          }
534          Intron={
535            Number=10
536            Evidence=experimental
537            Position=1100..1181
538            Gene=U1-70K
539          }
540          Intron={
541            Number=9
542            Evidence=experimental
543            Position=order(M57937:702..921,1..1011)
544            Note=2.1 kb gap
545            Gene=U1-70K
546          }
547          Intron={
548            Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208)
549            Gene=U1-70K
550          }
551          Intron={
552            Evidence=experimental
553            Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208)
554            Note=first gap-0.14 kb, second gap-0.62 kb
555            Gene=U1-70K
556          }
557          Intron={
558            Number=8
559            Evidence=experimental
560            Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181)
561            Note=first gap-0.14 kb, second gap-0.62 kb
562            Gene=U1-70K
563          }
564          Exon={
565            Number=10
566            Evidence=experimental
567            Position=1012..1099
568            Gene=U1-70K
569          }
570          Exon={
571            Number=11
572            Evidence=experimental
573            Position=1182..(1989.1998)
574            Gene=U1-70K
575          }
576          Exon={
577            Evidence=experimental
578            Position=1209..(1989.1998)
579            Gene=U1-70K
580          }
581          Mrna={
582            Product=ribonucleoprotein antigen
583            Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
584            Gene=U1-70K
585          }
586          Mrna={
587            Product=ribonucleoprotein antigen
588            Citation=[2]
589            Evidence=experimental
590            Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
591            Gene=U1-70K
592          }
593          Gene={
594            Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ...
595            Gene=U1-70K
596          }
597        }
598        Reference=1  (sites)
599        Reference=2  (bases 1 to 2155)
600        =
601

POD ERRORS

603       Hey! The above document had some coding errors, which are explained
604       below:
605
606       Around line 342:
607           You forgot a '=back' before '=head2'
608
609       Around line 347:
610           =back without =over
611
612
613
614perl v5.34.0                      2021-07-22               Boulder::Genbank(3)
Impressum