1GO::AnnotationProvider:U:sAenrnoCtoanttiroinbPuatGreOsd:e:rPA(en3rn)lotDaotciuomnePnrtoavtiidoenr::AnnotationParser(3)
2
3
4
6 GO::AnnotationProvider::AnnotationParser - parses a gene annotation
7 file
8
10 GO::AnnotationProvider::AnnotationParser - reads a Gene Ontology gene
11 associations file, and provides methods by which to retrieve the GO
12 annotations for the an annotated entity. Note, it is case insensitive,
13 with some caveats - see documentation below.
14
15 my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(annotationFile => "data/gene_association.sgd");
16
17 my $geneName = "AAT2";
18
19 print "GO associations for gene: ", join (" ", $annotationParser->goIdsByName(name => $geneName,
20 aspect => 'P')), "\n";
21
22 print "Database ID for gene: ", $annotationParser->databaseIdByName($geneName), "\n";
23
24 print "Database name: ", $annotationParser->databaseName(), "\n";
25
26 print "Standard name for gene: ", $annotationParser->standardNameByName($geneName), "\n";
27
28 my $i;
29
30 my @geneNames = $annotationParser->allStandardNames();
31
32 foreach $i (0..10) {
33
34 print "$geneNames[$i]\n";
35
36 }
37
39 GO::AnnotationProvider::AnnotationParser is a concrete subclass of
40 GO::AnnotationProvider, and creates a data structure mapping gene names
41 to GO annotations by parsing a file of annotations provided by the Gene
42 Ontology Consortium.
43
44 This package provides object methods for retrieving GO annotations that
45 have been parsed from a 'gene associations' file, provided by the gene
46 ontology consortium. The format for the file is:
47
48 Lines beginning with a '!' character are comment lines.
49
50 Column Cardinality Contents
51 ------ ----------- -------------------------------------------------------------
52 0 1 Database abbreviation for the source of annotation (e.g. SGD)
53 1 1 Database identifier of the annotated entity
54 2 1 Standard name of the annotated entity
55 3 0,1 NOT (if a gene is specifically NOT annotated to the term)
56 4 1 GOID of the annotation
57 5 1,n Reference(s) for the annotation
58 6 1 Evidence code for the annotation
59 7 0,n With or From (a bit mysterious)
60 8 1 Aspect of the Annotation (C, F, P)
61 9 0,1 Name of the product being annotated
62 10 0,n Alias(es) of the annotated product
63 11 1 type of annotated entity (one of gene, transcript, protein)
64 12 1,2 taxonomic id of the organism encoding and/or using the product
65 13 1 Date of annotation YYYYMMDD
66 14 1 Assigned_by : The database which made the annotation
67
68 Columns are separated by tabs. For those entries with a cardinality
69 greater than 1, multiple entries are pipe , |, delimited.
70
71 Further details can be found at:
72
73 http://www.geneontology.org/doc/GO.annotation.html#file
74
75 The following assumptions about the file are made (and should be true):
76
77 1. All aliases appear for all entries of a given annotated product
78 2. The database identifiers are unique, in that two different
79 entities cannot have the same database id.
80
82 Also see the TODO list in the parent, GO::AnnotationProvider.
83
84 1. Add in methods that will allow retrieval of evidence codes with
85 the annotations for a particular entity.
86
87 2. Add in methods that return all the annotated entities for a
88 particular GOID.
89
90 3. Add in the ability to request only annotations either including
91 or excluding particular evidence codes. Such evidence codes
92 could be provided as an anonymous array as the value of a named
93 argument.
94
95 4. Same as number 3, except allow the retrieval of annotated
96 entities for a particular GOID, based on inclusion or exclusion
97 of certain evidence codes.
98
99 These first four items will require a reworking of how data are
100 stored on the backend, and thus the parsing code itself, though it
101 should not affect any of the already existing API.
102
103 5. Instead of 'use'ing Storable, 'require' it instead, only at the
104 point of use, which will mean that AnnotationParser can be
105 happily used in the absence of Storable, just without those
106 functions that need it.
107
108 6. Extend the ValidateFile class method to check that an entity
109 should never be annotated to the same node twice, with the same
110 evidence, with the same reference.
111
112 7. An additional checker, that uses an AnnotationProvider in
113 conjunction with an OntologyProvider, would be useful, that
114 checks that some of the annotations themselves are valid, ie
115 that no entities are annotated to the 'unknown' node in a
116 particular aspect, and also to another node within that same
117 aspect. Can annotations be redundant? ie, if an entity is
118 annotated to a node, and an ancestor of the node, is that
119 annotation redundant? Does it depend on the evidence codes and
120 references. Or are such annotations reinforcing? These things
121 are useful to consider when formulating the confidence which can
122 be attributed to an annotation.
123
125 Usage
126 This class method simply prints out a usage statement, along with an
127 error message, if one was passed in.
128
129 Usage :
130
131 GO::AnnotationProvider::AnnotationParser->Usage();
132
133 ValidateFile
134 This class method reads an annotation file, and returns a reference to
135 an array of errors that are present within the file. The errors are
136 simply strings, each beginning with "Line $lineNo : " where $lineNo is
137 the number of the line in the file where the error was found.
138
139 Usage:
140
141 my $errorsRef = GO::AnnotationProvider::AnnotationParser->ValidateFile(annotationFile => $file);
142
144 new
145 This is the constructor for an AnnotationParser object.
146
147 The constructor expects one of two arguments, either a 'annotationFile'
148 argument, or and 'objectFile' argument. When instantiated with an
149 annotationFile argument, it expects it to correspond to an annotation
150 file created by one of the GO consortium members, according to their
151 file format. When instantiated with an objectFile argument, it expects
152 to open a previously created annotationParser object that has been
153 serialized to disk (see the serializeToDisk method).
154
155 Usage:
156
157 my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(annotationFile => $file);
158
159 my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(objectFile => $file);
160
163 Because there are many names by which an annotated entity may be
164 referred to, that are non-unique, there exist a set of methods for
165 determining whether a name is ambiguous, and to what database
166 identifiers such ambiguous names may refer.
167
168 Note, that the AnnotationParser is now case insensitive, but with some
169 caveats. For instance, you can use 'cdc6' to retrieve data for CDC6.
170 However, This if gene has been referred to as abc1, and another
171 referred to as ABC1, then these are treated as different, and
172 unambiguous. However, the text 'Abc1' would be considered ambiguous,
173 because it could refer to either. On the other hand, if a single gene
174 is referred to as XYZ1 and xyz1, and no other genes have that name (in
175 any casing), then Xyz1 would still be considered unambiguous.
176
177 nameIsAmbiguous
178 This public method returns a boolean to indicate whether a name is
179 ambiguous, i.e. whether the name might map to more than one entity (and
180 therefore more than one databaseId).
181
182 NB: API change:
183
184 nameIsAmbiguous is now case insensitive - that is, if there is a name
185 that is used twice using different casing, that will be treated as
186 ambiguous. Previous versions would have not treated these as
187 ambiguous. In the case that a name is provided in a certain casing,
188 which was encountered only once, then it will be treated as
189 unambiguous. This is the price of wanting a case insensitive
190 annotation parser...
191
192 Usage:
193
194 if ($annotationParser->nameIsAmbiguous($name)){
195
196 do something useful....or not....
197
198 }
199
200 databaseIdsForAmbiguousName
201 This public method returns an array of database identifiers for an
202 ambiguous name. If the name is not ambiguous, an empty list will be
203 returned.
204
205 NB: API change:
206
207 databaseIdsForAmbiguousName is now case insensitive - that is, if there
208 is a name that is used twice using different casing, that will be
209 treated as ambiguous. Previous versions would have not treated these
210 as ambiguous. However, if the name provided is of the exact casing as
211 a name that appeared only once with that exact casing, then it is
212 treated as unambiguous. This is the price of wanting a case insensitive
213 annotation parser...
214
215 Usage:
216
217 my @databaseIds = $annotationParser->databaseIdsForAmbiguousName($name);
218
219 ambiguousNames
220 This method returns an array of names, which from the annotation file
221 have been deemed to be ambiguous.
222
223 Note - even though we have made the annotation parser case insensitive,
224 if something appeared in the annotations file as BLAH1 and blah1, we
225 would not deem either of these to be ambiguous. However, if it
226 appeared as blah1 twice, referring to two different genes, then blah1
227 would be ambiguous.
228
229 Usage:
230
231 my @ambiguousNames = $annotationParser->ambiguousNames();
232
234 goIdsByDatabaseId
235 This public method returns a reference to an array of GOIDs that are
236 associated with the supplied databaseId for a specific aspect. If no
237 annotations are associated with that databaseId in that aspect, then a
238 reference to an empty array will be returned. If the databaseId is not
239 recognized, then undef will be returned. In the case that a databaseId
240 is ambiguous (for instance the same databaseId exists but with
241 different casings) then if the supplied database id matches the exact
242 case of one of those supplied, then that is the one it will be treated
243 as. In the case where the databaseId matches none of the possibilities
244 by case, then a fatal error will occur, because the provided databaseId
245 was ambiguous.
246
247 Usage:
248
249 my $goidsRef = $annotationParser->goIdsByDatabaseId(databaseId => $databaseId,
250 aspect => <P|F|C>);
251
252 goIdsByStandardName
253 This public method returns a reference to an array of GOIDs that are
254 associated with the supplied standardName for a specific aspect. If no
255 annotations are associated with the entity with that standard name in
256 that aspect, then a reference to an empty list will be returned. If
257 the supplied name is not used as a standard name, then undef will be
258 returned. In the case that the supplied standardName is ambiguous (for
259 instance the same standardName exists but with different casings) then
260 if the supplied standardName matches the exact case of one of those
261 supplied, then that is the one it will be treated as. In the case
262 where the standardName matches none of the possibilities by case, then
263 a fatal error will occur, because the provided standardName was
264 ambiguous.
265
266 Usage:
267
268 my $goidsRef = $annotationParser->goIdsByStandardName(standardName =>$standardName,
269 aspect =><P|F|C>);
270
271 goIdsByName
272 This public method returns a reference to an array of GO IDs that are
273 associated with the supplied name for a specific aspect. If there are
274 no GO associations for the entity corresponding to the supplied name in
275 the provided aspect, then a reference to an empty list will be
276 returned. If the supplied name does not correspond to any entity, then
277 undef will be returned. Because the name can be any of the databaseId,
278 the standard name, or any of the aliases, it is possible that the name
279 might be ambiguous. Clients of this object should first test whether
280 the name they are using is ambiguous, using the nameIsAmbiguous()
281 method, and handle it accordingly. If an ambiguous name is supplied,
282 then it will die.
283
284 NB: API change:
285
286 goIdsByName is now case insensitive - that is, if there is a name that
287 is used twice using different casing, that will be treated as
288 ambiguous. Previous versions would have not treated these as
289 ambiguous. This is the price of wanting a case insensitive annotation
290 parser. In the event that a name is provided that is ambiguous because
291 of case, if it matches exactly the case of one of the possible matches,
292 it will be treated unambiguously.
293
294 Usage:
295
296 my $goidsRef = $annotationParser->goIdsByName(name => $name,
297 aspect => <P|F|C>);
298
300 standardNameByDatabaseId
301 This method returns the standard name for a database id.
302
303 NB: API change
304
305 standardNameByDatabaseId is now case insensitive - that is, if there is
306 a databaseId that is used twice (or more) using different casing, it
307 will be treated as ambiguous. Previous versions would have not treated
308 these as ambiguous. This is the price of wanting a case insensitive
309 annotation parser. In the event that a name is provided that is
310 ambiguous because of case, if it matches exactly the case of one of the
311 possible matches, it will be treated unambiguously.
312
313 Usage:
314
315 my $standardName = $annotationParser->standardNameByDatabaseId($databaseId);
316
317 databaseIdByStandardName
318 This method returns the database id for a standard name.
319
320 NB: API change
321
322 databaseIdByStandardName is now case insensitive - that is, if there is
323 a standard name that is used twice (or more) using different casing, it
324 will be treated as ambiguous. Previous versions would have not treated
325 these as ambiguous. This is the price of wanting a case insensitive
326 annotation parser. In the event that a name is provided that is
327 ambiguous because of case, if it matches exactly the case of one of the
328 possible matches, it will be treated unambiguously.
329
330 Usage:
331
332 my $databaseId = $annotationParser->databaseIdByStandardName($standardName);
333
334 databaseIdByName
335 This method returns the database id for any identifier for a gene (e.g.
336 by databaseId itself, by standard name, or by alias). If the used name
337 is ambiguous, then the program will die. Thus clients should call the
338 nameIsAmbiguous() method, prior to using this method. If the name does
339 not map to any databaseId, then undef will be returned.
340
341 NB: API change
342
343 databaseIdByName is now case insensitive - that is, if there is a name
344 that is used twice using different casing, that will be treated as
345 ambiguous. Previous versions would have not treated these as
346 ambiguous. This is the price of wanting a case insensitive annotation
347 parser. In the event that a name is provided that is ambiguous because
348 of case, if it matches exactly the case of one of the possible matches,
349 it will be treated unambiguously.
350
351 Usage:
352
353 my $databaseId = $annotationParser->databaseIdByName($name);
354
355 standardNameByName
356 This public method returns the standard name for the the gene specified
357 by the given name. Because a name may be ambiguous, the
358 nameIsAmbiguous() method should be called first. If an ambiguous name
359 is supplied, then it will die with an appropriate error message. If
360 the name does not map to a standard name, then undef will be returned.
361
362 NB: API change
363
364 standardNameByName is now case insensitive - that is, if there is a
365 name that is used twice using different casing, that will be treated as
366 ambiguous. Previous versions would have not treated these as
367 ambiguous. This is the price of wanting a case insensitive annotation
368 parser.
369
370 Usage:
371
372 my $standardName = $annotationParser->standardNameByName($name);
373
375 nameIsStandardName
376 This method returns a boolean to indicate whether the supplied name is
377 used as a standard name.
378
379 NB : API change.
380
381 This is now case insensitive. If you provide abC1, and ABc1 is a
382 standard name, then it will return true.
383
384 Usage :
385
386 if ($annotationParser->nameIsStandardName($name)){
387
388 # do something
389
390 }
391
392 nameIsDatabaseId
393 This method returns a boolean to indicate whether the supplied name is
394 used as a database id.
395
396 NB : API change.
397
398 This is now case insensitive. If you provide abC1, and ABc1 is a
399 database id, then it will return true.
400
401 Usage :
402
403 if ($annotationParser->nameIsDatabaseId($name)){
404
405 # do something
406
407 }
408
409 nameIsAnnotated
410 This method returns a boolean to indicate whether the supplied name has
411 any annotations, either when considered as a databaseId, a
412 standardName, or an alias. If an aspect is also supplied, then it
413 indicates whether that name has any annotations in that aspect only.
414
415 NB: API change.
416
417 This is now case insensitive. If you provide abC1, and ABc1 has
418 annotation, then it will return true.
419
420 Usage :
421
422 if ($annotationParser->nameIsAnnotated(name => $name)){
423
424 # blah
425
426 }
427
428 or:
429
430 if ($annotationParser->nameIsAnnotated(name => $name,
431 aspect => $aspect)){
432
433 # blah
434
435 }
436
438 databaseName
439 This method returns the name of the annotating authority from the file
440 that was supplied to the constructor.
441
442 Usage :
443
444 my $databaseName = $annotationParser->databaseName();
445
446 numAnnotatedGenes
447 This method returns the number of entities in the annotation file that
448 have annotations in the supplied aspect. If no aspect is provided,
449 then it will return the number of genes with an annotation in at least
450 one aspect of GO.
451
452 Usage:
453
454 my $numAnnotatedGenes = $annotationParser->numAnnotatedGenes();
455
456 my $numAnnotatedGenes = $annotationParser->numAnnotatedGenes($aspect);
457
458 allDatabaseIds
459 This public method returns an array of all the database identifiers
460
461 Usage:
462
463 my @databaseIds = $annotationParser->allDatabaseIds();
464
465 allStandardNames
466 This public method returns an array of all standard names.
467
468 Usage:
469
470 my @standardNames = $annotationParser->allStandardNames();
471
473 file
474 This method returns the name of the file that was used to instantiate
475 the object.
476
477 Usage:
478
479 my $file = $annotationParser->file;
480
481 serializeToDisk
482 This public method saves the current state of the Annotation Parser
483 Object to a file, using the Storable package. The data are saved in
484 network order for portability, just in case. The name of the object
485 file is returned. By default, the name of the original file will be
486 used to make the name of the object file (including the full path from
487 where the file came), or the client can instead supply their own
488 filename.
489
490 Usage:
491
492 my $fileName = $annotationParser->serializeToDisk;
493
494 my $fileName = $annotationParser->serializeToDisk(filename => $filename);
495
497 CVS info is listed here:
498
499 # $Author: sherlock $
500 # $Date: 2008/05/13 23:06:16 $
501 # $Log: AnnotationParser.pm,v $
502 # Revision 1.35 2008/05/13 23:06:16 sherlock
503 # updated to fix bug with querying with a name that was unambiguous when
504 # taking its casing into account.
505 #
506 # Revision 1.34 2007/03/18 03:09:05 sherlock
507 # couple of PerlCritic suggested improvements, and an extra check to
508 # make sure that the cardinality between standard names and database ids
509 # is 1:1
510 #
511 # Revision 1.33 2006/07/28 00:02:14 sherlock
512 # fixed a couple of typos
513 #
514 # Revision 1.32 2004/07/28 17:12:10 sherlock
515 # bumped version
516 #
517 # Revision 1.31 2004/07/28 17:03:49 sherlock
518 # fixed bugs when calling goidsByDatabaseId instead of goIdsByDatabaseId
519 # on lines 1592 and 1617 - thanks to lfriedl@cs.umass.edu for spotting this.
520 #
521 # Revision 1.30 2003/11/26 18:44:28 sherlock
522 # finished making all the changes that were required to make it case
523 # insensitive, and modified POD accordingly. It appears to all work as
524 # expected...
525 #
526 # Revision 1.29 2003/11/22 00:05:05 sherlock
527 # made a very large number of changes to make much of it
528 # case-insensitive, such that using CDC6 or cdc6 amounts to the same
529 # query, as long as both versions of that name don't exist in the
530 # annotations file. Still needs a little work to allow names that are
531 # potentially ambiguous to be not ambiguous, if their casing matches
532 # exactly one form of the name that has been seen. Have started to
533 # update test suite to check all the case insensitive stuff, but is not
534 # yet finished.
535 #
536 #
537
539 Elizabeth Boyle, ell@mit.edu
540
541 Gavin Sherlock, sherlock@genome.stanford.edu
542
543
544
545perl v5.12.0 20G0O8:-:0A5n-n1o3tationProvider::AnnotationParser(3)