bp_genbank2gff3.pl(1)

1BP_GENBANK2GFF3(1)    User Contributed Perl Documentation   BP_GENBANK2GFF3(1)
2
3
4

NAME

6       genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3
7

SYNOPSIS

9         genbank2gff3.pl [options] filename(s)
10
11         # process a directory containing GenBank flatfiles
12         perl genbank2gff3.pl --dir path_to_files --zip
13
14         # process a single file, ignore explicit exons and introns
15         perl genbank2gff3.pl --filter exon --filter intron file.gbk.gz
16
17         # process a list of files
18         perl genbank2gff3.pl *gbk.gz
19
20         # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader
21         curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
22         | perl genbank2gff3.pl -noCDS -in stdin -out stdout \
23         | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
24
25           Options:
26               --dir     -d  path to a list of genbank flatfiles
27               --outdir  -o  location to write GFF files (can be 'stdout' or '-' for pipe)
28               --zip     -z  compress GFF3 output files with gzip
29               --summary -s  print a summary of the features in each contig
30               --filter  -x  genbank feature type(s) to ignore
31               --split   -y  split output to seperate GFF and fasta files for
32                             each genbank record
33               --nolump  -n  seperate file for each reference sequence
34                             (default is to lump all records together into one
35                              output file for each input file)
36               --ethresh -e  error threshold for unflattener
37                             set this high (>2) to ignore all unflattener errors
38               --[no]CDS -c  Keep CDS-exons, or convert to alternate gene-RNA-protein-exon
39                             model. --CDS is default. Use --CDS to keep default GFF gene model,
40                             use --noCDS to convert to g-r-p-e.
41               --format  -f  Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work
42                             (GenBank is default)
43               --GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available
44               --quiet       dont talk about what is being processed
45               --typesource  SO sequence type for source (e.g. chromosome; region; contig)
46               --help    -h  display this message
47

DESCRIPTION

49       This script uses Bio::SeqFeature::Tools::Unflattener and
50       Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene
51       containment hierarchies mapped for optimal display in gbrowse.
52
53       The input files are assumed to be gzipped GenBank flatfiles for refseq
54       contigs.  The files may contain multiple GenBank records.  Either a
55       single file or an entire directory can be processed.  By default, the
56       DNA sequence is embedded in the GFF but it can be saved into seperate
57       fasta file with the --split(-y) option.
58
59       If an input file contains multiple records, the default behaviour is to
60       dump all GFF and sequence to a file of the same name (with .gff
61       appended).  Using the 'nolump' option will create a seperate file for
62       each genbank record.  Using the 'split' option will create seperate GFF
63       and Fasta files for each genbank record.
64
65   Notes
66       'split' and 'nolump' produce many files
67
68       In cases where the input files contain many GenBank records (for
69       example, the chromosome files for the mouse genome build), a very large
70       number of output files will be produced if the 'split' or 'nolump'
71       options are selected.  If you do have lists of files > 6000, use the
72       --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to
73       load the gff and/ or fasta files.
74
75       Designed for RefSeq
76
77       This script is designed for RefSeq genomic sequence entries.  It may
78       work for third party annotations but this has not been tested.  But see
79       below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you
80       don't mind some gene model unflattener errors (dgg).
81
82       G-R-P-E Gene Model
83
84       Don Gilbert worked this over with needs to produce GFF3 suited to
85       loading to GMOD Chado databases.  Most of the changes I believe are
86       suited for general use.  One main chado-specific addition is the
87         --[no]cds2protein  flag
88
89       My favorite GFF is to set the above as ON by default (disable with
90       --nocds2prot) For general use it probably should be OFF, enabled with
91       --cds2prot.
92
93       This writes GFF with an alternate, but useful Gene model, instead of
94       the consensus model for GFF3
95
96         [ gene > mRNA> (exon,CDS,UTR) ]
97
98       This alternate is
99
100         gene > mRNA > polypeptide > exon
101
102       means the only feature with dna bases is the exon.  The others specify
103       only location ranges on a genome.  Exon of course is a child of mRNA
104       and protein/peptide.
105
106       The protein/polypeptide feature is an important one, having all the
107       annotations of the GenBank CDS feature, protein ID, translation, GO
108       terms, Dbxrefs to other proteins.
109
110       UTRs, introns, CDS-exons are all inferred from the primary exon bases
111       inside/outside appropriate higher feature ranges.   Other special gene
112       model features remain the same.
113
114       Several other improvements and bugfixes, minor but useful are included
115
116         * IO pipes now work:
117           curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ...
118
119         * GenBank main record fields are added to source feature, e.g. organism, date,
120           and the sourcetype, commonly chromosome for  genomes, is used.
121
122         * Gene Model handling for ncRNA, pseudogenes are added.
123
124         * GFF header is cleaner, more informative.
125           --GFF_VERSION flag allows choice of v2 as well as default v3
126
127         * GFF ##FASTA inclusion is improved, and
128           CDS translation sequence is moved to FASTA records.
129
130         * FT -> GFF attribute mapping is improved.
131
132         * --format choice of SeqIO input formats (GenBank default).
133           Uniprot/Swissprot and EMBL work and produce useful GFF.
134
135         * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
136             and more flexible usage.
137

TODO

139   Are these additions desired?
140        * filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY
141        * handle Entrezgene, other non-sequence SeqIO structures (really should change
142           those parsers to produce consistent annotation tags).
143
144   Related bugfixes/tests
145       These items from Bioperl mail were tested (sample data generating
146       errors), and found corrected:
147
148        From: Ed Green <green <at> eva.mpg.de>
149        Subject: genbank2gff3.pl on new human RefSeq
150        Date: 2006-03-13 21:22:26 GMT
151          -- unspecified errors (sample data works now).
152
153        From: Eric Just <e-just <at> northwestern.edu>
154        Subject: bp_genbank2gff3.pl
155        Date: 2007-01-26 17:08:49 GMT
156          -- bug fixed in genbank2gff3 for multi-record handling
157
158       This error is for a /trans_splice gene that is hard to handle, and
159       unflattner/genbank2 doesn't
160
161        From: Chad Matsalla <chad <at> dieselwurks.com>
162        Subject: genbank2gff3.PLS and the unflatenner - Inconsistent   order?
163        Date: 2005-07-15 19:51:48 GMT
164

AUTHOR

166       Sheldon McKay (mckays@cshl.edu)
167
168       Copyright (c) 2004 Cold Spring Harbor Laboratory.
169
170   AUTHOR of hacks for GFF2Chado loading
171       Don Gilbert (gilbertd@indiana.edu)
172
173
174
175perl v5.12.0                      2010-04-29                BP_GENBANK2GFF3(1)