1BP_GENBANK2GFF3(1) User Contributed Perl Documentation BP_GENBANK2GFF3(1)
2
3
4
6 genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3
7
9 genbank2gff3.pl [options] filename(s)
10
11 # process a directory containing GenBank flatfiles
12 perl genbank2gff3.pl --dir path_to_files --zip
13
14 # process a single file, ignore explicit exons and introns
15 perl genbank2gff3.pl --filter exon --filter intron file.gbk.gz
16
17 # process a list of files
18 perl genbank2gff3.pl *gbk.gz
19
20 # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader
21 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
22 | perl genbank2gff3.pl -noCDS -in stdin -out stdout \
23 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
24
25 Options:
26 --dir -d path to a list of genbank flatfiles
27 --outdir -o location to write GFF files (can be 'stdout' or '-' for pipe)
28 --zip -z compress GFF3 output files with gzip
29 --summary -s print a summary of the features in each contig
30 --filter -x genbank feature type(s) to ignore
31 --split -y split output to seperate GFF and fasta files for
32 each genbank record
33 --nolump -n seperate file for each reference sequence
34 (default is to lump all records together into one
35 output file for each input file)
36 --ethresh -e error threshold for unflattener
37 set this high (>2) to ignore all unflattener errors
38 --[no]CDS -c Keep CDS-exons, or convert to alternate gene-RNA-protein-exon
39 model. --CDS is default. Use --CDS to keep default GFF gene model,
40 use --noCDS to convert to g-r-p-e.
41 --format -f Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work
42 (GenBank is default)
43 --GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available
44 --quiet dont talk about what is being processed
45 --typesource SO sequence type for source (e.g. chromosome; region; contig)
46 --help -h display this message
47
49 This script uses Bio::SeqFeature::Tools::Unflattener and
50 Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene
51 containment hierarchies mapped for optimal display in gbrowse.
52
53 The input files are assumed to be gzipped GenBank flatfiles for refseq
54 contigs. The files may contain multiple GenBank records. Either a
55 single file or an entire directory can be processed. By default, the
56 DNA sequence is embedded in the GFF but it can be saved into seperate
57 fasta file with the --split(-y) option.
58
59 If an input file contains multiple records, the default behaviour is to
60 dump all GFF and sequence to a file of the same name (with .gff
61 appended). Using the 'nolump' option will create a seperate file for
62 each genbank record. Using the 'split' option will create seperate GFF
63 and Fasta files for each genbank record.
64
65 Notes
66 'split' and 'nolump' produce many files
67
68 In cases where the input files contain many GenBank records (for
69 example, the chromosome files for the mouse genome build), a very large
70 number of output files will be produced if the 'split' or 'nolump'
71 options are selected. If you do have lists of files > 6000, use the
72 --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to
73 load the gff and/ or fasta files.
74
75 Designed for RefSeq
76
77 This script is designed for RefSeq genomic sequence entries. It may
78 work for third party annotations but this has not been tested. But see
79 below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you
80 don't mind some gene model unflattener errors (dgg).
81
82 G-R-P-E Gene Model
83
84 Don Gilbert worked this over with needs to produce GFF3 suited to
85 loading to GMOD Chado databases. Most of the changes I believe are
86 suited for general use. One main chado-specific addition is the
87 --[no]cds2protein flag
88
89 My favorite GFF is to set the above as ON by default (disable with
90 --nocds2prot) For general use it probably should be OFF, enabled with
91 --cds2prot.
92
93 This writes GFF with an alternate, but useful Gene model, instead of
94 the consensus model for GFF3
95
96 [ gene > mRNA> (exon,CDS,UTR) ]
97
98 This alternate is
99
100 gene > mRNA > polypeptide > exon
101
102 means the only feature with dna bases is the exon. The others specify
103 only location ranges on a genome. Exon of course is a child of mRNA
104 and protein/peptide.
105
106 The protein/polypeptide feature is an important one, having all the
107 annotations of the GenBank CDS feature, protein ID, translation, GO
108 terms, Dbxrefs to other proteins.
109
110 UTRs, introns, CDS-exons are all inferred from the primary exon bases
111 inside/outside appropriate higher feature ranges. Other special gene
112 model features remain the same.
113
114 Several other improvements and bugfixes, minor but useful are included
115
116 * IO pipes now work:
117 curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ...
118
119 * GenBank main record fields are added to source feature, e.g. organism, date,
120 and the sourcetype, commonly chromosome for genomes, is used.
121
122 * Gene Model handling for ncRNA, pseudogenes are added.
123
124 * GFF header is cleaner, more informative.
125 --GFF_VERSION flag allows choice of v2 as well as default v3
126
127 * GFF ##FASTA inclusion is improved, and
128 CDS translation sequence is moved to FASTA records.
129
130 * FT -> GFF attribute mapping is improved.
131
132 * --format choice of SeqIO input formats (GenBank default).
133 Uniprot/Swissprot and EMBL work and produce useful GFF.
134
135 * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
136 and more flexible usage.
137
139 Are these additions desired?
140 * filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY
141 * handle Entrezgene, other non-sequence SeqIO structures (really should change
142 those parsers to produce consistent annotation tags).
143
144 Related bugfixes/tests
145 These items from Bioperl mail were tested (sample data generating
146 errors), and found corrected:
147
148 From: Ed Green <green <at> eva.mpg.de>
149 Subject: genbank2gff3.pl on new human RefSeq
150 Date: 2006-03-13 21:22:26 GMT
151 -- unspecified errors (sample data works now).
152
153 From: Eric Just <e-just <at> northwestern.edu>
154 Subject: bp_genbank2gff3.pl
155 Date: 2007-01-26 17:08:49 GMT
156 -- bug fixed in genbank2gff3 for multi-record handling
157
158 This error is for a /trans_splice gene that is hard to handle, and
159 unflattner/genbank2 doesn't
160
161 From: Chad Matsalla <chad <at> dieselwurks.com>
162 Subject: genbank2gff3.PLS and the unflatenner - Inconsistent order?
163 Date: 2005-07-15 19:51:48 GMT
164
166 Sheldon McKay (mckays@cshl.edu)
167
168 Copyright (c) 2004 Cold Spring Harbor Laboratory.
169
170 AUTHOR of hacks for GFF2Chado loading
171 Don Gilbert (gilbertd@indiana.edu)
172
173
174
175perl v5.12.0 2010-04-29 BP_GENBANK2GFF3(1)