1Boulder::Blast(3) User Contributed Perl Documentation Boulder::Blast(3)
2
3
4
6 Boulder::Blast - Parse and read BLAST files
7
9 use Boulder::Blast;
10
11 # parse from a single file
12 $blast = Boulder::Blast->parse('run3.blast');
13
14 # parse and read a set of blast output files
15 $stream = Boulder::Blast->new('run3.blast','run4.blast');
16 while ($blast = $stream->get) {
17 # do something with $blast object
18 }
19
20 # parse and read a whole directory of blast runs
21 $stream = Boulder::Blast->new(<*.blast>);
22 while ($blast = $stream->get) {
23 # do something with $blast object
24 }
25
26 # parse and read from STDIN
27 $stream = Boulder::Blast->new;
28 while ($blast = $stream->get) {
29 # do something with $blast object
30 }
31
32 # parse and read as a filehandle
33 $stream = Boulder::Blast->newFh(<*.blast>);
34 while ($blast = <$stream>) {
35 # do something with $blast object
36 }
37
38 # once you have a $blast object, you can get info about it:
39 $query = $blast->Blast_query;
40 @hits = $blast->Blast_hits;
41 foreach $hit (@hits) {
42 $hit_sequence = $hit->Name; # get the ID
43 $significance = $hit->Signif; # get the significance
44 @hsps = $hit->Hsps; # list of HSPs
45 foreach $hsp (@hsps) {
46 $query = $hsp->Query; # query sequence
47 $subject = $hsp->Subject; # subject sequence
48 $signif = $hsp->Signif; # significance of HSP
49 }
50 }
51
53 The Boulder::Blast class parses the output of the Washington University
54 (WU) or National Cenber for Biotechnology Information (NCBI) series of
55 BLAST programs and turns them into Stone records. You may then use the
56 standard Stone access methods to retrieve information about the BLAST
57 run, or add the information to a Boulder stream.
58
59 The parser works equally well on the contents of a static file, or on
60 information read dynamically from a filehandle or pipe.
61
63 parse() Method
64 $stone = Boulder::Blast->parse($file_path);
65 $stone = Boulder::Blast->parse($filehandle);
66
67 The parse() method accepts a path to a file or a filehandle, parses its
68 contents, and returns a Boulder Stone object. The file path may be
69 absolute or relative to the current directgly. The filehandle may be
70 specified as an IO::File object, a FileHandle object, or a reference to
71 a glob ("\*FILEHANDLE" notation). If you call parse() without any
72 arguments, it will try to parse the contents of standard input.
73
74 new() Method
75 $stream = Boulder::Blast->new;
76 $stream = Boulder::Blast->new($file [,@more_files]);
77 $stream = Boulder::Blast->new(\*FILEHANDLE);
78
79 If you wish, you may create the parser first with Boulder::Blast new(),
80 and then invoke the parser object's parse() method as many times as you
81 wish to, producing a Stone object each time.
82
84 The following tags are defined in the parsed Blast Stone object:
85
86 Information about the program
87 These top-level tags provide information about the version of the BLAST
88 program itself.
89
90 Blast_program
91 The name of the algorithm used to run the analysis. Possible
92 values include:
93
94 blastn
95 blastp
96 blastx
97 tblastn
98 tblastx
99 fasta3
100 fastx3
101 fasty3
102 tfasta3
103 tfastx3
104 tfasty3
105
106 Blast_version
107 This gives the version of the program in whatever form appears on
108 the banner page, e.g. "2.0a19-WashU".
109
110 Blast_program_date
111 This gives the date at which the program was compiled, if and only
112 if it appears on the banner page.
113
114 Information about the run
115 These top-level tags give information about the particular run, such as
116 the parameters that were used for the algorithm.
117
118 Blast_run_date
119 This gives the date and time at which the similarity analysis was
120 run, in the format "Fri Jul 6 09:32:36 1998"
121
122 Blast_parms
123 This points to a subrecord containing information about the
124 algorithm's runtime parameters. The following subtags are used.
125 Others may be added in the future:
126
127 Hspmax the value of the -hspmax argument
128 Expectation the value of E
129 Matrix the matrix in use, e.g. BLOSUM62
130 Ctxfactor the value of the -ctxfactor argument
131 Gapall The value of the -gapall argument
132
133 Information about the query sequence and subject database
134 Thse top-level tags give information about the query sequence and the
135 database that was searched on.
136
137 Blast_query
138 The identifier for the search sequence, as defined by the FASTA
139 format. This will be the first set of non-whitespace characters
140 following the ">" character. In other words, the search sequence
141 "name".
142
143 Blast_query_length
144 The length of the query sequence, in base pairs.
145
146 Blast_db
147 The Unix filesystem path to the subject database.
148
149 Blast_db_title
150 The title of the subject database.
151
152 The search results: the Blast_hits tag.
153 Each BLAST hit is represented by the tag Blast_hits. There may be
154 zero, one, or many such tags. They will be presented in reverse sorted
155 order of significance, i.e. most significant hit first.
156
157 Each Blast_hits tag is a Stone subrecord containing the following
158 subtags:
159
160 Name
161 The name/identifier of the sequence that was hit.
162
163 Length
164 The total length of the sequence that was hit
165
166 Signif
167 The significance of the hit. If there are multiple HSPs in the
168 hit, this will be the most significant (smallest) value.
169
170 Identity
171 The percent identity of the hit. If there are multiple HSPs, this
172 will be the one with the highest percent identity.
173
174 Expect
175 The expectation value for the hit. If there are multiple HSPs,
176 this will be the lowest expectation value in the set.
177
178 Hsps
179 One or more sub-sub-tags, pointing to a nested record containing
180 information about each high-scoring segment pair (HSP). See the
181 next section for details.
182
183 The Hsp records: the Hsps tag
184 Each Blast_hit tag will have at least one, and possibly several Hsps
185 tags, each one corresponding to a high-scoring segment pair (HSP).
186 These records contain detailed information about the hit, including the
187 alignments. Tags are as follows:
188
189 Signif
190 The significance (P value) of this HSP.
191
192 Bits
193 The number of bits of significance.
194
195 Expect
196 Expectation value for this HSP.
197
198 Identity
199 Percent identity.
200
201 Positives
202 Percent positive matches.
203
204 Score
205 The Smith-Waterman alignment score.
206
207 Orientation
208 The word "plus" or "minus". This tag is only present for
209 nucleotide searches, when the reverse complement match may be
210 present.
211
212 Strand
213 Depending on algorithm used, indicates complementarity of match and
214 possibly the reading frame. This is copied out of the blast
215 report. Possibilities include:
216
217 "Plus / Minus" "Plus / Plus" -- blastn algorithm
218 "+1 / -2" "+2 / -2" -- blastx, tblastx
219
220 Query_start
221 Position at which the HSP starts in the query sequence (1-based
222 indexing).
223
224 Query_end
225 Position at which the HSP stops in the query sequence.
226
227 Subject_start
228 Position at which the HSP starts in the subject (target) sequence.
229
230 Subject_end
231 Position at which the HSP stops in the subject (target) sequence.
232
233 Query, Subject, Alignment
234 These three tags contain strings which, together, create the gapped
235 alignment of the query sequence with the subject sequence.
236
237 For example, to print the alignment of the first HSP of the first
238 match, you might say:
239
240 $hsp = $blast->Blast_hits->Hsps;
241 print join("\n",$hsp->Query,$hsp->Alignment,$hsp->Subject),"\n";
242
243 See the bottom of this manual page for an example BLAST run.
244
246 This module has been extensively tested with WUBLAST, but very little
247 with NCBI BLAST. It probably will not work with PSI Blast or other
248 variants.
249
250 The author plans to adapt this module to parse other formats, as well
251 as non-BLAST formats such as the output of Fastn.
252
254 Boulder, Boulder::GenBank
255
257 Lincoln Stein <lstein@cshl.org>.
258
259 Copyright (c) 1998-1999 Cold Spring Harbor Laboratory
260
261 This library is free software; you can redistribute it and/or modify it
262 under the same terms as Perl itself. See DISCLAIMER.txt for
263 disclaimers of warranty.
264
266 This output was generated by the quickblast.pl program, which is
267 located in the eg/ subdirectory of the Boulder distribution directory.
268 It is a typical blastn (nucleotide->nucleotide) run; however long lines
269 (usually DNA sequences) have been truncated. Also note that per the
270 Boulder protocol, the percent sign (%) is escaped in the usual way. It
271 will be unescaped when reading the stream back in.
272
273 Blast_run_date=Fri Nov 6 14:40:41 1998
274 Blast_db_date=2:40 PM EST Nov 6, 1998
275 Blast_parms={
276 Hspmax=10
277 Expectation=10
278 Matrix=+5,-4
279 Ctxfactor=2.00
280 }
281 Blast_program_date=05-Feb-1998
282 Blast_db= /usr/tmp/quickblast18202aaaa
283 Blast_version=2.0a19-WashU
284 Blast_query=BCD207R
285 Blast_db_title= test.fasta
286 Blast_query_length=332
287 Blast_program=blastn
288 Blast_hits={
289 Signif=3.5e-74
290 Expect=3.5e-74,
291 Name=BCD207R
292 Identity=100%25
293 Length=332
294 Hsps={
295 Subject=GTGCTTTCAAACATTGATGGATTCCTCCCCTTGACATATATATATACTTTGGGTTCCCGCAA...
296 Signif=3.5e-74
297 Length=332
298 Bits=249.1
299 Query_start=1
300 Subject_end=332
301 Query=GTGCTTTCAAACATTGATGGATTCCTCCCCTTGACATATATATATACTTTGGGTTCCCGCAA...
302 Positives=100%25
303 Expect=3.5e-74,
304 Identity=100%25
305 Query_end=332
306 Orientation=plus
307 Score=1660
308 Strand=Plus / Plus
309 Subject_start=1
310 Alignment=||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
311 }
312 }
313 =
314
316 Here is the output from a typical blastp (protein->protein) run. Long
317 lines have again been truncated.
318
319 Blast_run_date=Fri Nov 6 14:37:23 1998
320 Blast_db_date=2:36 PM EST Nov 6, 1998
321 Blast_parms={
322 Hspmax=10
323 Expectation=10
324 Matrix=BLOSUM62
325 Ctxfactor=1.00
326 }
327 Blast_program_date=05-Feb-1998
328 Blast_db= /usr/tmp/quickblast18141aaaa
329 Blast_version=2.0a19-WashU
330 Blast_query=YAL004W
331 Blast_db_title= elegans.fasta
332 Blast_query_length=216
333 Blast_program=blastp
334 Blast_hits={
335 Signif=0.95
336 Expect=3.0,
337 Name=C28H8.2
338 Identity=30%25
339 Length=51
340 Hsps={
341 Subject=HMTVEFHVTSQSW---FGFEDHFHMIIR-AVNDENVGWGVRYLSMAF
342 Signif=0.95
343 Length=46
344 Bits=15.8
345 Query_start=100
346 Subject_end=49
347 Query=HLTQD-HGGDLFWGKVLGFTLKFNLNLRLTVNIDQLEWEVLHVSLHF
348 Positives=52%25
349 Expect=3.0,
350 Identity=30%25
351 Query_end=145
352 Orientation=plus
353 Score=45
354 Subject_start=7
355 Alignment=H+T + H W GF F++ +R VN + + W V ++S+ F
356 }
357 }
358 Blast_hits={
359 Signif=0.99
360 Expect=4.7,
361 Name=ZK896.2
362 Identity=24%25
363 Length=340
364 Hsps={
365 Subject=FSGKFTTFVLNKDQATLRMSSAEKTAEWNTAFDSRRGFF----TSGNYGL...
366 Signif=0.99
367 Length=101
368 Bits=22.9
369 Query_start=110
370 Subject_end=243
371 Query=FWGKVLGFTL-KFNLNLRLTVNIDQLEWEVLHVSLHFWVVEVSTDQTLSVE...
372 Positives=41%25
373 Expect=4.7,
374 Identity=24%25
375 Query_end=210
376 Orientation=plus
377 Score=65
378 Subject_start=146
379 Alignment=F GK F L K LR++ EW S + T +...
380 }
381 }
382 =
383
384
385
386perl v5.12.0 2002-02-04 Boulder::Blast(3)