1bwa(1) Bioinformatics tools bwa(1)
2
3
4
6 bwa - Burrows-Wheeler Alignment Tool
7
9 bwa index -a bwtsw database.fasta
10
11 bwa aln database.fasta short_read.fastq > aln_sa.sai
12
13 bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
14
15 bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq >
16 aln.sam
17
18 bwa bwasw database.fasta long_read.fastq > aln.sam
19
20
22 BWA is a fast light-weighted tool that aligns relatively short
23 sequences (queries) to a sequence database (targe), such as the human
24 reference genome. It implements two different algorithms, both based on
25 Burrows-Wheeler Transform (BWT). The first algorithm is designed for
26 short queries up to ~200bp with low error rate (<3%). It does gapped
27 global alignment w.r.t. queries, supports paired-end reads, and is one
28 of the fastest short read alignment algorithms to date while also vis‐
29 iting suboptimal hits. The second algorithm, BWA-SW, is designed for
30 long reads with more errors. It performs heuristic Smith-Waterman-like
31 alignment to find high-scoring local hits (and thus chimera). On low-
32 error short queries, BWA-SW is slower and less accurate than the first
33 algorithm, but on long queries, it is better.
34
35 For both algorithms, the database file in the FASTA format must be
36 first indexed with the `index' command, which typically takes a few
37 hours. The first algorithm is implemented via the `aln' command, which
38 finds the suffix array (SA) coordinates of good hits of each individual
39 read, and the `samse/sampe' command, which converts SA coordinates to
40 chromosomal coordinate and pairs reads (for `sampe'). The second algo‐
41 rithm is invoked by the `bwasw' command. It works for single-end reads
42 only.
43
44
46 index bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
47
48 Index database sequences in the FASTA format.
49
50 OPTIONS:
51
52 -c Build color-space index. The input fast should be in
53 nucleotide space.
54
55 -p STR Prefix of the output database [same as db filename]
56
57 -a STR Algorithm for constructing BWT index. Available
58 options are:
59
60 is IS linear-time algorithm for constructing suf‐
61 fix array. It requires 5.37N memory where N is
62 the size of the database. IS is moderately
63 fast, but does not work with database larger
64 than 2GB. IS is the default algorithm due to
65 its simplicity. The current codes for IS algo‐
66 rithm are reimplemented by Yuta Mori.
67
68 bwtsw Algorithm implemented in BWT-SW. This method
69 works with the whole human genome, but it does
70 not work with database smaller than 10MB and it
71 is usually slower than IS.
72
73
74 aln bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
75 nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M
76 misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta>
77 <in.query.fq> > <out.sai>
78
79 Find the SA coordinates of the input reads. Maximum maxSeedDiff
80 differences are allowed in the first seedLen subsequence and
81 maximum maxDiff differences are allowed in the whole sequence.
82
83 OPTIONS:
84
85 -n NUM Maximum edit distance if the value is INT, or the
86 fraction of missing alignments given 2% uniform base
87 error rate if FLOAT. In the latter case, the maximum
88 edit distance is automatically chosen for different
89 read lengths. [0.04]
90
91 -o INT Maximum number of gap opens [1]
92
93 -e INT Maximum number of gap extensions, -1 for k-difference
94 mode (disallowing long gaps) [-1]
95
96 -d INT Disallow a long deletion within INT bp towards the
97 3'-end [16]
98
99 -i INT Disallow an indel within INT bp towards the ends [5]
100
101 -l INT Take the first INT subsequence as seed. If INT is
102 larger than the query sequence, seeding will be dis‐
103 abled. For long reads, this option is typically ranged
104 from 25 to 35 for `-k 2'. [inf]
105
106 -k INT Maximum edit distance in the seed [2]
107
108 -t INT Number of threads (multi-threading mode) [1]
109
110 -M INT Mismatch penalty. BWA will not search for suboptimal
111 hits with a score lower than (bestScore-misMsc). [3]
112
113 -O INT Gap open penalty [11]
114
115 -E INT Gap extension penalty [4]
116
117 -R INT Proceed with suboptimal alignments if there are no
118 more than INT equally best hits. This option only
119 affects paired-end mapping. Increasing this threshold
120 helps to improve the pairing accuracy at the cost of
121 speed, especially for short reads (~32bp).
122
123 -c Reverse query but not complement it, which is required
124 for alignment in the color space.
125
126 -N Disable iterative search. All hits with no more than
127 maxDiff differences will be found. This mode is much
128 slower than the default.
129
130 -q INT Parameter for read trimming. BWA trims a read down to
131 argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l
132 is the original read length. [0]
133
134 -I The input is in the Illumina 1.3+ read format (quality
135 equals ASCII-64).
136
137 -B INT Length of barcode starting from the 5'-end. When INT
138 is positive, the barcode of each read will be trimmed
139 before mapping and will be written at the BC SAM tag.
140 For paired-end reads, the barcode from both ends are
141 concatenated. [0]
142
143 -b Specify the input read sequence file is the BAM for‐
144 mat. For paired-end data, two ends in a pair must be
145 grouped together and options -1 or -2 are usually
146 applied to specify which end should be mapped. Typical
147 command lines for mapping pair-end data in the BAM
148 format are:
149
150 bwa aln ref.fa -b1 reads.bam > 1.sai
151 bwa aln ref.fa -b2 reads.bam > 2.sai
152 bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam >
153 aln.sam
154
155 -0 When -b is specified, only use single-end reads in
156 mapping.
157
158 -1 When -b is specified, only use the first read in a
159 read pair in mapping (skip single-end reads and the
160 second reads).
161
162 -2 When -b is specified, only use the second read in a
163 read pair in mapping.
164
165
166 samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
167
168 Generate alignments in the SAM format given single-end reads.
169 Repetitive hits will be randomly chosen.
170
171 OPTIONS:
172
173 -n INT Maximum number of alignments to output in the XA tag
174 for reads paired properly. If a read has more than INT
175 hits, the XA tag will not be written. [3]
176
177 -r STR Specify the read group in a format like
178 `@RG\tID:foo\tSM:bar'. [null]
179
180
181 sampe bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N max‐
182 HitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
183 > <out.sam>
184
185 Generate alignments in the SAM format given paired-end reads.
186 Repetitive read pairs will be placed randomly.
187
188 OPTIONS:
189
190 -a INT Maximum insert size for a read pair to be considered
191 being mapped properly. Since 0.4.5, this option is only
192 used when there are not enough good alignment to infer
193 the distribution of insert sizes. [500]
194
195 -o INT Maximum occurrences of a read for pairing. A read with
196 more occurrneces will be treated as a single-end read.
197 Reducing this parameter helps faster pairing. [100000]
198
199 -P Load the entire FM-index into memory to reduce disk
200 operations (base-space reads only). With this option, at
201 least 1.25N bytes of memory are required, where N is the
202 length of the genome.
203
204 -n INT Maximum number of alignments to output in the XA tag for
205 reads paired properly. If a read has more than INT hits,
206 the XA tag will not be written. [3]
207
208 -N INT Maximum number of alignments to output in the XA tag for
209 disconcordant read pairs (excluding singletons). If a
210 read has more than INT hits, the XA tag will not be
211 written. [10]
212
213 -r STR Specify the read group in a format like
214 `@RG\tID:foo\tSM:bar'. [null]
215
216
217 bwasw bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r
218 gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv]
219 [-z zBest] [-N nHspRev] [-c thresCoef] <in.db.fasta> <in.fq>
220
221 Align query sequences in the <in.fq> file.
222
223 OPTIONS:
224
225 -a INT Score of a match [1]
226
227 -b INT Mismatch penalty [3]
228
229 -q INT Gap open penalty [5]
230
231 -r INT Gap extension penalty. The penalty for a contiguous
232 gap of size k is q+k*r. [2]
233
234 -t INT Number of threads in the multi-threading mode [1]
235
236 -w INT Band width in the banded alignment [33]
237
238 -T INT Minimum score threshold divided by a [37]
239
240 -c FLOAT Coefficient for threshold adjustment according to
241 query length. Given an l-long query, the threshold for
242 a hit to be retained is a*max{T,c*log(l)}. [5.5]
243
244 -z INT Z-best heuristics. Higher -z increases accuracy at the
245 cost of speed. [1]
246
247 -s INT Maximum SA interval size for initiating a seed. Higher
248 -s increases accuracy at the cost of speed. [3]
249
250 -N INT Minimum number of seeds supporting the resultant
251 alignment to skip reverse alignment. [5]
252
253
255 The output of the `aln' command is binary and designed for BWA use
256 only. BWA outputs the final alignment in the SAM (Sequence Align‐
257 ment/Map) format. Each line consists of:
258
259
260 ┌────┬───────┬──────────────────────────────────────────────────────────┐
261 │Col │ Field │ Description │
262 ├────┼───────┼──────────────────────────────────────────────────────────┤
263 │ 1 │ QNAME │ Query (pair) NAME │
264 │ 2 │ FLAG │ bitwise FLAG │
265 │ 3 │ RNAME │ Reference sequence NAME │
266 │ 4 │ POS │ 1-based leftmost POSition/coordinate of clipped sequence │
267 │ 5 │ MAPQ │ MAPping Quality (Phred-scaled) │
268 │ 6 │ CIAGR │ extended CIGAR string │
269 │ 7 │ MRNM │ Mate Reference sequence NaMe (`=' if same as RNAME) │
270 │ 8 │ MPOS │ 1-based Mate POSistion │
271 │ 9 │ ISIZE │ Inferred insert SIZE │
272 │10 │ SEQ │ query SEQuence on the same strand as the reference │
273 │11 │ QUAL │ query QUALity (ASCII-33 gives the Phred base quality) │
274 │12 │ OPT │ variable OPTional fields in the format TAG:VTYPE:VALUE │
275 └────┴───────┴──────────────────────────────────────────────────────────┘
276
277 Each bit in the FLAG field is defined as:
278
279
280 ┌────┬────────┬───────────────────────────────────────┐
281 │Chr │ Flag │ Description │
282 ├────┼────────┼───────────────────────────────────────┤
283 │ p │ 0x0001 │ the read is paired in sequencing │
284 │ P │ 0x0002 │ the read is mapped in a proper pair │
285 │ u │ 0x0004 │ the query sequence itself is unmapped │
286 │ U │ 0x0008 │ the mate is unmapped │
287 │ r │ 0x0010 │ strand of the query (1 for reverse) │
288 │ R │ 0x0020 │ strand of the mate │
289 │ 1 │ 0x0040 │ the read is the first read in a pair │
290 │ 2 │ 0x0080 │ the read is the second read in a pair │
291 │ s │ 0x0100 │ the alignment is not primary │
292 │ f │ 0x0200 │ QC failure │
293 │ d │ 0x0400 │ optical or PCR duplicate │
294 └────┴────────┴───────────────────────────────────────┘
295
296 The Please check <http://samtools.sourceforge.net> for the format spec‐
297 ification and the tools for post-processing the alignment.
298
299 BWA generates the following optional fields. Tags starting with `X' are
300 specific to BWA.
301
302
303 ┌────┬────────────────────────────────────────────────┐
304 │Tag │ Meaning │
305 ├────┼────────────────────────────────────────────────┤
306 │NM │ Edit distance │
307 │MD │ Mismatching positions/bases │
308 │AS │ Alignment score │
309 │BC │ Barcode sequence │
310 ├────┼────────────────────────────────────────────────┤
311 │X0 │ Number of best hits │
312 │X1 │ Number of suboptimal hits found by BWA │
313 │XN │ Number of ambiguous bases in the referenece │
314 │XM │ Number of mismatches in the alignment │
315 │XO │ Number of gap opens │
316 │XG │ Number of gap extentions │
317 │XT │ Type: Unique/Repeat/N/Mate-sw │
318 │XA │ Alternative hits; format: (chr,pos,CIGAR,NM;)* │
319 ├────┼────────────────────────────────────────────────┤
320 │XS │ Suboptimal alignment score │
321 │XF │ Support from forward/reverse alignment │
322 │XE │ Number of supporting seeds │
323 └────┴────────────────────────────────────────────────┘
324
325 Note that XO and XG are generated by BWT search while the CIGAR string
326 by Smith-Waterman alignment. These two tags may be inconsistent with
327 the CIGAR string. This is not a bug.
328
329
331 Alignment Accuracy
332 When seeding is disabled, BWA guarantees to find an alignment contain‐
333 ing maximum maxDiff differences including maxGapO gap opens which do
334 not occur within nIndelEnd bp towards either end of the query. Longer
335 gaps may be found if maxGapE is positive, but it is not guaranteed to
336 find all hits. When seeding is enabled, BWA further requires that the
337 first seedLen subsequence contains no more than maxSeedDiff differ‐
338 ences.
339
340 When gapped alignment is disabled, BWA is expected to generate the same
341 alignment as Eland, the Illumina alignment program. However, as BWA
342 change `N' in the database sequence to random nucleotides, hits to
343 these random sequences will also be counted. As a consequence, BWA may
344 mark a unique hit as a repeat, if the random sequences happen to be
345 identical to the sequences which should be unqiue in the database. This
346 random behaviour will be avoided in future releases.
347
348 By default, if the best hit is no so repetitive (controlled by -R), BWA
349 also finds all hits contains one more mismatch; otherwise, BWA finds
350 all equally best hits only. Base quality is NOT considered in evaluat‐
351 ing hits. In paired-end alignment, BWA pairs all hits it found. It fur‐
352 ther performs Smith-Waterman alignment for unmapped reads with mates
353 mapped to rescue mapped mates, and for high-quality anomalous pairs to
354 fix potential alignment errors.
355
356
357 Estimating Insert Size Distribution
358 BWA estimates the insert size distribution per 256*1024 read pairs. It
359 first collects pairs of reads with both ends mapped with a single-end
360 quality 20 or higher and then calculates median (Q2), lower and higher
361 quartile (Q1 and Q3). It estimates the mean and the variance of the
362 insert size distribution from pairs whose insert sizes are within
363 interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair
364 considered to be properly paired (SAM flag 0x2) is calculated by solv‐
365 ing equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is
366 the standard error of the insert size distribution, L is the length of
367 the genome, p0 is prior of anomalous pair and Phi() is the standard
368 cumulative distribution function. For mapping Illumina short-insert
369 reads to the human genome, x is about 6-7 sigma away from the mean.
370 Quartiles, mean, variance and x will be printed to the standard error
371 output.
372
373
374 Memory Requirement
375 With bwtsw algorithm, 2.5GB memory is required for indexing the com‐
376 plete human genome sequences. For short reads, the `aln' command uses
377 ~2.3GB memory and the `sampe' command uses ~3.5GB.
378
379
380 Speed
381 Indexing the human genome sequences takes 3 hours with bwtsw algorithm.
382 Indexing smaller genomes with IS or divsufsort algorithms is several
383 times faster, but requires more memory.
384
385 Speed of alignment is largely determined by the error rate of the query
386 sequences (r). Firstly, BWA runs much faster for near perfect hits than
387 for hits with many differences, and it stops searching for a hit with
388 l+2 differences if a l-difference hit is found. This means BWA will be
389 very slow if r is high because in this case BWA has to visit hits with
390 many differences and looking for these hits is expensive. Secondly, the
391 alignment algorithm behind makes the speed sensitive to [k log(N)/m],
392 where k is the maximum allowed differences, N the size of database and
393 m the length of a query. In practice, we choose k w.r.t. r and there‐
394 fore r is the leading factor. I would not recommend to use BWA on data
395 with r>0.02.
396
397 Pairing is slower for shorter reads. This is mainly because shorter
398 reads have more spurious hits and converting SA coordinates to chromo‐
399 somal coordinates are very costly.
400
401 In a practical experiment, BWA is able to map 2 million 32bp reads to a
402 bacterial genome in several minutes, map the same amount of reads to
403 human X chromosome in 8-15 minutes and to the human genome in 15-25
404 minutes. This result implies that the speed of BWA is insensitive to
405 the size of database and therefore BWA is more efficient when the data‐
406 base is sufficiently large. On smaller genomes, hash based algorithms
407 are usually much faster.
408
409
411 Command `bwasw' is designed for long-read alignment. The algorithm
412 behind, BWA-SW, is similar to BWT-SW, but does not guarantee to find
413 all local hits due to the heuristic acceleration. It tends to be faster
414 and more accurate if the resultant alignment is supported by more
415 seeds, and therefore BWA-SW usually performs better on long queries
416 than on short ones.
417
418 On 350-1000bp reads, BWA-SW is several to tens of times faster than the
419 existing programs. Its accuracy is comparable to SSAHA2, more accurate
420 than BLAT. Like BLAT, BWA-SW also finds chimera which may pose a chal‐
421 lenge to SSAHA2. On 10-100kbp queries where chimera detection is impor‐
422 tant, BWA-SW is over 10X faster than BLAT while being more sensitive.
423
424 BWA-SW can also be used to align ~100bp reads, but it is slower than
425 the short-read algorithm. Its sensitivity and accuracy is lower than
426 SSAHA2 especially when the sequencing error rate is above 2%. This is
427 the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode.
428
429
431 BWA website <http://bio-bwa.sourceforge.net>, Samtools website
432 <http://samtools.sourceforge.net>
433
434
436 Heng Li at the Sanger Institute wrote the key source codes and inte‐
437 grated the following codes for BWT construction: bwtsw
438 <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at
439 the University of Hong Kong and IS
440 <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge
441 <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and
442 implemented by Yuta Mori.
443
444
446 The full BWA package is distributed under GPLv3 as it uses source codes
447 from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS
448 libraries are distributed under the MIT license.
449
450 If you use the short-read alignment component, please cite the follow‐
451 ing paper:
452
453 Li H. and Durbin R. (2009) Fast and accurate short read alignment with
454 Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID:
455 19451168]
456
457 If you use the long-read component (BWA-SW), please cite:
458
459 Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
460 Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505]
461
462
464 BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
465 and mimics its binary file formats; BWA-SW resembles BWT-SW in several
466 ways. The initial idea about BWT-based alignment also came from the
467 group who developed BWT-SW. At the same time, BWA is different enough
468 from BWT-SW. The short-read alignment algorithm bears no similarity to
469 Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it
470 introduces heuristics that can hardly be applied to the original algo‐
471 rithm. In all, BWA does not guarantee to find all local hits as what
472 BWT-SW is designed to do, but it is much faster than BWT-SW on both
473 short and long query sequences.
474
475 I started to write the first piece of codes on 24 May 2008 and got the
476 initial stable version on 02 June 2008. During this period, I was
477 acquainted that Professor Tak-Wah Lam, the first author of BWT-SW
478 paper, was collaborating with Beijing Genomics Institute on SOAP2, the
479 successor to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has
480 come out in November 2008. According to the SourceForge download page,
481 the third BWT-based short read aligner, bowtie, was first released in
482 August 2008. At the time of writing this manual, at least three more
483 BWT-based short-read aligners are being implemented.
484
485 The BWA-SW algorithm is a new component of BWA. It was conceived in No‐
486 vember 2008 and implemented ten months later.
487
488
489
490bwa-0.5.9 24 January 2011 bwa(1)