bwa(1) - f14

1bwa(1)                       Bioinformatics tools                       bwa(1)
2
3
4

NAME

6       bwa - Burrows-Wheeler Alignment Tool
7

SYNOPSIS

9       bwa index -a bwtsw database.fasta
10
11       bwa aln database.fasta short_read.fastq > aln_sa.sai
12
13       bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
14
15       bwa  sampe  database.fasta  aln_sa1.sai aln_sa2.sai read1.fq read2.fq >
16       aln.sam
17
18       bwa bwasw database.fasta long_read.fastq > aln.sam
19
20

DESCRIPTION

22       BWA  is  a  fast  light-weighted  tool  that  aligns  relatively  short
23       sequences  (queries)  to a sequence database (targe), such as the human
24       reference genome. It implements two different algorithms, both based on
25       Burrows-Wheeler  Transform  (BWT).  The first algorithm is designed for
26       short queries up to ~200bp with low error rate (<3%).  It  does  gapped
27       global  alignment w.r.t. queries, supports paired-end reads, and is one
28       of the fastest short read alignment algorithms to date while also  vis‐
29       iting  suboptimal  hits.  The second algorithm, BWA-SW, is designed for
30       long reads with more errors. It performs heuristic  Smith-Waterman-like
31       alignment  to  find high-scoring local hits (and thus chimera). On low-
32       error short queries, BWA-SW is slower and less accurate than the  first
33       algorithm, but on long queries, it is better.
34
35       For  both  algorithms,  the  database  file in the FASTA format must be
36       first indexed with the `index' command, which  typically  takes  a  few
37       hours.  The first algorithm is implemented via the `aln' command, which
38       finds the suffix array (SA) coordinates of good hits of each individual
39       read,  and  the `samse/sampe' command, which converts SA coordinates to
40       chromosomal coordinate and pairs reads (for `sampe'). The second  algo‐
41       rithm  is invoked by the `bwasw' command. It works for single-end reads
42       only.
43
44

COMMANDS AND OPTIONS

46       index  bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
47
48              Index database sequences in the FASTA format.
49
50              OPTIONS:
51
52              -c        Build color-space index. The input fast should  be  in
53                        nucleotide space.
54
55              -p STR    Prefix of the output database [same as db filename]
56
57              -a STR    Algorithm   for   constructing  BWT  index.  Available
58                        options are:
59
60                        is     IS linear-time algorithm for constructing  suf‐
61                               fix  array. It requires 5.37N memory where N is
62                               the size of  the  database.  IS  is  moderately
63                               fast,  but  does  not work with database larger
64                               than 2GB. IS is the default  algorithm  due  to
65                               its  simplicity. The current codes for IS algo‐
66                               rithm are reimplemented by Yuta Mori.
67
68                        bwtsw  Algorithm implemented in  BWT-SW.  This  method
69                               works  with the whole human genome, but it does
70                               not work with database smaller than 10MB and it
71                               is usually slower than IS.
72
73
74       aln    bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
75              nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN]  [-M
76              misMsc]  [-O  gapOsc]  [-E  gapEsc]  [-q trimQual] <in.db.fasta>
77              <in.query.fq> > <out.sai>
78
79              Find the SA coordinates of the input reads. Maximum  maxSeedDiff
80              differences  are  allowed  in  the first seedLen subsequence and
81              maximum maxDiff differences are allowed in the whole sequence.
82
83              OPTIONS:
84
85              -n NUM    Maximum edit distance if the  value  is  INT,  or  the
86                        fraction  of  missing alignments given 2% uniform base
87                        error rate if FLOAT. In the latter case,  the  maximum
88                        edit  distance  is  automatically chosen for different
89                        read lengths. [0.04]
90
91              -o INT    Maximum number of gap opens [1]
92
93              -e INT    Maximum number of gap extensions, -1 for  k-difference
94                        mode (disallowing long gaps) [-1]
95
96              -d INT    Disallow  a  long  deletion  within INT bp towards the
97                        3'-end [16]
98
99              -i INT    Disallow an indel within INT bp towards the ends [5]
100
101              -l INT    Take the first INT subsequence  as  seed.  If  INT  is
102                        larger  than  the query sequence, seeding will be dis‐
103                        abled. For long reads, this option is typically ranged
104                        from 25 to 35 for `-k 2'. [inf]
105
106              -k INT    Maximum edit distance in the seed [2]
107
108              -t INT    Number of threads (multi-threading mode) [1]
109
110              -M INT    Mismatch  penalty.  BWA will not search for suboptimal
111                        hits with a score lower than (bestScore-misMsc). [3]
112
113              -O INT    Gap open penalty [11]
114
115              -E INT    Gap extension penalty [4]
116
117              -R INT    Proceed with suboptimal alignments  if  there  are  no
118                        more  than  INT  equally  best  hits. This option only
119                        affects paired-end mapping. Increasing this  threshold
120                        helps  to  improve the pairing accuracy at the cost of
121                        speed, especially for short reads (~32bp).
122
123              -c        Reverse query but not complement it, which is required
124                        for alignment in the color space.
125
126              -N        Disable  iterative  search. All hits with no more than
127                        maxDiff differences will be found. This mode  is  much
128                        slower than the default.
129
130              -q INT    Parameter  for read trimming. BWA trims a read down to
131                        argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT  where  l
132                        is the original read length. [0]
133
134              -I        The input is in the Illumina 1.3+ read format (quality
135                        equals ASCII-64).
136
137              -B INT    Length of barcode starting from the 5'-end.  When  INT
138                        is  positive, the barcode of each read will be trimmed
139                        before mapping and will be written at the BC SAM  tag.
140                        For  paired-end  reads, the barcode from both ends are
141                        concatenated. [0]
142
143              -b        Specify the input read sequence file is the  BAM  for‐
144                        mat.  For  paired-end data, two ends in a pair must be
145                        grouped together and options  -1  or  -2  are  usually
146                        applied to specify which end should be mapped. Typical
147                        command lines for mapping pair-end  data  in  the  BAM
148                        format are:
149
150                            bwa aln ref.fa -b1 reads.bam > 1.sai
151                            bwa aln ref.fa -b2 reads.bam > 2.sai
152                            bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam >
153                        aln.sam
154
155              -0        When -b is specified, only  use  single-end  reads  in
156                        mapping.
157
158              -1        When  -b  is  specified,  only use the first read in a
159                        read pair in mapping (skip single-end  reads  and  the
160                        second reads).
161
162              -2        When  -b  is  specified, only use the second read in a
163                        read pair in mapping.
164
165
166       samse  bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
167
168              Generate alignments in the SAM format  given  single-end  reads.
169              Repetitive hits will be randomly chosen.
170
171              OPTIONS:
172
173              -n INT    Maximum  number  of alignments to output in the XA tag
174                        for reads paired properly. If a read has more than INT
175                        hits, the XA tag will not be written. [3]
176
177              -r STR    Specify    the   read   group   in   a   format   like
178                        `@RG\tID:foo\tSM:bar'. [null]
179
180
181       sampe  bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N max‐
182              HitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
183              > <out.sam>
184
185              Generate alignments in the SAM format  given  paired-end  reads.
186              Repetitive read pairs will be placed randomly.
187
188              OPTIONS:
189
190              -a INT  Maximum  insert  size  for  a read pair to be considered
191                      being mapped properly. Since 0.4.5, this option is  only
192                      used  when  there are not enough good alignment to infer
193                      the distribution of insert sizes. [500]
194
195              -o INT  Maximum occurrences of a read for pairing. A  read  with
196                      more  occurrneces  will be treated as a single-end read.
197                      Reducing this parameter helps faster pairing. [100000]
198
199              -P      Load the entire FM-index  into  memory  to  reduce  disk
200                      operations (base-space reads only). With this option, at
201                      least 1.25N bytes of memory are required, where N is the
202                      length of the genome.
203
204              -n INT  Maximum number of alignments to output in the XA tag for
205                      reads paired properly. If a read has more than INT hits,
206                      the XA tag will not be written. [3]
207
208              -N INT  Maximum number of alignments to output in the XA tag for
209                      disconcordant read pairs (excluding  singletons).  If  a
210                      read  has  more  than  INT  hits, the XA tag will not be
211                      written. [10]
212
213              -r STR  Specify   the   read   group   in    a    format    like
214                      `@RG\tID:foo\tSM:bar'. [null]
215
216
217       bwasw  bwa  bwasw  [-a  matchScore]  [-b  mmPen]  [-q  gapOpenPen]  [-r
218              gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s  hspIntv]
219              [-z zBest] [-N nHspRev] [-c thresCoef] <in.db.fasta> <in.fq>
220
221              Align query sequences in the <in.fq> file.
222
223              OPTIONS:
224
225              -a INT    Score of a match [1]
226
227              -b INT    Mismatch penalty [3]
228
229              -q INT    Gap open penalty [5]
230
231              -r INT    Gap  extension  penalty.  The penalty for a contiguous
232                        gap of size k is q+k*r. [2]
233
234              -t INT    Number of threads in the multi-threading mode [1]
235
236              -w INT    Band width in the banded alignment [33]
237
238              -T INT    Minimum score threshold divided by a [37]
239
240              -c FLOAT  Coefficient  for  threshold  adjustment  according  to
241                        query length. Given an l-long query, the threshold for
242                        a hit to be retained is a*max{T,c*log(l)}. [5.5]
243
244              -z INT    Z-best heuristics. Higher -z increases accuracy at the
245                        cost of speed. [1]
246
247              -s INT    Maximum SA interval size for initiating a seed. Higher
248                        -s increases accuracy at the cost of speed. [3]
249
250              -N INT    Minimum  number  of  seeds  supporting  the  resultant
251                        alignment to skip reverse alignment. [5]
252
253

SAM ALIGNMENT FORMAT

255       The  output  of  the  `aln'  command is binary and designed for BWA use
256       only. BWA outputs the final  alignment  in  the  SAM  (Sequence  Align‐
257       ment/Map) format. Each line consists of:
258
259
260       ┌────┬───────┬──────────────────────────────────────────────────────────┐
261       │Col │ Field │                       Description                        │
262       ├────┼───────┼──────────────────────────────────────────────────────────┤
263       │ 1  │ QNAME │ Query (pair) NAME                                        │
264       │ 2  │ FLAG  │ bitwise FLAG                                             │
265       │ 3  │ RNAME │ Reference sequence NAME                                  │
266       │ 4  │ POS   │ 1-based leftmost POSition/coordinate of clipped sequence │
267       │ 5  │ MAPQ  │ MAPping Quality (Phred-scaled)                           │
268       │ 6  │ CIAGR │ extended CIGAR string                                    │
269       │ 7  │ MRNM  │ Mate Reference sequence NaMe (`=' if same as RNAME)      │
270       │ 8  │ MPOS  │ 1-based Mate POSistion                                   │
271       │ 9  │ ISIZE │ Inferred insert SIZE                                     │
272       │10  │ SEQ   │ query SEQuence on the same strand as the reference       │
273       │11  │ QUAL  │ query QUALity (ASCII-33 gives the Phred base quality)    │
274       │12  │ OPT   │ variable OPTional fields in the format TAG:VTYPE:VALUE   │
275       └────┴───────┴──────────────────────────────────────────────────────────┘
276
277       Each bit in the FLAG field is defined as:
278
279
280               ┌────┬────────┬───────────────────────────────────────┐
281               │Chr │  Flag  │              Description              │
282               ├────┼────────┼───────────────────────────────────────┤
283               │ p  │ 0x0001 │ the read is paired in sequencing      │
284               │ P  │ 0x0002 │ the read is mapped in a proper pair   │
285               │ u  │ 0x0004 │ the query sequence itself is unmapped │
286               │ U  │ 0x0008 │ the mate is unmapped                  │
287               │ r  │ 0x0010 │ strand of the query (1 for reverse)   │
288               │ R  │ 0x0020 │ strand of the mate                    │
289               │ 1  │ 0x0040 │ the read is the first read in a pair  │
290               │ 2  │ 0x0080 │ the read is the second read in a pair │
291               │ s  │ 0x0100 │ the alignment is not primary          │
292               │ f  │ 0x0200 │ QC failure                            │
293               │ d  │ 0x0400 │ optical or PCR duplicate              │
294               └────┴────────┴───────────────────────────────────────┘
295
296       The Please check <http://samtools.sourceforge.net> for the format spec‐
297       ification and the tools for post-processing the alignment.
298
299       BWA generates the following optional fields. Tags starting with `X' are
300       specific to BWA.
301
302
303               ┌────┬────────────────────────────────────────────────┐
304               │Tag │                    Meaning                     │
305               ├────┼────────────────────────────────────────────────┤
306               │NM  │ Edit distance                                  │
307               │MD  │ Mismatching positions/bases                    │
308               │AS  │ Alignment score                                │
309               │BC  │ Barcode sequence                               │
310               ├────┼────────────────────────────────────────────────┤
311               │X0  │ Number of best hits                            │
312               │X1  │ Number of suboptimal hits found by BWA         │
313               │XN  │ Number of ambiguous bases in the referenece    │
314               │XM  │ Number of mismatches in the alignment          │
315               │XO  │ Number of gap opens                            │
316               │XG  │ Number of gap extentions                       │
317               │XT  │ Type: Unique/Repeat/N/Mate-sw                  │
318               │XA  │ Alternative hits; format: (chr,pos,CIGAR,NM;)* │
319               ├────┼────────────────────────────────────────────────┤
320               │XS  │ Suboptimal alignment score                     │
321               │XF  │ Support from forward/reverse alignment         │
322               │XE  │ Number of supporting seeds                     │
323               └────┴────────────────────────────────────────────────┘
324
325       Note  that XO and XG are generated by BWT search while the CIGAR string
326       by Smith-Waterman alignment. These two tags may  be  inconsistent  with
327       the CIGAR string. This is not a bug.
328
329

NOTES ON SHORT-READ ALIGNMENT

331   Alignment Accuracy
332       When  seeding is disabled, BWA guarantees to find an alignment contain‐
333       ing maximum maxDiff differences including maxGapO gap  opens  which  do
334       not  occur  within nIndelEnd bp towards either end of the query. Longer
335       gaps may be found if maxGapE is positive, but it is not  guaranteed  to
336       find  all  hits. When seeding is enabled, BWA further requires that the
337       first seedLen subsequence contains no  more  than  maxSeedDiff  differ‐
338       ences.
339
340       When gapped alignment is disabled, BWA is expected to generate the same
341       alignment as Eland, the Illumina alignment  program.  However,  as  BWA
342       change  `N'  in  the  database  sequence to random nucleotides, hits to
343       these random sequences will also be counted. As a consequence, BWA  may
344       mark  a  unique  hit  as a repeat, if the random sequences happen to be
345       identical to the sequences which should be unqiue in the database. This
346       random behaviour will be avoided in future releases.
347
348       By default, if the best hit is no so repetitive (controlled by -R), BWA
349       also finds all hits contains one more mismatch;  otherwise,  BWA  finds
350       all  equally best hits only. Base quality is NOT considered in evaluat‐
351       ing hits. In paired-end alignment, BWA pairs all hits it found. It fur‐
352       ther  performs  Smith-Waterman  alignment for unmapped reads with mates
353       mapped to rescue mapped mates, and for high-quality anomalous pairs  to
354       fix potential alignment errors.
355
356
357   Estimating Insert Size Distribution
358       BWA  estimates the insert size distribution per 256*1024 read pairs. It
359       first collects pairs of reads with both ends mapped with  a  single-end
360       quality  20 or higher and then calculates median (Q2), lower and higher
361       quartile (Q1 and Q3). It estimates the mean and  the  variance  of  the
362       insert  size  distribution  from  pairs  whose  insert sizes are within
363       interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a  pair
364       considered  to be properly paired (SAM flag 0x2) is calculated by solv‐
365       ing equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean,  sigma  is
366       the  standard error of the insert size distribution, L is the length of
367       the genome, p0 is prior of anomalous pair and  Phi()  is  the  standard
368       cumulative  distribution  function.  For  mapping Illumina short-insert
369       reads to the human genome, x is about 6-7 sigma  away  from  the  mean.
370       Quartiles,  mean,  variance and x will be printed to the standard error
371       output.
372
373
374   Memory Requirement
375       With bwtsw algorithm, 2.5GB memory is required for  indexing  the  com‐
376       plete  human  genome sequences. For short reads, the `aln' command uses
377       ~2.3GB memory and the `sampe' command uses ~3.5GB.
378
379
380   Speed
381       Indexing the human genome sequences takes 3 hours with bwtsw algorithm.
382       Indexing  smaller  genomes  with IS or divsufsort algorithms is several
383       times faster, but requires more memory.
384
385       Speed of alignment is largely determined by the error rate of the query
386       sequences (r). Firstly, BWA runs much faster for near perfect hits than
387       for hits with many differences, and it stops searching for a  hit  with
388       l+2  differences if a l-difference hit is found. This means BWA will be
389       very slow if r is high because in this case BWA has to visit hits  with
390       many differences and looking for these hits is expensive. Secondly, the
391       alignment algorithm behind makes the speed sensitive to  [k  log(N)/m],
392       where  k is the maximum allowed differences, N the size of database and
393       m the length of a query. In practice, we choose k w.r.t. r  and  there‐
394       fore  r is the leading factor. I would not recommend to use BWA on data
395       with r>0.02.
396
397       Pairing is slower for shorter reads. This  is  mainly  because  shorter
398       reads  have more spurious hits and converting SA coordinates to chromo‐
399       somal coordinates are very costly.
400
401       In a practical experiment, BWA is able to map 2 million 32bp reads to a
402       bacterial  genome  in  several minutes, map the same amount of reads to
403       human X chromosome in 8-15 minutes and to the  human  genome  in  15-25
404       minutes.  This  result  implies that the speed of BWA is insensitive to
405       the size of database and therefore BWA is more efficient when the data‐
406       base  is  sufficiently large. On smaller genomes, hash based algorithms
407       are usually much faster.
408
409

NOTES ON LONG-READ ALIGNMENT

411       Command `bwasw' is designed  for  long-read  alignment.  The  algorithm
412       behind,  BWA-SW,  is  similar to BWT-SW, but does not guarantee to find
413       all local hits due to the heuristic acceleration. It tends to be faster
414       and  more  accurate  if  the  resultant  alignment is supported by more
415       seeds, and therefore BWA-SW usually performs  better  on  long  queries
416       than on short ones.
417
418       On 350-1000bp reads, BWA-SW is several to tens of times faster than the
419       existing programs. Its accuracy is comparable to SSAHA2, more  accurate
420       than  BLAT. Like BLAT, BWA-SW also finds chimera which may pose a chal‐
421       lenge to SSAHA2. On 10-100kbp queries where chimera detection is impor‐
422       tant, BWA-SW is over 10X faster than BLAT while being more sensitive.
423
424       BWA-SW  can  also  be used to align ~100bp reads, but it is slower than
425       the short-read algorithm. Its sensitivity and accuracy  is  lower  than
426       SSAHA2  especially  when the sequencing error rate is above 2%. This is
427       the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode.
428
429

AUTHOR

436       Heng  Li  at  the Sanger Institute wrote the key source codes and inte‐
437       grated   the   following   codes   for    BWT    construction:    bwtsw
438       <http://i.cs.hku.hk/~ckwong3/bwtsw/>,  implemented by Chi-Kwong Wong at
439       the       University       of       Hong       Kong       and        IS
440       <http://yuta.256.googlepages.com/sais>  originally  proposed by Nong Ge
441       <http://www.cs.sysu.edu.cn/nong/> at the  Sun  Yat-Sen  University  and
442       implemented by Yuta Mori.
443
444

LICENSE AND CITATION

446       The full BWA package is distributed under GPLv3 as it uses source codes
447       from BWT-SW which is covered by GPL. Sorting, hash table,  BWT  and  IS
448       libraries are distributed under the MIT license.
449
450       If  you use the short-read alignment component, please cite the follow‐
451       ing paper:
452
453       Li H. and Durbin R. (2009) Fast and accurate short read alignment  with
454       Burrows-Wheeler   transform.   Bioinformatics,   25,   1754-60.  [PMID:
455       19451168]
456
457       If you use the long-read component (BWA-SW), please cite:
458
459       Li H. and Durbin R. (2010) Fast and accurate long-read  alignment  with
460       Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505]
461
462

HISTORY

464       BWA  is  largely influenced by BWT-SW. It uses source codes from BWT-SW
465       and mimics its binary file formats; BWA-SW resembles BWT-SW in  several
466       ways.  The  initial  idea  about BWT-based alignment also came from the
467       group who developed BWT-SW. At the same time, BWA is  different  enough
468       from  BWT-SW. The short-read alignment algorithm bears no similarity to
469       Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW,  it
470       introduces  heuristics that can hardly be applied to the original algo‐
471       rithm. In all, BWA does not guarantee to find all local  hits  as  what
472       BWT-SW  is  designed  to  do, but it is much faster than BWT-SW on both
473       short and long query sequences.
474
475       I started to write the first piece of codes on 24 May 2008 and got  the
476       initial  stable  version  on  02  June  2008. During this period, I was
477       acquainted that Professor Tak-Wah  Lam,  the  first  author  of  BWT-SW
478       paper,  was collaborating with Beijing Genomics Institute on SOAP2, the
479       successor to SOAP (Short Oligonucleotide Analysis Package).  SOAP2  has
480       come  out in November 2008. According to the SourceForge download page,
481       the third BWT-based short read aligner, bowtie, was first  released  in
482       August  2008.  At  the time of writing this manual, at least three more
483       BWT-based short-read aligners are being implemented.
484
485       The BWA-SW algorithm is a new component of BWA. It was conceived in No‐
486       vember 2008 and implemented ten months later.
487
488
489
490bwa-0.5.9                       24 January 2011                         bwa(1)