bwa(1) - f30

1bwa(1)                       Bioinformatics tools                       bwa(1)
2
3
4

NAME

6       bwa - Burrows-Wheeler Alignment Tool
7

SYNOPSIS

9       bwa index ref.fa
10
11       bwa mem ref.fa reads.fq > aln-se.sam
12
13       bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
14
15       bwa aln ref.fa short_read.fq > aln_sa.sai
16
17       bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam
18
19       bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam
20
21       bwa bwasw ref.fa long_read.fq > aln.sam
22
23

DESCRIPTION

25       BWA is a software package for mapping low-divergent sequences against a
26       large reference genome, such as the human genome. It consists of  three
27       algorithms:  BWA-backtrack,  BWA-SW and BWA-MEM. The first algorithm is
28       designed for Illumina sequence reads up to 100bp, while  the  rest  two
29       for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share
30       similar features such as long-read support  and  split  alignment,  but
31       BWA-MEM, which is the latest, is generally recommended for high-quality
32       queries as it is faster and more accurate.   BWA-MEM  also  has  better
33       performance than BWA-backtrack for 70-100bp Illumina reads.
34
35       For  all  the algorithms, BWA first needs to construct the FM-index for
36       the reference genome (the  index  command).  Alignment  algorithms  are
37       invoked with different sub-commands: aln/samse/sampe for BWA-backtrack,
38       bwasw for BWA-SW and mem for the BWA-MEM algorithm.
39
40

COMMANDS AND OPTIONS

42       index  bwa index [-p prefix] [-a algoType] db.fa
43
44              Index database sequences in the FASTA format.
45
46              OPTIONS:
47
48              -p STR    Prefix of the output database [same as db filename]
49
50              -a STR    Algorithm for constructing BWT index.  BWA  implements
51                        three  algorithms  for BWT construction: is, bwtsw and
52                        rb2.  The first algorithm is a little faster for small
53                        database  but requires large RAM and does not work for
54                        databases with total length longer than 2GB. The  sec‐
55                        ond  algorithm is adapted from the BWT-SW source code.
56                        It in theory works with  database  with  trillions  of
57                        bases.  When  this option is not specified, the appro‐
58                        priate algorithm will be chosen automatically.
59
60
61       mem    bwa mem [-aCHjMpP] [-t nThreads] [-k minSeedLen] [-w  bandWidth]
62              [-d  zDropoff]  [-r seedSplitRatio] [-c maxOcc] [-D chainShadow]
63              [-m maxMateSW] [-W minSeedMatch] [-A matchScore] [-B  mmPenalty]
64              [-O  gapOpenPen]  [-E gapExtPen] [-L clipPen] [-U unpairPen] [-R
65              RGline]  [-H  HDlines]  [-v  verboseLevel]  db.prefix   reads.fq
66              [mates.fq]
67
68              Align  70bp-1Mbp  query  sequences  with  the BWA-MEM algorithm.
69              Briefly, the algorithm works by seeding alignments with  maximal
70              exact  matches  (MEMs) and then extending seeds with the affine-
71              gap Smith-Waterman algorithm (SW).
72
73              If mates.fq file is absent and option -p is not set,  this  com‐
74              mand regards input reads are single-end. If mates.fq is present,
75              this command assumes the i-th read in reads.fq and the i-th read
76              in  mates.fq  constitute a read pair. If -p is used, the command
77              assumes the 2i-th and the (2i+1)-th read in reads.fq  constitute
78              a read pair (such input file is said to be interleaved). In this
79              case, mates.fq is ignored. In the paired-end mode, the mem  com‐
80              mand will infer the read orientation and the insert size distri‐
81              bution from a batch of reads.
82
83              The BWA-MEM algorithm performs local alignment. It  may  produce
84              multiple  primary  alignments  for  different  part  of  a query
85              sequence. This is a crucial feature for long sequences. However,
86              some  tools  such  as Picard's markDuplicates does not work with
87              split alignments. One may consider to  use  option  -M  to  flag
88              shorter split hits as secondary.
89
90
91              ALGORITHM OPTIONS:
92
93              -t INT    Number of threads [1]
94
95              -k INT    Minimum  seed length. Matches shorter than INT will be
96                        missed. The alignment speed is usually insensitive  to
97                        this  value  unless it significantly deviates from 20.
98                        [19]
99
100              -w INT    Band width. Essentially, gaps longer than INT will not
101                        be  found.  Note  that  the maximum gap length is also
102                        affected by the scoring matrix and the hit length, not
103                        solely determined by this option. [100]
104
105              -d INT    Off-diagonal  X-dropoff  (Z-dropoff).  Stop  extension
106                        when the difference between the best and  the  current
107                        extension  score  is  above |i-j|*A+INT, where i and j
108                        are the current positions of the query and  reference,
109                        respectively,  and  A is the matching score. Z-dropoff
110                        is similar to BLAST's X-dropoff except that it doesn't
111                        penalize  gaps  in  one of the sequences in the align‐
112                        ment. Z-dropoff not only avoids unnecessary extension,
113                        but  also  reduces  poor alignments inside a long good
114                        alignment. [100]
115
116              -r FLOAT  Trigger  re-seeding  for  a  MEM  longer   than   min‐
117                        SeedLen*FLOAT.   This is a key heuristic parameter for
118                        tuning the  performance.  Larger  value  yields  fewer
119                        seeds, which leads to faster alignment speed but lower
120                        accuracy. [1.5]
121
122              -c INT    Discard a MEM if it has more than INT occurence in the
123                        genome. This is an insensitive parameter. [500]
124
125              -D FLOAT  Drop chains shorter than FLOAT fraction of the longest
126                        overlapping chain [0.5]
127
128              -m INT    Perform at most INT rounds of mate-SW [50]
129
130              -W INT    Drop a chain if  the  number  of  bases  in  seeds  is
131                        smaller  than  INT.  This option is primarily used for
132                        longer contigs/reads. When positive, it  also  affects
133                        seed filtering. [0]
134
135              -P        In  the  paired-end mode, perform SW to rescue missing
136                        hits only but do not try  to  find  hits  that  fit  a
137                        proper pair.
138
139
140              SCORING OPTIONS:
141
142              -A INT    Matching score. [1]
143
144              -B INT    Mismatch  penalty. The sequence error rate is approxi‐
145                        mately: {.75 * exp[-log(4) * B/A]}. [4]
146
147              -O INT[,INT]
148                        Gap open penalty. If two numbers  are  specified,  the
149                        first  is  the  penalty of openning a deletion and the
150                        second for openning an insertion. [6]
151
152              -E INT[,INT]
153                        Gap extension penalty. If two numbers  are  specified,
154                        the  first  is the penalty of extending a deletion and
155                        second for extending an insertion. A gap of  length  k
156                        costs  O  + k*E (i.e.  -O is for opening a zero-length
157                        gap). [1]
158
159              -L INT[,INT]
160                        Clipping penalty. When performing SW  extension,  BWA-
161                        MEM  keeps track of the best score reaching the end of
162                        query. If this score is larger than the best SW  score
163                        minus  the  clipping  penalty,  clipping  will  not be
164                        applied. Note that  in  this  case,  the  SAM  AS  tag
165                        reports  the  best  SW  score; clipping penalty is not
166                        deduced. If two numbers are provided, the first is for
167                        5'-end clipping and second for 3'-end clipping. [5]
168
169              -U INT    Penalty  for  an unpaired read pair. BWA-MEM scores an
170                        unpaired read pair  as  scoreRead1+scoreRead2-INT  and
171                        scores   a   paired  as  scoreRead1+scoreRead2-insert‐
172                        Penalty. It compares these  two  scores  to  determine
173                        whether  we should force pairing. A larger value leads
174                        to more aggressive read pair. [17]
175
176
177              INPUT/OUTPUT OPTIONS:
178
179              -p        Smart pairing. If two adjacent  reads  have  the  same
180                        name,  they  are  considered to form a read pair. This
181                        way, paired-end and single-end reads can be mixed in a
182                        single FASTA/Q stream.
183
184              -R STR    Complete  read  group header line. '\t' can be used in
185                        STR and will be converted to a TAB in the output  SAM.
186                        The  read  group  ID will be attached to every read in
187                        the  output.  An  example  is   '@RG\tID:foo\tSM:bar'.
188                        [null]
189
190              -H ARG    If  ARG  starts  with @, it is interpreted as a string
191                        and gets inserted into the output SAM  header;  other‐
192                        wise,  ARG  is  interpreted  as  a file with all lines
193                        starting with @ in the  file  inserted  into  the  SAM
194                        header. [null]
195
196              -T INT    Don't  output  alignment  with  score  lower than INT.
197                        This option affects output and occasionally  SAM  flag
198                        2. [30]
199
200              -j        Treat  ALT  contigs  as  part  of the primary assembly
201                        (i.e. ignore the db.prefix.alt file).
202
203              -h INT[,INT2]
204                        If a query has not  more  than  INT  hits  with  score
205                        higher  than  80%  of the best hit, output them all in
206                        the XA tag.  If INT2 is specified, BWA-MEM outputs  up
207                        to INT2 hits if the list contains a hit to an ALT con‐
208                        tig. [5,200]
209
210              -a        Output all found alignments for single-end or unpaired
211                        paired-end  reads. These alignments will be flagged as
212                        secondary alignments.
213
214              -C        Append append FASTA/Q  comment  to  SAM  output.  This
215                        option  can  be used to transfer read meta information
216                        (e.g. barcode)  to  the  SAM  output.  Note  that  the
217                        FASTA/Q  comment  (the  string  after  a  space in the
218                        header  line)  must  conform  the   SAM   spec   (e.g.
219                        BC:Z:CGTAC).  Malformated  comments  lead to incorrect
220                        SAM output.
221
222              -Y        Use soft clipping CIGAR  operation  for  supplementary
223                        alignments. By default, BWA-MEM uses soft clipping for
224                        the primary alignment and hard clipping for supplemen‐
225                        tary alignments.
226
227              -M        Mark  shorter split hits as secondary (for Picard com‐
228                        patibility).
229
230              -v INT    Control the verbose level of the output.  This  option
231                        has  not been fully supported throughout BWA. Ideally,
232                        a value 0 for disabling all the output  to  stderr;  1
233                        for outputting errors only; 2 for warnings and errors;
234                        3 for all normal messages; 4 or higher for  debugging.
235                        When this option takes value 4, the output is not SAM.
236                        [3]
237
238              -I FLOAT[,FLOAT[,INT[,INT]]]
239                        Specify the mean, standard deviation (10% of the  mean
240                        if  absent), max (4 sigma from the mean if absent) and
241                        min (4 sigma if absent) of the insert  size  distribu‐
242                        tion.  Only  applicable  to  the  FR  orientation.  By
243                        default, BWA-MEM infers these  numbers  and  the  pair
244                        orientations given enough reads. [inferred]
245
246
247
248       aln    bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
249              nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN]  [-M
250              misMsc]  [-O  gapOsc]  [-E  gapEsc]  [-q trimQual] <in.db.fasta>
251              <in.query.fq> > <out.sai>
252
253              Find the SA coordinates of the input reads. Maximum  maxSeedDiff
254              differences  are  allowed  in  the first seedLen subsequence and
255              maximum maxDiff differences are allowed in the whole sequence.
256
257              OPTIONS:
258
259              -n NUM    Maximum edit distance if the  value  is  INT,  or  the
260                        fraction  of  missing alignments given 2% uniform base
261                        error rate if FLOAT. In the latter case,  the  maximum
262                        edit  distance  is  automatically chosen for different
263                        read lengths. [0.04]
264
265              -o INT    Maximum number of gap opens [1]
266
267              -e INT    Maximum number of gap extensions, -1 for  k-difference
268                        mode (disallowing long gaps) [-1]
269
270              -d INT    Disallow  a  long  deletion  within INT bp towards the
271                        3'-end [16]
272
273              -i INT    Disallow an indel within INT bp towards the ends [5]
274
275              -l INT    Take the first INT subsequence  as  seed.  If  INT  is
276                        larger  than  the query sequence, seeding will be dis‐
277                        abled. For long reads, this option is typically ranged
278                        from 25 to 35 for `-k 2'. [inf]
279
280              -k INT    Maximum edit distance in the seed [2]
281
282              -t INT    Number of threads (multi-threading mode) [1]
283
284              -M INT    Mismatch  penalty.  BWA will not search for suboptimal
285                        hits with a score lower than (bestScore-misMsc). [3]
286
287              -O INT    Gap open penalty [11]
288
289              -E INT    Gap extension penalty [4]
290
291              -R INT    Proceed with suboptimal alignments  if  there  are  no
292                        more  than  INT  equally  best  hits. This option only
293                        affects paired-end mapping. Increasing this  threshold
294                        helps  to  improve the pairing accuracy at the cost of
295                        speed, especially for short reads (~32bp).
296
297              -c        Reverse query but not complement it, which is required
298                        for  alignment  in  the  color  space. (Disabled since
299                        0.6.x)
300
301              -N        Disable iterative search. All hits with no  more  than
302                        maxDiff  differences  will be found. This mode is much
303                        slower than the default.
304
305              -q INT    Parameter for read trimming. BWA trims a read down  to
306                        argmax_x{\sum_{i=x+1}^l(INT-q_i)}  if  q_l<INT where l
307                        is the original read length. [0]
308
309              -I        The input is in the Illumina 1.3+ read format (quality
310                        equals ASCII-64).
311
312              -B INT    Length  of  barcode starting from the 5'-end. When INT
313                        is positive, the barcode of each read will be  trimmed
314                        before  mapping and will be written at the BC SAM tag.
315                        For paired-end reads, the barcode from both  ends  are
316                        concatenated. [0]
317
318              -b        Specify  the  input read sequence file is the BAM for‐
319                        mat. For paired-end data, two ends in a pair  must  be
320                        grouped  together  and  options  -1  or -2 are usually
321                        applied to specify which end should be mapped. Typical
322                        command  lines  for  mapping  pair-end data in the BAM
323                        format are:
324
325                            bwa aln ref.fa -b1 reads.bam > 1.sai
326                            bwa aln ref.fa -b2 reads.bam > 2.sai
327                            bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam >
328                        aln.sam
329
330              -0        When  -b  is  specified,  only use single-end reads in
331                        mapping.
332
333              -1        When -b is specified, only use the  first  read  in  a
334                        read  pair  in  mapping (skip single-end reads and the
335                        second reads).
336
337              -2        When -b is specified, only use the second  read  in  a
338                        read pair in mapping.
339
340
341       samse  bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
342
343              Generate  alignments  in  the SAM format given single-end reads.
344              Repetitive hits will be randomly chosen.
345
346              OPTIONS:
347
348              -n INT    Maximum number of alignments to output in the  XA  tag
349                        for reads paired properly. If a read has more than INT
350                        hits, the XA tag will not be written. [3]
351
352              -r STR    Specify   the   read   group   in   a   format    like
353                        `@RG\tID:foo\tSM:bar'. [null]
354
355
356       sampe  bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N max‐
357              HitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
358              > <out.sam>
359
360              Generate  alignments  in  the SAM format given paired-end reads.
361              Repetitive read pairs will be placed randomly.
362
363              OPTIONS:
364
365              -a INT  Maximum insert size for a read  pair  to  be  considered
366                      being  mapped properly. Since 0.4.5, this option is only
367                      used when there are not enough good alignment  to  infer
368                      the distribution of insert sizes. [500]
369
370              -o INT  Maximum  occurrences  of a read for pairing. A read with
371                      more occurrneces will be treated as a  single-end  read.
372                      Reducing this parameter helps faster pairing. [100000]
373
374              -P      Load  the  entire  FM-index  into  memory to reduce disk
375                      operations (base-space reads only). With this option, at
376                      least 1.25N bytes of memory are required, where N is the
377                      length of the genome.
378
379              -n INT  Maximum number of alignments to output in the XA tag for
380                      reads paired properly. If a read has more than INT hits,
381                      the XA tag will not be written. [3]
382
383              -N INT  Maximum number of alignments to output in the XA tag for
384                      disconcordant  read  pairs  (excluding singletons). If a
385                      read has more than INT hits, the  XA  tag  will  not  be
386                      written. [10]
387
388              -r STR  Specify    the    read    group   in   a   format   like
389                      `@RG\tID:foo\tSM:bar'. [null]
390
391
392       bwasw  bwa  bwasw  [-a  matchScore]  [-b  mmPen]  [-q  gapOpenPen]  [-r
393              gapExtPen]  [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv]
394              [-z zBest] [-N nHspRev]  [-c  thresCoef]  <in.db.fasta>  <in.fq>
395              [mate.fq]
396
397              Align  query  sequences  in  the  in.fq  file.  When  mate.fq is
398              present, perform paired-end alignment. The paired-end mode  only
399              works  for reads Illumina short-insert libraries. In the paired-
400              end mode, BWA-SW may still output split alignments but they  are
401              all  marked  as not properly paired; the mate positions will not
402              be written if the mate has multiple local hits.
403
404              OPTIONS:
405
406              -a INT    Score of a match [1]
407
408              -b INT    Mismatch penalty [3]
409
410              -q INT    Gap open penalty [5]
411
412              -r INT    Gap extension penalty. The penalty  for  a  contiguous
413                        gap of size k is q+k*r. [2]
414
415              -t INT    Number of threads in the multi-threading mode [1]
416
417              -w INT    Band width in the banded alignment [33]
418
419              -T INT    Minimum score threshold divided by a [37]
420
421              -c FLOAT  Coefficient  for  threshold  adjustment  according  to
422                        query length. Given an l-long query, the threshold for
423                        a hit to be retained is a*max{T,c*log(l)}. [5.5]
424
425              -z INT    Z-best heuristics. Higher -z increases accuracy at the
426                        cost of speed. [1]
427
428              -s INT    Maximum SA interval size for initiating a seed. Higher
429                        -s increases accuracy at the cost of speed. [3]
430
431              -N INT    Minimum  number  of  seeds  supporting  the  resultant
432                        alignment to skip reverse alignment. [5]
433
434

SAM ALIGNMENT FORMAT

436       The output of the `aln' command is binary  and  designed  for  BWA  use
437       only.  BWA  outputs  the  final  alignment  in the SAM (Sequence Align‐
438       ment/Map) format. Each line consists of:
439
440
441       ┌────┬───────┬──────────────────────────────────────────────────────────┐
442       │Col │ Field │                       Description                        │
443       ├────┼───────┼──────────────────────────────────────────────────────────┤
444       │ 1  │ QNAME │ Query (pair) NAME                                        │
445       │ 2  │ FLAG  │ bitwise FLAG                                             │
446       │ 3  │ RNAME │ Reference sequence NAME                                  │
447       │ 4  │ POS   │ 1-based leftmost POSition/coordinate of clipped sequence │
448       │ 5  │ MAPQ  │ MAPping Quality (Phred-scaled)                           │
449       │ 6  │ CIAGR │ extended CIGAR string                                    │
450       │ 7  │ MRNM  │ Mate Reference sequence NaMe (`=' if same as RNAME)      │
451       │ 8  │ MPOS  │ 1-based Mate POSistion                                   │
452       │ 9  │ ISIZE │ Inferred insert SIZE                                     │
453       │10  │ SEQ   │ query SEQuence on the same strand as the reference       │
454       │11  │ QUAL  │ query QUALity (ASCII-33 gives the Phred base quality)    │
455       │12  │ OPT   │ variable OPTional fields in the format TAG:VTYPE:VALUE   │
456       └────┴───────┴──────────────────────────────────────────────────────────┘
457
458       Each bit in the FLAG field is defined as:
459
460
461               ┌────┬────────┬───────────────────────────────────────┐
462               │Chr │  Flag  │              Description              │
463               ├────┼────────┼───────────────────────────────────────┤
464               │ p  │ 0x0001 │ the read is paired in sequencing      │
465               │ P  │ 0x0002 │ the read is mapped in a proper pair   │
466               │ u  │ 0x0004 │ the query sequence itself is unmapped │
467               │ U  │ 0x0008 │ the mate is unmapped                  │
468               │ r  │ 0x0010 │ strand of the query (1 for reverse)   │
469               │ R  │ 0x0020 │ strand of the mate                    │
470               │ 1  │ 0x0040 │ the read is the first read in a pair  │
471               │ 2  │ 0x0080 │ the read is the second read in a pair │
472               │ s  │ 0x0100 │ the alignment is not primary          │
473               │ f  │ 0x0200 │ QC failure                            │
474               │ d  │ 0x0400 │ optical or PCR duplicate              │
475               │ S  │ 0x0800 │ supplementary alignment               │
476               └────┴────────┴───────────────────────────────────────┘
477
478       The Please check <http://samtools.sourceforge.net> for the format spec‐
479       ification and the tools for post-processing the alignment.
480
481       BWA generates the following optional fields. Tags starting with `X' are
482       specific to BWA.
483
484
485              ┌────┬──────────────────────────────────────────────────┐
486              │Tag │                     Meaning                      │
487              ├────┼──────────────────────────────────────────────────┤
488              │NM  │ Edit distance                                    │
489              │MD  │ Mismatching positions/bases                      │
490              │AS  │ Alignment score                                  │
491              │BC  │ Barcode sequence                                 │
492              │SA  │ Supplementary alignments                         │
493              ├────┼──────────────────────────────────────────────────┤
494              │X0  │ Number of best hits                              │
495              │X1  │ Number of suboptimal hits found by BWA           │
496              │XN  │ Number of ambiguous bases in the referenece      │
497              │XM  │ Number of mismatches in the alignment            │
498              │XO  │ Number of gap opens                              │
499              │XG  │ Number of gap extentions                         │
500              │XT  │ Type: Unique/Repeat/N/Mate-sw                    │
501              │XA  │ Alternative hits; format: /(chr,pos,CIGAR,NM;)*/ │
502              ├────┼──────────────────────────────────────────────────┤
503              │XS  │ Suboptimal alignment score                       │
504              │XF  │ Support from forward/reverse alignment           │
505              │XE  │ Number of supporting seeds                       │
506              └────┴──────────────────────────────────────────────────┘
507
508       Note that XO and XG are generated by BWT search while the CIGAR  string
509       by  Smith-Waterman  alignment.  These two tags may be inconsistent with
510       the CIGAR string. This is not a bug.
511
512

NOTES ON SHORT-READ ALIGNMENT

514   Alignment Accuracy
515       When seeding is disabled, BWA guarantees to find an alignment  contain‐
516       ing  maximum  maxDiff  differences including maxGapO gap opens which do
517       not occur within nIndelEnd bp towards either end of the  query.  Longer
518       gaps  may  be found if maxGapE is positive, but it is not guaranteed to
519       find all hits. When seeding is enabled, BWA further requires  that  the
520       first  seedLen  subsequence  contains  no more than maxSeedDiff differ‐
521       ences.
522
523       When gapped alignment is disabled, BWA is expected to generate the same
524       alignment  as Eland version 1, the Illumina alignment program. However,
525       as BWA change `N' in the database sequence to random nucleotides,  hits
526       to  these  random sequences will also be counted. As a consequence, BWA
527       may mark a unique hit as a repeat, if the random sequences happen to be
528       identical to the sequences which should be unqiue in the database.
529
530       By  default,  if  the  best hit is not highly repetitive (controlled by
531       -R), BWA also finds all hits contains one more mismatch; otherwise, BWA
532       finds  all  equally  best  hits only. Base quality is NOT considered in
533       evaluating hits. In the paired-end mode, BWA pairs all hits  it  found.
534       It further performs Smith-Waterman alignment for unmapped reads to res‐
535       cue reads with a high erro rate, and for high-quality  anomalous  pairs
536       to fix potential alignment errors.
537
538
539   Estimating Insert Size Distribution
540       BWA  estimates the insert size distribution per 256*1024 read pairs. It
541       first collects pairs of reads with both ends mapped with  a  single-end
542       quality  20 or higher and then calculates median (Q2), lower and higher
543       quartile (Q1 and Q3). It estimates the mean and  the  variance  of  the
544       insert  size  distribution  from  pairs  whose  insert sizes are within
545       interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a  pair
546       considered  to be properly paired (SAM flag 0x2) is calculated by solv‐
547       ing equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean,  sigma  is
548       the  standard error of the insert size distribution, L is the length of
549       the genome, p0 is prior of anomalous pair and  Phi()  is  the  standard
550       cumulative  distribution  function.  For  mapping Illumina short-insert
551       reads to the human genome, x is about 6-7 sigma  away  from  the  mean.
552       Quartiles,  mean,  variance and x will be printed to the standard error
553       output.
554
555
556   Memory Requirement
557       With bwtsw algorithm, 5GB memory is required for indexing the  complete
558       human  genome  sequences.  For short reads, the aln command uses ~3.2GB
559       memory and the sampe command uses ~5.4GB.
560
561
562   Speed
563       Indexing the human genome sequences takes 3 hours with bwtsw algorithm.
564       Indexing  smaller  genomes  with  IS algorithms is faster, but requires
565       more memory.
566
567       The speed of alignment is largely determined by the error rate  of  the
568       query  sequences  (r).  Firstly,  BWA runs much faster for near perfect
569       hits than for hits with many differences, and it stops searching for  a
570       hit with l+2 differences if a l-difference hit is found. This means BWA
571       will be very slow if r is high because in this case BWA  has  to  visit
572       hits  with  many  differences  and looking for these hits is expensive.
573       Secondly, the alignment algorithm behind makes the speed  sensitive  to
574       [k log(N)/m], where k is the maximum allowed differences, N the size of
575       database and m the length of a query. In practice, we choose k w.r.t. r
576       and therefore r is the leading factor. I would not recommend to use BWA
577       on data with r>0.02.
578
579       Pairing is slower for shorter reads. This  is  mainly  because  shorter
580       reads  have more spurious hits and converting SA coordinates to chromo‐
581       somal coordinates are very costly.
582
583

CHANGES IN BWA-0.6

585       Since version 0.6, BWA has been able to work with  a  reference  genome
586       longer  than 4GB.  This feature makes it possible to integrate the for‐
587       ward and reverse complemented genome in one FM-index, which  speeds  up
588       both  BWA-short and BWA-SW. As a tradeoff, BWA uses more memory because
589       it has to keep all positions and ranks in 64-bit integers, twice larger
590       than 32-bit integers used in the previous versions.
591
592       The latest BWA-SW also works for paired-end reads longer than 100bp. In
593       comparison to BWA-short, BWA-SW tends to be more  accurate  for  highly
594       unique  reads  and  more  robust to relative long INDELs and structural
595       variants.  Nonetheless, BWA-short usually has higher power  to  distin‐
596       guish the optimal hit from many suboptimal hits. The choice of the map‐
597       ping algorithm may depend on the application.
598
599

AUTHOR

606       Heng  Li  at  the Sanger Institute wrote the key source codes and inte‐
607       grated   the   following   codes   for    BWT    construction:    bwtsw
608       <http://i.cs.hku.hk/~ckwong3/bwtsw/>,  implemented by Chi-Kwong Wong at
609       the       University       of       Hong       Kong       and        IS
610       <http://yuta.256.googlepages.com/sais>  originally  proposed by Nong Ge
611       <http://www.cs.sysu.edu.cn/nong/> at the  Sun  Yat-Sen  University  and
612       implemented by Yuta Mori.
613
614

LICENSE AND CITATION

616       The full BWA package is distributed under GPLv3 as it uses source codes
617       from BWT-SW which is covered by GPL. Sorting, hash table,  BWT  and  IS
618       libraries are distributed under the MIT license.
619
620       If  you  use  the  BWA-backtrack  algorithm,  please cite the following
621       paper:
622
623       Li H. and Durbin R. (2009) Fast and accurate short read alignment  with
624       Burrows-Wheeler   transform.   Bioinformatics,  25,  1754-1760.  [PMID:
625       19451168]
626
627       If you use the BWA-SW algorithm, please cite:
628
629       Li H. and Durbin R. (2010) Fast and accurate long-read  alignment  with
630       Burrows-Wheeler   transform.   Bioinformatics,   26,   589-595.  [PMID:
631       20080505]
632
633       If you use BWA-MEM or the fastmap component of BWA, please cite:
634
635       Li H. (2013) Aligning sequence reads, clone sequences and assembly con‐
636       tigs with BWA-MEM. arXiv:1303.3997v1 [q-bio.GN].
637
638       It  is  likely  that  the BWA-MEM manuscript will not appear in a peer-
639       reviewed journal.
640
641

HISTORY

643       BWA is largely influenced by BWT-SW. It uses source codes  from  BWT-SW
644       and  mimics its binary file formats; BWA-SW resembles BWT-SW in several
645       ways. The initial idea about BWT-based alignment  also  came  from  the
646       group  who  developed BWT-SW. At the same time, BWA is different enough
647       from BWT-SW. The short-read alignment algorithm bears no similarity  to
648       Smith-Waterman  algorithm any more. While BWA-SW learns from BWT-SW, it
649       introduces heuristics that can hardly be applied to the original  algo‐
650       rithm.  In  all,  BWA does not guarantee to find all local hits as what
651       BWT-SW is designed to do, but it is much faster  than  BWT-SW  on  both
652       short and long query sequences.
653
654       I  started to write the first piece of codes on 24 May 2008 and got the
655       initial stable version on 02 June  2008.  During  this  period,  I  was
656       acquainted  that  Professor  Tak-Wah  Lam,  the  first author of BWT-SW
657       paper, was collaborating with Beijing Genomics Institute on SOAP2,  the
658       successor  to  SOAP (Short Oligonucleotide Analysis Package). SOAP2 has
659       come out in November 2008. According to the SourceForge download  page,
660       the  third  BWT-based short read aligner, bowtie, was first released in
661       August 2008. At the time of writing this manual, at  least  three  more
662       BWT-based short-read aligners are being implemented.
663
664       The BWA-SW algorithm is a new component of BWA. It was conceived in No‐
665       vember 2008 and implemented ten months later.
666
667       The BWA-MEM algorithm is based on an  algorithm  finding  super-maximal
668       exact  matches (SMEMs), which was first published with the fermi assem‐
669       bler paper in 2012. I first implemented the basic SMEM algorithm in the
670       fastmap command for an experiment and then extended the basic algorithm
671       and added the extension part in Feburary 2013 to make BWA-MEM  a  fully
672       featured mapper.
673
674
675
676
677bwa-0.7.15-r1140                  31 May 2016                           bwa(1)