samtools(1)

1samtools(1)                  Bioinformatics tools                  samtools(1)
2
3
4

NAME

6       samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
8       bcftools - Utilities for the Binary Call Format (BCF) and VCF
9

SYNOPSIS

11       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
12
13       samtools sort aln.bam aln.sorted
14
15       samtools index aln.sorted.bam
16
17       samtools idxstats aln.sorted.bam
18
19       samtools view aln.sorted.bam chr2:20,100,000-20,200,000
20
21       samtools merge out.bam in1.bam in2.bam in3.bam
22
23       samtools faidx ref.fasta
24
25       samtools pileup -vcf ref.fasta aln.sorted.bam
26
27       samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
28
29       samtools tview aln.sorted.bam ref.fasta
30
31       bcftools index in.bcf
32
33       bcftools view in.bcf chr2:100-200 > out.vcf
34
35       bcftools view -vc in.bcf > out.vcf 2> out.afs
36
37

DESCRIPTION

39       Samtools  is  a  set of utilities that manipulate alignments in the BAM
40       format. It imports from and exports to the SAM (Sequence Alignment/Map)
41       format,  does  sorting,  merging  and  indexing, and allows to retrieve
42       reads in any regions swiftly.
43
44       Samtools is designed to work on a stream. It regards an input file  `-'
45       as  the  standard  input (stdin) and an output file `-' as the standard
46       output (stdout). Several commands can thus be combined with Unix pipes.
47       Samtools always output warning and error messages to the standard error
48       output (stderr).
49
50       Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
51       HTTP  server  if  the  BAM file name starts with `ftp://' or `http://'.
52       Samtools checks the current working directory for the  index  file  and
53       will  download  the  index upon absence. Samtools does not retrieve the
54       entire alignment file unless it is asked to do so.
55
56

SAMTOOLS COMMANDS AND OPTIONS

58       view      samtools view  [-bchuHS]  [-t  in.refList]  [-o  output]  [-f
59                 reqFlag]  [-F  skipFlag]  [-q minMapQ] [-l library] [-r read‐
60                 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
61
62                 Extract/print all or sub alignments in SAM or BAM format.  If
63                 no  region  is specified, all the alignments will be printed;
64                 otherwise only alignments overlapping the  specified  regions
65                 will  be  output. An alignment may be given multiple times if
66                 it is overlapping several regions. A region can be presented,
67                 for  example,  in  the  following  format:  `chr2' (the whole
68                 chr2), `chr2:1000000' (region starting from  1,000,000bp)  or
69                 `chr2:1,000,000-2,000,000'   (region  between  1,000,000  and
70                 2,000,000bp including the  end  points).  The  coordinate  is
71                 1-based.
72
73                 OPTIONS:
74
75                 -b      Output in the BAM format.
76
77                 -f INT  Only  output  alignments with all bits in INT present
78                         in the FLAG field. INT can be in hex in the format of
79                         /^0x[0-9A-F]+/ [0]
80
81                 -F INT  Skip alignments with bits present in INT [0]
82
83                 -h      Include the header in the output.
84
85                 -H      Output the header only.
86
87                 -l STR  Only output reads in library STR [null]
88
89                 -o FILE Output file [stdout]
90
91                 -q INT  Skip alignments with MAPQ smaller than INT [0]
92
93                 -r STR  Only output reads in read group STR [null]
94
95                 -R FILE Output reads in read groups listed in FILE [null]
96
97                 -S      Input  is in SAM. If @SQ header lines are absent, the
98                         `-t' option is required.
99
100                 -c      Instead of printing the alignments, only  count  them
101                         and  print the total number. All filter options, such
102                         as `-f', `-F' and `-q' , are taken into account.
103
104                 -t FILE This file is TAB-delimited. Each  line  must  contain
105                         the  reference  name and the length of the reference,
106                         one line  for  each  distinct  reference;  additional
107                         fields  are ignored. This file also defines the order
108                         of the reference sequences in  sorting.  If  you  run
109                         `samtools  faidx  <ref.fa>', the resultant index file
110                         <ref.fa>.fai can be used as this <in.ref_list> file.
111
112                 -u      Output uncompressed BAM. This option saves time spent
113                         on  compression/decomprssion  and  is  thus preferred
114                         when the output is piped to another samtools command.
115
116
117       tview     samtools tview <in.sorted.bam> [ref.fasta]
118
119                 Text alignment viewer (based on the ncurses library). In  the
120                 viewer,  press `?' for help and press `g' to check the align‐
121                 ment   start   from   a   region   in   the    format    like
122                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
123                 reference sequence.
124
125
126       mpileup   samtools mpileup [-EBug] [-C capQcoef] [-r  reg]  [-f  in.fa]
127                 [-l  list]  [-M  capMapQ]  [-Q  minBaseQ] [-q minMapQ] in.bam
128                 [in2.bam [...]]
129
130                 Generate BCF or pileup for one or multiple BAM files.  Align‐
131                 ment  records are grouped by sample identifiers in @RG header
132                 lines. If sample identifiers are absent, each input  file  is
133                 regarded as one sample.
134
135                 In the pileup format (without -uor-g), each line represents a
136                 genomic position, consisting of chromosome name,  coordinate,
137                 reference base, read bases, read qualities and alignment map‐
138                 ping  qualities.  Information  on  match,  mismatch,   indel,
139                 strand,  mapping  quality and start and end of a read are all
140                 encoded at the read base column. At this column, a dot stands
141                 for  a  match  to the reference base on the forward strand, a
142                 comma for a match on the reverse strand, a '>' or '<'  for  a
143                 reference  skip, `ACGTN' for a mismatch on the forward strand
144                 and `acgtn' for a mismatch on the reverse strand.  A  pattern
145                 `\+[0-9]+[ACGTNacgtn]+'   indicates  there  is  an  insertion
146                 between this reference position and the next reference  posi‐
147                 tion.  The length of the insertion is given by the integer in
148                 the pattern, followed by the inserted sequence. Similarly,  a
149                 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
150                 reference. The deleted bases will be presented as `*' in  the
151                 following  lines.  Also at the read base column, a symbol `^'
152                 marks the start of a read. The ASCII of the character follow‐
153                 ing  `^'  minus  33  gives  the mapping quality. A symbol `$'
154                 marks the end of a read segment.
155
156                 Input Options:
157
158                 -6        Assume the quality is in the Illumina  1.3+  encod‐
159                           ing.   -A Do not skip anomalous read pairs in vari‐
160                           ant calling.
161
162                 -B        Disable probabilistic realignment for the  computa‐
163                           tion  of  base  alignment quality (BAQ). BAQ is the
164                           Phred-scaled probability of a read base being  mis‐
165                           aligned.  Applying  this  option  greatly  helps to
166                           reduce false SNPs caused by misalignments.
167
168                 -b FILE   List of input BAM files, one file per line [null]
169
170                 -C INT    Coefficient for  downgrading  mapping  quality  for
171                           reads containing excessive mismatches. Given a read
172                           with a phred-scaled probability q of  being  gener‐
173                           ated  from  the  mapped  position,  the new mapping
174                           quality  is  about  sqrt((INT-q)/INT)*INT.  A  zero
175                           value  disables this functionality; if enabled, the
176                           recommended value for BWA is 50. [0]
177
178                 -d INT    At a position, read maximally INT reads  per  input
179                           BAM. [250]
180
181                 -E        Extended  BAQ computation. This option helps sensi‐
182                           tivity especially for MNPs,  but  may  hurt  speci‐
183                           ficity a little bit.
184
185                 -f FILE   The  faidx-indexed reference file in the FASTA for‐
186                           mat. The  file  can  be  optionally  compressed  by
187                           razip.  [null]
188
189                 -l FILE   BED  or  position  list  file  containing a list of
190                           regions or sites where pileup or BCF should be gen‐
191                           erated [null]
192
193                 -q INT    Minimum mapping quality for an alignment to be used
194                           [0]
195
196                 -Q INT    Minimum base quality for a base  to  be  considered
197                           [13]
198
199                 -r STR    Only generate pileup in region STR [all sites]
200
201                 Output Options:
202
203
204                 -D        Output per-sample read depth
205
206                 -g        Compute genotype likelihoods and output them in the
207                           binary call format (BCF).
208
209                 -S        Output per-sample Phred-scaled strand bias P-value
210
211                 -u        Similar to -g except  that  the  output  is  uncom‐
212                           pressed BCF, which is preferred for piping.
213
214
215                 Options for Genotype Likelihood Computation (for -g or -u):
216
217
218                 -e INT    Phred-scaled  gap extension sequencing error proba‐
219                           bility. Reducing INT leads to longer indels. [20]
220
221                 -h INT    Coefficient for modeling homopolymer errors.  Given
222                           an  l-long homopolymer run, the sequencing error of
223                           an indel of size s is modeled as INT*s/l.  [100]
224
225                 -I        Do not perform INDEL calling
226
227                 -L INT    Skip INDEL calling if the average per-sample  depth
228                           is above INT.  [250]
229
230                 -o INT    Phred-scaled gap open sequencing error probability.
231                           Reducing INT leads to more indel calls. [40]
232
233                 -P STR    Comma dilimited list of  platforms  (determined  by
234                           @RG-PL)  from  which indel candidates are obtained.
235                           It is recommended to collect indel candidates  from
236                           sequencing  technologies  that have low indel error
237                           rate such as ILLUMINA. [all]
238
239
240       reheader  samtools reheader <in.header.sam> <in.bam>
241
242                 Replace  the  header   in   in.bam   with   the   header   in
243                 in.header.sam.   This  command  is much faster than replacing
244                 the header with a BAM->SAM->BAM conversion.
245
246
247       cat       samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam>
248                 [ ... ]
249
250                 Concatenate  BAMs.  The sequence dictionary of each input BAM
251                 must be identical, although this command does not check this.
252                 This  command  uses a similar trick to reheader which enables
253                 fast BAM concatenation.
254
255
256       sort      samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
257
258                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre‐
259                 fix>.bam will be created. This command may also create tempo‐
260                 rary files <out.prefix>.%d.bam when the whole alignment  can‐
261                 not be fitted into memory (controlled by option -m).
262
263                 OPTIONS:
264
265                 -o      Output the final alignment to the standard output.
266
267                 -n      Sort by read names rather than by chromosomal coordi‐
268                         nates
269
270                 -m INT  Approximately   the    maximum    required    memory.
271                         [500000000]
272
273
274       merge     samtools  merge  [-nur1f]  [-h  inh.sam]  [-R  reg] <out.bam>
275                 <in1.bam> <in2.bam> [...]
276
277                 Merge multiple sorted alignments.  The header reference lists
278                 of  all  the input BAM files, and the @SQ headers of inh.sam,
279                 if  any,  must  all  refer  to  the  same  set  of  reference
280                 sequences.   The header reference list and (unless overridden
281                 by -h) `@' headers of in1.bam will be copied to out.bam,  and
282                 the headers of other files will be ignored.
283
284                 OPTIONS:
285
286                 -1      Use zlib compression level 1 to comrpess the output
287
288                 -f      Force to overwrite the output file if present.
289
290                 -h FILE Use  the lines of FILE as `@' headers to be copied to
291                         out.bam, replacing any header lines that would other‐
292                         wise  be  copied  from in1.bam.  (FILE is actually in
293                         SAM format, though any alignment records it may  con‐
294                         tain are ignored.)
295
296                 -n      The  input alignments are sorted by read names rather
297                         than by chromosomal coordinates
298
299                 -R STR  Merge files in the specified region indicated by  STR
300                         [null]
301
302                 -r      Attach  an RG tag to each alignment. The tag value is
303                         inferred from file names.
304
305                 -u      Uncompressed BAM output
306
307
308       index     samtools index <aln.bam>
309
310                 Index sorted alignment for fast  random  access.  Index  file
311                 <aln.bam>.bai will be created.
312
313
314       idxstats  samtools idxstats <aln.bam>
315
316                 Retrieve and print stats in the index file. The output is TAB
317                 delimited with each line  consisting  of  reference  sequence
318                 name, sequence length, # mapped reads and # unmapped reads.
319
320
321       faidx     samtools faidx <ref.fasta> [region1 [...]]
322
323                 Index  reference sequence in the FASTA format or extract sub‐
324                 sequence from indexed reference sequence.  If  no  region  is
325                 specified,   faidx   will   index   the   file   and   create
326                 <ref.fasta>.fai on the disk. If regions are speficified,  the
327                 subsequences  will  be retrieved and printed to stdout in the
328                 FASTA format. The input file can be compressed  in  the  RAZF
329                 format.
330
331
332       fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
333
334                 Fill in mate coordinates, ISIZE and mate related flags from a
335                 name-sorted alignment.
336
337
338       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
339
340                 Remove potential PCR duplicates: if multiple read pairs  have
341                 identical  external  coordinates,  only  retain the pair with
342                 highest mapping quality.  In the paired-end mode,  this  com‐
343                 mand  ONLY  works  with  FR orientation and requires ISIZE is
344                 correctly set. It does not work for unpaired reads (e.g.  two
345                 ends mapped to different chromosomes or orphan reads).
346
347                 OPTIONS:
348
349                 -s      Remove  duplicate  for  single-end reads. By default,
350                         the command works for paired-end reads only.
351
352                 -S      Treat paired-end reads and single-end reads.
353
354
355       calmd     samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
356
357                 Generate the MD tag. If the MD tag is already  present,  this
358                 command  will  give a warning if the MD tag generated is dif‐
359                 ferent from the existing tag. Output SAM by default.
360
361                 OPTIONS:
362
363                 -A      When used jointly with -r this option overwrites  the
364                         original base quality.
365
366                 -e      Convert  a  the  read base to = if it is identical to
367                         the aligned reference base.  Indel  caller  does  not
368                         support the = bases at the moment.
369
370                 -u      Output uncompressed BAM
371
372                 -b      Output compressed BAM
373
374                 -S      The input is SAM with header lines
375
376                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
377                         reads. See the pileup command for details. [0]
378
379                 -r      Compute the BQ tag (without -A) or cap  base  quality
380                         by BAQ (with -A).
381
382                 -E      Extended  BAQ  calculation. This option trades speci‐
383                         ficity for sensitivity, though the effect is minor.
384
385
386       targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0]  [-1
387                 em1] [-2 em2] [-f ref] <in.bam>
388
389                 This  command identifies target regions by examining the con‐
390                 tinuity of read depth, computes haploid  consensus  sequences
391                 of targets and outputs a SAM with each sequence corresponding
392                 to a target. When option -f is in use, BAQ will  be  applied.
393                 This  command is only designed for cutting fosmid clones from
394                 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
395
396
397       phase     samtools phase [-AF] [-k len] [-b  prefix]  [-q  minLOD]  [-Q
398                 minBaseQ] <in.bam>
399
400                 Call and phase heterozygous SNPs.  OPTIONS:
401
402                 -A      Drop reads with ambiguous phase.
403
404                 -b STR  Prefix  of  BAM  output.  When this option is in use,
405                         phase-0 reads will be saved  in  file  STR.0.bam  and
406                         phase-1 reads in STR.1.bam.  Phase unknown reads will
407                         be randomly  allocated  to  one  of  the  two  files.
408                         Chimeric  reads  with  switch errors will be saved in
409                         STR.chimeric.bam.  [null]
410
411                 -F      Do not attempt to fix chimeric reads.
412
413                 -k INT  Maximum length for local phasing. [13]
414
415                 -q INT  Minimum Phred-scaled LOD to call a heterozygote. [40]
416
417                 -Q INT  Minimum base quality to be used in het calling. [13]
418
419

BCFTOOLS COMMANDS AND OPTIONS

421       view      bcftools view [-AbFGNQSucgv] [-D seqDict] [-l  listLoci]  [-s
422                 listSample]  [-i  gapSNPratio] [-t mutRate] [-p varThres] [-P
423                 prior] [-1 nGroup1] [-d minFrac] [-U  nPerm]  [-X  permThres]
424                 [-T trioType] in.bcf [region]
425
426                 Convert  between  BCF  and  VCF,  call variant candidates and
427                 estimate allele frequencies.
428
429
430                 Input/Output Options:
431
432                 -A        Retain all possible alternate  alleles  at  variant
433                           sites.   By  default,  the  view  command  discards
434                           unlikely alleles.
435
436                 -b        Output in the BCF format. The default is VCF.
437
438                 -D FILE   Sequence dictionary (list of chromosome names)  for
439                           VCF->BCF conversion [null]
440
441                 -F        Indicate  PL is generated by r921 or before (order‐
442                           ing is different).
443
444                 -G        Suppress all individual genotype information.
445
446                 -l FILE   List of sites at which  information  are  outputted
447                           [all sites]
448
449                 -N        Skip sites where the REF field is not A/C/G/T
450
451                 -Q        Output the QCALL likelihood format
452
453                 -s FILE   List  of  samples  to  use. The first column in the
454                           input gives the sample names and the  second  gives
455                           the  ploidy, which can only be 1 or 2. When the 2nd
456                           column is absent, the sample ploidy is  assumed  to
457                           be  2.  In the output, the ordering of samples will
458                           be identical to the one in FILE.  [null]
459
460                 -S        The input is VCF instead of BCF.
461
462                 -u        Uncompressed BCF output (force -b).
463
464                 Consensus/Variant Calling Options:
465
466                 -c        Call variants using Bayesian inference. This option
467                           automatically invokes option -e.
468
469                 -d FLOAT  When  -v is in use, skip loci where the fraction of
470                           samples covered by reads is below FLOAT. [0]
471
472                 -e        Perform max-likelihood  inference  only,  including
473                           estimating   the  site  allele  frequency,  testing
474                           Hardy-Weinberg equlibrium and testing  associations
475                           with LRT.
476
477                 -g        Call  per-sample  genotypes at variant sites (force
478                           -c)
479
480                 -i FLOAT  Ratio of INDEL-to-SNP mutation rate [0.15]
481
482                 -p FLOAT  A  site  is  considered  to   be   a   variant   if
483                           P(ref|D)<FLOAT [0.5]
484
485                 -P STR    Prior  or initial allele frequency spectrum. If STR
486                           can be full, cond2, flat or the file consisting  of
487                           error output from a previous variant calling run.
488
489                 -t FLOAT  Scaled muttion rate for variant calling [0.001]
490
491                 -T STR    Enable  pair/trio calling. For trio calling, option
492                           -s is usually needed to be applied to configure the
493                           trio  members and their ordering.  In the file sup‐
494                           plied to the option -s, the first  sample  must  be
495                           the  child, the second the father and the third the
496                           mother.   The  valid  values  of  STR  are  `pair',
497                           `trioauto',  `trioxd'  and  `trioxs',  where `pair'
498                           calls differences between two  input  samples,  and
499                           `trioxd'  (`trioxs')  specifies  that  the input is
500                           from the X chromosome non-PAR regions and the child
501                           is a female (male). [null]
502
503                 -v        Output variant sites only (force -c)
504
505                 Contrast Calling and Association Test Options:
506
507                 -1 INT    Number  of group-1 samples. This option is used for
508                           dividing the samples into two groups  for  contrast
509                           SNP  calling or association test.  When this option
510                           is in use, the following  VCF  INFO  will  be  out‐
511                           putted: PC2, PCHI2 and QCHI2. [0]
512
513                 -U INT    Number of permutations for association test (effec‐
514                           tive only with -1) [0]
515
516                 -X FLOAT  Only  perform   permutations   for   P(chi^2)<FLOAT
517                           (effective only with -U) [0.01]
518
519
520       index     bcftools index in.bcf
521
522                 Index sorted BCF for random access.
523
524
525       cat       bcftools cat in1.bcf [in2.bcf [...]]]
526
527                 Concatenate  BCF  files.  The  input files are required to be
528                 sorted and have  identical  samples  appearing  in  the  same
529                 order.
530

SAM FORMAT

532       Sequence  Alignment/Map  (SAM)  format is TAB-delimited. Apart from the
533       header lines, which are started with the  `@'  symbol,  each  alignment
534       line consists of:
535
536
537       ┌────┬───────┬──────────────────────────────────────────────────────────┐
538       │Col │ Field │                       Description                        │
539       ├────┼───────┼──────────────────────────────────────────────────────────┤
540       │ 1  │ QNAME │ Query template/pair NAME                                 │
541       │ 2  │ FLAG  │ bitwise FLAG                                             │
542       │ 3  │ RNAME │ Reference sequence NAME                                  │
543       │ 4  │ POS   │ 1-based leftmost POSition/coordinate of clipped sequence │
544       │ 5  │ MAPQ  │ MAPping Quality (Phred-scaled)                           │
545       │ 6  │ CIAGR │ extended CIGAR string                                    │
546       │ 7  │ MRNM  │ Mate Reference sequence NaMe (`=' if same as RNAME)      │
547       │ 8  │ MPOS  │ 1-based Mate POSistion                                   │
548       │ 9  │ TLEN  │ inferred Template LENgth (insert size)                   │
549       │10  │ SEQ   │ query SEQuence on the same strand as the reference       │
550       │11  │ QUAL  │ query QUALity (ASCII-33 gives the Phred base quality)    │
551       │12+ │ OPT   │ variable OPTional fields in the format TAG:VTYPE:VALUE   │
552       └────┴───────┴──────────────────────────────────────────────────────────┘
553
554       Each bit in the FLAG field is defined as:
555
556
557          ┌───────┬─────┬──────────────────────────────────────────────────┐
558          │ Flag  │ Chr │                   Description                    │
559          ├───────┼─────┼──────────────────────────────────────────────────┤
560          │0x0001 │  p  │ the read is paired in sequencing                 │
561          │0x0002 │  P  │ the read is mapped in a proper pair              │
562          │0x0004 │  u  │ the query sequence itself is unmapped            │
563          │0x0008 │  U  │ the mate is unmapped                             │
564          │0x0010 │  r  │ strand of the query (1 for reverse)              │
565          │0x0020 │  R  │ strand of the mate                               │
566          │0x0040 │  1  │ the read is the first read in a pair             │
567          │0x0080 │  2  │ the read is the second read in a pair            │
568          │0x0100 │  s  │ the alignment is not primary                     │
569          │0x0200 │  f  │ the read fails platform/vendor quality checks    │
570          │0x0400 │  d  │ the read is either a PCR or an optical duplicate │
571          └───────┴─────┴──────────────────────────────────────────────────┘
572       where  the  second  column  gives the string representation of the FLAG
573       field.
574
575

VCF FORMAT

577       The Variant Call Format (VCF) is a TAB-delimited format with each  data
578       line consists of the following fields:
579
580    ┌────┬────────┬──────────────────────────────────────────────────────────────┐
581    │Col │ Field  │                         Description                          │
582    ├────┼────────┼──────────────────────────────────────────────────────────────┤
583    │ 1  │ CHROM  │ CHROMosome name                                              │
584    │ 2  │ POS    │ the left-most POSition of the variant                        │
585    │ 3  │ ID     │ unique variant IDentifier                                    │
586    │ 4  │ REF    │ the REFerence allele                                         │
587    │ 5  │ ALT    │ the ALTernate allele(s), separated by comma                  │
588    │ 6  │ QUAL   │ variant/reference QUALity                                    │
589    │ 7  │ FILTER │ FILTers applied                                              │
590    │ 8  │ INFO   │ INFOrmation related to the variant, separated by semi-colon  │
591    │ 9  │ FORMAT │ FORMAT of the genotype fields, separated by colon (optional) │
592    │10+ │ SAMPLE │ SAMPLE genotypes and per-sample information (optional)       │
593    └────┴────────┴──────────────────────────────────────────────────────────────┘
594
595       The following table gives the INFO tags used by samtools and bcftools.
596
597
598┌──────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────┐
599│ Tag  │  Format   │                                     Description                                      │
600├──────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────┤
601│AF1   │ double    │ Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele    │
602│DP    │ int       │ Raw read depth (without quality filtering)                                           │
603│DP4   │ int[4]    │ # high-quality reference forward bases, ref reverse, alternate for and alt rev bases │
604│FQ    │ int       │ Consensus quality. Positive: sample genotypes different; negative: otherwise         │
605│MQ    │ int       │ Root-Mean-Square mapping quality of covering reads                                   │
606│PC2   │ int[2]    │ Phred probability of AF in group1 samples being larger (,smaller) than in group2     │
607│PCHI2 │ double    │ Posterior weighted chi^2 P-value between group1 and group2 samples                   │
608│PV4   │ double[4] │ P-value for strand bias, baseQ bias, mapQ bias and tail distance bias                │
609│QCHI2 │ int       │ Phred-scaled PCHI2                                                                   │
610│RP    │ int       │ # permutations yielding a smaller PCHI2                                              │
611│CLR   │ int       │ Phred log ratio of genotype likelihoods with and without the trio/pair constraint    │
612│UGT   │ string    │ Most probable genotype configuration without the trio constraint                     │
613│CGT   │ string    │ Most probable configuration with the trio constraint                                 │
614└──────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────┘
615

EXAMPLES

617       o Import SAM to BAM when @SQ lines are present in the header:
618
619           samtools view -bS aln.sam > aln.bam
620
621         If @SQ lines are absent:
622
623           samtools faidx ref.fa
624           samtools view -bt ref.fa.fai aln.sam > aln.bam
625
626         where ref.fa.fai is generated automatically by the faidx command.
627
628
629       o Attach the RG tag while merging sorted alignments:
630
631           perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu‐
632         mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
633           samtools merge -rh rg.txt merged.bam ga.bam 454.bam
634
635         The value in a RG tag is determined by the file name the read is com‐
636         ing  from. In this example, in the merged.bam, reads from ga.bam will
637         be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
638         RG:Z:454.
639
640
641       o Call SNPs and short INDELs for one diploid individual:
642
643           samtools  mpileup  -ugf  ref.fa  aln.bam  | bcftools view -bvcg - >
644         var.raw.bcf
645           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100   >
646         var.flt.vcf
647
648         The  -D  option  of  varFilter controls the maximum read depth, which
649         should be adjusted to about twice the average read  depth.   One  may
650         consider  to  add -C50 to mpileup if mapping quality is overestimated
651         for reads containing excessive mismatches. Applying this option  usu‐
652         ally helps BWA-short but may not other mappers.
653
654
655       o Generate the consensus sequence for one diploid individual:
656
657           samtools  mpileup  -uf ref.fa aln.bam | bcftools view -cg - | vcfu‐
658         tils.pl vcf2fq > cns.fq
659
660
661       o Call somatic mutations from a pair of samples:
662
663           samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -
664         > var.bcf
665
666         In  the  output INFO field, CLR gives the Phred-log ratio between the
667         likelihood by treating the two samples independently, and the likeli‐
668         hood  by  requiring the genotype to be identical.  This CLR is effec‐
669         tively a score measuring the confidence of somatic calls. The  higher
670         the better.
671
672
673       o Call de novo and somatic mutations from a family trio:
674
675           samtools  mpileup  -DSuf ref.fa aln.bam | bcftools view -bvcgT pair
676         -s samples.txt - > var.bcf
677
678         File samples.txt should consist of three lines specifying the  member
679         and  order  of  samples (in the order of child-father-mother).  Simi‐
680         larly, CLR gives the Phred-log likelihood ratio with and without  the
681         trio  constraint.   UGT  shows the most likely genotype configuration
682         without the trio constraint, and CGT gives the most  likely  genotype
683         configuration satisfying the trio constraint.
684
685
686       o Phase one individual:
687
688           samtools  calmd -AEur aln.bam ref.fa | samtools phase -b prefix - >
689         phase.out
690
691         The calmd command  is  used  to  reduce  false  heterozygotes  around
692         INDELs.
693
694
695       o Call SNPs and short indels for multiple diploid individuals:
696
697           samtools  mpileup  -P  ILLUMINA  -ugf  ref.fa *.bam | bcftools view
698         -bcvg - > var.raw.bcf
699           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
700         var.flt.vcf
701
702         Individuals  are identified from the SM tags in the @RG header lines.
703         Individuals can be pooled in one alignment file; one  individual  can
704         also  be  separated into multiple files. The -P option specifies that
705         indel candidates should be collected only from read groups  with  the
706         @RG-PL  tag  set to ILLUMINA.  Collecting indel candidates from reads
707         sequenced by an indel-prone technology may affect the performance  of
708         indel calling.
709
710
711       o Derive  the  allele  frequency spectrum (AFS) on a list of sites from
712         multiple individuals:
713
714           samtools mpileup -Igf ref.fa *.bam > all.bcf
715           bcftools view -bl sites.list all.bcf > sites.bcf
716           bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
717           bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
718           bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
719           ......
720
721         where sites.list contains the list of sites with each line consisting
722         of  the  reference sequence name and position. The following bcftools
723         commands estimate AFS by EM.
724
725
726       o Dump BAQ applied alignment for other SNP callers:
727
728           samtools calmd -bAr aln.bam > aln.baq.bam
729
730         It adds and corrects the NM and MD tags at the same time.  The  calmd
731         command  also comes with the -C option, the same as the one in pileup
732         and mpileup.  Apply if it helps.
733
734

LIMITATIONS

736       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
737         bam_aux.c.
738
739       o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
740         orphan reads or ends mapped to different chromosomes). If this  is  a
741         concern,  please  use  Picard's MarkDuplicate which correctly handles
742         these cases, although a little slower.
743
744

AUTHOR

746       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
747       Handsaker from the Broad Institute implemented the BGZF library and Jue
748       Ruan from Beijing Genomics Institute wrote the RAZF library. John  Mar‐
749       shall and Petr Danecek contribute to the source code and various people
750       from the 1000 Genomes Project have contributed to the SAM format speci‐
751       fication.
752
753