samtools(1)

1samtools(1)                  Bioinformatics tools                  samtools(1)
2
3
4

NAME

6       samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
8       bcftools - Utilities for the Binary Call Format (BCF) and VCF
9

SYNOPSIS

11       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
12
13       samtools sort aln.bam aln.sorted
14
15       samtools index aln.sorted.bam
16
17       samtools idxstats aln.sorted.bam
18
19       samtools view aln.sorted.bam chr2:20,100,000-20,200,000
20
21       samtools merge out.bam in1.bam in2.bam in3.bam
22
23       samtools faidx ref.fasta
24
25       samtools pileup -vcf ref.fasta aln.sorted.bam
26
27       samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
28
29       samtools tview aln.sorted.bam ref.fasta
30
31       bcftools index in.bcf
32
33       bcftools view in.bcf chr2:100-200 > out.vcf
34
35       bcftools view -Nvm0.99 in.bcf > out.vcf 2> out.afs
36
37

DESCRIPTION

39       Samtools  is  a  set of utilities that manipulate alignments in the BAM
40       format. It imports from and exports to the SAM (Sequence Alignment/Map)
41       format,  does  sorting,  merging  and  indexing, and allows to retrieve
42       reads in any regions swiftly.
43
44       Samtools is designed to work on a stream. It regards an input file  `-'
45       as  the  standard  input (stdin) and an output file `-' as the standard
46       output (stdout). Several commands can thus be combined with Unix pipes.
47       Samtools always output warning and error messages to the standard error
48       output (stderr).
49
50       Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
51       HTTP  server  if  the  BAM file name starts with `ftp://' or `http://'.
52       Samtools checks the current working directory for the  index  file  and
53       will  download  the  index upon absence. Samtools does not retrieve the
54       entire alignment file unless it is asked to do so.
55
56

SAMTOOLS COMMANDS AND OPTIONS

58       view      samtools view  [-bchuHS]  [-t  in.refList]  [-o  output]  [-f
59                 reqFlag]  [-F  skipFlag]  [-q minMapQ] [-l library] [-r read‐
60                 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
61
62                 Extract/print all or sub alignments in SAM or BAM format.  If
63                 no  region  is specified, all the alignments will be printed;
64                 otherwise only alignments overlapping the  specified  regions
65                 will  be  output. An alignment may be given multiple times if
66                 it is overlapping several regions. A region can be presented,
67                 for  example,  in  the  following  format:  `chr2' (the whole
68                 chr2), `chr2:1000000' (region starting from  1,000,000bp)  or
69                 `chr2:1,000,000-2,000,000'   (region  between  1,000,000  and
70                 2,000,000bp including the  end  points).  The  coordinate  is
71                 1-based.
72
73                 OPTIONS:
74
75                 -b        Output in the BAM format.
76
77                 -f INT    Only output alignments with all bits in INT present
78                           in the FLAG field. INT can be in hex in the  format
79                           of /^0x[0-9A-F]+/ [0]
80
81                 -F INT    Skip alignments with bits present in INT [0]
82
83                 -h        Include the header in the output.
84
85                 -H        Output the header only.
86
87                 -l STR    Only output reads in library STR [null]
88
89                 -o FILE   Output file [stdout]
90
91                 -q INT    Skip alignments with MAPQ smaller than INT [0]
92
93                 -r STR    Only output reads in read group STR [null]
94
95                 -R FILE   Output reads in read groups listed in FILE [null]
96
97                 -s FLOAT  Fraction of templates/pairs to subsample; the inte‐
98                           ger part is treated as the seed for the random num‐
99                           ber generator [-1]
100
101                 -S        Input  is  in  SAM. If @SQ header lines are absent,
102                           the `-t' option is required.
103
104                 -c        Instead of printing the alignments, only count them
105                           and  print  the  total  number. All filter options,
106                           such as `-f',  `-F'  and  `-q'  ,  are  taken  into
107                           account.
108
109                 -t FILE   This  file is TAB-delimited. Each line must contain
110                           the reference name and the length of the reference,
111                           one  line  for  each distinct reference; additional
112                           fields are ignored.  This  file  also  defines  the
113                           order of the reference sequences in sorting. If you
114                           run `samtools faidx <ref.fa>', the resultant  index
115                           file <ref.fa>.fai can be used as this <in.ref_list>
116                           file.
117
118                 -u        Output uncompressed BAM.  This  option  saves  time
119                           spent  on compression/decomprssion and is thus pre‐
120                           ferred when the output is piped to another samtools
121                           command.
122
123
124       tview     samtools   tview   [-p   chr:pos]   [-s   STR]  [-d  display]
125                 <in.sorted.bam> [ref.fasta]
126
127                 Text alignment viewer (based on the ncurses library). In  the
128                 viewer,  press `?' for help and press `g' to check the align‐
129                 ment   start   from   a   region   in   the    format    like
130                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
131                 reference sequence.
132
133                 Options:
134
135                 -d display    Output as (H)tml or (C)urses or (T)ext
136
137                 -p chr:pos    Go directly to this position
138
139                 -s STR        Display only reads from  this  sample  or  read
140                               group
141
142
143       mpileup   samtools  mpileup  [-EBugp] [-C capQcoef] [-r reg] [-f in.fa]
144                 [-l list] [-M capMapQ]  [-Q  minBaseQ]  [-q  minMapQ]  in.bam
145                 [in2.bam [...]]
146
147                 Generate  BCF or pileup for one or multiple BAM files. Align‐
148                 ment records are grouped by sample identifiers in @RG  header
149                 lines.  If  sample identifiers are absent, each input file is
150                 regarded as one sample.
151
152                 In the pileup format (without -uor-g), each line represents a
153                 genomic  position, consisting of chromosome name, coordinate,
154                 reference base, read bases, read qualities and alignment map‐
155                 ping   qualities.  Information  on  match,  mismatch,  indel,
156                 strand, mapping quality and start and end of a read  are  all
157                 encoded at the read base column. At this column, a dot stands
158                 for a match to the reference base on the  forward  strand,  a
159                 comma  for  a match on the reverse strand, a '>' or '<' for a
160                 reference skip, `ACGTN' for a mismatch on the forward  strand
161                 and  `acgtn'  for a mismatch on the reverse strand. A pattern
162                 `\+[0-9]+[ACGTNacgtn]+'  indicates  there  is  an   insertion
163                 between  this reference position and the next reference posi‐
164                 tion. The length of the insertion is given by the integer  in
165                 the  pattern, followed by the inserted sequence. Similarly, a
166                 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
167                 reference.  The deleted bases will be presented as `*' in the
168                 following lines. Also at the read base column, a  symbol  `^'
169                 marks the start of a read. The ASCII of the character follow‐
170                 ing `^' minus 33 gives the  mapping  quality.  A  symbol  `$'
171                 marks the end of a read segment.
172
173                 Input Options:
174
175                 -6        Assume  the  quality is in the Illumina 1.3+ encod‐
176                           ing.  -A Do not skip anomalous read pairs in  vari‐
177                           ant calling.
178
179                 -B        Disable  probabilistic realignment for the computa‐
180                           tion of base alignment quality (BAQ).  BAQ  is  the
181                           Phred-scaled  probability of a read base being mis‐
182                           aligned. Applying  this  option  greatly  helps  to
183                           reduce false SNPs caused by misalignments.
184
185                 -b FILE   List of input BAM files, one file per line [null]
186
187                 -C INT    Coefficient  for  downgrading  mapping  quality for
188                           reads containing excessive mismatches. Given a read
189                           with  a  phred-scaled probability q of being gener‐
190                           ated from the  mapped  position,  the  new  mapping
191                           quality  is  about  sqrt((INT-q)/INT)*INT.  A  zero
192                           value disables this functionality; if enabled,  the
193                           recommended value for BWA is 50. [0]
194
195                 -d INT    At  a  position, read maximally INT reads per input
196                           BAM. [250]
197
198                 -E        Extended BAQ computation. This option helps  sensi‐
199                           tivity  especially  for  MNPs,  but may hurt speci‐
200                           ficity a little bit.
201
202                 -f FILE   The faidx-indexed reference file in the FASTA  for‐
203                           mat.  The  file  can  be  optionally  compressed by
204                           razip.  [null]
205
206                 -l FILE   BED or position list  file  containing  a  list  of
207                           regions or sites where pileup or BCF should be gen‐
208                           erated [null]
209
210                 -q INT    Minimum mapping quality for an alignment to be used
211                           [0]
212
213                 -Q INT    Minimum  base  quality  for a base to be considered
214                           [13]
215
216                 -r STR    Only generate pileup in region STR [all sites]
217
218                 Output Options:
219
220
221                 -D        Output per-sample read depth
222
223                 -g        Compute genotype likelihoods and output them in the
224                           binary call format (BCF).
225
226                 -S        Output per-sample Phred-scaled strand bias P-value
227
228                 -u        Similar  to  -g  except  that  the output is uncom‐
229                           pressed BCF, which is preferred for piping.
230
231
232                 Options for Genotype Likelihood Computation (for -g or -u):
233
234
235                 -e INT    Phred-scaled gap extension sequencing error  proba‐
236                           bility. Reducing INT leads to longer indels. [20]
237
238                 -h INT    Coefficient  for modeling homopolymer errors. Given
239                           an l-long homopolymer run, the sequencing error  of
240                           an indel of size s is modeled as INT*s/l.  [100]
241
242                 -I        Do not perform INDEL calling
243
244                 -L INT    Skip  INDEL calling if the average per-sample depth
245                           is above INT.  [250]
246
247                 -o INT    Phred-scaled gap open sequencing error probability.
248                           Reducing INT leads to more indel calls. [40]
249
250                 -p        Apply  -m  and -F thresholds per sample to increase
251                           sensitivity of calling.  By  default  both  options
252                           are applied to reads pooled from all samples.
253
254                 -P STR    Comma  dilimited  list  of platforms (determined by
255                           @RG-PL) from which indel candidates  are  obtained.
256                           It  is recommended to collect indel candidates from
257                           sequencing technologies that have low  indel  error
258                           rate such as ILLUMINA. [all]
259
260
261       reheader  samtools reheader <in.header.sam> <in.bam>
262
263                 Replace   the   header   in   in.bam   with   the  header  in
264                 in.header.sam.  This command is much  faster  than  replacing
265                 the header with a BAM->SAM->BAM conversion.
266
267
268       cat       samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam>
269                 [ ... ]
270
271                 Concatenate BAMs. The sequence dictionary of each  input  BAM
272                 must be identical, although this command does not check this.
273                 This command uses a similar trick to reheader  which  enables
274                 fast BAM concatenation.
275
276
277       sort      samtools sort [-nof] [-m maxMem] <in.bam> <out.prefix>
278
279                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre‐
280                 fix>.bam will be created. This command may also create tempo‐
281                 rary  files <out.prefix>.%d.bam when the whole alignment can‐
282                 not be fitted into memory (controlled by option -m).
283
284                 OPTIONS:
285
286                 -o      Output the final alignment to the standard output.
287
288                 -n      Sort by read names rather than by chromosomal coordi‐
289                         nates
290
291                 -f      Use  <out.prefix>  as the full output path and do not
292                         append .bam suffix.
293
294                 -m INT  Approximately   the    maximum    required    memory.
295                         [500000000]
296
297
298       merge     samtools  merge  [-nur1f]  [-h  inh.sam]  [-R  reg] <out.bam>
299                 <in1.bam> <in2.bam> [...]
300
301                 Merge multiple sorted alignments.  The header reference lists
302                 of  all  the input BAM files, and the @SQ headers of inh.sam,
303                 if  any,  must  all  refer  to  the  same  set  of  reference
304                 sequences.   The header reference list and (unless overridden
305                 by -h) `@' headers of in1.bam will be copied to out.bam,  and
306                 the headers of other files will be ignored.
307
308                 OPTIONS:
309
310                 -1      Use zlib compression level 1 to comrpess the output
311
312                 -f      Force to overwrite the output file if present.
313
314                 -h FILE Use  the lines of FILE as `@' headers to be copied to
315                         out.bam, replacing any header lines that would other‐
316                         wise  be  copied  from in1.bam.  (FILE is actually in
317                         SAM format, though any alignment records it may  con‐
318                         tain are ignored.)
319
320                 -n      The  input alignments are sorted by read names rather
321                         than by chromosomal coordinates
322
323                 -R STR  Merge files in the specified region indicated by  STR
324                         [null]
325
326                 -r      Attach  an RG tag to each alignment. The tag value is
327                         inferred from file names.
328
329                 -u      Uncompressed BAM output
330
331
332       index     samtools index <aln.bam>
333
334                 Index sorted alignment for fast  random  access.  Index  file
335                 <aln.bam>.bai will be created.
336
337
338       idxstats  samtools idxstats <aln.bam>
339
340                 Retrieve and print stats in the index file. The output is TAB
341                 delimited with each line  consisting  of  reference  sequence
342                 name, sequence length, # mapped reads and # unmapped reads.
343
344
345       faidx     samtools faidx <ref.fasta> [region1 [...]]
346
347                 Index  reference sequence in the FASTA format or extract sub‐
348                 sequence from indexed reference sequence.  If  no  region  is
349                 specified,   faidx   will   index   the   file   and   create
350                 <ref.fasta>.fai on the disk. If regions are speficified,  the
351                 subsequences  will  be retrieved and printed to stdout in the
352                 FASTA format. The input file can be compressed  in  the  RAZF
353                 format.
354
355
356       fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
357
358                 Fill in mate coordinates, ISIZE and mate related flags from a
359                 name-sorted alignment.
360
361
362       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
363
364                 Remove potential PCR duplicates: if multiple read pairs  have
365                 identical  external  coordinates,  only  retain the pair with
366                 highest mapping quality.  In the paired-end mode,  this  com‐
367                 mand  ONLY  works  with  FR orientation and requires ISIZE is
368                 correctly set. It does not work for unpaired reads (e.g.  two
369                 ends mapped to different chromosomes or orphan reads).
370
371                 OPTIONS:
372
373                 -s      Remove  duplicate  for  single-end reads. By default,
374                         the command works for paired-end reads only.
375
376                 -S      Treat paired-end reads and single-end reads.
377
378
379       calmd     samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
380
381                 Generate the MD tag. If the MD tag is already  present,  this
382                 command  will  give a warning if the MD tag generated is dif‐
383                 ferent from the existing tag. Output SAM by default.
384
385                 OPTIONS:
386
387                 -A      When used jointly with -r this option overwrites  the
388                         original base quality.
389
390                 -e      Convert  a  the  read base to = if it is identical to
391                         the aligned reference base.  Indel  caller  does  not
392                         support the = bases at the moment.
393
394                 -u      Output uncompressed BAM
395
396                 -b      Output compressed BAM
397
398                 -S      The input is SAM with header lines
399
400                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
401                         reads. See the pileup command for details. [0]
402
403                 -r      Compute the BQ tag (without -A) or cap  base  quality
404                         by BAQ (with -A).
405
406                 -E      Extended  BAQ  calculation. This option trades speci‐
407                         ficity for sensitivity, though the effect is minor.
408
409
410       targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0]  [-1
411                 em1] [-2 em2] [-f ref] <in.bam>
412
413                 This  command identifies target regions by examining the con‐
414                 tinuity of read depth, computes haploid  consensus  sequences
415                 of targets and outputs a SAM with each sequence corresponding
416                 to a target. When option -f is in use, BAQ will  be  applied.
417                 This  command is only designed for cutting fosmid clones from
418                 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
419
420
421       phase     samtools phase [-AF] [-k len] [-b  prefix]  [-q  minLOD]  [-Q
422                 minBaseQ] <in.bam>
423
424                 Call and phase heterozygous SNPs.  OPTIONS:
425
426                 -A      Drop reads with ambiguous phase.
427
428                 -b STR  Prefix  of  BAM  output.  When this option is in use,
429                         phase-0 reads will be saved  in  file  STR.0.bam  and
430                         phase-1 reads in STR.1.bam.  Phase unknown reads will
431                         be randomly  allocated  to  one  of  the  two  files.
432                         Chimeric  reads  with  switch errors will be saved in
433                         STR.chimeric.bam.  [null]
434
435                 -F      Do not attempt to fix chimeric reads.
436
437                 -k INT  Maximum length for local phasing. [13]
438
439                 -q INT  Minimum Phred-scaled LOD to call a heterozygote. [40]
440
441                 -Q INT  Minimum base quality to be used in het calling. [13]
442
443

BCFTOOLS COMMANDS AND OPTIONS

445       view      bcftools view [-AbFGNQSucgv] [-D seqDict] [-l  listLoci]  [-s
446                 listSample]  [-i  gapSNPratio] [-t mutRate] [-p varThres] [-m
447                 varThres] [-P prior] [-1 nGroup1] [-d minFrac] [-U nPerm] [-X
448                 permThres] [-T trioType] in.bcf [region]
449
450                 Convert  between  BCF  and  VCF,  call variant candidates and
451                 estimate allele frequencies.
452
453
454                 Input/Output Options:
455
456                 -A        Retain all possible alternate  alleles  at  variant
457                           sites.   By  default,  the  view  command  discards
458                           unlikely alleles.
459
460                 -b        Output in the BCF format. The default is VCF.
461
462                 -D FILE   Sequence dictionary (list of chromosome names)  for
463                           VCF->BCF conversion [null]
464
465                 -F        Indicate  PL is generated by r921 or before (order‐
466                           ing is different).
467
468                 -G        Suppress all individual genotype information.
469
470                 -l FILE   List of sites at which  information  are  outputted
471                           [all sites]
472
473                 -N        Skip sites where the REF field is not A/C/G/T
474
475                 -Q        Output the QCALL likelihood format
476
477                 -s FILE   List  of  samples  to  use. The first column in the
478                           input gives the sample names and the  second  gives
479                           the  ploidy, which can only be 1 or 2. When the 2nd
480                           column is absent, the sample ploidy is  assumed  to
481                           be  2.  In the output, the ordering of samples will
482                           be identical to the one in FILE.  [null]
483
484                 -S        The input is VCF instead of BCF.
485
486                 -u        Uncompressed BCF output (force -b).
487
488                 Consensus/Variant Calling Options:
489
490                 -c        Call variants using Bayesian inference. This option
491                           automatically invokes option -e.
492
493                 -d FLOAT  When  -v is in use, skip loci where the fraction of
494                           samples covered by reads is below FLOAT. [0]
495
496                 -e        Perform max-likelihood  inference  only,  including
497                           estimating   the  site  allele  frequency,  testing
498                           Hardy-Weinberg equlibrium and testing  associations
499                           with LRT.
500
501                 -g        Call  per-sample  genotypes at variant sites (force
502                           -c)
503
504                 -i FLOAT  Ratio of INDEL-to-SNP mutation rate [0.15]
505
506                 -m FLOAT  New model for improved multiallelic and  rare-vari‐
507                           ant  calling.  Another  ALT  allele  is accepted if
508                           P(chi^2) of LRT exceeds the  FLOAT  threshold.  The
509                           parameter seems robust and the actual value usually
510                           does not affect the results much; a good  value  to
511                           use  is  0.99.  This  is  the  recommended  calling
512                           method. [0]
513
514                 -p FLOAT  A  site  is  considered  to   be   a   variant   if
515                           P(ref|D)<FLOAT [0.5]
516
517                 -P STR    Prior  or initial allele frequency spectrum. If STR
518                           can be full, cond2, flat or the file consisting  of
519                           error output from a previous variant calling run.
520
521                 -t FLOAT  Scaled muttion rate for variant calling [0.001]
522
523                 -T STR    Enable  pair/trio calling. For trio calling, option
524                           -s is usually needed to be applied to configure the
525                           trio  members and their ordering.  In the file sup‐
526                           plied to the option -s, the first  sample  must  be
527                           the  child, the second the father and the third the
528                           mother.   The  valid  values  of  STR  are  `pair',
529                           `trioauto',  `trioxd'  and  `trioxs',  where `pair'
530                           calls differences between two  input  samples,  and
531                           `trioxd'  (`trioxs')  specifies  that  the input is
532                           from the X chromosome non-PAR regions and the child
533                           is a female (male). [null]
534
535                 -v        Output variant sites only (force -c)
536
537                 Contrast Calling and Association Test Options:
538
539                 -1 INT    Number  of group-1 samples. This option is used for
540                           dividing the samples into two groups  for  contrast
541                           SNP  calling or association test.  When this option
542                           is in use, the following  VCF  INFO  will  be  out‐
543                           putted: PC2, PCHI2 and QCHI2. [0]
544
545                 -U INT    Number of permutations for association test (effec‐
546                           tive only with -1) [0]
547
548                 -X FLOAT  Only  perform   permutations   for   P(chi^2)<FLOAT
549                           (effective only with -U) [0.01]
550
551
552       index     bcftools index in.bcf
553
554                 Index sorted BCF for random access.
555
556
557       cat       bcftools cat in1.bcf [in2.bcf [...]]]
558
559                 Concatenate  BCF  files.  The  input files are required to be
560                 sorted and have  identical  samples  appearing  in  the  same
561                 order.
562

SAM FORMAT

564       Sequence  Alignment/Map  (SAM)  format is TAB-delimited. Apart from the
565       header lines, which are started with the  `@'  symbol,  each  alignment
566       line consists of:
567
568
569       ┌────┬───────┬──────────────────────────────────────────────────────────┐
570       │Col │ Field │                       Description                        │
571       ├────┼───────┼──────────────────────────────────────────────────────────┤
572       │ 1  │ QNAME │ Query template/pair NAME                                 │
573       │ 2  │ FLAG  │ bitwise FLAG                                             │
574       │ 3  │ RNAME │ Reference sequence NAME                                  │
575       │ 4  │ POS   │ 1-based leftmost POSition/coordinate of clipped sequence │
576       │ 5  │ MAPQ  │ MAPping Quality (Phred-scaled)                           │
577       │ 6  │ CIAGR │ extended CIGAR string                                    │
578       │ 7  │ MRNM  │ Mate Reference sequence NaMe (`=' if same as RNAME)      │
579       │ 8  │ MPOS  │ 1-based Mate POSistion                                   │
580       │ 9  │ TLEN  │ inferred Template LENgth (insert size)                   │
581       │10  │ SEQ   │ query SEQuence on the same strand as the reference       │
582       │11  │ QUAL  │ query QUALity (ASCII-33 gives the Phred base quality)    │
583       │12+ │ OPT   │ variable OPTional fields in the format TAG:VTYPE:VALUE   │
584       └────┴───────┴──────────────────────────────────────────────────────────┘
585
586       Each bit in the FLAG field is defined as:
587
588
589          ┌───────┬─────┬──────────────────────────────────────────────────┐
590          │ Flag  │ Chr │                   Description                    │
591          ├───────┼─────┼──────────────────────────────────────────────────┤
592          │0x0001 │  p  │ the read is paired in sequencing                 │
593          │0x0002 │  P  │ the read is mapped in a proper pair              │
594          │0x0004 │  u  │ the query sequence itself is unmapped            │
595          │0x0008 │  U  │ the mate is unmapped                             │
596          │0x0010 │  r  │ strand of the query (1 for reverse)              │
597          │0x0020 │  R  │ strand of the mate                               │
598          │0x0040 │  1  │ the read is the first read in a pair             │
599          │0x0080 │  2  │ the read is the second read in a pair            │
600          │0x0100 │  s  │ the alignment is not primary                     │
601          │0x0200 │  f  │ the read fails platform/vendor quality checks    │
602          │0x0400 │  d  │ the read is either a PCR or an optical duplicate │
603          └───────┴─────┴──────────────────────────────────────────────────┘
604       where  the  second  column  gives the string representation of the FLAG
605       field.
606
607

VCF FORMAT

609       The Variant Call Format (VCF) is a TAB-delimited format with each  data
610       line consists of the following fields:
611
612    ┌────┬────────┬──────────────────────────────────────────────────────────────┐
613    │Col │ Field  │                         Description                          │
614    ├────┼────────┼──────────────────────────────────────────────────────────────┤
615    │ 1  │ CHROM  │ CHROMosome name                                              │
616    │ 2  │ POS    │ the left-most POSition of the variant                        │
617    │ 3  │ ID     │ unique variant IDentifier                                    │
618    │ 4  │ REF    │ the REFerence allele                                         │
619    │ 5  │ ALT    │ the ALTernate allele(s), separated by comma                  │
620    │ 6  │ QUAL   │ variant/reference QUALity                                    │
621    │ 7  │ FILTER │ FILTers applied                                              │
622    │ 8  │ INFO   │ INFOrmation related to the variant, separated by semi-colon  │
623    │ 9  │ FORMAT │ FORMAT of the genotype fields, separated by colon (optional) │
624    │10+ │ SAMPLE │ SAMPLE genotypes and per-sample information (optional)       │
625    └────┴────────┴──────────────────────────────────────────────────────────────┘
626
627       The following table gives the INFO tags used by samtools and bcftools.
628
629
630┌──────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
631│ Tag  │  Format   │                                            Description                                             │
632├──────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
633│AF1   │ double    │ Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele                  │
634│DP    │ int       │ Raw read depth (without quality filtering)                                                         │
635│DP4   │ int[4]    │ # high-quality reference forward bases, ref reverse, alternate for and alt rev bases               │
636│FQ    │ int       │ Consensus quality. Positive: sample genotypes different; negative: otherwise                       │
637│MQ    │ int       │ Root-Mean-Square mapping quality of covering reads                                                 │
638│PC2   │ int[2]    │ Phred probability of AF in group1 samples being larger (,smaller) than in group2                   │
639│PCHI2 │ double    │ Posterior weighted chi^2 P-value between group1 and group2 samples                                 │
640│PV4   │ double[4] │ P-value for strand bias, baseQ bias, mapQ bias and tail distance bias                              │
641│QCHI2 │ int       │ Phred-scaled PCHI2                                                                                 │
642│RP    │ int       │ # permutations yielding a smaller PCHI2                                                            │
643│CLR   │ int       │ Phred log ratio of genotype likelihoods with and without the trio/pair constraint                  │
644│UGT   │ string    │ Most probable genotype configuration without the trio constraint                                   │
645│CGT   │ string    │ Most probable configuration with the trio constraint                                               │
646│VDB   │ float     │ Tests variant positions within reads. Intended for filtering RNA-seq artifacts around splice sites │
647│RPB   │ float     │ Mann-Whitney rank-sum test for tail distance bias                                                  │
648│HWE   │ float     │ Hardy-Weinberg equilibrium test, Wigginton et al., PMID: 15789306                                  │
649└──────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘
650

EXAMPLES

652       o Import SAM to BAM when @SQ lines are present in the header:
653
654           samtools view -bS aln.sam > aln.bam
655
656         If @SQ lines are absent:
657
658           samtools faidx ref.fa
659           samtools view -bt ref.fa.fai aln.sam > aln.bam
660
661         where ref.fa.fai is generated automatically by the faidx command.
662
663
664       o Attach the RG tag while merging sorted alignments:
665
666           perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu‐
667         mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
668           samtools merge -rh rg.txt merged.bam ga.bam 454.bam
669
670         The value in a RG tag is determined by the file name the read is com‐
671         ing  from. In this example, in the merged.bam, reads from ga.bam will
672         be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
673         RG:Z:454.
674
675
676       o Call SNPs and short INDELs for one diploid individual:
677
678           samtools  mpileup  -ugf  ref.fa  aln.bam  | bcftools view -bvcg - >
679         var.raw.bcf
680           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100   >
681         var.flt.vcf
682
683         The  -D  option  of  varFilter controls the maximum read depth, which
684         should be adjusted to about twice the average read  depth.   One  may
685         consider  to  add -C50 to mpileup if mapping quality is overestimated
686         for reads containing excessive mismatches. Applying this option  usu‐
687         ally helps BWA-short but may not other mappers.
688
689
690       o Generate the consensus sequence for one diploid individual:
691
692           samtools  mpileup  -uf ref.fa aln.bam | bcftools view -cg - | vcfu‐
693         tils.pl vcf2fq > cns.fq
694
695
696       o Call somatic mutations from a pair of samples:
697
698           samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -
699         > var.bcf
700
701         In  the  output INFO field, CLR gives the Phred-log ratio between the
702         likelihood by treating the two samples independently, and the likeli‐
703         hood  by  requiring the genotype to be identical.  This CLR is effec‐
704         tively a score measuring the confidence of somatic calls. The  higher
705         the better.
706
707
708       o Call de novo and somatic mutations from a family trio:
709
710           samtools  mpileup  -DSuf ref.fa aln.bam | bcftools view -bvcgT pair
711         -s samples.txt - > var.bcf
712
713         File samples.txt should consist of three lines specifying the  member
714         and  order  of  samples (in the order of child-father-mother).  Simi‐
715         larly, CLR gives the Phred-log likelihood ratio with and without  the
716         trio  constraint.   UGT  shows the most likely genotype configuration
717         without the trio constraint, and CGT gives the most  likely  genotype
718         configuration satisfying the trio constraint.
719
720
721       o Phase one individual:
722
723           samtools  calmd -AEur aln.bam ref.fa | samtools phase -b prefix - >
724         phase.out
725
726         The calmd command  is  used  to  reduce  false  heterozygotes  around
727         INDELs.
728
729
730       o Call SNPs and short indels for multiple diploid individuals:
731
732           samtools  mpileup  -P  ILLUMINA  -ugf  ref.fa *.bam | bcftools view
733         -bcvg - > var.raw.bcf
734           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
735         var.flt.vcf
736
737         Individuals  are identified from the SM tags in the @RG header lines.
738         Individuals can be pooled in one alignment file; one  individual  can
739         also  be  separated into multiple files. The -P option specifies that
740         indel candidates should be collected only from read groups  with  the
741         @RG-PL  tag  set to ILLUMINA.  Collecting indel candidates from reads
742         sequenced by an indel-prone technology may affect the performance  of
743         indel calling.
744
745         Note that there is a new calling model which can be invoked by
746
747             bcftools view -m0.99  ...
748
749         which fixes some severe limitations of the default method.
750
751         For filtering, best results seem to be achieved by first applying the
752         SnpGap filter and then applying some machine learning approach
753
754             vcf-annotate -f SnpGap=n
755             vcf filter ...
756
757         Both can be found in the vcftools and htslib package (links below).
758
759
760       o Derive the allele frequency spectrum (AFS) on a list  of  sites  from
761         multiple individuals:
762
763           samtools mpileup -Igf ref.fa *.bam > all.bcf
764           bcftools view -bl sites.list all.bcf > sites.bcf
765           bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
766           bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
767           bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
768           ......
769
770         where sites.list contains the list of sites with each line consisting
771         of the reference sequence name and position. The  following  bcftools
772         commands estimate AFS by EM.
773
774
775       o Dump BAQ applied alignment for other SNP callers:
776
777           samtools calmd -bAr aln.bam > aln.baq.bam
778
779         It  adds  and corrects the NM and MD tags at the same time. The calmd
780         command also comes with the -C option, the same as the one in  pileup
781         and mpileup.  Apply if it helps.
782
783

LIMITATIONS

785       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
786         bam_aux.c.
787
788       o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
789         orphan  reads  or ends mapped to different chromosomes). If this is a
790         concern, please use Picard's MarkDuplicate  which  correctly  handles
791         these cases, although a little slower.
792
793

AUTHOR

795       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
796       Handsaker from the Broad Institute implemented the BGZF library and Jue
797       Ruan  from Beijing Genomics Institute wrote the RAZF library. John Mar‐
798       shall and Petr Danecek contribute to the source code and various people
799       from the 1000 Genomes Project have contributed to the SAM format speci‐
800       fication.
801
802