1samtools(1) Bioinformatics tools samtools(1)
2
3
4
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
8 bcftools - Utilities for the Binary Call Format (BCF) and VCF
9
11 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
12
13 samtools sort aln.bam aln.sorted
14
15 samtools index aln.sorted.bam
16
17 samtools idxstats aln.sorted.bam
18
19 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
20
21 samtools merge out.bam in1.bam in2.bam in3.bam
22
23 samtools faidx ref.fasta
24
25 samtools pileup -vcf ref.fasta aln.sorted.bam
26
27 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
28
29 samtools tview aln.sorted.bam ref.fasta
30
31 bcftools index in.bcf
32
33 bcftools view in.bcf chr2:100-200 > out.vcf
34
35 bcftools view -Nvm0.99 in.bcf > out.vcf 2> out.afs
36
37
39 Samtools is a set of utilities that manipulate alignments in the BAM
40 format. It imports from and exports to the SAM (Sequence Alignment/Map)
41 format, does sorting, merging and indexing, and allows to retrieve
42 reads in any regions swiftly.
43
44 Samtools is designed to work on a stream. It regards an input file `-'
45 as the standard input (stdin) and an output file `-' as the standard
46 output (stdout). Several commands can thus be combined with Unix pipes.
47 Samtools always output warning and error messages to the standard error
48 output (stderr).
49
50 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
51 HTTP server if the BAM file name starts with `ftp://' or `http://'.
52 Samtools checks the current working directory for the index file and
53 will download the index upon absence. Samtools does not retrieve the
54 entire alignment file unless it is asked to do so.
55
56
58 view samtools view [-bchuHS] [-t in.refList] [-o output] [-f
59 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read‐
60 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
61
62 Extract/print all or sub alignments in SAM or BAM format. If
63 no region is specified, all the alignments will be printed;
64 otherwise only alignments overlapping the specified regions
65 will be output. An alignment may be given multiple times if
66 it is overlapping several regions. A region can be presented,
67 for example, in the following format: `chr2' (the whole
68 chr2), `chr2:1000000' (region starting from 1,000,000bp) or
69 `chr2:1,000,000-2,000,000' (region between 1,000,000 and
70 2,000,000bp including the end points). The coordinate is
71 1-based.
72
73 OPTIONS:
74
75 -b Output in the BAM format.
76
77 -f INT Only output alignments with all bits in INT present
78 in the FLAG field. INT can be in hex in the format
79 of /^0x[0-9A-F]+/ [0]
80
81 -F INT Skip alignments with bits present in INT [0]
82
83 -h Include the header in the output.
84
85 -H Output the header only.
86
87 -l STR Only output reads in library STR [null]
88
89 -o FILE Output file [stdout]
90
91 -q INT Skip alignments with MAPQ smaller than INT [0]
92
93 -r STR Only output reads in read group STR [null]
94
95 -R FILE Output reads in read groups listed in FILE [null]
96
97 -s FLOAT Fraction of templates/pairs to subsample; the inte‐
98 ger part is treated as the seed for the random num‐
99 ber generator [-1]
100
101 -S Input is in SAM. If @SQ header lines are absent,
102 the `-t' option is required.
103
104 -c Instead of printing the alignments, only count them
105 and print the total number. All filter options,
106 such as `-f', `-F' and `-q' , are taken into
107 account.
108
109 -t FILE This file is TAB-delimited. Each line must contain
110 the reference name and the length of the reference,
111 one line for each distinct reference; additional
112 fields are ignored. This file also defines the
113 order of the reference sequences in sorting. If you
114 run `samtools faidx <ref.fa>', the resultant index
115 file <ref.fa>.fai can be used as this <in.ref_list>
116 file.
117
118 -u Output uncompressed BAM. This option saves time
119 spent on compression/decomprssion and is thus pre‐
120 ferred when the output is piped to another samtools
121 command.
122
123
124 tview samtools tview [-p chr:pos] [-s STR] [-d display]
125 <in.sorted.bam> [ref.fasta]
126
127 Text alignment viewer (based on the ncurses library). In the
128 viewer, press `?' for help and press `g' to check the align‐
129 ment start from a region in the format like
130 `chr10:10,000,000' or `=10,000,000' when viewing the same
131 reference sequence.
132
133 Options:
134
135 -d display Output as (H)tml or (C)urses or (T)ext
136
137 -p chr:pos Go directly to this position
138
139 -s STR Display only reads from this sample or read
140 group
141
142
143 mpileup samtools mpileup [-EBugp] [-C capQcoef] [-r reg] [-f in.fa]
144 [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam
145 [in2.bam [...]]
146
147 Generate BCF or pileup for one or multiple BAM files. Align‐
148 ment records are grouped by sample identifiers in @RG header
149 lines. If sample identifiers are absent, each input file is
150 regarded as one sample.
151
152 In the pileup format (without -uor-g), each line represents a
153 genomic position, consisting of chromosome name, coordinate,
154 reference base, read bases, read qualities and alignment map‐
155 ping qualities. Information on match, mismatch, indel,
156 strand, mapping quality and start and end of a read are all
157 encoded at the read base column. At this column, a dot stands
158 for a match to the reference base on the forward strand, a
159 comma for a match on the reverse strand, a '>' or '<' for a
160 reference skip, `ACGTN' for a mismatch on the forward strand
161 and `acgtn' for a mismatch on the reverse strand. A pattern
162 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
163 between this reference position and the next reference posi‐
164 tion. The length of the insertion is given by the integer in
165 the pattern, followed by the inserted sequence. Similarly, a
166 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
167 reference. The deleted bases will be presented as `*' in the
168 following lines. Also at the read base column, a symbol `^'
169 marks the start of a read. The ASCII of the character follow‐
170 ing `^' minus 33 gives the mapping quality. A symbol `$'
171 marks the end of a read segment.
172
173 Input Options:
174
175 -6 Assume the quality is in the Illumina 1.3+ encod‐
176 ing. -A Do not skip anomalous read pairs in vari‐
177 ant calling.
178
179 -B Disable probabilistic realignment for the computa‐
180 tion of base alignment quality (BAQ). BAQ is the
181 Phred-scaled probability of a read base being mis‐
182 aligned. Applying this option greatly helps to
183 reduce false SNPs caused by misalignments.
184
185 -b FILE List of input BAM files, one file per line [null]
186
187 -C INT Coefficient for downgrading mapping quality for
188 reads containing excessive mismatches. Given a read
189 with a phred-scaled probability q of being gener‐
190 ated from the mapped position, the new mapping
191 quality is about sqrt((INT-q)/INT)*INT. A zero
192 value disables this functionality; if enabled, the
193 recommended value for BWA is 50. [0]
194
195 -d INT At a position, read maximally INT reads per input
196 BAM. [250]
197
198 -E Extended BAQ computation. This option helps sensi‐
199 tivity especially for MNPs, but may hurt speci‐
200 ficity a little bit.
201
202 -f FILE The faidx-indexed reference file in the FASTA for‐
203 mat. The file can be optionally compressed by
204 razip. [null]
205
206 -l FILE BED or position list file containing a list of
207 regions or sites where pileup or BCF should be gen‐
208 erated [null]
209
210 -q INT Minimum mapping quality for an alignment to be used
211 [0]
212
213 -Q INT Minimum base quality for a base to be considered
214 [13]
215
216 -r STR Only generate pileup in region STR [all sites]
217
218 Output Options:
219
220
221 -D Output per-sample read depth
222
223 -g Compute genotype likelihoods and output them in the
224 binary call format (BCF).
225
226 -S Output per-sample Phred-scaled strand bias P-value
227
228 -u Similar to -g except that the output is uncom‐
229 pressed BCF, which is preferred for piping.
230
231
232 Options for Genotype Likelihood Computation (for -g or -u):
233
234
235 -e INT Phred-scaled gap extension sequencing error proba‐
236 bility. Reducing INT leads to longer indels. [20]
237
238 -h INT Coefficient for modeling homopolymer errors. Given
239 an l-long homopolymer run, the sequencing error of
240 an indel of size s is modeled as INT*s/l. [100]
241
242 -I Do not perform INDEL calling
243
244 -L INT Skip INDEL calling if the average per-sample depth
245 is above INT. [250]
246
247 -o INT Phred-scaled gap open sequencing error probability.
248 Reducing INT leads to more indel calls. [40]
249
250 -p Apply -m and -F thresholds per sample to increase
251 sensitivity of calling. By default both options
252 are applied to reads pooled from all samples.
253
254 -P STR Comma dilimited list of platforms (determined by
255 @RG-PL) from which indel candidates are obtained.
256 It is recommended to collect indel candidates from
257 sequencing technologies that have low indel error
258 rate such as ILLUMINA. [all]
259
260
261 reheader samtools reheader <in.header.sam> <in.bam>
262
263 Replace the header in in.bam with the header in
264 in.header.sam. This command is much faster than replacing
265 the header with a BAM->SAM->BAM conversion.
266
267
268 cat samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam>
269 [ ... ]
270
271 Concatenate BAMs. The sequence dictionary of each input BAM
272 must be identical, although this command does not check this.
273 This command uses a similar trick to reheader which enables
274 fast BAM concatenation.
275
276
277 sort samtools sort [-nof] [-m maxMem] <in.bam> <out.prefix>
278
279 Sort alignments by leftmost coordinates. File <out.pre‐
280 fix>.bam will be created. This command may also create tempo‐
281 rary files <out.prefix>.%d.bam when the whole alignment can‐
282 not be fitted into memory (controlled by option -m).
283
284 OPTIONS:
285
286 -o Output the final alignment to the standard output.
287
288 -n Sort by read names rather than by chromosomal coordi‐
289 nates
290
291 -f Use <out.prefix> as the full output path and do not
292 append .bam suffix.
293
294 -m INT Approximately the maximum required memory.
295 [500000000]
296
297
298 merge samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam>
299 <in1.bam> <in2.bam> [...]
300
301 Merge multiple sorted alignments. The header reference lists
302 of all the input BAM files, and the @SQ headers of inh.sam,
303 if any, must all refer to the same set of reference
304 sequences. The header reference list and (unless overridden
305 by -h) `@' headers of in1.bam will be copied to out.bam, and
306 the headers of other files will be ignored.
307
308 OPTIONS:
309
310 -1 Use zlib compression level 1 to comrpess the output
311
312 -f Force to overwrite the output file if present.
313
314 -h FILE Use the lines of FILE as `@' headers to be copied to
315 out.bam, replacing any header lines that would other‐
316 wise be copied from in1.bam. (FILE is actually in
317 SAM format, though any alignment records it may con‐
318 tain are ignored.)
319
320 -n The input alignments are sorted by read names rather
321 than by chromosomal coordinates
322
323 -R STR Merge files in the specified region indicated by STR
324 [null]
325
326 -r Attach an RG tag to each alignment. The tag value is
327 inferred from file names.
328
329 -u Uncompressed BAM output
330
331
332 index samtools index <aln.bam>
333
334 Index sorted alignment for fast random access. Index file
335 <aln.bam>.bai will be created.
336
337
338 idxstats samtools idxstats <aln.bam>
339
340 Retrieve and print stats in the index file. The output is TAB
341 delimited with each line consisting of reference sequence
342 name, sequence length, # mapped reads and # unmapped reads.
343
344
345 faidx samtools faidx <ref.fasta> [region1 [...]]
346
347 Index reference sequence in the FASTA format or extract sub‐
348 sequence from indexed reference sequence. If no region is
349 specified, faidx will index the file and create
350 <ref.fasta>.fai on the disk. If regions are speficified, the
351 subsequences will be retrieved and printed to stdout in the
352 FASTA format. The input file can be compressed in the RAZF
353 format.
354
355
356 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
357
358 Fill in mate coordinates, ISIZE and mate related flags from a
359 name-sorted alignment.
360
361
362 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
363
364 Remove potential PCR duplicates: if multiple read pairs have
365 identical external coordinates, only retain the pair with
366 highest mapping quality. In the paired-end mode, this com‐
367 mand ONLY works with FR orientation and requires ISIZE is
368 correctly set. It does not work for unpaired reads (e.g. two
369 ends mapped to different chromosomes or orphan reads).
370
371 OPTIONS:
372
373 -s Remove duplicate for single-end reads. By default,
374 the command works for paired-end reads only.
375
376 -S Treat paired-end reads and single-end reads.
377
378
379 calmd samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
380
381 Generate the MD tag. If the MD tag is already present, this
382 command will give a warning if the MD tag generated is dif‐
383 ferent from the existing tag. Output SAM by default.
384
385 OPTIONS:
386
387 -A When used jointly with -r this option overwrites the
388 original base quality.
389
390 -e Convert a the read base to = if it is identical to
391 the aligned reference base. Indel caller does not
392 support the = bases at the moment.
393
394 -u Output uncompressed BAM
395
396 -b Output compressed BAM
397
398 -S The input is SAM with header lines
399
400 -C INT Coefficient to cap mapping quality of poorly mapped
401 reads. See the pileup command for details. [0]
402
403 -r Compute the BQ tag (without -A) or cap base quality
404 by BAQ (with -A).
405
406 -E Extended BAQ calculation. This option trades speci‐
407 ficity for sensitivity, though the effect is minor.
408
409
410 targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1
411 em1] [-2 em2] [-f ref] <in.bam>
412
413 This command identifies target regions by examining the con‐
414 tinuity of read depth, computes haploid consensus sequences
415 of targets and outputs a SAM with each sequence corresponding
416 to a target. When option -f is in use, BAQ will be applied.
417 This command is only designed for cutting fosmid clones from
418 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
419
420
421 phase samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q
422 minBaseQ] <in.bam>
423
424 Call and phase heterozygous SNPs. OPTIONS:
425
426 -A Drop reads with ambiguous phase.
427
428 -b STR Prefix of BAM output. When this option is in use,
429 phase-0 reads will be saved in file STR.0.bam and
430 phase-1 reads in STR.1.bam. Phase unknown reads will
431 be randomly allocated to one of the two files.
432 Chimeric reads with switch errors will be saved in
433 STR.chimeric.bam. [null]
434
435 -F Do not attempt to fix chimeric reads.
436
437 -k INT Maximum length for local phasing. [13]
438
439 -q INT Minimum Phred-scaled LOD to call a heterozygote. [40]
440
441 -Q INT Minimum base quality to be used in het calling. [13]
442
443
445 view bcftools view [-AbFGNQSucgv] [-D seqDict] [-l listLoci] [-s
446 listSample] [-i gapSNPratio] [-t mutRate] [-p varThres] [-m
447 varThres] [-P prior] [-1 nGroup1] [-d minFrac] [-U nPerm] [-X
448 permThres] [-T trioType] in.bcf [region]
449
450 Convert between BCF and VCF, call variant candidates and
451 estimate allele frequencies.
452
453
454 Input/Output Options:
455
456 -A Retain all possible alternate alleles at variant
457 sites. By default, the view command discards
458 unlikely alleles.
459
460 -b Output in the BCF format. The default is VCF.
461
462 -D FILE Sequence dictionary (list of chromosome names) for
463 VCF->BCF conversion [null]
464
465 -F Indicate PL is generated by r921 or before (order‐
466 ing is different).
467
468 -G Suppress all individual genotype information.
469
470 -l FILE List of sites at which information are outputted
471 [all sites]
472
473 -N Skip sites where the REF field is not A/C/G/T
474
475 -Q Output the QCALL likelihood format
476
477 -s FILE List of samples to use. The first column in the
478 input gives the sample names and the second gives
479 the ploidy, which can only be 1 or 2. When the 2nd
480 column is absent, the sample ploidy is assumed to
481 be 2. In the output, the ordering of samples will
482 be identical to the one in FILE. [null]
483
484 -S The input is VCF instead of BCF.
485
486 -u Uncompressed BCF output (force -b).
487
488 Consensus/Variant Calling Options:
489
490 -c Call variants using Bayesian inference. This option
491 automatically invokes option -e.
492
493 -d FLOAT When -v is in use, skip loci where the fraction of
494 samples covered by reads is below FLOAT. [0]
495
496 -e Perform max-likelihood inference only, including
497 estimating the site allele frequency, testing
498 Hardy-Weinberg equlibrium and testing associations
499 with LRT.
500
501 -g Call per-sample genotypes at variant sites (force
502 -c)
503
504 -i FLOAT Ratio of INDEL-to-SNP mutation rate [0.15]
505
506 -m FLOAT New model for improved multiallelic and rare-vari‐
507 ant calling. Another ALT allele is accepted if
508 P(chi^2) of LRT exceeds the FLOAT threshold. The
509 parameter seems robust and the actual value usually
510 does not affect the results much; a good value to
511 use is 0.99. This is the recommended calling
512 method. [0]
513
514 -p FLOAT A site is considered to be a variant if
515 P(ref|D)<FLOAT [0.5]
516
517 -P STR Prior or initial allele frequency spectrum. If STR
518 can be full, cond2, flat or the file consisting of
519 error output from a previous variant calling run.
520
521 -t FLOAT Scaled muttion rate for variant calling [0.001]
522
523 -T STR Enable pair/trio calling. For trio calling, option
524 -s is usually needed to be applied to configure the
525 trio members and their ordering. In the file sup‐
526 plied to the option -s, the first sample must be
527 the child, the second the father and the third the
528 mother. The valid values of STR are `pair',
529 `trioauto', `trioxd' and `trioxs', where `pair'
530 calls differences between two input samples, and
531 `trioxd' (`trioxs') specifies that the input is
532 from the X chromosome non-PAR regions and the child
533 is a female (male). [null]
534
535 -v Output variant sites only (force -c)
536
537 Contrast Calling and Association Test Options:
538
539 -1 INT Number of group-1 samples. This option is used for
540 dividing the samples into two groups for contrast
541 SNP calling or association test. When this option
542 is in use, the following VCF INFO will be out‐
543 putted: PC2, PCHI2 and QCHI2. [0]
544
545 -U INT Number of permutations for association test (effec‐
546 tive only with -1) [0]
547
548 -X FLOAT Only perform permutations for P(chi^2)<FLOAT
549 (effective only with -U) [0.01]
550
551
552 index bcftools index in.bcf
553
554 Index sorted BCF for random access.
555
556
557 cat bcftools cat in1.bcf [in2.bcf [...]]]
558
559 Concatenate BCF files. The input files are required to be
560 sorted and have identical samples appearing in the same
561 order.
562
564 Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the
565 header lines, which are started with the `@' symbol, each alignment
566 line consists of:
567
568
569 ┌────┬───────┬──────────────────────────────────────────────────────────┐
570 │Col │ Field │ Description │
571 ├────┼───────┼──────────────────────────────────────────────────────────┤
572 │ 1 │ QNAME │ Query template/pair NAME │
573 │ 2 │ FLAG │ bitwise FLAG │
574 │ 3 │ RNAME │ Reference sequence NAME │
575 │ 4 │ POS │ 1-based leftmost POSition/coordinate of clipped sequence │
576 │ 5 │ MAPQ │ MAPping Quality (Phred-scaled) │
577 │ 6 │ CIAGR │ extended CIGAR string │
578 │ 7 │ MRNM │ Mate Reference sequence NaMe (`=' if same as RNAME) │
579 │ 8 │ MPOS │ 1-based Mate POSistion │
580 │ 9 │ TLEN │ inferred Template LENgth (insert size) │
581 │10 │ SEQ │ query SEQuence on the same strand as the reference │
582 │11 │ QUAL │ query QUALity (ASCII-33 gives the Phred base quality) │
583 │12+ │ OPT │ variable OPTional fields in the format TAG:VTYPE:VALUE │
584 └────┴───────┴──────────────────────────────────────────────────────────┘
585
586 Each bit in the FLAG field is defined as:
587
588
589 ┌───────┬─────┬──────────────────────────────────────────────────┐
590 │ Flag │ Chr │ Description │
591 ├───────┼─────┼──────────────────────────────────────────────────┤
592 │0x0001 │ p │ the read is paired in sequencing │
593 │0x0002 │ P │ the read is mapped in a proper pair │
594 │0x0004 │ u │ the query sequence itself is unmapped │
595 │0x0008 │ U │ the mate is unmapped │
596 │0x0010 │ r │ strand of the query (1 for reverse) │
597 │0x0020 │ R │ strand of the mate │
598 │0x0040 │ 1 │ the read is the first read in a pair │
599 │0x0080 │ 2 │ the read is the second read in a pair │
600 │0x0100 │ s │ the alignment is not primary │
601 │0x0200 │ f │ the read fails platform/vendor quality checks │
602 │0x0400 │ d │ the read is either a PCR or an optical duplicate │
603 └───────┴─────┴──────────────────────────────────────────────────┘
604 where the second column gives the string representation of the FLAG
605 field.
606
607
609 The Variant Call Format (VCF) is a TAB-delimited format with each data
610 line consists of the following fields:
611
612 ┌────┬────────┬──────────────────────────────────────────────────────────────┐
613 │Col │ Field │ Description │
614 ├────┼────────┼──────────────────────────────────────────────────────────────┤
615 │ 1 │ CHROM │ CHROMosome name │
616 │ 2 │ POS │ the left-most POSition of the variant │
617 │ 3 │ ID │ unique variant IDentifier │
618 │ 4 │ REF │ the REFerence allele │
619 │ 5 │ ALT │ the ALTernate allele(s), separated by comma │
620 │ 6 │ QUAL │ variant/reference QUALity │
621 │ 7 │ FILTER │ FILTers applied │
622 │ 8 │ INFO │ INFOrmation related to the variant, separated by semi-colon │
623 │ 9 │ FORMAT │ FORMAT of the genotype fields, separated by colon (optional) │
624 │10+ │ SAMPLE │ SAMPLE genotypes and per-sample information (optional) │
625 └────┴────────┴──────────────────────────────────────────────────────────────┘
626
627 The following table gives the INFO tags used by samtools and bcftools.
628
629
630┌──────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
631│ Tag │ Format │ Description │
632├──────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
633│AF1 │ double │ Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele │
634│DP │ int │ Raw read depth (without quality filtering) │
635│DP4 │ int[4] │ # high-quality reference forward bases, ref reverse, alternate for and alt rev bases │
636│FQ │ int │ Consensus quality. Positive: sample genotypes different; negative: otherwise │
637│MQ │ int │ Root-Mean-Square mapping quality of covering reads │
638│PC2 │ int[2] │ Phred probability of AF in group1 samples being larger (,smaller) than in group2 │
639│PCHI2 │ double │ Posterior weighted chi^2 P-value between group1 and group2 samples │
640│PV4 │ double[4] │ P-value for strand bias, baseQ bias, mapQ bias and tail distance bias │
641│QCHI2 │ int │ Phred-scaled PCHI2 │
642│RP │ int │ # permutations yielding a smaller PCHI2 │
643│CLR │ int │ Phred log ratio of genotype likelihoods with and without the trio/pair constraint │
644│UGT │ string │ Most probable genotype configuration without the trio constraint │
645│CGT │ string │ Most probable configuration with the trio constraint │
646│VDB │ float │ Tests variant positions within reads. Intended for filtering RNA-seq artifacts around splice sites │
647│RPB │ float │ Mann-Whitney rank-sum test for tail distance bias │
648│HWE │ float │ Hardy-Weinberg equilibrium test, Wigginton et al., PMID: 15789306 │
649└──────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘
650
652 o Import SAM to BAM when @SQ lines are present in the header:
653
654 samtools view -bS aln.sam > aln.bam
655
656 If @SQ lines are absent:
657
658 samtools faidx ref.fa
659 samtools view -bt ref.fa.fai aln.sam > aln.bam
660
661 where ref.fa.fai is generated automatically by the faidx command.
662
663
664 o Attach the RG tag while merging sorted alignments:
665
666 perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu‐
667 mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
668 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
669
670 The value in a RG tag is determined by the file name the read is com‐
671 ing from. In this example, in the merged.bam, reads from ga.bam will
672 be attached RG:Z:ga, while reads from 454.bam will be attached
673 RG:Z:454.
674
675
676 o Call SNPs and short INDELs for one diploid individual:
677
678 samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - >
679 var.raw.bcf
680 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 >
681 var.flt.vcf
682
683 The -D option of varFilter controls the maximum read depth, which
684 should be adjusted to about twice the average read depth. One may
685 consider to add -C50 to mpileup if mapping quality is overestimated
686 for reads containing excessive mismatches. Applying this option usu‐
687 ally helps BWA-short but may not other mappers.
688
689
690 o Generate the consensus sequence for one diploid individual:
691
692 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfu‐
693 tils.pl vcf2fq > cns.fq
694
695
696 o Call somatic mutations from a pair of samples:
697
698 samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -
699 > var.bcf
700
701 In the output INFO field, CLR gives the Phred-log ratio between the
702 likelihood by treating the two samples independently, and the likeli‐
703 hood by requiring the genotype to be identical. This CLR is effec‐
704 tively a score measuring the confidence of somatic calls. The higher
705 the better.
706
707
708 o Call de novo and somatic mutations from a family trio:
709
710 samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair
711 -s samples.txt - > var.bcf
712
713 File samples.txt should consist of three lines specifying the member
714 and order of samples (in the order of child-father-mother). Simi‐
715 larly, CLR gives the Phred-log likelihood ratio with and without the
716 trio constraint. UGT shows the most likely genotype configuration
717 without the trio constraint, and CGT gives the most likely genotype
718 configuration satisfying the trio constraint.
719
720
721 o Phase one individual:
722
723 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - >
724 phase.out
725
726 The calmd command is used to reduce false heterozygotes around
727 INDELs.
728
729
730 o Call SNPs and short indels for multiple diploid individuals:
731
732 samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view
733 -bcvg - > var.raw.bcf
734 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 >
735 var.flt.vcf
736
737 Individuals are identified from the SM tags in the @RG header lines.
738 Individuals can be pooled in one alignment file; one individual can
739 also be separated into multiple files. The -P option specifies that
740 indel candidates should be collected only from read groups with the
741 @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
742 sequenced by an indel-prone technology may affect the performance of
743 indel calling.
744
745 Note that there is a new calling model which can be invoked by
746
747 bcftools view -m0.99 ...
748
749 which fixes some severe limitations of the default method.
750
751 For filtering, best results seem to be achieved by first applying the
752 SnpGap filter and then applying some machine learning approach
753
754 vcf-annotate -f SnpGap=n
755 vcf filter ...
756
757 Both can be found in the vcftools and htslib package (links below).
758
759
760 o Derive the allele frequency spectrum (AFS) on a list of sites from
761 multiple individuals:
762
763 samtools mpileup -Igf ref.fa *.bam > all.bcf
764 bcftools view -bl sites.list all.bcf > sites.bcf
765 bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
766 bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
767 bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
768 ......
769
770 where sites.list contains the list of sites with each line consisting
771 of the reference sequence name and position. The following bcftools
772 commands estimate AFS by EM.
773
774
775 o Dump BAQ applied alignment for other SNP callers:
776
777 samtools calmd -bAr aln.bam > aln.baq.bam
778
779 It adds and corrects the NM and MD tags at the same time. The calmd
780 command also comes with the -C option, the same as the one in pileup
781 and mpileup. Apply if it helps.
782
783
785 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
786 bam_aux.c.
787
788 o Samtools paired-end rmdup does not work for unpaired reads (e.g.
789 orphan reads or ends mapped to different chromosomes). If this is a
790 concern, please use Picard's MarkDuplicate which correctly handles
791 these cases, although a little slower.
792
793
795 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
796 Handsaker from the Broad Institute implemented the BGZF library and Jue
797 Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar‐
798 shall and Petr Danecek contribute to the source code and various people
799 from the 1000 Genomes Project have contributed to the SAM format speci‐
800 fication.
801
802
804 Samtools website: <http://samtools.sourceforge.net>
805 Samtools latest source: <https://github.com/samtools/samtools>
806 VCFtools website with stable link to VCF specification:
807 <http://vcftools.sourceforge.net>
808 HTSlib website: <https://github.com/samtools/htslib>
809
810
811
812samtools-0.1.19 15 March 2013 samtools(1)