1samtools(1) Bioinformatics tools samtools(1)
2
3
4
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
10
11 samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
12
13 samtools index aln.sorted.bam
14
15 samtools idxstats aln.sorted.bam
16
17 samtools flagstat aln.sorted.bam
18
19 samtools stats aln.sorted.bam
20
21 samtools bedcov aln.sorted.bam
22
23 samtools depth aln.sorted.bam
24
25 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
26
27 samtools merge out.bam in1.bam in2.bam in3.bam
28
29 samtools faidx ref.fasta
30
31 samtools fqidx ref.fastq
32
33 samtools tview aln.sorted.bam ref.fasta
34
35 samtools split merged.bam
36
37 samtools quickcheck in1.bam in2.cram
38
39 samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
40
41 samtools fixmate in.namesorted.sam out.bam
42
43 samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
44
45 samtools flags PAIRED,UNMAP,MUNMAP
46
47 samtools fastq input.bam > output.fastq
48
49 samtools fasta input.bam > output.fasta
50
51 samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out‐
52 put.bam input.bam
53
54 samtools collate -o aln.name_collated.bam aln.sorted.bam
55
56 samtools depad input.bam
57
58 samtools markdup in.algnsorted.bam out.bam
59
60
62 Samtools is a set of utilities that manipulate alignments in the BAM
63 format. It imports from and exports to the SAM (Sequence Alignment/Map)
64 format, does sorting, merging and indexing, and allows to retrieve
65 reads in any regions swiftly.
66
67 Samtools is designed to work on a stream. It regards an input file `-'
68 as the standard input (stdin) and an output file `-' as the standard
69 output (stdout). Several commands can thus be combined with Unix pipes.
70 Samtools always output warning and error messages to the standard error
71 output (stderr).
72
73 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
74 HTTP server if the BAM file name starts with `ftp://' or `http://'.
75 Samtools checks the current working directory for the index file and
76 will download the index upon absence. Samtools does not retrieve the
77 entire alignment file unless it is asked to do so.
78
79
81 view samtools view [options] in.sam|in.bam|in.cram [region...]
82
83 With no options or regions specified, prints all alignments
84 in the specified input alignment file (in SAM, BAM, or CRAM
85 format) to standard output in SAM format (with no header).
86
87 You may specify one or more space-separated region specifica‐
88 tions after the input filename to restrict output to only
89 those alignments which overlap the specified region(s). Use
90 of region specifications requires a coordinate-sorted and
91 indexed input file (in BAM or CRAM format).
92
93 The -b, -C, -1, -u, -h, -H, and -c options change the output
94 format from the default of headerless SAM, and the -o and -U
95 options set the output file name(s).
96
97 The -t and -T options provide additional reference data. One
98 of these two options is required when SAM input does not con‐
99 tain @SQ headers, and the -T option is required whenever
100 writing CRAM output.
101
102 The -L, -M, -r, -R, -s, -q, -l, -m, -f, -F, and -G options
103 filter the alignments that will be included in the output to
104 only those alignments that match certain criteria.
105
106 The -x and -B options modify the data which is contained in
107 each alignment.
108
109 Finally, the -@ option can be used to allocate additional
110 threads to be used for compression, and the -? option
111 requests a long help message.
112
113
114 REGIONS:
115 Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and
116 all position coordinates are 1-based.
117
118 Important note: when multiple regions are given, some align‐
119 ments may be output multiple times if they overlap more than
120 one of the specified regions.
121
122 Examples of region specifications:
123
124 chr1 Output all alignments mapped to the reference
125 sequence named `chr1' (i.e. @SQ SN:chr1).
126
127 chr2:1000000
128 The region on chr2 beginning at base position
129 1,000,000 and ending at the end of the chromosome.
130
131 chr3:1000-2000
132 The 1001bp region on chr3 beginning at base posi‐
133 tion 1,000 and ending at base position 2,000
134 (including both end positions).
135
136 '*' Output the unmapped reads at the end of the file.
137 (This does not include any unmapped reads placed on
138 a reference sequence alongside their mapped mates.)
139
140 . Output all alignments. (Mostly unnecessary as not
141 specifying a region at all has the same effect.)
142
143 OPTIONS:
144
145 -b Output in the BAM format.
146
147 -C Output in the CRAM format (requires -T).
148
149 -1 Enable fast BAM compression (implies -b).
150
151 -u Output uncompressed BAM. This option saves time
152 spent on compression/decompression and is thus pre‐
153 ferred when the output is piped to another samtools
154 command.
155
156 -h Include the header in the output.
157
158 -H Output the header only.
159
160 -c Instead of printing the alignments, only count them
161 and print the total number. All filter options,
162 such as -f, -F, and -q, are taken into account.
163
164 -? Output long help and exit immediately.
165
166 -o FILE Output to FILE [stdout].
167
168 -U FILE Write alignments that are not selected by the vari‐
169 ous filter options to FILE. When this option is
170 used, all alignments (or all alignments intersect‐
171 ing the regions specified) are written to either
172 the output file or this file, but never both.
173
174 -t FILE A tab-delimited FILE. Each line must contain the
175 reference name in the first column and the length
176 of the reference in the second column, with one
177 line for each distinct reference. Any additional
178 fields beyond the second column are ignored. This
179 file also defines the order of the reference
180 sequences in sorting. If you run: `samtools faidx
181 <ref.fa>', the resulting index file <ref.fa>.fai
182 can be used as this FILE.
183
184 -T FILE A FASTA format reference FILE, optionally com‐
185 pressed by bgzip and ideally indexed by samtools
186 faidx. If an index is not present, one will be
187 generated for you.
188
189 -L FILE Only output alignments overlapping the input BED
190 FILE [null].
191
192 -M Use the multi-region iterator on the union of the
193 BED file and command-line region arguments. This
194 avoids re-reading the same regions of files so can
195 sometimes be much faster. Note this also removes
196 duplicate sequences. Without this a sequence that
197 overlaps multiple regions specified on the command
198 line will be reported multiple times.
199
200 -r STR Output alignments in read group STR [null]. Note
201 that records with no RG tag will also be output
202 when using this option. This behaviour may change
203 in a future release.
204
205 -R FILE Output alignments in read groups listed in FILE
206 [null]. Note that records with no RG tag will also
207 be output when using this option. This behaviour
208 may change in a future release.
209
210 -q INT Skip alignments with MAPQ smaller than INT [0].
211
212 -l STR Only output alignments in library STR [null].
213
214 -m INT Only output alignments with number of CIGAR bases
215 consuming query sequence ≥ INT [0]
216
217 -f INT Only output alignments with all bits set in INT
218 present in the FLAG field. INT can be specified in
219 hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
220 in octal by beginning with `0' (i.e. /^0[0-7]+/)
221 [0].
222
223 -F INT Do not output alignments with any bits set in INT
224 present in the FLAG field. INT can be specified in
225 hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
226 in octal by beginning with `0' (i.e. /^0[0-7]+/)
227 [0].
228
229 -G INT Do not output alignments with all bits set in INT
230 present in the FLAG field. This is the opposite of
231 -f such that -f12 -G12 is the same as no filtering
232 at all. INT can be specified in hex by beginning
233 with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by
234 beginning with `0' (i.e. /^0[0-7]+/) [0].
235
236 -x STR Read tag to exclude from output (repeatable) [null]
237
238 -B Collapse the backward CIGAR operation.
239
240 -s FLOAT Output only a proportion of the input alignments.
241 This subsampling acts in the same way on all of the
242 alignment records in the same template or read
243 pair, so it never keeps a read but not its mate.
244
245 The integer and fractional parts of the -s INT.FRAC
246 option are used separately: the part after the dec‐
247 imal point sets the fraction of templates/pairs to
248 be kept, while the integer part is used as a seed
249 that influences which subset of reads is kept.
250
251 When subsampling data that has previously been sub‐
252 sampled, be sure to use a different seed value from
253 those used previously; otherwise more reads will be
254 retained than expected.
255
256 -@ INT Number of BAM compression threads to use in addi‐
257 tion to main thread [0].
258
259 -S Ignored for compatibility with previous samtools
260 versions. Previously this option was required if
261 input was in SAM format, but now the correct format
262 is automatically detected by examining the first
263 few characters of input.
264
265
266 sort samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
267 [-n] [-t tag] [-T tmpprefix] [-@ threads]
268 [in.sam|in.bam|in.cram]
269
270 Sort alignments by leftmost coordinates, or by read name when
271 -n is used. An appropriate @HD-SO sort order header tag will
272 be added or an existing one updated if necessary.
273
274 The sorted output is written to standard output by default,
275 or to the specified file (out.bam) when -o is used. This
276 command will also create temporary files tmpprefix.%d.bam as
277 needed when the entire alignment data cannot fit into memory
278 (as controlled via the -m option).
279
280 Options:
281
282 -l INT Set the desired compression level for the final
283 output file, ranging from 0 (uncompressed) or 1
284 (fastest but minimal compression) to 9 (best com‐
285 pression but slowest to write), similarly to
286 gzip(1)'s compression level setting.
287
288 If -l is not used, the default compression level
289 will apply.
290
291 -m INT Approximately the maximum required memory per
292 thread, specified either in bytes or with a K, M,
293 or G suffix. [768 MiB]
294
295 To prevent sort from creating a huge number of
296 temporary files, it enforces a minimum value of 1M
297 for this setting.
298
299 -n Sort by read names (i.e., the QNAME field) rather
300 than by chromosomal coordinates.
301
302 -t TAG Sort first by the value in the alignment tag TAG,
303 then by position or name (if also using -n). -o
304 FILE Write the final sorted output to FILE, rather
305 than to standard output.
306
307 -O FORMAT Write the final output as sam, bam, or cram.
308
309 By default, samtools tries to select a format
310 based on the -o filename extension; if output is
311 to standard output or no format can be deduced,
312 bam is selected.
313
314 -T PREFIX Write temporary files to PREFIX.nnnn.bam, or if
315 the specified PREFIX is an existing directory, to
316 PREFIX/samtools.mmm.mmm.tmp.nnnn.bam, where mmm is
317 unique to this invocation of the sort command.
318
319 By default, any temporary files are written along‐
320 side the output file, as out.bam.tmp.nnnn.bam, or
321 if output is to standard output, in the current
322 directory as samtools.mmm.mmm.tmp.nnnn.bam.
323
324 -@ INT Set number of sorting and compression threads. By
325 default, operation is single-threaded.
326
327 Ordering Rules
328
329 The following rules are used for ordering records.
330
331 If option -t is in use, records are first sorted by the value
332 of the given alignment tag, and then by position or name (if
333 using -n). For example, “-t RG” will make read group the
334 primary sort key. The rules for ordering by tag are:
335
336
337 · Records that do not have the tag are sorted before ones
338 that do.
339
340 · If the types of the tags are different, they will be
341 sorted so that single character tags (type A) come before
342 array tags (type B), then string tags (types H and Z),
343 then numeric tags (types f and i).
344
345 · Numeric tags (types f and i) are compared by value. Note
346 that comparisons of floating-point values are subject to
347 issues of rounding and precision.
348
349 · String tags (types H and Z) are compared based on the
350 binary contents of the tag using the C strcmp(3) func‐
351 tion.
352
353 · Character tags (type A) are compared by binary character
354 value.
355
356 · No attempt is made to compare tags of other types —
357 notably type B array values will not be compared.
358
359 When the -n option is present, records are sorted by name.
360 Names are compared so as to give a “natural” ordering — i.e.
361 sections consisting of digits are compared numerically while
362 all other sections are compared based on their binary repre‐
363 sentation. This means “a1” will come before “b1” and “a9”
364 will come before “a10”. Records with the same name will be
365 ordered according to the values of the READ1 and READ2 flags
366 (see flags).
367
368 When the -n option is not present, reads are sorted by refer‐
369 ence (according to the order of the @SQ header records), then
370 by position in the reference, and then by the REVERSE flag.
371
372 Note
373
374
375 Historically samtools sort also accepted a less flexible way
376 of specifying the final and temporary output filenames:
377
378 samtools sort [-f] [-o] in.bam out.prefix
379
380 This has now been removed. The previous out.prefix argument
381 (and -f option, if any) should be changed to an appropriate
382 combination of -T PREFIX and -o FILE. The previous -o option
383 should be removed, as output defaults to standard output.
384
385
386 index samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]
387
388 Index a coordinate-sorted BAM or CRAM file for fast random
389 access. (Note that this does not work with SAM files even if
390 they are bgzip compressed — to index such files, use tabix(1)
391 instead.)
392
393 This index is needed when region arguments are used to limit
394 samtools view and similar commands to particular regions of
395 interest.
396
397 If an output filename is given, the index file will be writ‐
398 ten to out.index. Otherwise, for a CRAM file aln.cram, index
399 file aln.cram.crai will be created; for a BAM file aln.bam,
400 either aln.bam.bai or aln.bam.csi will be created, depending
401 on the index format selected.
402
403 Options:
404
405 -b Create a BAI index. This is currently the default
406 when no format options are used.
407
408 -c Create a CSI index. By default, the minimum interval
409 size for the index is 2^14, which is the same as the
410 fixed value used by the BAI format.
411
412 -m INT Create a CSI index, with a minimum interval size of
413 2^INT.
414
415
416 idxstats samtools idxstats in.sam|in.bam|in.cram
417
418 Retrieve and print stats in the index file corresponding to
419 the input file. Before calling idxstats, the input BAM file
420 should be indexed by samtools index.
421
422 If run on a SAM or CRAM file or an unindexed BAM file, this
423 command will still produce the same summary statistics, but
424 does so by reading through the entire file. This is far
425 slower than using the BAM indices.
426
427 The output is TAB-delimited with each line consisting of ref‐
428 erence sequence name, sequence length, # mapped reads and #
429 unmapped reads. It is written to stdout.
430
431
432 flagstat samtools flagstat in.sam|in.bam|in.cram
433
434 Does a full pass through the input file to calculate and
435 print statistics to stdout.
436
437 Provides counts for each of 13 categories based primarily on
438 bit flags in the FLAG field. Each category in the output is
439 broken down into QC pass and QC fail, which is presented as
440 "#PASS + #FAIL" followed by a description of the category.
441
442 The first row of output gives the total number of reads that
443 are QC pass and fail (according to flag bit 0x200). For exam‐
444 ple:
445
446 122 + 28 in total (QC-passed reads + QC-failed reads)
447
448 Which would indicate that there are a total of 150 reads in
449 the input file, 122 of which are marked as QC pass and 28 of
450 which are marked as "not passing quality controls"
451
452 Following this, additional categories are given for reads
453 which are:
454
455
456 secondary
457 0x100 bit set
458
459 supplementary
460 0x800 bit set
461
462 duplicates
463 0x400 bit set
464
465 mapped 0x4 bit not set
466
467 paired in sequencing
468 0x1 bit set
469
470 read1 both 0x1 and 0x40 bits set
471
472 read2 both 0x1 and 0x80 bits set
473
474 properly paired
475 both 0x1 and 0x2 bits set and 0x4 bit not set
476
477 with itself and mate mapped
478 0x1 bit set and neither 0x4 nor 0x8 bits set
479
480 singletons
481 both 0x1 and 0x8 bits set and bit 0x4 not set
482
483 And finally, two rows are given that additionally filter on
484 the reference name (RNAME), mate reference name (MRNM), and
485 mapping quality (MAPQ) fields:
486
487
488 with mate mapped to a different chr
489 0x1 bit set and neither 0x4 nor 0x8 bits set
490 and MRNM not equal to RNAME
491
492 with mate mapped to a different chr (mapQ>=5)
493 0x1 bit set and neither 0x4 nor 0x8 bits set
494 and MRNM not equal to RNAME and MAPQ >= 5
495
496
497 stats samtools stats [options] in.sam|in.bam|in.cram [region...]
498
499 samtools stats collects statistics from BAM files and outputs
500 in a text format. The output can be visualized graphically
501 using plot-bamstats.
502
503 Options:
504
505 -c, --coverage MIN,MAX,STEP
506 Set coverage distribution to the specified range
507 (MIN, MAX, STEP all given as integers) [1,1000,1]
508
509 -d, --remove-dups
510 Exclude from statistics reads marked as duplicates
511
512 -f, --required-flag STR|INT
513 Required flag, 0 for unset. See also `samtools flags`
514 [0]
515
516 -F, --filtering-flag STR|INT
517 Filtering flag, 0 for unset. See also `samtools
518 flags` [0]
519
520 --GC-depth FLOAT
521 the size of GC-depth bins (decreasing bin size
522 increases memory requirement) [2e4]
523
524 -h, --help
525 This help message
526
527 -i, --insert-size INT
528 Maximum insert size [8000]
529
530 -I, --id STR
531 Include only listed read group or sample name []
532
533 -l, --read-length INT
534 Include in the statistics only reads with the given
535 read length []
536
537 -m, --most-inserts FLOAT
538 Report only the main part of inserts [0.99]
539
540 -P, --split-prefix STR
541 A path or string prefix to prepend to filenames out‐
542 put when creating categorised statistics files with
543 -S/--split. [input filename]
544
545 -q, --trim-quality INT
546 The BWA trimming parameter [0]
547
548 -r, --ref-seq FILE
549 Reference sequence (required for GC-depth and mis‐
550 matches-per-cycle calculation). []
551
552 -S, --split TAG
553 In addition to the complete statistics, also output
554 categorised statistics based on the tagged field TAG
555 (e.g., use --split RG to split into read groups).
556
557 Categorised statistics are written to files named
558 <prefix>_<value>.bamstat, where prefix is as given by
559 --split-prefix (or the input filename by default) and
560 value has been encountered as the specified tagged
561 field's value in one or more alignment records.
562
563 -t, --target-regions FILE
564 Do stats in these regions only. Tab-delimited file
565 chr,from,to, 1-based, inclusive. []
566
567 -x, --sparse
568 Suppress outputting IS rows where there are no inser‐
569 tions.
570
571
572 bedcov samtools bedcov [options] region.bed
573 in1.sam|in1.bam|in1.cram[...]
574
575 Reports the total read base count (i.e. the sum of per base
576 read depths) for each genomic region specified in the sup‐
577 plied BED file. The regions are output as they appear in the
578 BED file and are 0-based. Counts for each alignment file
579 supplied are reported in separate columns.
580
581 Options:
582
583 -Q INT Only count reads with mapping quality greater than INT
584
585 -j Do not include deletions (D) and ref skips (N) in bed‐
586 cov computation.
587
588
589 depth samtools depth [options] [in1.sam|in1.bam|in1.cram
590 [in2.sam|in2.bam|in2.cram] [...]]
591
592 Computes the depth at each position or region.
593
594 Options:
595
596 -a Output all positions (including those with zero
597 depth)
598
599 -a -a, -aa
600 Output absolutely all positions, including unused
601 reference sequences. Note that when used in conjunc‐
602 tion with a BED file the -a option may sometimes
603 operate as if -aa was specified if the reference
604 sequence has coverage outside of the region specified
605 in the BED file.
606
607 -b FILE Compute depth at list of positions or regions in
608 specified BED FILE. []
609
610 -f FILE Use the BAM files specified in the FILE (a file of
611 filenames, one file per line) []
612
613 -l INT Ignore reads shorter than INT
614
615 -m, -d INT
616 Truncate reported depth at a maximum of INT reads.
617 [8000]. If 0, depth is set to the maximum integer
618 value, effectively removing any depth limit.
619
620 -q INT Only count reads with base quality greater than INT
621
622 -Q INT Only count reads with mapping quality greater than
623 INT
624
625 -r CHR:FROM-TO
626 Only report depth in specified region.
627
628
629 merge samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>]
630 <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>]
631
632 Merge multiple sorted alignment files, producing a single
633 sorted output file that contains all the input records and
634 maintains the existing sort order.
635
636 If -h is specified the @SQ headers of input files will be
637 merged into the specified header, otherwise they will be
638 merged into a composite header created from the input head‐
639 ers. If in the process of merging @SQ lines for coordinate
640 sorted input files, a conflict arises as to the order (for
641 example input1.bam has @SQ for a,b,c and input2.bam has
642 b,a,c) then the resulting output file will need to be re-
643 sorted back into coordinate order.
644
645 Unless the -c or -p flags are specified then when merging @RG
646 and @PG records into the output header then any IDs found to
647 be duplicates of existing IDs in the output header will have
648 a suffix appended to them to differentiate them from similar
649 header records from other files and the read records will be
650 updated to reflect this.
651
652 The ordering of the records in the input files must match the
653 usage of the -n and -t command-line options. If they do not,
654 the output order will be undefined. See sort for information
655 about record ordering.
656
657 OPTIONS:
658
659 -1 Use zlib compression level 1 to compress the output.
660
661 -b FILE List of input BAM files, one file per line.
662
663 -f Force to overwrite the output file if present.
664
665 -h FILE Use the lines of FILE as `@' headers to be copied to
666 out.bam, replacing any header lines that would other‐
667 wise be copied from in1.bam. (FILE is actually in
668 SAM format, though any alignment records it may con‐
669 tain are ignored.)
670
671 -n The input alignments are sorted by read names rather
672 than by chromosomal coordinates
673
674 -t TAG The input alignments have been sorted by the value of
675 TAG, then by either position or name (if -n is
676 given).
677
678 -R STR Merge files in the specified region indicated by STR
679 [null]
680
681 -r Attach an RG tag to each alignment. The tag value is
682 inferred from file names.
683
684 -u Uncompressed BAM output
685
686 -c When several input files contain @RG headers with the
687 same ID, emit only one of them (namely, the header
688 line from the first file we find that ID in) to the
689 merged output file. Combining these similar headers
690 is usually the right thing to do when the files being
691 merged originated from the same file.
692
693 Without -c, all @RG headers appear in the output
694 file, with random suffixes added to their IDs where
695 necessary to differentiate them.
696
697 -p Similarly, for each @PG ID in the set of files to
698 merge, use the @PG line of the first file we find
699 that ID in rather than adding a suffix to differenti‐
700 ate similar IDs.
701
702
703 faidx samtools faidx <ref.fasta> [region1 [...]]
704
705 Index reference sequence in the FASTA format or extract sub‐
706 sequence from indexed reference sequence. If no region is
707 specified, faidx will index the file and create
708 <ref.fasta>.fai on the disk. If regions are specified, the
709 subsequences will be retrieved and printed to stdout in the
710 FASTA format.
711
712 The input file can be compressed in the BGZF format.
713
714 The sequences in the input file should all have different
715 names. If they do not, indexing will emit a warning about
716 duplicate sequences and retrieval will only produce subse‐
717 quences from the first sequence with the duplicated name.
718
719 FASTQ files can be read and indexed by this command. Without
720 using --fastq any extracted subsequence will be in FASTA for‐
721 mat.
722
723 Options
724
725 -o, --output FILE
726 Write FASTA to file rather than to stdout.
727
728 -n, --length INT
729 Length of FASTA sequence line. [60]
730
731 -c, --continue
732 Continue working if a non-existant region is
733 requested.
734
735 -r, --region-file FILE
736 Read regions from a file. Format is chr:from-to, one
737 per line.
738
739 -f, --fastq
740 Read FASTQ files and output extracted sequences in
741 FASTQ format. Same as using samtools fqidx.
742
743 -i, --reverse-complement
744 Output the sequence as the reverse complement. When
745 this option is used, “/rc” will be appended to the
746 sequence names. To turn this off or change the
747 string appended, use the --mark-strand option.
748
749 --mark-strand TYPE
750 Append strand indicator to sequence name. TYPE can
751 be one of:
752
753 rc Append '/rc' when writing the reverse comple‐
754 ment. This is the default.
755
756 no Do not append anything.
757
758 sign Append '(+)' for forward strand or '(-)' for
759 reverse complement. This matches the output
760 of “bedtools getfasta -s”.
761
762 custom,<pos>,<neg>
763 Append string <pos> to names when writing the
764 forward strand and <neg> when writing the
765 reverse strand. Spaces are preserved, so it
766 is possible to move the indicator into the
767 comment part of the description line by
768 including a leading space in the strings <pos>
769 and <neg>.
770
771 -h, --help
772 Print help message and exit.
773
774
775 fqidx samtools fqidx <ref.fastq> [region1 [...]]
776
777 Index reference sequence in the FASTQ format or extract sub‐
778 sequence from indexed reference sequence. If no region is
779 specified, fqidx will index the file and create
780 <ref.fastq>.fai on the disk. If regions are specified, the
781 subsequences will be retrieved and printed to stdout in the
782 FASTQ format.
783
784 The input file can be compressed in the BGZF format.
785
786 The sequences in the input file should all have different
787 names. If they do not, indexing will emit a warning about
788 duplicate sequences and retrieval will only produce subse‐
789 quences from the first sequence with the duplicated name.
790
791 samtools fqidx should only be used on fastq files with a
792 small number of entries. Trying to use it on a file contain‐
793 ing millions of short sequencing reads will produce an index
794 that is almost as big as the original file, and searches
795 using the index will be very slow and use a lot of memory.
796
797 Options
798
799 -o, --output FILE
800 Write FASTQ to file rather than to stdout.
801
802 -n, --length INT
803 Length of FASTQ sequence line. [60]
804
805 -c, --continue
806 Continue working if a non-existant region is
807 requested.
808
809 -r, --region-file FILE
810 Read regions from a file. Format is chr:from-to, one
811 per line.
812
813 -i, --reverse-complement
814 Output the sequence as the reverse complement. When
815 this option is used, “/rc” will be appended to the
816 sequence names. To turn this off or change the
817 string appended, use the --mark-strand option.
818
819 --mark-strand TYPE
820 Append strand indicator to sequence name. TYPE can
821 be one of:
822
823 rc Append '/rc' when writing the reverse comple‐
824 ment. This is the default.
825
826 no Do not append anything.
827
828 sign Append '(+)' for forward strand or '(-)' for
829 reverse complement. This matches the output
830 of “bedtools getfasta -s”.
831
832 custom,<pos>,<neg>
833 Append string <pos> to names when writing the
834 forward strand and <neg> when writing the
835 reverse strand. Spaces are preserved, so it
836 is possible to move the indicator into the
837 comment part of the description line by
838 including a leading space in the strings <pos>
839 and <neg>.
840
841 -h, --help
842 Print help message and exit.
843
844
845 tview samtools tview [-p chr:pos] [-s STR] [-d display]
846 <in.sorted.bam> [ref.fasta]
847
848 Text alignment viewer (based on the ncurses library). In the
849 viewer, press `?' for help and press `g' to check the align‐
850 ment start from a region in the format like
851 `chr10:10,000,000' or `=10,000,000' when viewing the same
852 reference sequence.
853
854 Options:
855
856 -d display Output as (H)tml or (C)urses or (T)ext
857
858 -p chr:pos Go directly to this position
859
860 -s STR Display only alignments from this sample or
861 read group
862
863
864 split samtools split [options] merged.sam|merged.bam|merged.cram
865
866 Splits a file by read group.
867
868 Options:
869
870 -u FILE1 Put reads with no RG tag or an unrecognised RG
871 tag into FILE1
872
873 -u FILE1:FILE2
874 As above, but assigns an RG tag as given in the
875 header of FILE2
876
877 -f STRING Output filename format string (see below)
878 ["%*_%#.%."]
879
880 -v Verbose output
881
882 Format string expansions:
883
884 %% %
885 %* basename
886 %# @RG index
887 %! @RG ID
888 %. output format filename extension
889
890
891 quickcheck
892 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]
893
894 Quickly check that input files appear to be intact. Checks
895 that beginning of the file contains a valid header (all for‐
896 mats) containing at least one target sequence and then seeks
897 to the end of the file and checks that an end-of-file (EOF)
898 is present and intact (BAM only).
899
900 Data in the middle of the file is not read since that would
901 be much more time consuming, so please note that this command
902 will not detect internal corruption, but is useful for test‐
903 ing that files are not truncated before performing more
904 intensive tasks on them.
905
906 This command will exit with a non-zero exit code if any input
907 files don't have a valid header or are missing an EOF block.
908 Otherwise it will exit successfully (with a zero exit code).
909
910 Options:
911
912 -v Verbose output: will additionally print the names of
913 all input files that don't pass the check to stdout.
914 Multiple -v options will cause additional messages
915 regarding check results to be printed to stderr.
916
917 -q Quiet mode: disables warning messages on stderr about
918 files that fail. If both -q and -v options are used
919 then the appropriate level of -v takes precedence.
920
921
922 dict samtools dict <ref.fasta|ref.fasta.gz>
923
924 Create a sequence dictionary file from a fasta file.
925
926 OPTIONS:
927
928 -a, --assembly STR
929 Specify the assembly for the AS tag.
930
931 -H, --no-header
932 Do not print the @HD header line.
933
934 -o, --output FILE
935 Output to FILE [stdout].
936
937 -s, --species STR
938 Specify the species for the SP tag.
939
940 -u, --uri STR
941 Specify the URI for the UR tag. Defaults to the
942 absolute path of ref.fasta unless reading from
943 stdin.
944
945
946 fixmate samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
947
948 Fill in mate coordinates, ISIZE and mate related flags from a
949 name-sorted alignment.
950
951 OPTIONS:
952
953 -r Remove secondary and unmapped reads.
954
955 -p Disable FR proper pair check.
956
957 -c Add template cigar ct tag.
958
959 -m Add ms (mate score) tags. These are used by
960 markdup to select the best reads to keep.
961
962 -O FORMAT Write the final output as sam, bam, or cram.
963
964 By default, samtools tries to select a format
965 based on the output filename extension; if output
966 is to standard output or no format can be deduced,
967 bam is selected.
968
969
970 mpileup samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l
971 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
972
973 Generate pileup for one or multiple BAM files. Alignment
974 records are grouped by sample (SM) identifiers in @RG header
975 lines. If sample identifiers are absent, each input file is
976 regarded as one sample.
977
978 Samtools mpileup can still produce VCF and BCF output, but
979 this feature is deprecated and will be removed in a future
980 release. Please use bcftools mpileup for this instead.
981 (Documentation on the deprecated options has been removed
982 from this manual page, but older versions are available
983 online at <http://www.htslib.org/doc/>.)
984
985 In the pileup format (without -u or -g), each line represents
986 a genomic position, consisting of chromosome name, 1-based
987 coordinate, reference base, the number of reads covering the
988 site, read bases, base qualities and alignment mapping quali‐
989 ties. Information on match, mismatch, indel, strand, mapping
990 quality and start and end of a read are all encoded at the
991 read base column. At this column, a dot stands for a match to
992 the reference base on the forward strand, a comma for a match
993 on the reverse strand, a '>' or '<' for a reference skip,
994 `ACGTN' for a mismatch on the forward strand and `acgtn' for
995 a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGT‐
996 Nacgtn]+' indicates there is an insertion between this refer‐
997 ence position and the next reference position. The length of
998 the insertion is given by the integer in the pattern, fol‐
999 lowed by the inserted sequence. Similarly, a pattern
1000 `-[0-9]+[ACGTNacgtn]+' represents a deletion from the refer‐
1001 ence. The deleted bases will be presented as `*' in the fol‐
1002 lowing lines. Also at the read base column, a symbol `^'
1003 marks the start of a read. The ASCII of the character follow‐
1004 ing `^' minus 33 gives the mapping quality. A symbol `$'
1005 marks the end of a read segment.
1006
1007 Note that there are two orthogonal ways to specify locations
1008 in the input file; via -r region and -l file. The former
1009 uses (and requires) an index to do random access while the
1010 latter streams through the file contents filtering out the
1011 specified regions, requiring no index. The two may be used
1012 in conjunction. For example a BED file containing locations
1013 of genes in chromosome 20 could be specified using -r 20 -l
1014 chr20.bed, meaning that the index is used to find chromosome
1015 20 and then it is filtered for the regions listed in the bed
1016 file.
1017
1018 Input Options:
1019
1020 -6, --illumina1.3+
1021 Assume the quality is in the Illumina 1.3+ encod‐
1022 ing.
1023
1024 -A, --count-orphans
1025 Do not skip anomalous read pairs in variant call‐
1026 ing.
1027
1028 -b, --bam-list FILE
1029 List of input BAM files, one file per line [null]
1030
1031 -B, --no-BAQ
1032 Disable base alignment quality (BAQ) computation.
1033 See BAQ below.
1034
1035 -C, --adjust-MQ INT
1036 Coefficient for downgrading mapping quality for
1037 reads containing excessive mismatches. Given a read
1038 with a phred-scaled probability q of being gener‐
1039 ated from the mapped position, the new mapping
1040 quality is about sqrt((INT-q)/INT)*INT. A zero
1041 value disables this functionality; if enabled, the
1042 recommended value for BWA is 50. [0]
1043
1044 -d, --max-depth INT
1045 At a position, read maximally INT reads per input
1046 file. Setting this limit reduces the amount of mem‐
1047 ory and time needed to process regions with very
1048 high coverage. Passing zero for this option sets
1049 it to the highest possible value, effectively
1050 removing the depth limit. [8000]
1051
1052 Note that up to release 1.8, samtools would enforce
1053 a minimum value for this option. This no longer
1054 happens and the limit is set exactly as specified.
1055
1056 -E, --redo-BAQ
1057 Recalculate BAQ on the fly, ignore existing BQ
1058 tags. See BAQ below.
1059
1060 -f, --fasta-ref FILE
1061 The faidx-indexed reference file in the FASTA for‐
1062 mat. The file can be optionally compressed by
1063 bgzip. [null]
1064
1065 Supplying a reference file will enable base align‐
1066 ment quality calculation for all reads aligned to a
1067 reference in the file. See BAQ below.
1068
1069 -G, --exclude-RG FILE
1070 Exclude reads from readgroups listed in FILE (one
1071 @RG-ID per line)
1072
1073 -l, --positions FILE
1074 BED or position list file containing a list of
1075 regions or sites where pileup or BCF should be gen‐
1076 erated. Position list files contain two columns
1077 (chromosome and position) and start counting from
1078 1. BED files contain at least 3 columns (chromo‐
1079 some, start and end position) and are 0-based half-
1080 open.
1081 While it is possible to mix both position-list and
1082 BED coordinates in the same file, this is strongly
1083 ill advised due to the differing coordinate sys‐
1084 tems. [null]
1085
1086 -q, -min-MQ INT
1087 Minimum mapping quality for an alignment to be used
1088 [0]
1089
1090 -Q, --min-BQ INT
1091 Minimum base quality for a base to be considered
1092 [13]
1093
1094 -r, --region STR
1095 Only generate pileup in region. Requires the BAM
1096 files to be indexed. If used in conjunction with
1097 -l then considers the intersection of the two
1098 requests. STR [all sites]
1099
1100 -R, --ignore-RG
1101 Ignore RG tags. Treat all reads in one BAM as one
1102 sample.
1103
1104 --rf, --incl-flags STR|INT
1105 Required flags: skip reads with mask bits unset
1106 [null]
1107
1108 --ff, --excl-flags STR|INT
1109 Filter flags: skip reads with mask bits set
1110 [UNMAP,SECONDARY,QCFAIL,DUP]
1111
1112 -x, --ignore-overlaps
1113 Disable read-pair overlap detection.
1114
1115 Output Options:
1116
1117 -o, --output FILE
1118 Write pileup output to FILE, rather than the
1119 default of standard output.
1120
1121 (The same short option is used for both the depre‐
1122 cated --open-prob option and --output . If -o's
1123 argument contains any non-digit characters other
1124 than a leading + or - sign, it is interpreted as
1125 --output. Usually the filename extension will take
1126 care of this, but to write to an entirely numeric
1127 filename use -o ./123 or --output 123.)
1128
1129 -O, --output-BP
1130 Output base positions on reads.
1131
1132 -s, --output-MQ
1133 Output mapping quality.
1134
1135 --output-QNAME
1136 Output an extra column containing comma-separated
1137 read names.
1138
1139 -a Output all positions, including those with zero
1140 depth.
1141
1142 -a -a, -aa
1143 Output absolutely all positions, including unused
1144 reference sequences. Note that when used in con‐
1145 junction with a BED file the -a option may some‐
1146 times operate as if -aa was specified if the refer‐
1147 ence sequence has coverage outside of the region
1148 specified in the BED file.
1149
1150 BAQ (Base Alignment Quality)
1151
1152 BAQ is the Phred-scaled probability of a read base being mis‐
1153 aligned. It greatly helps to reduce false SNPs caused by
1154 misalignments. BAQ is calculated using the probabilistic
1155 realignment method described in the paper “Improving SNP dis‐
1156 covery by base alignment quality”, Heng Li, Bioinformatics,
1157 Volume 27, Issue 8 <https://doi.org/10.1093/bioinformat‐
1158 ics/btr076>
1159
1160 BAQ is turned on when a reference file is supplied using the
1161 -f option. To disable it, use the -B option.
1162
1163 It is possible to store pre-calculated BAQ values in a SAM
1164 BQ:Z tag. Samtools mpileup will use the precalculated values
1165 if it finds them. The -E option can be used to make it
1166 ignore the contents of the BQ:Z tag and force it to recalcu‐
1167 late the BAQ scores by making a new alignment.
1168
1169
1170 flags samtools flags INT|STR[,...]
1171
1172 Convert between textual and numeric flag representation.
1173
1174 FLAGS:
1175
1176 0x1 PAIRED paired-end (or multiple-segment) sequencing technology
1177 0x2 PROPER_PAIR each segment properly aligned according to the aligner
1178 0x4 UNMAP segment unmapped
1179 0x8 MUNMAP next segment in the template unmapped
1180 0x10 REVERSE SEQ is reverse complemented
1181 0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
1182 0x40 READ1 the first segment in the template
1183 0x80 READ2 the last segment in the template
1184 0x100 SECONDARY secondary alignment
1185 0x200 QCFAIL not passing quality controls
1186 0x400 DUP PCR or optical duplicate
1187 0x800 SUPPLEMENTARY supplementary alignment
1188
1189
1190 fastq/a samtools fastq [options] in.bam
1191 samtools fasta [options] in.bam
1192
1193 Converts a BAM or CRAM into either FASTQ or FASTA format
1194 depending on the command invoked. The files will be automati‐
1195 cally compressed if the file names have a .gz or .bgzf exten‐
1196 sion.
1197
1198 The input to this program must be collated by name. Use sam‐
1199 tools collate or samtools sort -n to ensure this.
1200
1201 For each different QNAME, the input records are categorised
1202 according to the state of the READ1 and READ2 flag bits. The
1203 three categories used are:
1204
1205 1 : Only READ1 is set.
1206
1207 2 : Only READ2 is set.
1208
1209 0 : Either both READ1 and READ2 are set; or neither is set.
1210
1211 The exact meaning of these categories depends on the sequenc‐
1212 ing technology used. It is expected that ordinary single and
1213 paired-end sequencing reads will be in categories 1 and 2 (in
1214 the case of paired-end reads, one read of the pair will be in
1215 category 1, the other in category 2). Category 0 is essen‐
1216 tially a “catch-all” for reads that do not fit into a simple
1217 paired-end sequencing model.
1218
1219 For each category only one sequence will be written for a
1220 given QNAME. If more than one record is available for a
1221 given QNAME and category, the first in input file order that
1222 has quality values will be used. If none of the candidate
1223 records has quality values, then the first in input file
1224 order will be used instead.
1225
1226 Sequences will be written to standard output unless one of
1227 the -1,-2, or -0 options is used, in which case sequences for
1228 that category will be written to the specified file.
1229
1230 If a singleton file is specified using the -s option then
1231 only paired sequences will be output for categories 1 and 2;
1232 paired meaning that for a given QNAME there are sequences for
1233 both category 1 and 2. If there is a sequence for only one
1234 of categories 1 or 2 then it will be diverted into the speci‐
1235 fied singletons file. This can be used to prepare fastq
1236 files for programs that cannot handle a mixture of paired and
1237 singleton reads.
1238
1239 The -s option only affects category 1 and 2 records. The
1240 output for category 0 will be the same irrespective of the
1241 use of this option.
1242
1243 OPTIONS:
1244
1245 -n By default, either '/1' or '/2' is added to the end
1246 of read names where the corresponding READ1 or READ2
1247 FLAG bit is set. Using -n causes read names to be
1248 left as they are.
1249
1250 -N Always add either '/1' or '/2' to the end of read
1251 names even when put into different files.
1252
1253 -O Use quality values from OQ tags in preference to
1254 standard quality string if available.
1255
1256 -s FILE Write singleton reads to FILE.
1257
1258 -t Copy RG, BC and QT tags to the FASTQ header line, if
1259 they exist.
1260
1261 -T TAGLIST
1262 Specify a comma-separated list of tags to copy to the
1263 FASTQ header line, if they exist.
1264
1265 -1 FILE Write reads with the READ1 FLAG set (and READ2 not
1266 set) to FILE instead of outputting them. If the -s
1267 option is used, only paired reads will be written to
1268 this file.
1269
1270 -2 FILE Write reads with the READ2 FLAG set (and READ1 not
1271 set) to FILE instead of outputting them. If the -s
1272 option is used, only paired reads will be written to
1273 this file.
1274
1275 -0 FILE Write reads where the READ1 and READ2 FLAG bits set
1276 are either both set or both unset to FILE instead of
1277 outputting them.
1278
1279 -f INT Only output alignments with all bits set in INT
1280 present in the FLAG field. INT can be specified in
1281 hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
1282 in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1283
1284 -F INT Do not output alignments with any bits set in INT
1285 present in the FLAG field. INT can be specified in
1286 hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
1287 in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1288
1289 -G INT Only EXCLUDE reads with all of the bits set in INT
1290 present in the FLAG field. INT can be specified in
1291 hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
1292 in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1293
1294 -i add Illumina Casava 1.8 format entry to header (eg
1295 1:N:0:ATCACG)
1296
1297 -c [0..9]
1298 set compression level when writing gz or bgzf fastq
1299 files.
1300
1301 --i1 FILE
1302 write first index reads to FILE
1303
1304 --i2 FILE
1305 write second index reads to FILE
1306
1307 --barcode-tag TAG
1308 aux tag to find index reads in [default: BC]
1309
1310 --quality-tag TAG
1311 aux tag to find index quality in [default: QT]
1312
1313 --index-format STR
1314 string to describe how to parse the barcode and qual‐
1315 ity tags. For example:
1316
1317
1318 i14i8 the first 14 characters are index 1, the next
1319 8 characters are index 2
1320
1321 n8i14 ignore the first 8 characters, and use the
1322 next 14 characters for index 1
1323
1324 If the tag contains a separator, then the
1325 numeric part can be replaced with '*' to mean
1326 'read until the separator or end of tag', for
1327 example:
1328
1329 n*i* ignore the left part of the tag until the
1330 separator, then use the second part
1331
1332 EXAMPLES
1333
1334 Output paired reads to separate files, discarding singletons,
1335 supplementary and secondary reads. The resulting files can
1336 be used with, for example, the bwa aligner.
1337
1338 samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam
1339
1340
1341 Output paired and singleton reads in a single file, discard‐
1342 ing supplementary and secondary reads. To get all of the
1343 reads in a single file, it is necessary to redirect the out‐
1344 put of samtools fastq. The output file is suitable for use
1345 with bwa mem -p which understands interleaved files contain‐
1346 ing a mixture of paired and singleton reads.
1347
1348 samtools fastq -0 /dev/null -F 0x900 in.bam > all_reads.fq
1349
1350
1351 Output paired reads in a single file, discarding supplemen‐
1352 tary and secondary reads. Save any singletons in a separate
1353 file. Append /1 and /2 to read names. This format is suit‐
1354 able for use by NextGenMap when using its -p and -q options.
1355 With this aligner, paired reads must be mapped separately to
1356 the singletons.
1357
1358 samtools fastq -0 /dev/null -s single.fq -N -F 0x900 in.bam > paired.fq
1359
1360
1361 BUGS
1362
1363
1364 o The way of specifying output files is far to complicated
1365 and easy to get wrong.
1366
1367
1368 o The default value for the -F option should really be 0x900
1369 so that secondary and supplementary reads are automatically
1370 excluded. The existing default of 0 is retained for rea‐
1371 sons of compatibility.
1372
1373
1374
1375 collate samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
1376
1377 Shuffles and groups reads together by their names. A faster
1378 alternative to a full query name sort, collate ensures that
1379 reads of the same name are grouped together in contiguous
1380 groups, but doesn't make any guarantees about the order of
1381 read names between groups.
1382
1383 The output from this command should be suitable for any oper‐
1384 ation that requires all reads from the same template to be
1385 grouped together.
1386
1387 If present, <prefix> is used to name the temporary files that
1388 collate uses when sorting the data. If neither the '-O' nor
1389 '-o' options are used, <prefix> must be present and collate
1390 will use it to make an output file name by appending a suffix
1391 depending on the format written (.bam by default).
1392
1393 If either the -O or -o option is used, <prefix> is optional.
1394 If <prefix> is absent, collate will write the temporary files
1395 to a system-dependent location (/tmp on UNIX).
1396
1397 Using -f for fast mode will output only primary alignments
1398 that have either the READ1 or READ2 flags set (but not both).
1399 Any other alignment records will be filtered out. The colla‐
1400 tion will only work correctly if there are no more than two
1401 reads for any given QNAME after filtering.
1402
1403 Fast mode keeps a buffer of alignments in memory so that it
1404 can write out most pairs as soon as they are found instead of
1405 storing them in temporary files. This allows collate to
1406 avoid some work and so finish more quickly compared to the
1407 standard mode. The number of alignments held can be changed
1408 using -r, storing more alignments uses more memory but
1409 increases the number of pairs that can be written early.
1410
1411 While collate normally randomises the ordering of read pairs,
1412 fast mode does not. Position-dependent biases that would
1413 normally be broken up can remain in the fast collate output.
1414 It is therefore not a good idea to use fast mode when prepar‐
1415 ing data for programs that expect randomly ordered paired
1416 reads. For example using fast collate instead of the stan‐
1417 dard mode may lead to significantly different results from
1418 aligners that estimate library insert sizes on batches of
1419 reads.
1420
1421 Options:
1422
1423 -O Output to stdout. This option cannot be used with
1424 '-o'.
1425
1426 -o FILE Write output to FILE. This option cannot be used
1427 with '-O'.
1428
1429 -u Write uncompressed BAM output
1430
1431 -l INT Compression level. [1]
1432
1433 -n INT Number of temporary files to use. [64]
1434
1435 -f Fast mode (primary alignments only).
1436
1437 -r INT Number of reads to store in memory (for use with -f).
1438 [10000]
1439
1440
1441 reheader samtools reheader [-iP] in.header.sam in.bam
1442
1443 Replace the header in in.bam with the header in
1444 in.header.sam. This command is much faster than replacing
1445 the header with a BAM→SAM→BAM conversion.
1446
1447 By default this command outputs the BAM or CRAM file to stan‐
1448 dard output (stdout), but for CRAM format files it has the
1449 option to perform an in-place edit, both reading and writing
1450 to the same file. No validity checking is performed on the
1451 header, nor that it is suitable to use with the sequence data
1452 itself.
1453
1454 OPTIONS:
1455
1456 -P, --no-PG
1457 Do not generate an @PG header line.
1458
1459 -i, --in-place
1460 Perform the header edit in-place, if possible. This
1461 only works on CRAM files and only if there is suffi‐
1462 cient room to store the new header. The amount of
1463 space available will differ for each CRAM file.
1464
1465
1466 cat samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam>
1467 <in2.bam> [ ... ]
1468
1469 Concatenate BAMs or CRAMs. Although this works on either BAM
1470 or CRAM, all input files must be the same format as each
1471 other. The sequence dictionary of each input file must be
1472 identical, although this command does not check this. This
1473 command uses a similar trick to reheader which enables fast
1474 BAM concatenation.
1475
1476 OPTIONS:
1477
1478 -b FOFN Read the list of input BAM or CRAM files from FOFN.
1479 These are concatenated prior to any files specified
1480 on the command line. Multiple -b FOFN options may be
1481 specified to concatenate multiple lists of BAM/CRAM
1482 files.
1483
1484 -h FILE Uses the SAM header from FILE. By default the header
1485 is taken from the first file to be concatenated.
1486
1487 -o FILE Write the concatenated output to FILE. By default
1488 this is sent to stdout.
1489
1490
1491 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
1492
1493 This command is obsolete. Use markdup instead.
1494
1495 Remove potential PCR duplicates: if multiple read pairs have
1496 identical external coordinates, only retain the pair with
1497 highest mapping quality. In the paired-end mode, this com‐
1498 mand ONLY works with FR orientation and requires ISIZE is
1499 correctly set. It does not work for unpaired reads (e.g. two
1500 ends mapped to different chromosomes or orphan reads).
1501
1502 OPTIONS:
1503
1504 -s Remove duplicates for single-end reads. By default,
1505 the command works for paired-end reads only.
1506
1507 -S Treat paired-end reads and single-end reads.
1508
1509
1510 addreplacerg
1511 samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l
1512 level] [-o out.bam] <input.bam>
1513
1514 Adds or replaces read group tags in a file.
1515
1516 OPTIONS:
1517
1518 -r STRING
1519 Allows you to specify a read group line to append to
1520 the header and applies it to the reads specified by
1521 the -m option. If repeated it automatically adds in
1522 tabs between invocations.
1523
1524 -R STRING
1525 Allows you to specify the read group ID of an exist‐
1526 ing @RG line and applies it to the reads specified.
1527
1528 -m MODE If you choose orphan_only then existing RG tags are
1529 not overwritten, if you choose overwrite_all, exist‐
1530 ing RG tags are overwritten. The default is over‐
1531 write_all.
1532
1533 -o STRING
1534 Write the final output to STRING. The default is to
1535 write to stdout.
1536
1537 By default, samtools tries to select a format based
1538 on the output filename extension; if output is to
1539 standard output or no format can be deduced, bam is
1540 selected.
1541
1542
1543 calmd samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>
1544
1545 Generate the MD tag. If the MD tag is already present, this
1546 command will give a warning if the MD tag generated is dif‐
1547 ferent from the existing tag. Output SAM by default.
1548
1549 Calmd can also read and write CRAM files although in most
1550 cases it is pointless as CRAM recalculates MD and NM tags on
1551 the fly. The one exception to this case is where both input
1552 and output CRAM files have been / are being created with the
1553 no_ref option.
1554
1555 OPTIONS:
1556
1557 -A When used jointly with -r this option overwrites the
1558 original base quality.
1559
1560 -e Convert a the read base to = if it is identical to
1561 the aligned reference base. Indel caller does not
1562 support the = bases at the moment.
1563
1564 -u Output uncompressed BAM
1565
1566 -b Output compressed BAM
1567
1568 -C INT Coefficient to cap mapping quality of poorly mapped
1569 reads. See the pileup command for details. [0]
1570
1571 -r Compute the BQ tag (without -A) or cap base quality
1572 by BAQ (with -A).
1573
1574 -E Extended BAQ calculation. This option trades speci‐
1575 ficity for sensitivity, though the effect is minor.
1576
1577
1578 targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1
1579 em1] [-2 em2] [-f ref] <in.bam>
1580
1581 This command identifies target regions by examining the con‐
1582 tinuity of read depth, computes haploid consensus sequences
1583 of targets and outputs a SAM with each sequence corresponding
1584 to a target. When option -f is in use, BAQ will be applied.
1585 This command is only designed for cutting fosmid clones from
1586 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
1587
1588
1589 phase samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q
1590 minBaseQ] <in.bam>
1591
1592 Call and phase heterozygous SNPs.
1593
1594 OPTIONS:
1595
1596 -A Drop reads with ambiguous phase.
1597
1598 -b STR Prefix of BAM output. When this option is in use,
1599 phase-0 reads will be saved in file STR.0.bam and
1600 phase-1 reads in STR.1.bam. Phase unknown reads will
1601 be randomly allocated to one of the two files.
1602 Chimeric reads with switch errors will be saved in
1603 STR.chimeric.bam. [null]
1604
1605 -F Do not attempt to fix chimeric reads.
1606
1607 -k INT Maximum length for local phasing. [13]
1608
1609 -q INT Minimum Phred-scaled LOD to call a heterozygote. [40]
1610
1611 -Q INT Minimum base quality to be used in het calling. [13]
1612
1613
1614 depad samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>
1615
1616 Converts a BAM aligned against a padded reference to a BAM
1617 aligned against the depadded reference. The padded reference
1618 may contain verbatim "*" bases in it, but "*" bases are also
1619 counted in the reference numbering. This means that a
1620 sequence base-call aligned against a reference "*" is consid‐
1621 ered to be a cigar match ("M" or "X") operator (if the base-
1622 call is "A", "C", "G" or "T"). After depadding the reference
1623 "*" bases are deleted and such aligned sequence base-calls
1624 become insertions. Similarly transformations apply for dele‐
1625 tions and padding cigar operations.
1626
1627 OPTIONS:
1628
1629 -S Ignored for compatibility with previous samtools ver‐
1630 sions. Previously this option was required if input
1631 was in SAM format, but now the correct format is auto‐
1632 matically detected by examining the first few charac‐
1633 ters of input.
1634
1635 -s Output in SAM format. The default is BAM.
1636
1637 -C Output in CRAM format. The default is BAM.
1638
1639 -u Do not compress the output. Applies to either BAM or
1640 CRAM output format.
1641
1642 -1 Enable fastest compression level. Only works for BAM
1643 or CRAM output.
1644
1645 -T FILE
1646 Provides the padded reference file. Note that without
1647 this the @SQ line lengths will be incorrect, so for
1648 most use cases this option will be considered as
1649 mandatory.
1650
1651 -o FILE
1652 Specifies the output filename. By default output is
1653 sent to stdout.
1654
1655
1656 markdup samtools markdup [-l length] [-r] [-s] [-T] [-S]
1657 in.algsort.bam out.bam
1658
1659 Mark duplicate alignments from a coordinate sorted file that
1660 has been run through fixmate with the -m option. This pro‐
1661 gram relies on the MC and ms tags that fixmate provides.
1662
1663
1664 -l INT Expected maximum read length of INT bases. [300]
1665
1666 -r Remove duplicate reads.
1667
1668 -s Print some basic stats.
1669
1670 -T PREFIX Write temporary files to PREFIX.sam‐
1671 tools.nnnn.mmmm.tmp
1672
1673 -S Mark supplementary reads of duplicates as dupli‐
1674 cates.
1675
1676
1677 EXAMPLE
1678
1679 # The first sort can be omitted if the file is already name ordered
1680 samtools sort -n -o namesort.bam example.bam
1681
1682 # Add ms and MC tags for markdup to use later
1683 samtools fixmate -m namesort.bam fixmate.bam
1684
1685 # Markdup needs position order
1686 samtools sort -o positionsort.bam fixmate.bam
1687
1688 # Finally mark duplicates
1689 samtools markdup positionsort.bam markdup.bam
1690
1691
1692 help, --help
1693 Display a brief usage message listing the samtools commands
1694 available. If the name of a command is also given, e.g.,
1695 samtools help view, the detailed usage message for that par‐
1696 ticular command is displayed.
1697
1698
1699 --version Display the version numbers and copyright information for
1700 samtools and the important libraries used by samtools.
1701
1702
1703 --version-only
1704 Display the full samtools version number in a machine-read‐
1705 able format.
1706
1708 Several long-options are shared between multiple samtools subcommands:
1709 --input-fmt, --input-fmt-option, --output-fmt, --output-fmt-option, and
1710 --reference. The input format is typically auto-detected so specifying
1711 the format is usually unnecessary and the option is included for com‐
1712 pleteness. Note that not all subcommands have all options. Consult
1713 the subcommand help for more details.
1714
1715 Format strings recognised are "sam", "bam" and "cram". They may be
1716 followed by a comma separated list of options as key or key=value. See
1717 below for examples.
1718
1719 The fmt-option arguments accept either a single option or option=value.
1720 Note that some options only work on some file formats and only on read
1721 or write streams. If value is unspecified for a boolean option, the
1722 value is assumed to be 1. The valid options are as follows.
1723
1724 level=INT
1725 Output only. Specifies the compression level from 1 to 9, or 0 for
1726 uncompressed.
1727
1728 nthreads=INT
1729 Specifies the number of threads to use during encoding and/or
1730 decoding. For BAM this will be encoding only. In CRAM the threads
1731 are dynamically shared between encoder and decoder.
1732
1733 reference=fasta_file
1734 Specifies a FASTA reference file for use in CRAM encoding or decod‐
1735 ing. It usually is not required for decoding except in the situa‐
1736 tion of the MD5 not being obtainable via the REF_PATH or REF_CACHE
1737 environment variables.
1738
1739 decode_md=0|1
1740 CRAM input only; defaults to 1 (on). CRAM does not typically store
1741 MD and NM tags, preferring to generate them on the fly. This
1742 option controls this behaviour. It can be particularly useful when
1743 combined with a file encoded using store_md=1 and store_nm=1.
1744
1745 store_md=0|1
1746 CRAM output only; defaults to 0 (off). CRAM normally only stores
1747 MD tags when no reference is unknown and lets the decoder generate
1748 these values on-the-fly (see decode_md).
1749
1750 store_nm=0|1
1751 CRAM output only; defaults to 0 (off). CRAM normally only stores
1752 NM tags when no reference is unknown and lets the decoder generate
1753 these values on-the-fly (see decode_md).
1754
1755 ignore_md5=0|1
1756 CRAM input only; defaults to 0 (off). When enabled, md5 checksum
1757 errors on the reference sequence and block checksum errors within
1758 CRAM are ignored. Use of this option is strongly discouraged.
1759
1760 required_fields=bit-field
1761 CRAM input only; specifies which SAM columns need to be populated.
1762 By default all fields are used. Limiting the decode to specific
1763 columns can have significant performance gains. The bit-field is a
1764 numerical value constructed from the following table.
1765
1766 0x1 SAM_QNAME
1767 0x2 SAM_FLAG
1768 0x4 SAM_RNAME
1769 0x8 SAM_POS
1770 0x10 SAM_MAPQ
1771 0x20 SAM_CIGAR
1772 0x40 SAM_RNEXT
1773 0x80 SAM_PNEXT
1774 0x100 SAM_TLEN
1775 0x200 SAM_SEQ
1776 0x400 SAM_QUAL
1777 0x800 SAM_AUX
1778 0x1000 SAM_RGAUX
1779
1780 name_prefix=string
1781 CRAM input only; defaults to output filename. Any sequences with
1782 auto-generated read names will use string as the name prefix.
1783
1784 multi_seq_per_slice=0|1
1785 CRAM output only; defaults to 0 (off). By default CRAM generates
1786 one container per reference sequence, except in the case of many
1787 small references (such as a fragmented assembly).
1788
1789 version=major.minor
1790 CRAM output only. Specifies the CRAM version number. Acceptable
1791 values are "2.1" and "3.0".
1792
1793 seqs_per_slice=INT
1794 CRAM output only; defaults to 10000.
1795
1796 slices_per_container=INT
1797 CRAM output only; defaults to 1. The effect of having multiple
1798 slices per container is to share the compression header block
1799 between multiple slices. This is unlikely to have any significant
1800 impact unless the number of sequences per slice is reduced.
1801 (Together these two options control the granularity of random
1802 access.)
1803
1804 embed_ref=0|1
1805 CRAM output only; defaults to 0 (off). If 1, this will store por‐
1806 tions of the reference sequence in each slice, permitting decode
1807 without having requiring an external copy of the reference
1808 sequence.
1809
1810 no_ref=0|1
1811 CRAM output only; defaults to 0 (off). If 1, sequences will be
1812 stored verbatim with no reference encoding. This can be useful if
1813 no reference is available for the file.
1814
1815 use_bzip2=0|1
1816 CRAM output only; defaults to 0 (off). Permits use of bzip2 in
1817 CRAM block compression.
1818
1819 use_lzma=0|1
1820 CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM
1821 block compression.
1822
1823 lossy_names=0|1
1824 CRAM output only; defaults to 0 (off). If 1, templates with all
1825 members within the same CRAM slice will have their read names
1826 removed. New names will be automatically generated during decod‐
1827 ing. Also see the name_prefix option.
1828
1829 For example:
1830
1831 samtools view --input-fmt-option decode_md=0
1832 --output-fmt cram,version=3.0 --output-fmt-option embed_ref
1833 --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
1834
1835
1837 The CRAM format requires use of a reference sequence for both reading
1838 and writing.
1839
1840 When reading a CRAM the @SQ headers are interrogated to identify the
1841 reference sequence MD5sum (M5: tag) and the local reference sequence
1842 filename (UR: tag). Note that http:// and ftp:// based URLs in the UR:
1843 field are not used, but local fasta filenames (with or without file://)
1844 can be used.
1845
1846 To create a CRAM the @SQ headers will also be read to identify the ref‐
1847 erence sequences, but M5: and UR: tags may not be present. In this case
1848 the -T and -t options of samtools view may be used to specify the fasta
1849 or fasta.fai filenames respectively (provided the .fasta.fai file is
1850 also backed up by a .fasta file).
1851
1852 The search order to obtain a reference is:
1853
1854 1. Use any local file specified by the command line options (eg -T).
1855
1856 2. Look for MD5 via REF_CACHE environment variable.
1857
1858 3. Look for MD5 in each element of the REF_PATH environment variable.
1859
1860 4. Look for a local file listed in the UR: header tag.
1861
1863 HTS_PATH
1864 A colon-separated list of directories in which to search for
1865 HTSlib plugins. If $HTS_PATH starts or ends with a colon or
1866 contains a double colon (::), the built-in list of directories
1867 is searched at that point in the search.
1868
1869 If no HTS_PATH variable is defined, the built-in list of direc‐
1870 tories specified when HTSlib was built is used, which typically
1871 includes /usr/local/libexec/htslib and similar directories.
1872
1873
1874 REF_PATH
1875 A colon separated (semi-colon on Windows) list of locations in
1876 which to look for sequences identified by their MD5sums. This
1877 can be either a list of directories or URLs. Note that if a URL
1878 is included then the colon in http:// and ftp:// and the
1879 optional port number will be treated as part of the URL and not
1880 a PATH field separator. For URLs, the text %s will be replaced
1881 by the MD5sum being read.
1882
1883 If no REF_PATH has been specified it will default to
1884 http://www.ebi.ac.uk/ena/cram/md5/%s and if REF_CACHE is also
1885 unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s. If
1886 $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system tempo‐
1887 rary directory if no home directory is found) will be used simi‐
1888 larly.
1889
1890
1891 REF_CACHE
1892 This can be defined to a single directory housing a local cache
1893 of references. Upon downloading a reference it will be stored
1894 in the location pointed to by REF_CACHE. When reading a refer‐
1895 ence it will be looked for in this directory before searching
1896 REF_PATH. To avoid many files being stored in the same direc‐
1897 tory, a pathname may be constructed using %nums and %s notation,
1898 consuming num characters of the MD5sum. For example
1899 /local/ref_cache/%2s/%2s/%s will create 2 nested subdirectories
1900 with the filenames in the deepest directory being the last 28
1901 characters of the md5sum.
1902
1903 The REF_CACHE directory will be searched for before attempting
1904 to load via the REF_PATH search list. If no REF_PATH is
1905 defined, both REF_PATH and REF_CACHE will be automatically set
1906 (see above), but if REF_PATH is defined and REF_CACHE not then
1907 no local cache is used.
1908
1909 To aid population of the REF_CACHE directory a script
1910 misc/seq_cache_populate.pl is provided in the Samtools distribu‐
1911 tion. This takes a fasta file or a directory of fasta files and
1912 generates the MD5sum named files.
1913
1915 o Import SAM to BAM when @SQ lines are present in the header:
1916
1917 samtools view -bS aln.sam > aln.bam
1918
1919 If @SQ lines are absent:
1920
1921 samtools faidx ref.fa
1922 samtools view -bt ref.fa.fai aln.sam > aln.bam
1923
1924 where ref.fa.fai is generated automatically by the faidx command.
1925
1926
1927 o Convert a BAM file to a CRAM file using a local reference sequence.
1928
1929 samtools view -C -T ref.fa aln.bam > aln.cram
1930
1931
1932 o Attach the RG tag while merging sorted alignments:
1933
1934 perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
1935 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
1936
1937 The value in a RG tag is determined by the file name the read is com‐
1938 ing from. In this example, in the merged.bam, reads from ga.bam will
1939 be attached RG:Z:ga, while reads from 454.bam will be attached
1940 RG:Z:454.
1941
1942
1943 o Convert a BAM file to a CRAM with NM and MD tags stored verbatim
1944 rather than calculating on the fly during CRAM decode, so that mixed
1945 data sets with MD/NM only on some records, or NM calculated using
1946 different definitions of mismatch, can be decoded without change.
1947 The second command demonstrates how to decode such a file. The
1948 request to not decode MD here is turning off auto-generation of both
1949 MD and NM; it will still emit the MD/NM tags on records that had
1950 these stored verbatim.
1951
1952 samtools view -C --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o aln.cram aln.bam
1953 samtools view --input-fmt-option decode_md=0 -o aln.new.bam aln.cram
1954
1955
1956 o An alternative way of achieving the above is listing multiple options
1957 after the --output-fmt or -O option. The commands below are equiva‐
1958 lent to the two above.
1959
1960 samtools view -O cram,store_md=1,store_nm=1 -o aln.cram aln.bam
1961 samtools view --input-fmt cram,decode_md=0 -o aln.new.bam aln.cram
1962
1963
1964
1965 o Call SNPs and short INDELs:
1966
1967 samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
1968 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf > var.flt.vcf
1969
1970 The bcftools filter command marks low quality sites and sites with
1971 the read depth exceeding a limit, which should be adjusted to about
1972 twice the average read depth (bigger read depths usually indicate
1973 problematic regions which are often enriched for artefacts). One may
1974 consider to add -C50 to mpileup if mapping quality is overestimated
1975 for reads containing excessive mismatches. Applying this option usu‐
1976 ally helps BWA-short but may not other mappers.
1977
1978 Individuals are identified from the SM tags in the @RG header lines.
1979 Individuals can be pooled in one alignment file; one individual can
1980 also be separated into multiple files. The -P option specifies that
1981 indel candidates should be collected only from read groups with the
1982 @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
1983 sequenced by an indel-prone technology may affect the performance of
1984 indel calling.
1985
1986
1987 o Generate the consensus sequence for one diploid individual:
1988
1989 samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
1990
1991
1992 o Phase one individual:
1993
1994 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
1995
1996 The calmd command is used to reduce false heterozygotes around
1997 INDELs.
1998
1999
2000
2001 o Dump BAQ applied alignment for other SNP callers:
2002
2003 samtools calmd -bAr aln.bam > aln.baq.bam
2004
2005 It adds and corrects the NM and MD tags at the same time. The calmd
2006 command also comes with the -C option, the same as the one in pileup
2007 and mpileup. Apply if it helps.
2008
2009
2011 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
2012 bam_aux.c.
2013
2014 o Samtools paired-end rmdup does not work for unpaired reads (e.g.
2015 orphan reads or ends mapped to different chromosomes). If this is a
2016 concern, please use Picard's MarkDuplicates which correctly handles
2017 these cases, although a little slower.
2018
2019
2021 Heng Li from the Sanger Institute wrote the original C version of sam‐
2022 tools. Bob Handsaker from the Broad Institute implemented the BGZF
2023 library. James Bonfield from the Sanger Institute developed the CRAM
2024 implementation. John Marshall and Petr Danecek contribute to the
2025 source code and various people from the 1000 Genomes Project have con‐
2026 tributed to the SAM format specification.
2027
2028
2030 bcftools(1), sam(5), tabix(1)
2031
2032 Samtools website: <http://www.htslib.org/>
2033 File format specification of SAM/BAM,CRAM,VCF/BCF: <http://sam‐
2034 tools.github.io/hts-specs>
2035 Samtools latest source: <https://github.com/samtools/samtools>
2036 HTSlib latest source: <https://github.com/samtools/htslib>
2037 Bcftools website: <http://samtools.github.io/bcftools>
2038
2039
2040
2041samtools-1.9 18 July 2018 samtools(1)