1samtools-markdup(1) Bioinformatics tools samtools-markdup(1)
2
3
4
6 samtools markdup - mark duplicate alignments in a coordinate sorted
7 file
8
10 samtools markdup [-l length] [-r] [-s] [-T] [-S] [-f file] [-d dis‐
11 tance] [-c] [-t] [-m] [--mode] [--include-fails] [--no-PG] [-u] [--no-
12 multi-dup] in.algsort.bam out.bam
13
14
16 Mark duplicate alignments from a coordinate sorted file that has been
17 run through samtools fixmate with the -m option. This program relies
18 on the MC and ms tags that fixmate provides.
19
20
22 -l INT Expected maximum read length of INT bases. [300]
23
24 -r Remove duplicate reads.
25
26 -s Print some basic stats. See STATISTICS.
27
28 -T PREFIX Write temporary files to PREFIX.samtools.nnnn.mmmm.tmp
29
30 -S Mark supplementary reads of duplicates as duplicates.
31
32 -f file Write stats to named file.
33
34 -d distance
35 The optical duplicate distance. Suggested settings of 100
36 for HiSeq style platforms or about 2500 for NovaSeq ones.
37 Default is 0 to not look for optical duplicates. When set,
38 duplicate reads are tagged with dt:Z:SQ for optical dupli‐
39 cates and dt:Z:LB otherwise. Calculation of distance de‐
40 pends on coordinate data embedded in the read names produced
41 by the Illumina sequencing machines. Optical duplicate de‐
42 tection will not work on non standard names.
43
44 -c Clear previous duplicate settings and tags.
45
46 -t Mark duplicates with the name of the original in a do tag.
47
48 -m, --mode TYPE
49 Duplicate decision method for paired reads. Values are t or
50 s. Mode t measures positions based on template start/end
51 (default). Mode s measures positions based on sequence
52 start. While the two methods identify mostly the same reads
53 as duplicates, mode s tends to return more results. Un‐
54 paired reads are treated identically by both modes.
55
56 -u Output uncompressed SAM, BAM or CRAM.
57
58 --include-fails
59 Include quality checked failed reads.
60
61 --no-multi-dup
62 Stop checking duplicates of duplicates for correctness.
63 While still marking reads as duplicates further checks to
64 make sure all optical duplicates are found are not carried
65 out. Also operates on -t tagging where reads may tagged
66 with a better quality read but not necessarily the best one.
67 Using this option can speed up duplicate marking when there
68 are a great many duplicates for each original read.
69
70 --no-PG Do not add a PG line to the output file.
71
72 -@, --threads INT
73 Number of input/output compression threads to use in addi‐
74 tion to main thread [0].
75
76
78 Entries are:
79 COMMAND: the command line.
80 READ: number of reads read in.
81 WRITTEN: reads written out.
82 EXCLUDED: reads ignored. See below.
83 EXAMINED: reads examined for duplication.
84 PAIRED: reads that are part of a pair.
85 SINGLE: reads that are not part of a pair.
86 DUPLICATE PAIR: reads in a duplicate pair.
87 DUPLICATE SINGLE: single read duplicates.
88 DUPLICATE PAIR OPTICAL: optical duplicate paired reads.
89 DUPLICATE SINGLE OPTICAL: optical duplicate single reads.
90 DUPLICATE NON PRIMARY: supplementary/secondary duplicate reads.
91 DUPLICATE NON PRIMARY OPTICAL: supplementary/secondary optical dupli‐
92 cate reads.
93 DUPLICATE PRIMARY TOTAL: number of primary duplicate reads.
94 DUPLICATE TOTAL: total number of duplicate reads.
95 ESTIMATED LIBRARY SIZE: estimate of the number of unique fragments in
96 the sequencing library.
97
98
99 Estimated library size makes various assumptions e.g. the library con‐
100 sists of unique fragments that are randomly selected (with replacement)
101 with equal probability. This is unlikely to be true in practice. How‐
102 ever it can provide a useful guide into how many unique read pairs are
103 likely to be available. In particular it can be used to determine how
104 much more data might be obtained by further sequencing of the library.
105
106 Excluded reads are those marked as secondary, supplementary or un‐
107 mapped. By default QC failed reads are also excluded but can be in‐
108 cluded as an option. Excluded reads are not used for calculating du‐
109 plicates. They can optionally be marked as duplicates if they have a
110 primary that is also a duplicate.
111
113 This first collate command can be omitted if the file is already name
114 ordered or collated:
115
116 samtools collate -o namecollate.bam example.bam
117
118
119 Add ms and MC tags for markdup to use later:
120
121 samtools fixmate -m namecollate.bam fixmate.bam
122
123
124 Markdup needs position order:
125
126 samtools sort -o positionsort.bam fixmate.bam
127
128
129 Finally mark duplicates:
130
131 samtools markdup positionsort.bam markdup.bam
132
133
134 Typically the fixmate step would be applied immediately after sequence
135 alignment and the markdup step after sorting by chromosome and posi‐
136 tion. Thus no additional sort steps are normally needed.
137
138
140 Written by Andrew Whitwham from the Sanger Institute.
141
142
144 samtools(1), samtools-sort(1), samtools-collate(1), samtools-fixmate(1)
145
146 Samtools website: <http://www.htslib.org/>
147
148
149
150samtools-1.13 7 July 2021 samtools-markdup(1)