1samtools-markdup(1) Bioinformatics tools samtools-markdup(1)
2
3
4
6 samtools markdup - mark duplicate alignments in a coordinate sorted
7 file
8
10 samtools markdup [-l length] [-r] [-s] [-T] [-S] [-f file] [-d dis‐
11 tance] [-c] [-t] [-m] [--mode] [--include-fails] [--no-PG] [-u] [--no-
12 multi-dup] [--read-coords] [--coords-order] in.algsort.bam out.bam
13
14
16 Mark duplicate alignments from a coordinate sorted file that has been
17 run through samtools fixmate with the -m option. This program relies
18 on the MC and ms tags that fixmate provides.
19
20
22 -l INT Expected maximum read length of INT bases. [300]
23
24 -r Remove duplicate reads.
25
26 -s Print some basic stats. See STATISTICS.
27
28 -T PREFIX Write temporary files to PREFIX.samtools.nnnn.mmmm.tmp
29
30 -S Mark supplementary reads of duplicates as duplicates.
31
32 -f file Write stats to named file.
33
34 -d distance
35 The optical duplicate distance. Suggested settings of 100
36 for HiSeq style platforms or about 2500 for NovaSeq ones.
37 Default is 0 to not look for optical duplicates. When set,
38 duplicate reads are tagged with dt:Z:SQ for optical dupli‐
39 cates and dt:Z:LB otherwise. Calculation of distance de‐
40 pends on coordinate data embedded in the read names produced
41 by the Illumina sequencing machines. Optical duplicate de‐
42 tection will not work on non standard names without the use
43 of --read-coords.
44
45 -c Clear previous duplicate settings and tags.
46
47 -t Mark duplicates with the name of the original in a do tag.
48
49 -m, --mode TYPE
50 Duplicate decision method for paired reads. Values are t or
51 s. Mode t measures positions based on template start/end
52 (default). Mode s measures positions based on sequence
53 start. While the two methods identify mostly the same reads
54 as duplicates, mode s tends to return more results. Un‐
55 paired reads are treated identically by both modes.
56
57 -u Output uncompressed SAM, BAM or CRAM.
58
59 --include-fails
60 Include quality checked failed reads.
61
62 --no-multi-dup
63 Stop checking duplicates of duplicates for correctness.
64 While still marking reads as duplicates further checks to
65 make sure all optical duplicates are found are not carried
66 out. Also operates on -t tagging where reads may tagged
67 with a better quality read but not necessarily the best one.
68 Using this option can speed up duplicate marking when there
69 are a great many duplicates for each original read.
70
71 --read-coords REGEX
72 This takes a POSIX regular expression for at least x and y
73 to be used in optical duplicate marking It can also include
74 another part of the read name to test for equality, eg
75 lane:tile elements. Elements wanted are captured with paren‐
76 theses. Examples below.
77
78 --coords-order ORDER
79 The order of the elements captured in the regular expres‐
80 sion. Default is txy where t is a part of the read name se‐
81 lected for string comparison and x/y the coordinates used
82 for optical duplicate detection. Valid orders are: txy,
83 tyx, xyt, yxt, xty, ytx, xy and yx.
84
85 --no-PG Do not add a PG line to the output file.
86
87 -@, --threads INT
88 Number of input/output compression threads to use in addi‐
89 tion to main thread [0].
90
91
93 Entries are:
94 COMMAND: the command line.
95 READ: number of reads read in.
96 WRITTEN: reads written out.
97 EXCLUDED: reads ignored. See below.
98 EXAMINED: reads examined for duplication.
99 PAIRED: reads that are part of a pair.
100 SINGLE: reads that are not part of a pair.
101 DUPLICATE PAIR: reads in a duplicate pair.
102 DUPLICATE SINGLE: single read duplicates.
103 DUPLICATE PAIR OPTICAL: optical duplicate paired reads.
104 DUPLICATE SINGLE OPTICAL: optical duplicate single reads.
105 DUPLICATE NON PRIMARY: supplementary/secondary duplicate reads.
106 DUPLICATE NON PRIMARY OPTICAL: supplementary/secondary optical dupli‐
107 cate reads.
108 DUPLICATE PRIMARY TOTAL: number of primary duplicate reads.
109 DUPLICATE TOTAL: total number of duplicate reads.
110 ESTIMATED LIBRARY SIZE: estimate of the number of unique fragments in
111 the sequencing library.
112
113
114 Estimated library size makes various assumptions e.g. the library con‐
115 sists of unique fragments that are randomly selected (with replacement)
116 with equal probability. This is unlikely to be true in practice. How‐
117 ever it can provide a useful guide into how many unique read pairs are
118 likely to be available. In particular it can be used to determine how
119 much more data might be obtained by further sequencing of the library.
120
121 Excluded reads are those marked as secondary, supplementary or un‐
122 mapped. By default QC failed reads are also excluded but can be in‐
123 cluded as an option. Excluded reads are not used for calculating du‐
124 plicates. They can optionally be marked as duplicates if they have a
125 primary that is also a duplicate.
126
128 This first collate command can be omitted if the file is already name
129 ordered or collated:
130
131 samtools collate -o namecollate.bam example.bam
132
133
134 Add ms and MC tags for markdup to use later:
135
136 samtools fixmate -m namecollate.bam fixmate.bam
137
138
139 Markdup needs position order:
140
141 samtools sort -o positionsort.bam fixmate.bam
142
143
144 Finally mark duplicates:
145
146 samtools markdup positionsort.bam markdup.bam
147
148
149 Typically the fixmate step would be applied immediately after sequence
150 alignment and the markdup step after sorting by chromosome and posi‐
151 tion. Thus no additional sort steps are normally needed.
152
153 To use the regex to obtain coordinates from reads, two or three values
154 have to be captured. To mimic the normal behaviour and match a read
155 name of the format machine:run:flowcell:lane:tile:x:y use:
156
157 --read-coords '(144:[0-9]+:[0-9]+:[0-9]+):([0-9]+):([0-9]+)'
158 --coords-order txy
159
160
161 To match only the coordinates of x:y:randomstuff use:
162
163 --read-coords '^([[:digit:]]):([[:digit:]])'
164 --coords-order xy
165
166
167 It is possible that complex regular expressions may slow the running of
168 the program. It would be best to keep them simple.
169
170
172 Written by Andrew Whitwham from the Sanger Institute.
173
174
176 samtools(1), samtools-sort(1), samtools-collate(1), samtools-fixmate(1)
177
178 Samtools website: <http://www.htslib.org/>
179
180
181
182samtools-1.15.1 7 April 2022 samtools-markdup(1)