samtools-markdup(1)

1samtools-markdup(1)          Bioinformatics tools          samtools-markdup(1)
2
3
4

NAME

6       samtools  markdup  -  mark  duplicate alignments in a coordinate sorted
7       file
8

SYNOPSIS

10       samtools markdup [-l length] [-r] [-s] [-T] [-S]  [-f  file]  [-d  dis‐
11       tance]  [-c] [-t] [-m] [--mode] [--include-fails] [--no-PG] [-u] [--no-
12       multi-dup] [--read-coords] [--coords-order] in.algsort.bam out.bam
13
14

DESCRIPTION

16       Mark duplicate alignments from a coordinate sorted file that  has  been
17       run  through  samtools fixmate with the -m option.  This program relies
18       on the MC and ms tags that fixmate provides.
19
20

OPTIONS

22       -l INT     Expected maximum read length of INT bases.  [300]
23
24       -r         Remove duplicate reads.
25
26       -s         Print some basic stats. See STATISTICS.
27
28       -T PREFIX  Write temporary files to PREFIX.samtools.nnnn.mmmm.tmp
29
30       -S         Mark supplementary reads of duplicates as duplicates.
31
32       -f file    Write stats to named file.
33
34       -d distance
35                  The optical duplicate distance.  Suggested settings  of  100
36                  for  HiSeq  style  platforms or about 2500 for NovaSeq ones.
37                  Default is 0 to not look for optical duplicates.  When  set,
38                  duplicate  reads  are tagged with dt:Z:SQ for optical dupli‐
39                  cates and dt:Z:LB otherwise.  Calculation  of  distance  de‐
40                  pends on coordinate data embedded in the read names produced
41                  by the Illumina sequencing machines.  Optical duplicate  de‐
42                  tection  will not work on non standard names without the use
43                  of --read-coords.
44
45       -c         Clear previous duplicate settings and tags.
46
47       -t         Mark duplicates with the name of the original in a do tag.
48
49       -m, --mode TYPE
50                  Duplicate decision method for paired reads.  Values are t or
51                  s.   Mode  t  measures positions based on template start/end
52                  (default).  Mode s  measures  positions  based  on  sequence
53                  start.  While the two methods identify mostly the same reads
54                  as duplicates, mode s tends to  return  more  results.   Un‐
55                  paired reads are treated identically by both modes.
56
57       -u         Output uncompressed SAM, BAM or CRAM.
58
59       --include-fails
60                  Include quality checked failed reads.
61
62       --no-multi-dup
63                  Stop  checking  duplicates  of  duplicates  for correctness.
64                  While still marking reads as duplicates  further  checks  to
65                  make  sure  all optical duplicates are found are not carried
66                  out.  Also operates on -t tagging  where  reads  may  tagged
67                  with a better quality read but not necessarily the best one.
68                  Using this option can speed up duplicate marking when  there
69                  are a great many duplicates for each original read.
70
71       --read-coords REGEX
72                  This  takes  a POSIX regular expression for at least x and y
73                  to be used in optical duplicate marking It can also  include
74                  another  part  of  the  read  name  to test for equality, eg
75                  lane:tile elements. Elements wanted are captured with paren‐
76                  theses.  Examples below.
77
78       --coords-order ORDER
79                  The  order  of  the elements captured in the regular expres‐
80                  sion. Default is txy where t is a part of the read name  se‐
81                  lected  for  string  comparison and x/y the coordinates used
82                  for optical duplicate detection.   Valid  orders  are:  txy,
83                  tyx, xyt, yxt, xty, ytx, xy and yx.
84
85       --no-PG    Do not add a PG line to the output file.
86
87       -@, --threads INT
88                  Number  of  input/output compression threads to use in addi‐
89                  tion to main thread [0].
90
91

STATISTICS

93       Entries are:
94       COMMAND: the command line.
95       READ: number of reads read in.
96       WRITTEN: reads written out.
97       EXCLUDED: reads ignored.  See below.
98       EXAMINED: reads examined for duplication.
99       PAIRED: reads that are part of a pair.
100       SINGLE: reads that are not part of a pair.
101       DUPLICATE PAIR: reads in a duplicate pair.
102       DUPLICATE SINGLE: single read duplicates.
103       DUPLICATE PAIR OPTICAL: optical duplicate paired reads.
104       DUPLICATE SINGLE OPTICAL: optical duplicate single reads.
105       DUPLICATE NON PRIMARY: supplementary/secondary duplicate reads.
106       DUPLICATE NON PRIMARY OPTICAL: supplementary/secondary  optical  dupli‐
107       cate reads.
108       DUPLICATE PRIMARY TOTAL: number of primary duplicate reads.
109       DUPLICATE TOTAL: total number of duplicate reads.
110       ESTIMATED  LIBRARY  SIZE: estimate of the number of unique fragments in
111       the sequencing library.
112
113
114       Estimated library size makes various assumptions e.g. the library  con‐
115       sists of unique fragments that are randomly selected (with replacement)
116       with equal probability.  This is unlikely to be true in practice.  How‐
117       ever  it can provide a useful guide into how many unique read pairs are
118       likely to be available.  In particular it can be used to determine  how
119       much more data might be obtained by further sequencing of the library.
120
121       Excluded  reads  are  those  marked  as secondary, supplementary or un‐
122       mapped.  By default QC failed reads are also excluded but  can  be  in‐
123       cluded  as  an option.  Excluded reads are not used for calculating du‐
124       plicates.  They can optionally be marked as duplicates if they  have  a
125       primary that is also a duplicate.
126

EXAMPLES

128       This  first  collate command can be omitted if the file is already name
129       ordered or collated:
130
131           samtools collate -o namecollate.bam example.bam
132
133
134       Add ms and MC tags for markdup to use later:
135
136           samtools fixmate -m namecollate.bam fixmate.bam
137
138
139       Markdup needs position order:
140
141           samtools sort -o positionsort.bam fixmate.bam
142
143
144       Finally mark duplicates:
145
146           samtools markdup positionsort.bam markdup.bam
147
148
149       Typically the fixmate step would be applied immediately after  sequence
150       alignment  and  the  markdup step after sorting by chromosome and posi‐
151       tion.  Thus no additional sort steps are normally needed.
152
153       To use the regex to obtain coordinates from reads, two or three  values
154       have  to  be  captured.  To mimic the normal behaviour and match a read
155       name of the format machine:run:flowcell:lane:tile:x:y use:
156
157           --read-coords '(144:[0-9]+:[0-9]+:[0-9]+):([0-9]+):([0-9]+)'
158           --coords-order txy
159
160
161       To match only the coordinates of x:y:randomstuff use:
162
163           --read-coords '^([[:digit:]]):([[:digit:]])'
164           --coords-order xy
165
166
167       It is possible that complex regular expressions may slow the running of
168       the program.  It would be best to keep them simple.
169
170

AUTHOR

172       Written by Andrew Whitwham from the Sanger Institute.
173
174