samtools-markdup(1)

1samtools-markdup(1)          Bioinformatics tools          samtools-markdup(1)
2
3
4

NAME

6       samtools  markdup  -  mark  duplicate alignments in a coordinate sorted
7       file
8

SYNOPSIS

10       samtools markdup [-l length] [-r] [-s] [-T] [-S]  [-f  file]  [-d  dis‐
11       tance]  [-c] [-t] [-m] [--mode] [--include-fails] [--no-PG] [-u] [--no-
12       multi-dup] in.algsort.bam out.bam
13
14

DESCRIPTION

16       Mark duplicate alignments from a coordinate sorted file that  has  been
17       run  through  samtools fixmate with the -m option.  This program relies
18       on the MC and ms tags that fixmate provides.
19
20

OPTIONS

22       -l INT     Expected maximum read length of INT bases.  [300]
23
24       -r         Remove duplicate reads.
25
26       -s         Print some basic stats. See STATISTICS.
27
28       -T PREFIX  Write temporary files to PREFIX.samtools.nnnn.mmmm.tmp
29
30       -S         Mark supplementary reads of duplicates as duplicates.
31
32       -f file    Write stats to named file.
33
34       -d distance
35                  The optical duplicate distance.  Suggested settings  of  100
36                  for  HiSeq  style  platforms or about 2500 for NovaSeq ones.
37                  Default is 0 to not look for optical duplicates.  When  set,
38                  duplicate  reads  are tagged with dt:Z:SQ for optical dupli‐
39                  cates and dt:Z:LB otherwise.  Calculation  of  distance  de‐
40                  pends on coordinate data embedded in the read names produced
41                  by the Illumina sequencing machines.  Optical duplicate  de‐
42                  tection will not work on non standard names.
43
44       -c         Clear previous duplicate settings and tags.
45
46       -t         Mark duplicates with the name of the original in a do tag.
47
48       -m, --mode TYPE
49                  Duplicate decision method for paired reads.  Values are t or
50                  s.  Mode t measures positions based  on  template  start/end
51                  (default).   Mode  s  measures  positions  based on sequence
52                  start.  While the two methods identify mostly the same reads
53                  as  duplicates,  mode  s  tends to return more results.  Un‐
54                  paired reads are treated identically by both modes.
55
56       -u         Output uncompressed SAM, BAM or CRAM.
57
58       --include-fails
59                  Include quality checked failed reads.
60
61       --no-multi-dup
62                  Stop checking  duplicates  of  duplicates  for  correctness.
63                  While  still  marking  reads as duplicates further checks to
64                  make sure all optical duplicates are found are  not  carried
65                  out.   Also  operates  on  -t tagging where reads may tagged
66                  with a better quality read but not necessarily the best one.
67                  Using  this option can speed up duplicate marking when there
68                  are a great many duplicates for each original read.
69
70       --no-PG    Do not add a PG line to the output file.
71
72       -@, --threads INT
73                  Number of input/output compression threads to use  in  addi‐
74                  tion to main thread [0].
75
76

STATISTICS

78       Entries are:
79       COMMAND: the command line.
80       READ: number of reads read in.
81       WRITTEN: reads written out.
82       EXCLUDED: reads ignored.  See below.
83       EXAMINED: reads examined for duplication.
84       PAIRED: reads that are part of a pair.
85       SINGLE: reads that are not part of a pair.
86       DUPLICATE PAIR: reads in a duplicate pair.
87       DUPLICATE SINGLE: single read duplicates.
88       DUPLICATE PAIR OPTICAL: optical duplicate paired reads.
89       DUPLICATE SINGLE OPTICAL: optical duplicate single reads.
90       DUPLICATE NON PRIMARY: supplementary/secondary duplicate reads.
91       DUPLICATE  NON  PRIMARY OPTICAL: supplementary/secondary optical dupli‐
92       cate reads.
93       DUPLICATE PRIMARY TOTAL: number of primary duplicate reads.
94       DUPLICATE TOTAL: total number of duplicate reads.
95       ESTIMATED LIBRARY SIZE: estimate of the number of unique  fragments  in
96       the sequencing library.
97
98
99       Estimated  library size makes various assumptions e.g. the library con‐
100       sists of unique fragments that are randomly selected (with replacement)
101       with equal probability.  This is unlikely to be true in practice.  How‐
102       ever it can provide a useful guide into how many unique read pairs  are
103       likely  to be available.  In particular it can be used to determine how
104       much more data might be obtained by further sequencing of the library.
105
106       Excluded reads are those marked  as  secondary,  supplementary  or  un‐
107       mapped.   By  default  QC failed reads are also excluded but can be in‐
108       cluded as an option.  Excluded reads are not used for  calculating  du‐
109       plicates.   They  can optionally be marked as duplicates if they have a
110       primary that is also a duplicate.
111

EXAMPLES

113       This first collate command can be omitted if the file is  already  name
114       ordered or collated:
115
116       samtools collate -o namecollate.bam example.bam
117
118
119       Add ms and MC tags for markdup to use later:
120
121       samtools fixmate -m namecollate.bam fixmate.bam
122
123
124       Markdup needs position order:
125
126       samtools sort -o positionsort.bam fixmate.bam
127
128
129       Finally mark duplicates:
130
131       samtools markdup positionsort.bam markdup.bam
132
133
134       Typically  the fixmate step would be applied immediately after sequence
135       alignment and the markdup step after sorting by  chromosome  and  posi‐
136       tion.  Thus no additional sort steps are normally needed.
137
138

AUTHOR

140       Written by Andrew Whitwham from the Sanger Institute.
141
142