1samtools-stats(1) Bioinformatics tools samtools-stats(1)
2
3
4
6 samtools stats - produces comprehensive statistics from alignment file
7
9 samtools stats [options] in.sam|in.bam|in.cram [region...]
10
11
13 samtools stats collects statistics from BAM files and outputs in a text
14 format. The output can be visualized graphically using plot-bamstats.
15
16 A summary of output sections is listed below, followed by more detailed
17 descriptions.
18
19
20 CHK Checksum
21 SN Summary numbers
22 FFQ First fragment qualities
23 LFQ Last fragment qualities
24 GCF GC content of first fragments
25 GCL GC content of last fragments
26 GCC ACGT content per cycle
27 GCT ACGT content per cycle, read oriented
28 FBC ACGT content per cycle for first fragments only
29 FTC ACGT raw counters for first fragments
30 LBC ACGT content per cycle for last fragments only
31 LTC ACGT raw counters for last fragments
32 BCC ACGT content per cycle for BC barcode
33 CRC ACGT content per cycle for CR barcode
34 OXC ACGT content per cycle for OX barcode
35 RXC ACGT content per cycle for RX barcode
36 QTQ Quality distribution for BC barcode
37 CYQ Quality distribution for CR barcode
38 BZQ Quality distribution for OX barcode
39 QXQ Quality distribution for RX barcode
40 IS Insert sizes
41 RL Read lengths
42 FRL Read lengths for first fragments only
43 LRL Read lengths for last fragments only
44 ID Indel size distribution
45 IC Indels per cycle
46 COV Coverage (depth) distribution
47 GCD GC-depth
48
49 Not all sections will be reported as some depend on the data being co‐
50 ordinate sorted while others are only present when specific barcode
51 tags are in use.
52
53 Some of the statistics are collected for “first” or “last” fragments.
54 Records are put into these categories using the PAIRED (0x1), READ1
55 (0x40) and READ2 (0x80) flag bits, as follows:
56
57
58 • Unpaired reads (i.e. PAIRED is not set) are all “first” fragments.
59 For these records, the READ1 and READ2 flags are ignored.
60
61 • Reads where PAIRED and READ1 are set, and READ2 is not set are
62 “first” fragments.
63
64 • Reads where PAIRED and READ2 are set, and READ1 is not set are
65 “last” fragments.
66
67 • Reads where PAIRED is set and either both READ1 and READ2 are set
68 or neither is set are not counted in either category.
69
70 Information on the meaning of the flags is given in the SAM specifica‐
71 tion document <https://samtools.github.io/hts-specs/SAMv1.pdf>.
72
73 The CHK row contains distinct CRC32 checksums of read names, sequences
74 and quality values. The checksums are computed per alignment record
75 and summed, meaning the checksum does not change if the input file has
76 the sort-order changed.
77
78 The SN section contains a series of counts, percentages, and averages,
79 in a similar style to samtools flagstat, but more comprehensive.
80
81 raw total sequences - total number of reads in a file, excluding
82 supplementary and secondary reads. Same number reported by sam‐
83 tools view -c.
84
85 filtered sequences - number of discarded reads when using -f or
86 -F option.
87
88 sequences - number of processed reads.
89
90 is sorted - flag indicating whether the file is coordinate
91 sorted (1) or not (0).
92
93 1st fragments - number of first fragment reads (flags 0x01 not
94 set; or flags 0x01 and 0x40 set, 0x80 not set).
95
96 last fragments - number of last fragment reads (flags 0x01 and
97 0x80 set, 0x40 not set).
98
99 reads mapped - number of reads, paired or single, that are
100 mapped (flag 0x4 or 0x8 not set).
101
102 reads mapped and paired - number of mapped paired reads (flag
103 0x1 is set and flags 0x4 and 0x8 are not set).
104
105 reads unmapped - number of unmapped reads (flag 0x4 is set).
106
107 reads properly paired - number of mapped paired reads with flag
108 0x2 set.
109
110 paired - number of paired reads, mapped or unmapped, that are
111 neither secondary nor supplementary (flag 0x1 is set and flags
112 0x100 (256) and 0x800 (2048) are not set).
113
114 reads duplicated - number of duplicate reads (flag 0x400 (1024)
115 is set).
116
117 reads MQ0 - number of mapped reads with mapping quality 0.
118
119 reads QC failed - number of reads that failed the quality checks
120 (flag 0x200 (512) is set).
121
122 non-primary alignments - number of secondary reads (flag 0x100
123 (256) set).
124
125 supplementary alignments - number of supplementary reads (flag
126 0x800 (2048) set).
127
128 total length - number of processed bases from reads that are
129 neither secondary nor supplementary (flags 0x100 (256) and 0x800
130 (2048) are not set).
131
132 total first fragment length - number of processed bases that be‐
133 long to first fragments.
134
135 total last fragment length - number of processed bases that be‐
136 long to last fragments.
137
138 bases mapped - number of processed bases that belong to reads
139 mapped.
140
141 bases mapped (cigar) - number of mapped bases filtered by the
142 CIGAR string corresponding to the read they belong to. Only
143 alignment matches(M), inserts(I), sequence matches(=) and se‐
144 quence mismatches(X) are counted.
145
146 bases trimmed - number of bases trimmed by bwa, that belong to
147 non secondary and non supplementary reads. Enabled by -q option.
148
149 bases duplicated - number of bases that belong to reads dupli‐
150 cated.
151
152 mismatches - number of mismatched bases, as reported by the NM
153 tag associated with a read, if present.
154
155 error rate - ratio between mismatches and bases mapped (cigar).
156
157 average length - ratio between total length and sequences.
158
159 average first fragment length - ratio between total first frag‐
160 ment length and 1st fragments.
161
162 average last fragment length - ratio between total last fragment
163 length and last fragments.
164
165 maximum length - length of the longest read (includes hard-
166 clipped bases).
167
168 maximum first fragment length - length of the longest first
169 fragment read (includes hard-clipped bases).
170
171 maximum last fragment length - length of the longest last frag‐
172 ment read (includes hard-clipped bases).
173
174 average quality - ratio between the sum of base qualities and
175 total length.
176
177 insert size average - the average absolute template length for
178 paired and mapped reads.
179
180 insert size standard deviation - standard deviation for the av‐
181 erage template length distribution.
182
183 inward oriented pairs - number of paired reads with flag 0x40
184 (64) set and flag 0x10 (16) not set or with flag 0x80 (128) set
185 and flag 0x10 (16) set.
186
187 outward oriented pairs - number of paired reads with flag 0x40
188 (64) set and flag 0x10 (16) set or with flag 0x80 (128) set and
189 flag 0x10 (16) not set.
190
191 pairs with other orientation - number of paired reads that don't
192 fall in any of the above two categories.
193
194 pairs on different chromosomes - number of pairs where one read
195 is on one chromosome and the pair read is on a different chromo‐
196 some.
197
198 percentage of properly paired reads - percentage of reads prop‐
199 erly paired out of sequences.
200
201 bases inside the target - number of bases inside the target re‐
202 gion(s) (when a target file is specified with -t option).
203
204 percentage of target genome with coverage > VAL - percentage of
205 target bases with a coverage larger than VAL. By default, VAL is
206 0, but a custom value can be supplied by the user with -g op‐
207 tion.
208
209
210 The FFQ and LFQ sections report the quality distribution per first/last
211 fragment and per cycle number. They have one row per cycle (reported
212 as the first column after the FFQ/LFQ key) with remaining columns being
213 the observed integer counts per quality value, starting at quality 0 in
214 the left-most row and ending at the largest observed quality. Thus
215 each row forms its own quality distribution and any cycle specific
216 quality artefacts can be observed.
217
218 GCF and GCL report the total GC content of each fragment, separated
219 into first and last fragments. The columns show the GC percentile (be‐
220 tween 0 and 100) and an integer count of fragments at that percentile.
221
222 GCC, FBC and LBC report the nucleotide content per cycle either com‐
223 bined (GCC) or split into first (FBC) and last (LBC) fragments. The
224 columns are cycle number (integer), and percentage counts for A, C, G,
225 T, N and other (typically containing ambiguity codes) normalised
226 against the total counts of A, C, G and T only (excluding N and other).
227
228 GCT offers a similar report to GCC, but whereas GCC counts nucleotides
229 as they appear in the SAM output (in reference orientation), GCT takes
230 into account whether a nucleotide belongs to a reverse complemented
231 read and counts it in the original read orientation. If there are no
232 reverse complemented reads in a file, the GCC and GCT reports will be
233 identical.
234
235 FTC and LTC report the total numbers of nucleotides for first and last
236 fragments, respectively. The columns are the raw counters for A, C, G,
237 T and N bases.
238
239 BCC, CRC, OXC and RXC are the barcode equivalent of GCC, showing nu‐
240 cleotide content for the barcode tags BC, CR, OX and RX respectively.
241 Their quality values distributions are in the QTQ, CYQ, BZQ and QXQ
242 sections, corresponding to the BC/QT, CR/CY, OX/BZ and RX/QX SAM format
243 sequence/quality tags. These quality value distributions follow the
244 same format used in the FFQ and LFQ sections. All these section names
245 are followed by a number (1 or 2), indicating that the stats figures
246 below them correspond to the first or second barcode (in the case of
247 dual indexing). Thus, these sections will appear as BCC1, CRC1, OXC1
248 and RXC1, accompanied by their quality correspondents QTQ1, CYQ1, BZQ1
249 and QXQ1. If a separator is present in the barcode sequence (usually a
250 hyphen), indicating dual indexing, then sections ending in "2" will
251 also be reported to show the second tag statistics (e.g. both BCC1 and
252 BCC2 are present).
253
254 IS reports insert size distributions with one row per size, reported in
255 the first column, with subsequent columns for the frequency of total
256 pairs, inward oriented pairs, outward orient pairs and other orienta‐
257 tion pairs. The -i option specifies the maximum insert size reported.
258
259 RL reports the distribution for all read lengths, with one row per ob‐
260 served length (up to the maximum specified by the -l option). Columns
261 are read length and frequency. FRL and LRL contains the same informa‐
262 tion separated into first and last fragments.
263
264 ID reports the distribution of indel sizes, with one row per observed
265 size. The columns are size, frequency of insertions at that size and
266 frequency of deletions at that size.
267
268 IC reports the frequency of indels occurring per cycle, broken down by
269 both insertion / deletion and by first / last read. Note for multi-
270 base indels this only counts the first base location. Columns are cy‐
271 cle, number of insertions in first fragments, number of insertions in
272 last fragments, number of deletions in first fragments, and number of
273 deletions in last fragments.
274
275 COV reports a distribution of the alignment depth per covered reference
276 site. For example an average depth of 50 would ideally result in a
277 normal distribution centred on 50, but the presence of repeats or copy-
278 number variation may reveal multiple peaks at approximate multiples of
279 50. The first column is an inclusive coverage range in the form of
280 [min-max]. The next columns are a repeat of the maximum portion of the
281 depth range (now as a single integer) and the frequency that depth
282 range was observed. The minimum, maximum and range step size are con‐
283 trolled by the -c option. Depths above and below the minimum and maxi‐
284 mum are reported with ranges [<min] and [max<].
285
286 GCD reports the GC content of the reference data aligned against per
287 alignment record, with one row per observed GC percentage reported as
288 the first column and sorted on this column. The second column is a to‐
289 tal sequence percentile, as a running total (ending at 100%). The
290 first and second columns may be used to produce a simple distribution
291 of GC content. Subsequent columns list the coverage depth at 10th,
292 25th, 50th, 75th and 90th GC percentiles for this specific GC percent‐
293 age, revealing any GC bias in mapping. These columns are averaged
294 depths, so are floating point with no maximum value.
295
296
298 -c, --coverage MIN,MAX,STEP
299 Set coverage distribution to the specified range (MIN, MAX,
300 STEP all given as integers) [1,1000,1]
301
302 -d, --remove-dups
303 Exclude from statistics reads marked as duplicates
304
305 -f, --required-flag STR|INT
306 Required flag, 0 for unset. See also `samtools flags` [0]
307
308 -F, --filtering-flag STR|INT
309 Filtering flag, 0 for unset. See also `samtools flags` [0]
310
311 --GC-depth FLOAT
312 the size of GC-depth bins (decreasing bin size increases memory
313 requirement) [2e4]
314
315 -h, --help
316 This help message
317
318 -i, --insert-size INT
319 Maximum insert size [8000]
320
321 -I, --id STR
322 Include only listed read group or sample name []
323
324 -l, --read-length INT
325 Include in the statistics only reads with the given read length
326 [-1]
327
328 -m, --most-inserts FLOAT
329 Report only the main part of inserts [0.99]
330
331 -P, --split-prefix STR
332 A path or string prefix to prepend to filenames output when
333 creating categorised statistics files with -S/--split. [input
334 filename]
335
336 -q, --trim-quality INT
337 The BWA trimming parameter [0]
338
339 -r, --ref-seq FILE
340 Reference sequence (required for GC-depth and mismatches-per-
341 cycle calculation). []
342
343 -S, --split TAG
344 In addition to the complete statistics, also output categorised
345 statistics based on the tagged field TAG (e.g., use --split RG
346 to split into read groups).
347
348 Categorised statistics are written to files named <pre‐
349 fix>_<value>.bamstat, where prefix is as given by --split-pre‐
350 fix (or the input filename by default) and value has been en‐
351 countered as the specified tagged field's value in one or more
352 alignment records.
353
354 -t, --target-regions FILE
355 Do stats in these regions only. Tab-delimited file chr,from,to,
356 1-based, inclusive. []
357
358 -x, --sparse
359 Suppress outputting IS rows where there are no insertions.
360
361 -p, --remove-overlaps
362 Remove overlaps of paired-end reads from coverage and base
363 count computations.
364
365 -g, --cov-threshold INT
366 Only bases with coverage above this value will be included in
367 the target percentage computation [0]
368
369 -X If this option is set, it will allows user to specify custom‐
370 ized index file location(s) if the data folder does not contain
371 any index file. Example usage: samtools stats [options] -X
372 /data_folder/data.bam /index_folder/data.bai chrM:1-10
373
374 -@, --threads INT
375 Number of input/output compression threads to use in addition
376 to main thread [0].
377
378
380 Written by Petr Danacek with major modifications by Nicholas Clarke,
381 Martin Pollard, Josh Randall, and Valeriu Ohan, all from the Sanger In‐
382 stitute.
383
384
386 samtools(1), samtools-flagstat(1), samtools-idxstats(1)
387
388 Samtools website: <http://www.htslib.org/>
389
390
391
392samtools-1.13 7 July 2021 samtools-stats(1)