1samtools-stats(1)            Bioinformatics tools            samtools-stats(1)
2
3
4

NAME

6       samtools stats - produces comprehensive statistics from alignment file
7

SYNOPSIS

9       samtools stats [options] in.sam|in.bam|in.cram [region...]
10
11

DESCRIPTION

13       samtools stats collects statistics from BAM files and outputs in a text
14       format.  The output can be visualized graphically using plot-bamstats.
15
16       A summary of output sections is listed below, followed by more detailed
17       descriptions.
18
19
20       CHK   Checksum
21       SN    Summary numbers
22       FFQ   First fragment qualities
23       LFQ   Last fragment qualities
24       GCF   GC content of first fragments
25       GCL   GC content of last fragments
26       GCC   ACGT content per cycle
27       GCT   ACGT content per cycle, read oriented
28       FBC   ACGT content per cycle for first fragments only
29       FTC   ACGT raw counters for first fragments
30       LBC   ACGT content per cycle for last fragments only
31       LTC   ACGT raw counters for last fragments
32       BCC   ACGT content per cycle for BC barcode
33       CRC   ACGT content per cycle for CR barcode
34       OXC   ACGT content per cycle for OX barcode
35       RXC   ACGT content per cycle for RX barcode
36       QTQ   Quality distribution for BC barcode
37       CYQ   Quality distribution for CR barcode
38       BZQ   Quality distribution for OX barcode
39       QXQ   Quality distribution for RX barcode
40       IS    Insert sizes
41       RL    Read lengths
42       FRL   Read lengths for first fragments only
43       LRL   Read lengths for last fragments only
44       ID    Indel size distribution
45       IC    Indels per cycle
46       COV   Coverage (depth) distribution
47       GCD   GC-depth
48
49       Not  all sections will be reported as some depend on the data being co‐
50       ordinate sorted while others are only  present  when  specific  barcode
51       tags are in use.
52
53       Some  of  the statistics are collected for “first” or “last” fragments.
54       Records are put into these categories using  the  PAIRED  (0x1),  READ1
55       (0x40) and READ2 (0x80) flag bits, as follows:
56
57
58       •   Unpaired  reads (i.e. PAIRED is not set) are all “first” fragments.
59           For these records, the READ1 and READ2 flags are ignored.
60
61       •   Reads where PAIRED and READ1 are set, and  READ2  is  not  set  are
62           “first” fragments.
63
64       •   Reads  where  PAIRED  and  READ2  are set, and READ1 is not set are
65           “last” fragments.
66
67       •   Reads where PAIRED is set and either both READ1 and READ2  are  set
68           or neither is set are not counted in either category.
69
70       Information  on the meaning of the flags is given in the SAM specifica‐
71       tion document <https://samtools.github.io/hts-specs/SAMv1.pdf>.
72
73       The CHK row contains distinct CRC32 checksums of read names,  sequences
74       and  quality  values.   The checksums are computed per alignment record
75       and summed, meaning the checksum does not change if the input file  has
76       the sort-order changed.
77
78       The  SN section contains a series of counts, percentages, and averages,
79       in a similar style to samtools flagstat, but more comprehensive.
80
81              raw total sequences - total number of reads in a file, excluding
82              supplementary and secondary reads.  Same number reported by sam‐
83              tools view -c.
84
85              filtered sequences - number of discarded reads when using -f  or
86              -F option.
87
88              sequences - number of processed reads.
89
90              is  sorted  -  flag  indicating  whether  the file is coordinate
91              sorted (1) or not (0).
92
93              1st fragments - number of first fragment reads (flags  0x01  not
94              set; or flags 0x01 and 0x40 set, 0x80 not set).
95
96              last  fragments  - number of last fragment reads (flags 0x01 and
97              0x80 set, 0x40 not set).
98
99              reads mapped - number of  reads,  paired  or  single,  that  are
100              mapped (flag 0x4 or 0x8 not set).
101
102              reads  mapped  and  paired - number of mapped paired reads (flag
103              0x1 is set and flags 0x4 and 0x8 are not set).
104
105              reads unmapped - number of unmapped reads (flag 0x4 is set).
106
107              reads properly paired - number of mapped paired reads with  flag
108              0x2 set.
109
110              paired  -  number  of paired reads, mapped or unmapped, that are
111              neither secondary nor supplementary (flag 0x1 is set  and  flags
112              0x100 (256) and 0x800 (2048) are not set).
113
114              reads  duplicated - number of duplicate reads (flag 0x400 (1024)
115              is set).
116
117              reads MQ0 - number of mapped reads with mapping quality 0.
118
119              reads QC failed - number of reads that failed the quality checks
120              (flag 0x200 (512) is set).
121
122              non-primary  alignments  - number of secondary reads (flag 0x100
123              (256) set).
124
125              supplementary alignments - number of supplementary  reads  (flag
126              0x800 (2048) set).
127
128              total  length  -  number  of processed bases from reads that are
129              neither secondary nor supplementary (flags 0x100 (256) and 0x800
130              (2048) are not set).
131
132              total first fragment length - number of processed bases that be‐
133              long to first fragments.
134
135              total last fragment length - number of processed bases that  be‐
136              long to last fragments.
137
138              bases  mapped  -  number of processed bases that belong to reads
139              mapped.
140
141              bases mapped (cigar) - number of mapped bases  filtered  by  the
142              CIGAR  string  corresponding  to  the  read they belong to. Only
143              alignment matches(M), inserts(I), sequence  matches(=)  and  se‐
144              quence mismatches(X) are counted.
145
146              bases  trimmed  - number of bases trimmed by bwa, that belong to
147              non secondary and non supplementary reads. Enabled by -q option.
148
149              bases duplicated - number of bases that belong to  reads  dupli‐
150              cated.
151
152              mismatches  -  number of mismatched bases, as reported by the NM
153              tag associated with a read, if present.
154
155              error rate - ratio between mismatches and bases mapped (cigar).
156
157              average length - ratio between total length and sequences.
158
159              average first fragment length - ratio between total first  frag‐
160              ment length and 1st fragments.
161
162              average last fragment length - ratio between total last fragment
163              length and last fragments.
164
165              maximum length - length of  the  longest  read  (includes  hard-
166              clipped bases).
167
168              maximum  first  fragment  length  -  length of the longest first
169              fragment read (includes hard-clipped bases).
170
171              maximum last fragment length - length of the longest last  frag‐
172              ment read (includes hard-clipped bases).
173
174              average  quality  -  ratio between the sum of base qualities and
175              total length.
176
177              insert size average - the average absolute template  length  for
178              paired and mapped reads.
179
180              insert  size standard deviation - standard deviation for the av‐
181              erage template length distribution.
182
183              inward oriented pairs - number of paired reads  with  flag  0x40
184              (64)  set and flag 0x10 (16) not set or with flag 0x80 (128) set
185              and flag 0x10 (16) set.
186
187              outward oriented pairs - number of paired reads with  flag  0x40
188              (64)  set and flag 0x10 (16) set or with flag 0x80 (128) set and
189              flag 0x10 (16) not set.
190
191              pairs with other orientation - number of paired reads that don't
192              fall in any of the above two categories.
193
194              pairs  on different chromosomes - number of pairs where one read
195              is on one chromosome and the pair read is on a different chromo‐
196              some.
197
198              percentage  of properly paired reads - percentage of reads prop‐
199              erly paired out of sequences.
200
201              bases inside the target - number of bases inside the target  re‐
202              gion(s) (when a target file is specified with -t option).
203
204              percentage  of target genome with coverage > VAL - percentage of
205              target bases with a coverage larger than VAL. By default, VAL is
206              0,  but  a  custom value can be supplied by the user with -g op‐
207              tion.
208
209
210       The FFQ and LFQ sections report the quality distribution per first/last
211       fragment  and  per cycle number.  They have one row per cycle (reported
212       as the first column after the FFQ/LFQ key) with remaining columns being
213       the observed integer counts per quality value, starting at quality 0 in
214       the left-most row and ending at the  largest  observed  quality.   Thus
215       each  row  forms  its  own  quality distribution and any cycle specific
216       quality artefacts can be observed.
217
218       GCF and GCL report the total GC content  of  each  fragment,  separated
219       into first and last fragments.  The columns show the GC percentile (be‐
220       tween 0 and 100) and an integer count of fragments at that percentile.
221
222       GCC, FBC and LBC report the nucleotide content per  cycle  either  com‐
223       bined  (GCC)  or  split into first (FBC) and last (LBC) fragments.  The
224       columns are cycle number (integer), and percentage counts for A, C,  G,
225       T,  N  and  other  (typically  containing  ambiguity  codes) normalised
226       against the total counts of A, C, G and T only (excluding N and other).
227
228       GCT offers a similar report to GCC, but whereas GCC counts  nucleotides
229       as  they appear in the SAM output (in reference orientation), GCT takes
230       into account whether a nucleotide belongs  to  a  reverse  complemented
231       read  and  counts it in the original read orientation.  If there are no
232       reverse complemented reads in a file, the GCC and GCT reports  will  be
233       identical.
234
235       FTC  and LTC report the total numbers of nucleotides for first and last
236       fragments, respectively. The columns are the raw counters for A, C,  G,
237       T and N bases.
238
239       BCC,  CRC,  OXC  and RXC are the barcode equivalent of GCC, showing nu‐
240       cleotide content for the barcode tags BC, CR, OX and  RX  respectively.
241       Their  quality  values  distributions  are in the QTQ, CYQ, BZQ and QXQ
242       sections, corresponding to the BC/QT, CR/CY, OX/BZ and RX/QX SAM format
243       sequence/quality  tags.   These  quality value distributions follow the
244       same format used in the FFQ and LFQ sections. All these  section  names
245       are  followed  by  a number (1 or 2), indicating that the stats figures
246       below them correspond to the first or second barcode (in  the  case  of
247       dual  indexing).  Thus,  these sections will appear as BCC1, CRC1, OXC1
248       and RXC1, accompanied by their quality correspondents QTQ1, CYQ1,  BZQ1
249       and  QXQ1. If a separator is present in the barcode sequence (usually a
250       hyphen), indicating dual indexing, then sections  ending  in  "2"  will
251       also  be reported to show the second tag statistics (e.g. both BCC1 and
252       BCC2 are present).
253
254       IS reports insert size distributions with one row per size, reported in
255       the  first  column,  with subsequent columns for the frequency of total
256       pairs, inward oriented pairs, outward orient pairs and  other  orienta‐
257       tion pairs.  The -i option specifies the maximum insert size reported.
258
259       RL  reports the distribution for all read lengths, with one row per ob‐
260       served length (up to the maximum specified by the -l option).   Columns
261       are  read length and frequency.  FRL and LRL contains the same informa‐
262       tion separated into first and last fragments.
263
264       ID reports the distribution of indel sizes, with one row  per  observed
265       size.  The  columns  are size, frequency of insertions at that size and
266       frequency of deletions at that size.
267
268       IC reports the frequency of indels occurring per cycle, broken down  by
269       both  insertion  /  deletion and by first / last read.  Note for multi-
270       base indels this only counts the first base location.  Columns are  cy‐
271       cle,  number  of insertions in first fragments, number of insertions in
272       last fragments, number of deletions in first fragments, and  number  of
273       deletions in last fragments.
274
275       COV reports a distribution of the alignment depth per covered reference
276       site.  For example an average depth of 50 would  ideally  result  in  a
277       normal distribution centred on 50, but the presence of repeats or copy-
278       number variation may reveal multiple peaks at approximate multiples  of
279       50.   The  first  column  is an inclusive coverage range in the form of
280       [min-max].  The next columns are a repeat of the maximum portion of the
281       depth  range  (now  as  a  single integer) and the frequency that depth
282       range was observed.  The minimum, maximum and range step size are  con‐
283       trolled by the -c option.  Depths above and below the minimum and maxi‐
284       mum are reported with ranges [<min] and [max<].
285
286       GCD reports the GC content of the reference data  aligned  against  per
287       alignment  record,  with one row per observed GC percentage reported as
288       the first column and sorted on this column.  The second column is a to‐
289       tal  sequence  percentile,  as  a  running total (ending at 100%).  The
290       first and second columns may be used to produce a  simple  distribution
291       of  GC  content.   Subsequent  columns list the coverage depth at 10th,
292       25th, 50th, 75th and 90th GC percentiles for this specific GC  percent‐
293       age,  revealing  any  GC  bias  in mapping.  These columns are averaged
294       depths, so are floating point with no maximum value.
295
296

OPTIONS

298       -c, --coverage MIN,MAX,STEP
299               Set coverage distribution to the  specified  range  (MIN,  MAX,
300               STEP all given as integers) [1,1000,1]
301
302       -d, --remove-dups
303               Exclude from statistics reads marked as duplicates
304
305       -f, --required-flag STR|INT
306               Required flag, 0 for unset. See also `samtools flags` [0]
307
308       -F, --filtering-flag STR|INT
309               Filtering flag, 0 for unset. See also `samtools flags` [0]
310
311       --GC-depth FLOAT
312               the size of GC-depth bins (decreasing bin size increases memory
313               requirement) [2e4]
314
315       -h, --help
316               This help message
317
318       -i, --insert-size INT
319               Maximum insert size [8000]
320
321       -I, --id STR
322               Include only listed read group or sample name []
323
324       -l, --read-length INT
325               Include in the statistics only reads with the given read length
326               [-1]
327
328       -m, --most-inserts FLOAT
329               Report only the main part of inserts [0.99]
330
331       -P, --split-prefix STR
332               A  path  or  string  prefix to prepend to filenames output when
333               creating categorised statistics files with -S/--split.   [input
334               filename]
335
336       -q, --trim-quality INT
337               The BWA trimming parameter [0]
338
339       -r, --ref-seq FILE
340               Reference  sequence  (required for GC-depth and mismatches-per-
341               cycle calculation).  []
342
343       -S, --split TAG
344               In addition to the complete statistics, also output categorised
345               statistics  based on the tagged field TAG (e.g., use --split RG
346               to split into read groups).
347
348               Categorised  statistics  are  written  to  files  named   <pre‐
349               fix>_<value>.bamstat,  where prefix is as given by --split-pre‐
350               fix (or the input filename by default) and value has  been  en‐
351               countered  as the specified tagged field's value in one or more
352               alignment records.
353
354       -t, --target-regions FILE
355               Do stats in these regions only. Tab-delimited file chr,from,to,
356               1-based, inclusive.  []
357
358       -x, --sparse
359               Suppress outputting IS rows where there are no insertions.
360
361       -p, --remove-overlaps
362               Remove  overlaps  of  paired-end  reads  from coverage and base
363               count computations.
364
365       -g, --cov-threshold INT
366               Only bases with coverage above this value will be  included  in
367               the target percentage computation [0]
368
369       -X      If  this  option is set, it will allows user to specify custom‐
370               ized index file location(s) if the data folder does not contain
371               any  index  file.   Example  usage: samtools stats [options] -X
372               /data_folder/data.bam /index_folder/data.bai chrM:1-10
373
374       -@, --threads INT
375               Number of input/output compression threads to use  in  addition
376               to main thread [0].
377
378

AUTHOR

380       Written  by  Petr  Danacek with major modifications by Nicholas Clarke,
381       Martin Pollard, Josh Randall, and Valeriu Ohan, all from the Sanger In‐
382       stitute.
383
384

SEE ALSO

386       samtools(1), samtools-flagstat(1), samtools-idxstats(1)
387
388       Samtools website: <http://www.htslib.org/>
389
390
391
392samtools-1.13                     7 July 2021                samtools-stats(1)
Impressum