samtools-ampliconstats(1)

1samtools-ampliconstats(1)    Bioinformatics tools    samtools-ampliconstats(1)
2
3
4

NAME

6       samtools  ampliconstats  - produces statistics from amplicon sequencing
7       alignment file
8

SYNOPSIS

10       samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...
11
12

DESCRIPTION

14       samtools ampliconstats collects  statistics  from  one  or  more  input
15       alignment  files and produces tables in text format.  The output can be
16       visualized graphically using plot-ampliconstats.
17
18       The alignment files should have previously been clipped of  primer  se‐
19       quence,  for  example by "samtools ampliconclip" and the sites of these
20       primers should be specified as a bed file in the arguments.   Each  am‐
21       plicon  must  be  present in the bed file with one or more LEFT primers
22       (direction "+") followed by one or more RIGHT primers.  For example:
23
24
25         MN908947.3  1875  1897  nCoV-2019_7_LEFT        60  +
26         MN908947.3  1868  1890  nCoV-2019_7_LEFT_alt0   60  +
27         MN908947.3  2247  2269  nCoV-2019_7_RIGHT       60  -
28         MN908947.3  2242  2264  nCoV-2019_7_RIGHT_alt5  60  -
29         MN908947.3  2181  2205  nCoV-2019_8_LEFT        60  +
30         MN908947.3  2568  2592  nCoV-2019_8_RIGHT       60  -
31
32
33       Ampliconstats will identify which read belongs to which amplicon.   For
34       purposes  of  computing coverage statistics for amplicons with multiple
35       primer choices, only the innermost primer locations are used.
36
37       A summary of output sections is listed below, followed by more detailed
38       descriptions.
39
40
41       SS                     Amplicon and file counts.  Always comes first
42       AMPLICON               Amplicon primer locations
43       FSS                    File specific: summary stats
44       FRPERC                 File specific: read percentage distribution between amplicons
45       FDEPTH                 File specific: average read depth per amplicon
46       FVDEPTH                File specific: average read depth per amplicon, full length only
47       FREADS                 File specific: numbers of reads per amplicon
48       FPCOV                  File specific: percent coverage per amplicon
49       FTCOORD                File specific: template start,end coordinate frequencies per amplicon
50       FAMP                   File specific: amplicon correct / double / treble length counts
51       FDP_ALL                File specific: template depth per reference base, all templates
52       FDP_VALID              File specific: template depth per reference base,
53       valid templates only
54       CSS                    Combined  summary stats
55       CRPERC                 Combined: read percentage distribution between amplicons
56       CDEPTH                 Combined: average read depth per amplicon
57       CVDEPTH                Combined: average read depth per amplicon, full length only
58       CREADS                 Combined: numbers of reads per amplicon
59       CPCOV                  Combined: percent coverage per amplicon
60       CTCOORD                Combined: template coordinates per amplicon
61       CAMP                   Combined: amplicon correct / double / treble length counts
62       CDP_ALL                Combined: template depth per reference base, all templates
63       CDP_VALID              Combined: template depth per reference base,
64       valid templates only
65
66       File specific sections start with both the section key and the filename
67       basename (minus directory and .sam, .bam or .cram suffix).
68
69       Note that the file specific sections are interleaved, ordered first  by
70       file  and  secondly  by  the  file specific stats.  To collate them to‐
71       gether, use "grep" to pull out all data of a specific type.
72
73       The combined sections (C*) follow the same format as the file  specific
74       sections,  with  a  different key.  For simplicity of parsing they also
75       have a filename column which is filled out with "COMBINED".  These rows
76       contain stats aggregated across all input files.
77
78

SS / AMPLICON

80       This  section  is  once per file and includes summary information to be
81       utilised for scaling of plots, for example the total number  of  ampli‐
82       cons  and  files  present,  tool version number, and command line argu‐
83       ments.  The second column is the filename or "COMBINED".  This is  fol‐
84       lowed  by  the  reference name (unless single-ref mode is enabled), and
85       the summary statistic name and value.
86
87       The AMPLICON section is a reformatting of the  input  BED  file.   Each
88       line consists of the reference name (unless single-ref mode is enable),
89       the amplicon number and the start-end coordinates of the left and right
90       primers.   Where  multiple  primers are available these are comma sepa‐
91       rated, for example 10-30,15-40 in the left primer column indicates  two
92       primers  have been multiplex together covering genome coordinates 10-30
93       inclusive and 14-40 inclusively.
94
95
96

CSS SECTION

98       This section consists of summary counts for the  entire  set  of  input
99       files.   These may be useful for automatic scaling of plots.
100
101
102       Number of amplicons   Total number of amplicons listed in primer.bed
103       Number of files       Total number of SAM, BAM or CRAM files
104       End of summary        Always the last item.  Marker for end of CSS block.
105
106
107

FSS SECTION

109       This  lists  summary  statistics  specific to an individual input file.
110       The values reported are:
111
112
113       raw total sequences   Total number of sequences found in the file
114       filtered sequences    Number of sequences filtered with -F option
115       failed primer match   Number of sequences that did not correspond to
116                             a known primer location
117       matching sequences    Number of sequences allocated to an amplicon
118
119

FREADS / CREADS SECTION

121       For each amplicon, this simply reports the count  of  reads  that  have
122       been  assigned  to  it.  A read is assigned to an amplicon if the start
123       and/or end of the read is within a specified number  of  bases  of  the
124       primer  sites  listed in the bed file.  This distance is controlled via
125       the -m option.
126
127

FRPERC / CRPERC SECTION

129       For each amplicon, this lists what percentage of reads were assigned to
130       this  amplicon  out of the total number of assigned reads.  This may be
131       used to diagnose how uniform this distribution is.
132
133       Note this is a pure read count and has no relation to amplicon size.
134
135

FDEPTH / CDEPTH / FVDEPTH / CVDEPTH SECTION

137       Using the reads assigned to each amplicon and their start /  end  loca‐
138       tions  on  that  reference, computed using the POS and CIGAR fields, we
139       compute the total number of bases aligned to this amplicon  and  corre‐
140       sponding  the  average depth.  The VDEPTH variants are filtered to only
141       include templates with end-to-end coverage across the amplicon.   These
142       can be considered to be "valid" or "usable" templates and give an indi‐
143       cation of the minimum depth for the amplicon rather  than  the  average
144       depth.
145
146       To  compute  the depth the length of the amplicon is computed using the
147       innermost set of primers, if multiple choices are  listed  in  the  bed
148       file.
149
150

FPCOV / CPCOV SECTION

152       Similar  to  the  FDEPTH section, this is a binary status of covered or
153       not covered per position in each amplicon.  This is then expressed as a
154       percentage  by dividing by the amplicon length, which is computed using
155       the innermost set of primers covering this amplicon.
156
157       The minimum depth necessary to constitute a position as being "covered"
158       is specifiable using the -d option.
159
160
161

FTCOORD / CTCOORD / FAMP / CAMP SECTION

163       It  is possible for an amplicon to be produced using incorrect primers,
164       giving  rise  to  extra-long  amplicons  (typically  double  or  treble
165       length).
166
167       The FTCOORD field holds a distribution of observed template coordinates
168       from the input data.  Each row consists of the file name, the  amplicon
169       number  in  question, and tab separated tuples of start, end, frequency
170       and status (0 for OK, 1 for skipping amplicon, 2 for unknown location).
171       Each  template  is  only counted for one amplicon, so if the read-pairs
172       span amplicons the count will show up in the  left-most  amplicon  cov‐
173       ered.
174
175       Th  COORD  data  may indicate which primers are being utilised if there
176       are alternates available for a given amplicon.
177
178       For COORD lines amplicon number 0 holds the  frequency  data  for  data
179       that  reads that have not been assigned to any amplicon.  That is, they
180       may lie within an amplicon, but they do not start or  end  at  a  known
181       primer  location.  It is not recorded for BED files containing multiple
182       references.
183
184       The FAMP / CAMP section is a simple count per amplicon of the number of
185       templates  coming  from  this amplicon.  Templates are counted once per
186       amplicon, but and like the FTCOORD field if a read-pair spans amplicons
187       it  is  only  counted in the left-most amplicon.  Each line consists of
188       the file name, amplicon number and 3 counts for the number of templates
189       with  both  ends within this amplicon, the number of templates with the
190       rightmost end in another amplicon, and the number  of  templates  where
191       the other end has failed to be assigned to an amplicon.
192
193       Note FAMP / CAMP amplicon number 0 is the summation of data for all am‐
194       plicons (1 onwards).
195
196

FDP_ALL / CDP_ALL / FDP_VALID / CDP_VALID section

198       These are for depth plots per base rather than per amplicon.  They dis‐
199       tinguish  between  all  reads  in all templates, and only reads in tem‐
200       plates considered to be "valid".  Such templates have  both  reads  (if
201       paired)  matching known primer locations from he same amplicon and have
202       full length coverage across the entire amplicon.
203
204       This FDP_VALID can be considered  to  be  the  minimum  template  depth
205       across the amplicon.
206
207       The  difference  between  the VALID and ALL plots represents additional
208       data that for some reason may not be suitable for producing  a  consen‐
209       sus.  For example an amplicon that skips a primer, pairing 10_LEFT with
210       12_RIGHT, will have coverage for the first half of amplicon 10 and  the
211       last  half of amplicon 12.  Counting the number of reads or bases alone
212       in the amplicon does not reveal the  potential  for  non-uniformity  of
213       coverage.
214
215       The  lines  start  with the type keyword, file / sample name, reference
216       name (unless single-ref mode is enabled), followed by a variable number
217       of  tab  separated tuples consisting of depth,length.  The length field
218       is a basic form of run-length encoding where all depth values within  a
219       specified  fraction  of  each  other (e.g. >= (1-fract)*midpoint and <=
220       (1+fract)*midpoint) are combined into a single run.  This  fraction  is
221       controlled via the -D option.
222
223

OPTIONS

225       -f, --required-flag INT|STR
226               Only  output alignments with all bits set in INT present in the
227               FLAG field.  INT can be specified in hex by beginning with `0x'
228               (i.e.  /^0x[0-9A-F]+/)  or in octal by beginning with `0' (i.e.
229               /^0[0-7]+/) [0], or in string form by specifying a  comma-sepa‐
230               rated  list  of keywords as listed by the "samtools flags" sub‐
231               command.
232
233
234       -F, --filter-flag INT|STR
235               Do not output alignments with any bits set in  INT  present  in
236               the  FLAG field.  INT can be specified in hex by beginning with
237               `0x' (i.e. /^0x[0-9A-F]+/) or in octal by  beginning  with  `0'
238               (i.e. /^0[0-7]+/) [0], or in string form by specifying a comma-
239               separated list of keywords as listed by  the  "samtools  flags"
240               subcommand.
241
242
243       -a, --max-amplicons INT
244               Specify the maximum number of amplicons permitted.
245
246
247       -b, --tcoord-bin INT
248               Bin the template start,end positions into multiples of NT prior
249               to counting their frequency and reporting in the FTCOORD /  CT‐
250               COORD  lines.   This may be useful for technologies with higher
251               errors rates where the alignment ends will vary slightly.   De‐
252               faults to 1, which is equivalent to no binning.
253
254
255       -c, --tcoord-min-count INT
256               In   the  FTCOORD  and  CTCOORD  lines,  only  record  template
257               start,end coordinate combination if they  occur  at  least  INT
258               times.
259
260
261       -d, --min-depth INT
262               Specifies  the minimum base depth to consider a reference posi‐
263               tion to be covered, for purposes of the FRPERC and CRPERC  sec‐
264               tions.
265
266
267       -D, --depth-bin FRACTION
268               Controls  the  merging  of  neighbouring similar depths for the
269               FDP_ALL and FDP_VALID plots.  The  default  FRACTION  is  0.01,
270               meaning  depths within +/- 1% of a mid point will be aggregated
271               together as a run of the same value.  This merging is useful to
272               reduce the file size.  Use -D 0 to record every depth.
273
274
275       -l, --max-amplicon-length INT
276               Specifies the maximum length of any individual amplicon.
277
278
279       -m, --pos-margin INT
280               Reads  are  compared against the primer start and end locations
281               specified in the BED file.  An aligned  sequence  should  start
282               precisely  at  these locations, but sequencing errors may cause
283               the primer clipping to be a few bases out or for the  alignment
284               to  add  a few extra bases of soft clip.  This option specifies
285               the margin of error permitted when matching a read to an ampli‐
286               con number.
287
288
289       -o  FILE
290               Output stats to FILE.  The default is to write to stdout.
291
292
293       -s, --use-sample-name
294               Instead  of  using  the  basename  component  of the input path
295               names, use the SM field from the first @RG header line.
296
297
298       -S, --single-ref
299               Force the output format to  match  the  older  single-reference
300               style used in Samtools 1.12 and earlier.  This removes the ref‐
301               erence names from the SS, AMPLICON, DP_ALL  and  DP_VALID  sec‐
302               tions.   It  cannot  be  enabled if the input BED file has more
303               than one reference present.  Note that  plot-ampliconstats  can
304               process both output styles.
305
306
307       -t, --tlen-adjust INT
308               Adjust the TLEN field by +/- INT to compensate for primer clip‐
309               ping.  This defaults to zero, but  if  the  primers  have  been
310               clipped  and the TLEN field has not been updated using samtools
311               fixmate then the template length will be wrong by  the  sum  of
312               the forward and reverse primer lengths.
313
314               This adjustment does not have to be precise as the --pos-margin
315               field permits some leeway.  Hence if required, it should be set
316               to approximately double the average primer length.
317
318
319       -@ INT  Number  of  BAM/CRAM (de)compression threads to use in addition
320               to main thread [0].
321
322

EXAMPLE

324       To run ampliconstats on a directory full of CRAM files and then produce
325       a series of PNG images named "mydata*.png":
326
327
328               samtools  ampliconstats  V3/nCoV-2019.bed /path/*.cram > astats
329               plot-ampliconstats -size 1200,900 mydata astats
330
331
332

AUTHOR

334       Written by James Bonfield from the Sanger Institute.
335
336