1samtools-ampliconstats(1) Bioinformatics tools samtools-ampliconstats(1)
2
3
4
6 samtools ampliconstats - produces statistics from amplicon sequencing
7 alignment file
8
10 samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...
11
12
14 samtools ampliconstats collects statistics from one or more input
15 alignment files and produces tables in text format. The output can be
16 visualized graphically using plot-ampliconstats.
17
18 The alignment files should have previously been clipped of primer se‐
19 quence, for example by "samtools ampliconclip" and the sites of these
20 primers should be specified as a bed file in the arguments. Each am‐
21 plicon must be present in the bed file with one or more LEFT primers
22 (direction "+") followed by one or more RIGHT primers. For example:
23
24
25 MN908947.3 1875 1897 nCoV-2019_7_LEFT 60 +
26 MN908947.3 1868 1890 nCoV-2019_7_LEFT_alt0 60 +
27 MN908947.3 2247 2269 nCoV-2019_7_RIGHT 60 -
28 MN908947.3 2242 2264 nCoV-2019_7_RIGHT_alt5 60 -
29 MN908947.3 2181 2205 nCoV-2019_8_LEFT 60 +
30 MN908947.3 2568 2592 nCoV-2019_8_RIGHT 60 -
31
32
33 Ampliconstats will identify which read belongs to which amplicon. For
34 purposes of computing coverage statistics for amplicons with multiple
35 primer choices, only the innermost primer locations are used.
36
37 A summary of output sections is listed below, followed by more detailed
38 descriptions.
39
40
41 SS Amplicon and file counts. Always comes first
42 AMPLICON Amplicon primer locations
43 FSS File specific: summary stats
44 FRPERC File specific: read percentage distribution between amplicons
45 FDEPTH File specific: average read depth per amplicon
46 FVDEPTH File specific: average read depth per amplicon, full length only
47 FREADS File specific: numbers of reads per amplicon
48 FPCOV File specific: percent coverage per amplicon
49 FTCOORD File specific: template start,end coordinate frequencies per amplicon
50 FAMP File specific: amplicon correct / double / treble length counts
51 FDP_ALL File specific: template depth per reference base, all templates
52 FDP_VALID File specific: template depth per reference base,
53 valid templates only
54 CSS Combined summary stats
55 CRPERC Combined: read percentage distribution between amplicons
56 CDEPTH Combined: average read depth per amplicon
57 CVDEPTH Combined: average read depth per amplicon, full length only
58 CREADS Combined: numbers of reads per amplicon
59 CPCOV Combined: percent coverage per amplicon
60 CTCOORD Combined: template coordinates per amplicon
61 CAMP Combined: amplicon correct / double / treble length counts
62 CDP_ALL Combined: template depth per reference base, all templates
63 CDP_VALID Combined: template depth per reference base,
64 valid templates only
65
66 File specific sections start with both the section key and the filename
67 basename (minus directory and .sam, .bam or .cram suffix).
68
69 Note that the file specific sections are interleaved, ordered first by
70 file and secondly by the file specific stats. To collate them to‐
71 gether, use "grep" to pull out all data of a specific type.
72
73 The combined sections (C*) follow the same format as the file specific
74 sections, with a different key. For simplicity of parsing they also
75 have a filename column which is filled out with "COMBINED". These rows
76 contain stats aggregated across all input files.
77
78
80 This section is once per file and includes summary information to be
81 utilised for scaling of plots, for example the total number of ampli‐
82 cons and files present, tool version number, and command line argu‐
83 ments. The second column is the filename or "COMBINED". This is fol‐
84 lowed by the reference name (unless single-ref mode is enabled), and
85 the summary statistic name and value.
86
87 The AMPLICON section is a reformatting of the input BED file. Each
88 line consists of the reference name (unless single-ref mode is enable),
89 the amplicon number and the start-end coordinates of the left and right
90 primers. Where multiple primers are available these are comma sepa‐
91 rated, for example 10-30,15-40 in the left primer column indicates two
92 primers have been multiplex together covering genome coordinates 10-30
93 inclusive and 14-40 inclusively.
94
95
96
98 This section consists of summary counts for the entire set of input
99 files. These may be useful for automatic scaling of plots.
100
101
102 Number of amplicons Total number of amplicons listed in primer.bed
103 Number of files Total number of SAM, BAM or CRAM files
104 End of summary Always the last item. Marker for end of CSS block.
105
106
107
109 This lists summary statistics specific to an individual input file.
110 The values reported are:
111
112
113 raw total sequences Total number of sequences found in the file
114 filtered sequences Number of sequences filtered with -F option
115 failed primer match Number of sequences that did not correspond to
116 a known primer location
117 matching sequences Number of sequences allocated to an amplicon
118
119
121 For each amplicon, this simply reports the count of reads that have
122 been assigned to it. A read is assigned to an amplicon if the start
123 and/or end of the read is within a specified number of bases of the
124 primer sites listed in the bed file. This distance is controlled via
125 the -m option.
126
127
129 For each amplicon, this lists what percentage of reads were assigned to
130 this amplicon out of the total number of assigned reads. This may be
131 used to diagnose how uniform this distribution is.
132
133 Note this is a pure read count and has no relation to amplicon size.
134
135
137 Using the reads assigned to each amplicon and their start / end loca‐
138 tions on that reference, computed using the POS and CIGAR fields, we
139 compute the total number of bases aligned to this amplicon and corre‐
140 sponding the average depth. The VDEPTH variants are filtered to only
141 include templates with end-to-end coverage across the amplicon. These
142 can be considered to be "valid" or "usable" templates and give an indi‐
143 cation of the minimum depth for the amplicon rather than the average
144 depth.
145
146 To compute the depth the length of the amplicon is computed using the
147 innermost set of primers, if multiple choices are listed in the bed
148 file.
149
150
152 Similar to the FDEPTH section, this is a binary status of covered or
153 not covered per position in each amplicon. This is then expressed as a
154 percentage by dividing by the amplicon length, which is computed using
155 the innermost set of primers covering this amplicon.
156
157 The minimum depth necessary to constitute a position as being "covered"
158 is specifiable using the -d option.
159
160
161
163 It is possible for an amplicon to be produced using incorrect primers,
164 giving rise to extra-long amplicons (typically double or treble
165 length).
166
167 The FTCOORD field holds a distribution of observed template coordinates
168 from the input data. Each row consists of the file name, the amplicon
169 number in question, and tab separated tuples of start, end, frequency
170 and status (0 for OK, 1 for skipping amplicon, 2 for unknown location).
171 Each template is only counted for one amplicon, so if the read-pairs
172 span amplicons the count will show up in the left-most amplicon cov‐
173 ered.
174
175 Th COORD data may indicate which primers are being utilised if there
176 are alternates available for a given amplicon.
177
178 For COORD lines amplicon number 0 holds the frequency data for data
179 that reads that have not been assigned to any amplicon. That is, they
180 may lie within an amplicon, but they do not start or end at a known
181 primer location. It is not recorded for BED files containing multiple
182 references.
183
184 The FAMP / CAMP section is a simple count per amplicon of the number of
185 templates coming from this amplicon. Templates are counted once per
186 amplicon, but and like the FTCOORD field if a read-pair spans amplicons
187 it is only counted in the left-most amplicon. Each line consists of
188 the file name, amplicon number and 3 counts for the number of templates
189 with both ends within this amplicon, the number of templates with the
190 rightmost end in another amplicon, and the number of templates where
191 the other end has failed to be assigned to an amplicon.
192
193 Note FAMP / CAMP amplicon number 0 is the summation of data for all am‐
194 plicons (1 onwards).
195
196
198 These are for depth plots per base rather than per amplicon. They dis‐
199 tinguish between all reads in all templates, and only reads in tem‐
200 plates considered to be "valid". Such templates have both reads (if
201 paired) matching known primer locations from he same amplicon and have
202 full length coverage across the entire amplicon.
203
204 This FDP_VALID can be considered to be the minimum template depth
205 across the amplicon.
206
207 The difference between the VALID and ALL plots represents additional
208 data that for some reason may not be suitable for producing a consen‐
209 sus. For example an amplicon that skips a primer, pairing 10_LEFT with
210 12_RIGHT, will have coverage for the first half of amplicon 10 and the
211 last half of amplicon 12. Counting the number of reads or bases alone
212 in the amplicon does not reveal the potential for non-uniformity of
213 coverage.
214
215 The lines start with the type keyword, file / sample name, reference
216 name (unless single-ref mode is enabled), followed by a variable number
217 of tab separated tuples consisting of depth,length. The length field
218 is a basic form of run-length encoding where all depth values within a
219 specified fraction of each other (e.g. >= (1-fract)*midpoint and <=
220 (1+fract)*midpoint) are combined into a single run. This fraction is
221 controlled via the -D option.
222
223
225 -f, --required-flag INT|STR
226 Only output alignments with all bits set in INT present in the
227 FLAG field. INT can be specified in hex by beginning with `0x'
228 (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e.
229 /^0[0-7]+/) [0], or in string form by specifying a comma-sepa‐
230 rated list of keywords as listed by the "samtools flags" sub‐
231 command.
232
233
234 -F, --filter-flag INT|STR
235 Do not output alignments with any bits set in INT present in
236 the FLAG field. INT can be specified in hex by beginning with
237 `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
238 (i.e. /^0[0-7]+/) [0], or in string form by specifying a comma-
239 separated list of keywords as listed by the "samtools flags"
240 subcommand.
241
242
243 -a, --max-amplicons INT
244 Specify the maximum number of amplicons permitted.
245
246
247 -b, --tcoord-bin INT
248 Bin the template start,end positions into multiples of NT prior
249 to counting their frequency and reporting in the FTCOORD / CT‐
250 COORD lines. This may be useful for technologies with higher
251 errors rates where the alignment ends will vary slightly. De‐
252 faults to 1, which is equivalent to no binning.
253
254
255 -c, --tcoord-min-count INT
256 In the FTCOORD and CTCOORD lines, only record template
257 start,end coordinate combination if they occur at least INT
258 times.
259
260
261 -d, --min-depth INT
262 Specifies the minimum base depth to consider a reference posi‐
263 tion to be covered, for purposes of the FRPERC and CRPERC sec‐
264 tions.
265
266
267 -D, --depth-bin FRACTION
268 Controls the merging of neighbouring similar depths for the
269 FDP_ALL and FDP_VALID plots. The default FRACTION is 0.01,
270 meaning depths within +/- 1% of a mid point will be aggregated
271 together as a run of the same value. This merging is useful to
272 reduce the file size. Use -D 0 to record every depth.
273
274
275 -l, --max-amplicon-length INT
276 Specifies the maximum length of any individual amplicon.
277
278
279 -m, --pos-margin INT
280 Reads are compared against the primer start and end locations
281 specified in the BED file. An aligned sequence should start
282 precisely at these locations, but sequencing errors may cause
283 the primer clipping to be a few bases out or for the alignment
284 to add a few extra bases of soft clip. This option specifies
285 the margin of error permitted when matching a read to an ampli‐
286 con number.
287
288
289 -o FILE
290 Output stats to FILE. The default is to write to stdout.
291
292
293 -s, --use-sample-name
294 Instead of using the basename component of the input path
295 names, use the SM field from the first @RG header line.
296
297
298 -S, --single-ref
299 Force the output format to match the older single-reference
300 style used in Samtools 1.12 and earlier. This removes the ref‐
301 erence names from the SS, AMPLICON, DP_ALL and DP_VALID sec‐
302 tions. It cannot be enabled if the input BED file has more
303 than one reference present. Note that plot-ampliconstats can
304 process both output styles.
305
306
307 -t, --tlen-adjust INT
308 Adjust the TLEN field by +/- INT to compensate for primer clip‐
309 ping. This defaults to zero, but if the primers have been
310 clipped and the TLEN field has not been updated using samtools
311 fixmate then the template length will be wrong by the sum of
312 the forward and reverse primer lengths.
313
314 This adjustment does not have to be precise as the --pos-margin
315 field permits some leeway. Hence if required, it should be set
316 to approximately double the average primer length.
317
318
319 -@ INT Number of BAM/CRAM (de)compression threads to use in addition
320 to main thread [0].
321
322
324 To run ampliconstats on a directory full of CRAM files and then produce
325 a series of PNG images named "mydata*.png":
326
327
328 samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
329 plot-ampliconstats -size 1200,900 mydata astats
330
331
332
334 Written by James Bonfield from the Sanger Institute.
335
336
338 samtools(1), samtools-ampliconclip(1) samtools-stats(1), samtools-
339 flags(1)
340
341 Samtools website: <http://www.htslib.org/>
342
343
344
345samtools-1.13 7 July 2021 samtools-ampliconstats(1)