1samtools-ampliconstats(1) Bioinformatics tools samtools-ampliconstats(1)
2
3
4
6 samtools ampliconstats - produces statistics from amplicon sequencing
7 alignment file
8
10 samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...
11
12
14 samtools ampliconstats collects statistics from one or more input
15 alignment files and produces tables in text format. The output can be
16 visualized graphically using plot-ampliconstats.
17
18 The alignment files should have previously been clipped of primer se‐
19 quence, for example by "samtools ampliconclip" and the sites of these
20 primers should be specified as a bed file in the arguments. Each am‐
21 plicon must be present in the bed file with one or more LEFT primers
22 (direction "+") followed by one or more RIGHT primers. For example:
23
24
25 MN908947.3 1875 1897 nCoV-2019_7_LEFT 60 +
26 MN908947.3 1868 1890 nCoV-2019_7_LEFT_alt0 60 +
27 MN908947.3 2247 2269 nCoV-2019_7_RIGHT 60 -
28 MN908947.3 2242 2264 nCoV-2019_7_RIGHT_alt5 60 -
29 MN908947.3 2181 2205 nCoV-2019_8_LEFT 60 +
30 MN908947.3 2568 2592 nCoV-2019_8_RIGHT 60 -
31
32
33 Ampliconstats will identify which read belongs to which amplicon. For
34 purposes of computing coverage statistics for amplicons with multiple
35 primer choices, only the innermost primer locations are used.
36
37 A summary of output sections is listed below, followed by more detailed
38 descriptions.
39
40 SS Amplicon and file counts. Always comes first
41 AMPLICON Amplicon primer locations
42 FSS File specific: summary stats
43 FRPERC File specific: read percentage distribution between ampli‐
44 cons
45 FDEPTH File specific: average read depth per amplicon
46 FVDEPTH File specific: average read depth per amplicon, full length
47 only
48 FREADS File specific: numbers of reads per amplicon
49 FPCOV File specific: percent coverage per amplicon
50 FTCOORD File specific: template start,end coordinate frequencies
51 per amplicon
52 FAMP File specific: amplicon correct / double / treble length
53 counts
54 FDP_ALL File specific: template depth per reference base, all tem‐
55 plates
56 FDP_VALID File specific: template depth per reference base, valid
57 templates only
58 CSS Combined summary stats
59 CRPERC Combined: read percentage distribution between amplicons
60 CDEPTH Combined: average read depth per amplicon
61 CVDEPTH Combined: average read depth per amplicon, full length only
62 CREADS Combined: numbers of reads per amplicon
63 CPCOV Combined: percent coverage per amplicon
64 CTCOORD Combined: template coordinates per amplicon
65 CAMP Combined: amplicon correct / double / treble length counts
66 CDP_ALL Combined: template depth per reference base, all templates
67 CDP_VALID Combined: template depth per reference base, valid tem‐
68 plates only
69
70 File specific sections start with both the section key and the filename
71 basename (minus directory and .sam, .bam or .cram suffix).
72
73 Note that the file specific sections are interleaved, ordered first by
74 file and secondly by the file specific stats. To collate them to‐
75 gether, use "grep" to pull out all data of a specific type.
76
77 The combined sections (C*) follow the same format as the file specific
78 sections, with a different key. For simplicity of parsing they also
79 have a filename column which is filled out with "COMBINED". These rows
80 contain stats aggregated across all input files.
81
82
84 This section is once per file and includes summary information to be
85 utilised for scaling of plots, for example the total number of ampli‐
86 cons and files present, tool version number, and command line argu‐
87 ments. The second column is the filename or "COMBINED". This is fol‐
88 lowed by the reference name (unless single-ref mode is enabled), and
89 the summary statistic name and value.
90
91 The AMPLICON section is a reformatting of the input BED file. Each
92 line consists of the reference name (unless single-ref mode is enable),
93 the amplicon number and the start-end coordinates of the left and right
94 primers. Where multiple primers are available these are comma sepa‐
95 rated, for example 10-30,15-40 in the left primer column indicates two
96 primers have been multiplex together covering genome coordinates 10-30
97 inclusive and 14-40 inclusively.
98
99
100
102 This section consists of summary counts for the entire set of input
103 files. These may be useful for automatic scaling of plots.
104
105
106 Number of amplicons Total number of amplicons listed in primer.bed
107 Number of files Total number of SAM, BAM or CRAM files
108 End of summary Always the last item. Marker for end of CSS block.
109
110
111
113 This lists summary statistics specific to an individual input file.
114 The values reported are:
115
116
117 raw total sequences Total number of sequences found in the file
118 filtered sequences Number of sequences filtered with -F option
119 failed primer match Number of sequences that did not correspond to
120 a known primer location
121 matching sequences Number of sequences allocated to an amplicon
122
123
125 For each amplicon, this simply reports the count of reads that have
126 been assigned to it. A read is assigned to an amplicon if the start
127 and/or end of the read is within a specified number of bases of the
128 primer sites listed in the bed file. This distance is controlled via
129 the -m option.
130
131
133 For each amplicon, this lists what percentage of reads were assigned to
134 this amplicon out of the total number of assigned reads. This may be
135 used to diagnose how uniform this distribution is.
136
137 Note this is a pure read count and has no relation to amplicon size.
138
139
141 Using the reads assigned to each amplicon and their start / end loca‐
142 tions on that reference, computed using the POS and CIGAR fields, we
143 compute the total number of bases aligned to this amplicon and corre‐
144 sponding the average depth. The VDEPTH variants are filtered to only
145 include templates with end-to-end coverage across the amplicon. These
146 can be considered to be "valid" or "usable" templates and give an indi‐
147 cation of the minimum depth for the amplicon rather than the average
148 depth.
149
150 To compute the depth the length of the amplicon is computed using the
151 innermost set of primers, if multiple choices are listed in the bed
152 file.
153
154
156 Similar to the FDEPTH section, this is a binary status of covered or
157 not covered per position in each amplicon. This is then expressed as a
158 percentage by dividing by the amplicon length, which is computed using
159 the innermost set of primers covering this amplicon.
160
161 The minimum depth necessary to constitute a position as being "covered"
162 is specifiable using the -d option.
163
164
165
167 It is possible for an amplicon to be produced using incorrect primers,
168 giving rise to extra-long amplicons (typically double or treble
169 length).
170
171 The FTCOORD field holds a distribution of observed template coordinates
172 from the input data. Each row consists of the file name, the amplicon
173 number in question, and tab separated tuples of start, end, frequency
174 and status (0 for OK, 1 for skipping amplicon, 2 for unknown location).
175 Each template is only counted for one amplicon, so if the read-pairs
176 span amplicons the count will show up in the left-most amplicon cov‐
177 ered.
178
179 Th COORD data may indicate which primers are being utilised if there
180 are alternates available for a given amplicon.
181
182 For COORD lines amplicon number 0 holds the frequency data for data
183 that reads that have not been assigned to any amplicon. That is, they
184 may lie within an amplicon, but they do not start or end at a known
185 primer location. It is not recorded for BED files containing multiple
186 references.
187
188 The FAMP / CAMP section is a simple count per amplicon of the number of
189 templates coming from this amplicon. Templates are counted once per
190 amplicon, but and like the FTCOORD field if a read-pair spans amplicons
191 it is only counted in the left-most amplicon. Each line consists of
192 the file name, amplicon number and 3 counts for the number of templates
193 with both ends within this amplicon, the number of templates with the
194 rightmost end in another amplicon, and the number of templates where
195 the other end has failed to be assigned to an amplicon.
196
197 Note FAMP / CAMP amplicon number 0 is the summation of data for all am‐
198 plicons (1 onwards).
199
200
202 These are for depth plots per base rather than per amplicon. They dis‐
203 tinguish between all reads in all templates, and only reads in tem‐
204 plates considered to be "valid". Such templates have both reads (if
205 paired) matching known primer locations from he same amplicon and have
206 full length coverage across the entire amplicon.
207
208 This FDP_VALID can be considered to be the minimum template depth
209 across the amplicon.
210
211 The difference between the VALID and ALL plots represents additional
212 data that for some reason may not be suitable for producing a consen‐
213 sus. For example an amplicon that skips a primer, pairing 10_LEFT with
214 12_RIGHT, will have coverage for the first half of amplicon 10 and the
215 last half of amplicon 12. Counting the number of reads or bases alone
216 in the amplicon does not reveal the potential for non-uniformity of
217 coverage.
218
219 The lines start with the type keyword, file / sample name, reference
220 name (unless single-ref mode is enabled), followed by a variable number
221 of tab separated tuples consisting of depth,length. The length field
222 is a basic form of run-length encoding where all depth values within a
223 specified fraction of each other (e.g. >= (1-fract)*midpoint and <=
224 (1+fract)*midpoint) are combined into a single run. This fraction is
225 controlled via the -D option.
226
227
229 -f, --required-flag INT|STR
230 Only output alignments with all bits set in INT present in the
231 FLAG field. INT can be specified in hex by beginning with `0x'
232 (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e.
233 /^0[0-7]+/) [0], or in string form by specifying a comma-sepa‐
234 rated list of keywords as listed by the "samtools flags" sub‐
235 command.
236
237
238 -F, --filter-flag INT|STR
239 Do not output alignments with any bits set in INT present in
240 the FLAG field. INT can be specified in hex by beginning with
241 `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
242 (i.e. /^0[0-7]+/) [0], or in string form by specifying a comma-
243 separated list of keywords as listed by the "samtools flags"
244 subcommand.
245
246
247 -a, --max-amplicons INT
248 Specify the maximum number of amplicons permitted.
249
250
251 -b, --tcoord-bin INT
252 Bin the template start,end positions into multiples of NT prior
253 to counting their frequency and reporting in the FTCOORD / CT‐
254 COORD lines. This may be useful for technologies with higher
255 errors rates where the alignment ends will vary slightly. De‐
256 faults to 1, which is equivalent to no binning.
257
258
259 -c, --tcoord-min-count INT
260 In the FTCOORD and CTCOORD lines, only record template
261 start,end coordinate combination if they occur at least INT
262 times.
263
264
265 -d, --min-depth INT
266 Specifies the minimum base depth to consider a reference posi‐
267 tion to be covered, for purposes of the FRPERC and CRPERC sec‐
268 tions.
269
270
271 -D, --depth-bin FRACTION
272 Controls the merging of neighbouring similar depths for the
273 FDP_ALL and FDP_VALID plots. The default FRACTION is 0.01,
274 meaning depths within +/- 1% of a mid point will be aggregated
275 together as a run of the same value. This merging is useful to
276 reduce the file size. Use -D 0 to record every depth.
277
278
279 -l, --max-amplicon-length INT
280 Specifies the maximum length of any individual amplicon.
281
282
283 -m, --pos-margin INT
284 Reads are compared against the primer start and end locations
285 specified in the BED file. An aligned sequence should start
286 precisely at these locations, but sequencing errors may cause
287 the primer clipping to be a few bases out or for the alignment
288 to add a few extra bases of soft clip. This option specifies
289 the margin of error permitted when matching a read to an ampli‐
290 con number.
291
292
293 -o FILE
294 Output stats to FILE. The default is to write to stdout.
295
296
297 -s, --use-sample-name
298 Instead of using the basename component of the input path
299 names, use the SM field from the first @RG header line.
300
301
302 -S, --single-ref
303 Force the output format to match the older single-reference
304 style used in Samtools 1.12 and earlier. This removes the ref‐
305 erence names from the SS, AMPLICON, DP_ALL and DP_VALID sec‐
306 tions. It cannot be enabled if the input BED file has more
307 than one reference present. Note that plot-ampliconstats can
308 process both output styles.
309
310
311 -t, --tlen-adjust INT
312 Adjust the TLEN field by +/- INT to compensate for primer clip‐
313 ping. This defaults to zero, but if the primers have been
314 clipped and the TLEN field has not been updated using samtools
315 fixmate then the template length will be wrong by the sum of
316 the forward and reverse primer lengths.
317
318 This adjustment does not have to be precise as the --pos-margin
319 field permits some leeway. Hence if required, it should be set
320 to approximately double the average primer length.
321
322
323 -@ INT Number of BAM/CRAM (de)compression threads to use in addition
324 to main thread [0].
325
326
328 To run ampliconstats on a directory full of CRAM files and then produce
329 a series of PNG images named "mydata*.png":
330
331
332 samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
333 plot-ampliconstats -size 1200,900 mydata astats
334
335
336
338 Written by James Bonfield from the Sanger Institute.
339
340
342 samtools(1), samtools-ampliconclip(1) samtools-stats(1), samtools-
343 flags(1)
344
345 Samtools website: <http://www.htslib.org/>
346
347
348
349samtools-1.15.1 7 April 2022 samtools-ampliconstats(1)