samtools-ampliconstats(1)

1samtools-ampliconstats(1)    Bioinformatics tools    samtools-ampliconstats(1)
2
3
4

NAME

6       samtools  ampliconstats  - produces statistics from amplicon sequencing
7       alignment file
8

SYNOPSIS

10       samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...
11
12

DESCRIPTION

14       samtools ampliconstats collects  statistics  from  one  or  more  input
15       alignment  files and produces tables in text format.  The output can be
16       visualized graphically using plot-ampliconstats.
17
18       The alignment files should have previously been clipped of  primer  se‐
19       quence,  for  example by "samtools ampliconclip" and the sites of these
20       primers should be specified as a bed file in the arguments.   Each  am‐
21       plicon  must  be  present in the bed file with one or more LEFT primers
22       (direction "+") followed by one or more RIGHT primers.  For example:
23
24
25         MN908947.3  1875  1897  nCoV-2019_7_LEFT        60  +
26         MN908947.3  1868  1890  nCoV-2019_7_LEFT_alt0   60  +
27         MN908947.3  2247  2269  nCoV-2019_7_RIGHT       60  -
28         MN908947.3  2242  2264  nCoV-2019_7_RIGHT_alt5  60  -
29         MN908947.3  2181  2205  nCoV-2019_8_LEFT        60  +
30         MN908947.3  2568  2592  nCoV-2019_8_RIGHT       60  -
31
32
33       Ampliconstats will identify which read belongs to which amplicon.   For
34       purposes  of  computing coverage statistics for amplicons with multiple
35       primer choices, only the innermost primer locations are used.
36
37       A summary of output sections is listed below, followed by more detailed
38       descriptions.
39
40       SS          Amplicon and file counts.  Always comes first
41       AMPLICON    Amplicon primer locations
42       FSS         File specific: summary stats
43       FRPERC      File  specific: read percentage distribution between ampli‐
44                   cons
45       FDEPTH      File specific: average read depth per amplicon
46       FVDEPTH     File specific: average read depth per amplicon, full length
47                   only
48       FREADS      File specific: numbers of reads per amplicon
49       FPCOV       File specific: percent coverage per amplicon
50       FTCOORD     File  specific:  template  start,end coordinate frequencies
51                   per amplicon
52       FAMP        File specific: amplicon correct / double  /  treble  length
53                   counts
54       FDP_ALL     File  specific: template depth per reference base, all tem‐
55                   plates
56       FDP_VALID   File specific: template depth  per  reference  base,  valid
57                   templates only
58       CSS         Combined  summary stats
59       CRPERC      Combined: read percentage distribution between amplicons
60       CDEPTH      Combined: average read depth per amplicon
61       CVDEPTH     Combined: average read depth per amplicon, full length only
62       CREADS      Combined: numbers of reads per amplicon
63       CPCOV       Combined: percent coverage per amplicon
64       CTCOORD     Combined: template coordinates per amplicon
65       CAMP        Combined: amplicon correct / double / treble length counts
66       CDP_ALL     Combined: template depth per reference base, all templates
67       CDP_VALID   Combined:  template  depth  per  reference base, valid tem‐
68                   plates only
69
70       File specific sections start with both the section key and the filename
71       basename (minus directory and .sam, .bam or .cram suffix).
72
73       Note  that the file specific sections are interleaved, ordered first by
74       file and secondly by the file specific  stats.   To  collate  them  to‐
75       gether, use "grep" to pull out all data of a specific type.
76
77       The  combined sections (C*) follow the same format as the file specific
78       sections, with a different key.  For simplicity of  parsing  they  also
79       have a filename column which is filled out with "COMBINED".  These rows
80       contain stats aggregated across all input files.
81
82

SS / AMPLICON

84       This section is once per file and includes summary  information  to  be
85       utilised  for  scaling of plots, for example the total number of ampli‐
86       cons and files present, tool version number,  and  command  line  argu‐
87       ments.   The second column is the filename or "COMBINED".  This is fol‐
88       lowed by the reference name (unless single-ref mode  is  enabled),  and
89       the summary statistic name and value.
90
91       The  AMPLICON  section  is  a reformatting of the input BED file.  Each
92       line consists of the reference name (unless single-ref mode is enable),
93       the amplicon number and the start-end coordinates of the left and right
94       primers.  Where multiple primers are available these  are  comma  sepa‐
95       rated,  for example 10-30,15-40 in the left primer column indicates two
96       primers have been multiplex together covering genome coordinates  10-30
97       inclusive and 14-40 inclusively.
98
99
100

CSS SECTION

102       This  section  consists  of  summary counts for the entire set of input
103       files.   These may be useful for automatic scaling of plots.
104
105
106       Number of amplicons   Total number of amplicons listed in primer.bed
107       Number of files       Total number of SAM, BAM or CRAM files
108       End of summary        Always the last item.  Marker for end of CSS block.
109
110
111

FSS SECTION

113       This lists summary statistics specific to  an  individual  input  file.
114       The values reported are:
115
116
117       raw total sequences   Total number of sequences found in the file
118       filtered sequences    Number of sequences filtered with -F option
119       failed primer match   Number of sequences that did not correspond to
120                             a known primer location
121       matching sequences    Number of sequences allocated to an amplicon
122
123

FREADS / CREADS SECTION

125       For  each  amplicon,  this  simply reports the count of reads that have
126       been assigned to it.  A read is assigned to an amplicon  if  the  start
127       and/or  end  of  the  read is within a specified number of bases of the
128       primer sites listed in the bed file.  This distance is  controlled  via
129       the -m option.
130
131

FRPERC / CRPERC SECTION

133       For each amplicon, this lists what percentage of reads were assigned to
134       this amplicon out of the total number of assigned reads.  This  may  be
135       used to diagnose how uniform this distribution is.
136
137       Note this is a pure read count and has no relation to amplicon size.
138
139

FDEPTH / CDEPTH / FVDEPTH / CVDEPTH SECTION

141       Using  the  reads assigned to each amplicon and their start / end loca‐
142       tions on that reference, computed using the POS and  CIGAR  fields,  we
143       compute  the  total number of bases aligned to this amplicon and corre‐
144       sponding the average depth.  The VDEPTH variants are filtered  to  only
145       include  templates with end-to-end coverage across the amplicon.  These
146       can be considered to be "valid" or "usable" templates and give an indi‐
147       cation  of  the  minimum depth for the amplicon rather than the average
148       depth.
149
150       To compute the depth the length of the amplicon is computed  using  the
151       innermost  set  of  primers,  if multiple choices are listed in the bed
152       file.
153
154

FPCOV / CPCOV SECTION

156       Similar to the FDEPTH section, this is a binary status  of  covered  or
157       not covered per position in each amplicon.  This is then expressed as a
158       percentage by dividing by the amplicon length, which is computed  using
159       the innermost set of primers covering this amplicon.
160
161       The minimum depth necessary to constitute a position as being "covered"
162       is specifiable using the -d option.
163
164
165

FTCOORD / CTCOORD / FAMP / CAMP SECTION

167       It is possible for an amplicon to be produced using incorrect  primers,
168       giving  rise  to  extra-long  amplicons  (typically  double  or  treble
169       length).
170
171       The FTCOORD field holds a distribution of observed template coordinates
172       from  the input data.  Each row consists of the file name, the amplicon
173       number in question, and tab separated tuples of start,  end,  frequency
174       and status (0 for OK, 1 for skipping amplicon, 2 for unknown location).
175       Each template is only counted for one amplicon, so  if  the  read-pairs
176       span  amplicons  the  count will show up in the left-most amplicon cov‐
177       ered.
178
179       Th COORD data may indicate which primers are being  utilised  if  there
180       are alternates available for a given amplicon.
181
182       For  COORD  lines  amplicon  number 0 holds the frequency data for data
183       that reads that have not been assigned to any amplicon.  That is,  they
184       may  lie  within  an  amplicon, but they do not start or end at a known
185       primer location.  It is not recorded for BED files containing  multiple
186       references.
187
188       The FAMP / CAMP section is a simple count per amplicon of the number of
189       templates coming from this amplicon.  Templates are  counted  once  per
190       amplicon, but and like the FTCOORD field if a read-pair spans amplicons
191       it is only counted in the left-most amplicon.  Each  line  consists  of
192       the file name, amplicon number and 3 counts for the number of templates
193       with both ends within this amplicon, the number of templates  with  the
194       rightmost  end  in  another amplicon, and the number of templates where
195       the other end has failed to be assigned to an amplicon.
196
197       Note FAMP / CAMP amplicon number 0 is the summation of data for all am‐
198       plicons (1 onwards).
199
200

FDP_ALL / CDP_ALL / FDP_VALID / CDP_VALID section

202       These are for depth plots per base rather than per amplicon.  They dis‐
203       tinguish between all reads in all templates, and  only  reads  in  tem‐
204       plates  considered  to  be "valid".  Such templates have both reads (if
205       paired) matching known primer locations from he same amplicon and  have
206       full length coverage across the entire amplicon.
207
208       This  FDP_VALID  can  be  considered  to  be the minimum template depth
209       across the amplicon.
210
211       The difference between the VALID and ALL  plots  represents  additional
212       data  that  for some reason may not be suitable for producing a consen‐
213       sus.  For example an amplicon that skips a primer, pairing 10_LEFT with
214       12_RIGHT,  will have coverage for the first half of amplicon 10 and the
215       last half of amplicon 12.  Counting the number of reads or bases  alone
216       in  the  amplicon  does  not reveal the potential for non-uniformity of
217       coverage.
218
219       The lines start with the type keyword, file /  sample  name,  reference
220       name (unless single-ref mode is enabled), followed by a variable number
221       of tab separated tuples consisting of depth,length.  The  length  field
222       is  a basic form of run-length encoding where all depth values within a
223       specified fraction of each other (e.g.  >=  (1-fract)*midpoint  and  <=
224       (1+fract)*midpoint)  are  combined into a single run.  This fraction is
225       controlled via the -D option.
226
227

OPTIONS

229       -f, --required-flag INT|STR
230               Only output alignments with all bits set in INT present in  the
231               FLAG field.  INT can be specified in hex by beginning with `0x'
232               (i.e. /^0x[0-9A-F]+/) or in octal by beginning with  `0'  (i.e.
233               /^0[0-7]+/)  [0], or in string form by specifying a comma-sepa‐
234               rated list of keywords as listed by the "samtools  flags"  sub‐
235               command.
236
237
238       -F, --filter-flag INT|STR
239               Do  not  output  alignments with any bits set in INT present in
240               the FLAG field.  INT can be specified in hex by beginning  with
241               `0x'  (i.e.  /^0x[0-9A-F]+/)  or in octal by beginning with `0'
242               (i.e. /^0[0-7]+/) [0], or in string form by specifying a comma-
243               separated  list  of  keywords as listed by the "samtools flags"
244               subcommand.
245
246
247       -a, --max-amplicons INT
248               Specify the maximum number of amplicons permitted.
249
250
251       -b, --tcoord-bin INT
252               Bin the template start,end positions into multiples of NT prior
253               to  counting their frequency and reporting in the FTCOORD / CT‐
254               COORD lines.  This may be useful for technologies  with  higher
255               errors  rates where the alignment ends will vary slightly.  De‐
256               faults to 1, which is equivalent to no binning.
257
258
259       -c, --tcoord-min-count INT
260               In  the  FTCOORD  and  CTCOORD  lines,  only  record   template
261               start,end  coordinate  combination  if  they occur at least INT
262               times.
263
264
265       -d, --min-depth INT
266               Specifies the minimum base depth to consider a reference  posi‐
267               tion  to be covered, for purposes of the FRPERC and CRPERC sec‐
268               tions.
269
270
271       -D, --depth-bin FRACTION
272               Controls the merging of neighbouring  similar  depths  for  the
273               FDP_ALL  and  FDP_VALID  plots.   The default FRACTION is 0.01,
274               meaning depths within +/- 1% of a mid point will be  aggregated
275               together as a run of the same value.  This merging is useful to
276               reduce the file size.  Use -D 0 to record every depth.
277
278
279       -l, --max-amplicon-length INT
280               Specifies the maximum length of any individual amplicon.
281
282
283       -m, --pos-margin INT
284               Reads are compared against the primer start and  end  locations
285               specified  in  the  BED file.  An aligned sequence should start
286               precisely at these locations, but sequencing errors  may  cause
287               the  primer clipping to be a few bases out or for the alignment
288               to add a few extra bases of soft clip.  This  option  specifies
289               the margin of error permitted when matching a read to an ampli‐
290               con number.
291
292
293       -o  FILE
294               Output stats to FILE.  The default is to write to stdout.
295
296
297       -s, --use-sample-name
298               Instead of using the  basename  component  of  the  input  path
299               names, use the SM field from the first @RG header line.
300
301
302       -S, --single-ref
303               Force  the  output  format  to match the older single-reference
304               style used in Samtools 1.12 and earlier.  This removes the ref‐
305               erence  names  from  the SS, AMPLICON, DP_ALL and DP_VALID sec‐
306               tions.  It cannot be enabled if the input  BED  file  has  more
307               than  one  reference present.  Note that plot-ampliconstats can
308               process both output styles.
309
310
311       -t, --tlen-adjust INT
312               Adjust the TLEN field by +/- INT to compensate for primer clip‐
313               ping.   This  defaults  to  zero,  but if the primers have been
314               clipped and the TLEN field has not been updated using  samtools
315               fixmate  then  the  template length will be wrong by the sum of
316               the forward and reverse primer lengths.
317
318               This adjustment does not have to be precise as the --pos-margin
319               field permits some leeway.  Hence if required, it should be set
320               to approximately double the average primer length.
321
322
323       -@ INT  Number of BAM/CRAM (de)compression threads to use  in  addition
324               to main thread [0].
325
326

EXAMPLE

328       To run ampliconstats on a directory full of CRAM files and then produce
329       a series of PNG images named "mydata*.png":
330
331
332         samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
333         plot-ampliconstats -size 1200,900 mydata astats
334
335
336

AUTHOR

338       Written by James Bonfield from the Sanger Institute.
339
340