1samtools-consensus(1)        Bioinformatics tools        samtools-consensus(1)
2
3
4

NAME

6       samtools consensus - produces produce a consensus FASTA/FASTQ/PILEUP
7

SYNOPSIS

9       samtools  consensus  [-saAMq] [-r region] [-f format] [-l line-len] [-d
10       min-depth] [-C cutoff] [-c call-fract] [-H het-fract] in.bam
11
12

DESCRIPTION

14       Generate consensus from a SAM, BAM or CRAM file based on  the  contents
15       of  the  alignment  records.  The consensus is written either as FASTA,
16       FASTQ, or a pileup oriented format.  This is selected using the -f FOR‐
17       MAT option.
18
19       The  default  output  for  FASTA and FASTQ formats include one base per
20       non-gap consensus.  Hence insertions with respect to the aligned refer‐
21       ence  will  be  included  and deletions removed.  This behaviour can be
22       controlled with the --show-ins and --show-del options.  This  could  be
23       used  to  compute  a new reference from sequences assemblies to realign
24       against.
25
26       The pileup-style format strictly adheres to one row per consensus loca‐
27       tion,  differing  from  the one row per reference based used in the re‐
28       lated "samtools mpileup" command.  This means the base  quality  values
29       for inserted columns are reported.  The base quality value of gaps (ei‐
30       ther within an insertion or otherwise) are determined as the average of
31       the  surrounding  non-gap  bases.   The columns shown are the reference
32       name, position, nth base at that position (zero if not  an  insertion),
33       consensus call, consensus confidence, sequences and quality values.
34
35       Two  consensus  calling algorithms are offered.  The default computes a
36       heterozygous consensus in a Bayesian manner, derived  from  the  "Gap5"
37       consensus  algorithm.  Quality values are also tweaked to take into ac‐
38       count other nearby low quality values.  This can also be disabled,  us‐
39       ing the --no-adj-qual option.
40
41       This method also utilises the mapping qualities, unless the --no-use-MQ
42       option is used.  Mapping qualities are also auto-scaled  to  take  into
43       account  the  local reference variation by processing the MD:Z tag, un‐
44       less --no-adj-MQ is used.  Mapping qualities can be  capped  between  a
45       minimum  (--low-MQ)  and maximum (--high-MQ), although the defaults are
46       liberal and trust the data to be true.  Finally an overall scale on the
47       resulting  mapping  quality  can be supplied (--scale-MQ, defaulting to
48       1.0).  This has the effect of favouring more calls with a higher  false
49       positive  rate  (values  greater  than 1.0) or being more cautious with
50       higher false negative rates and lower false positive (values less  than
51       1.0).
52
53       The second method is a simple frequency counting algorithm, summing ei‐
54       ther +1 for each base type or +qual if the --use-qual option is  speci‐
55       fied.  This is enabled with the --mode simple option.
56
57       The  summed  share of a specific base type is then compared against the
58       total possible and if this is above the --call-fract fraction parameter
59       then  the  most likely base type is called, or "N" otherwise (or absent
60       if it is a gap).  The --ambig option permits  generation  of  ambiguity
61       codes  instead of "N", provided the minimum fraction of the second most
62       common base type to the most common is above the --het-fract fraction.
63
64

OPTIONS

66       General options that apply to both algorithms:
67
68
69       -r REG, --region REG
70                 Limit the query to region REG.  This requires an index.
71
72       -f FMT, --format FMT
73                 Produce format FMT, with "fastq",  "fasta"  and  "pileup"  as
74                 permitted options.
75
76       -l N, --line-len N
77                 Sets  the maximum line length of line-wrapped fasta and fastq
78                 formats to N.
79
80       -o FILE, --output FILE
81                 Output consensus to FILE instead of stdout.
82
83       -m STR, --mode STR
84                 Select the consensus algorithm.   Valid  modes  are  "simple"
85                 frequency  counting  and  the  "bayesian" (Gap5) method, with
86                 Bayesian being the default.  (Note case does not  matter,  so
87                 "Bayesian" is accepted too.)
88
89       -a        Outputs  all bases, from start to end of reference, even when
90                 the aligned data does not extend to the ends.  This  is  most
91                 useful for construction of a full length reference sequence.
92
93
94       --rf, --incl-flags STR|INT
95                 Only  include reads with at least one FLAG bit set.  Defaults
96                 to zero, which filters no reads.
97
98
99       --ff, --excl-flags STR|INT
100                 Exclude reads with any FLAG bit set.  Defaults to "UNMAP,SEC‐
101                 ONDARY,QCFAIL,DUP".
102
103
104       --min-MQ INT
105                 Filters out reads with a mapping quality below INT.  This de‐
106                 faults to zero.
107
108
109       --show-del yes/no
110                 Whether to show deletions as "*" (no) or  to  omit  from  the
111                 output (yes).  Defaults to no.
112
113
114       --show-ins yes/no
115                 Whether  to  show  insertions  in the consensus.  Defaults to
116                 yes.
117
118
119       -A, --ambig
120                 Enables IUPAC ambiguity codes in the consensus output.  With‐
121                 out this the output will be limited to A, C, G, T, N and *.
122
123
124       The  following  options  apply  only  to the simple (default) consensus
125       mode:
126
127
128       -q, --use-qual
129                 For the simple consensus algorithm, this enables use of  base
130                 quality  values.   Instead  of  summing 1 per base called, it
131                 sums the base quality instead.  These sums are also  used  in
132                 the  --call-fract  and  --het-fract  parameters too.  Quality
133                 values are always used for the "Gap5"  consensus  method  and
134                 this  option  has  no affect.  Note currently  quality values
135                 only affect SNPs and not inserted sequences, which still  get
136                 scores with a fixed +1 per base type occurrence.
137
138
139       -d D, --min-depth D
140                 The  minimum  depth  required to make a call.  Defaults to 1.
141                 Failing this depth check will produce consensus "N",  or  ab‐
142                 sent if it is an insertion.
143
144
145       -H H, --het-fract H
146                 For  consensus columns containing multiple base types, if the
147                 second most frequent type is at least H fraction of the  most
148                 common type then a heterozygous base type will be reported in
149                 the consensus.  Otherwise the most common base is used,  pro‐
150                 vided  it  meets  the --call-fract parameter (otherwise "N").
151                 The fractions computed may be modified by the use of  quality
152                 values  if the -q option is enabled.  Note although IUPAC has
153                 ambiguity codes for A,C,G,T vs any other A,C,G,T it does  not
154                 have  codes  for  A,C,G,T  vs  gap (such as in a heterozygous
155                 deletion).  Given the lack  of  any  official  code,  we  use
156                 lower-case letter to symbolise a half-present base type.
157
158
159       -c C, --call-fract C
160                 Only  used  for  the  simple consensus algorithm.  Require at
161                 least C fraction of bases agreeing with the most likely  con‐
162                 sensus  call  to omit that base type.  This defaults to 0.75.
163                 Failing this check will output "N".
164
165
166
167       The following options apply only to Bayesian consensus mode enabled
168       with the -5 option.
169
170
171       -5        Enable Bayesian consensus algorithm.
172
173
174       -C C, --cutoff C
175                 Only used for the Gap5 consensus mode, which produces a Phred
176                 style  score for the final consensus quality.  If this is be‐
177                 low C then the consensus is called as "N".
178
179
180       --use-MQ, --no-use-MQ
181                 Enable or disable the use of mapping qualities.  Defaults  to
182                 on.
183
184
185       --adj-MQ, --no-adj-MQ
186                 If mapping qualities are used, this controls whether they are
187                 scaled by the local number of mismatches  to  the  reference.
188                 The  reference  is  unknown by this tool, so this data is ob‐
189                 tained from  the  MD:Z  auxiliary  tag  (or  ignored  if  not
190                 present).  Defaults to on.
191
192
193       --NM-halo INT
194                 Specifies  the  distance  either  side of the base call being
195                 considered for computing the number of local mismatches.
196
197
198       --low-MQ MIN, --high-MQ MAX
199                 Specifies a minimum and maximum value of the mapping quality.
200                 These  are not filters and instead simply put upper and lower
201                 caps on the values.  The defaults are 0 and 60.
202
203
204       --scale-MQ FLOAT
205                 This is a general  multiplicative   mapping  quality  scaling
206                 factor.  The effect is to globally raise or lower the quality
207                 values used in the consensus  algorithm.   Defaults  to  1.0,
208                 which leaves the values unchanged.
209
210
211       --P-het FLOAT
212                 Controls  the likelihood of any position being a heterozygous
213                 site.  Defaults to 1e-4.  Smaller numbers makes the algorithm
214                 more  likely  to  call  a pure base type.  Note the algorithm
215                 will always compute the probability of the base being homozy‐
216                 gous  vs  heterozygous, irrespective of whether the output is
217                 reported as ambiguous (it will be "N" if deemed  to  be  het‐
218                 erozygous without --ambig mode enabled).
219
220

EXAMPLES

222       -      Create a modified FASTA reference that has a 1:1 coordinate cor‐
223              respondence with the original reference used in alignment.
224
225                samtools consensus -a --show-ins no in.bam -o ref.fa
226
227
228
229       -      Create a FASTQ file for the contigs with aligned data, including
230              insertions.
231
232                samtools consensus -f fastq in.bam -o cons.fq
233
234
235

AUTHOR

237       Written by James Bonfield from the Sanger Institute.
238
239

SEE ALSO

241       samtools(1), samtools-mpileup(1),
242
243       Samtools website: <http://www.htslib.org/>
244
245
246
247samtools-1.15.1                  7 April 2022            samtools-consensus(1)
Impressum