1samtools-consensus(1) Bioinformatics tools samtools-consensus(1)
2
3
4
6 samtools consensus - produces produce a consensus FASTA/FASTQ/PILEUP
7
9 samtools consensus [-saAMq] [-r region] [-f format] [-l line-len] [-d
10 min-depth] [-C cutoff] [-c call-fract] [-H het-fract] in.bam
11
12
14 Generate consensus from a SAM, BAM or CRAM file based on the contents
15 of the alignment records. The consensus is written either as FASTA,
16 FASTQ, or a pileup oriented format. This is selected using the -f FOR‐
17 MAT option.
18
19 The default output for FASTA and FASTQ formats include one base per
20 non-gap consensus. Hence insertions with respect to the aligned refer‐
21 ence will be included and deletions removed. This behaviour can be
22 controlled with the --show-ins and --show-del options. This could be
23 used to compute a new reference from sequences assemblies to realign
24 against.
25
26 The pileup-style format strictly adheres to one row per consensus loca‐
27 tion, differing from the one row per reference based used in the re‐
28 lated "samtools mpileup" command. This means the base quality values
29 for inserted columns are reported. The base quality value of gaps (ei‐
30 ther within an insertion or otherwise) are determined as the average of
31 the surrounding non-gap bases. The columns shown are the reference
32 name, position, nth base at that position (zero if not an insertion),
33 consensus call, consensus confidence, sequences and quality values.
34
35 Two consensus calling algorithms are offered. The default computes a
36 heterozygous consensus in a Bayesian manner, derived from the "Gap5"
37 consensus algorithm. Quality values are also tweaked to take into ac‐
38 count other nearby low quality values. This can also be disabled, us‐
39 ing the --no-adj-qual option.
40
41 This method also utilises the mapping qualities, unless the --no-use-MQ
42 option is used. Mapping qualities are also auto-scaled to take into
43 account the local reference variation by processing the MD:Z tag, un‐
44 less --no-adj-MQ is used. Mapping qualities can be capped between a
45 minimum (--low-MQ) and maximum (--high-MQ), although the defaults are
46 liberal and trust the data to be true. Finally an overall scale on the
47 resulting mapping quality can be supplied (--scale-MQ, defaulting to
48 1.0). This has the effect of favouring more calls with a higher false
49 positive rate (values greater than 1.0) or being more cautious with
50 higher false negative rates and lower false positive (values less than
51 1.0).
52
53 The second method is a simple frequency counting algorithm, summing ei‐
54 ther +1 for each base type or +qual if the --use-qual option is speci‐
55 fied. This is enabled with the --mode simple option.
56
57 The summed share of a specific base type is then compared against the
58 total possible and if this is above the --call-fract fraction parameter
59 then the most likely base type is called, or "N" otherwise (or absent
60 if it is a gap). The --ambig option permits generation of ambiguity
61 codes instead of "N", provided the minimum fraction of the second most
62 common base type to the most common is above the --het-fract fraction.
63
64
66 General options that apply to both algorithms:
67
68
69 -r REG, --region REG
70 Limit the query to region REG. This requires an index.
71
72 -f FMT, --format FMT
73 Produce format FMT, with "fastq", "fasta" and "pileup" as
74 permitted options.
75
76 -l N, --line-len N
77 Sets the maximum line length of line-wrapped fasta and fastq
78 formats to N.
79
80 -o FILE, --output FILE
81 Output consensus to FILE instead of stdout.
82
83 -m STR, --mode STR
84 Select the consensus algorithm. Valid modes are "simple"
85 frequency counting and the "bayesian" (Gap5) method, with
86 Bayesian being the default. (Note case does not matter, so
87 "Bayesian" is accepted too.)
88
89 -a Outputs all bases, from start to end of reference, even when
90 the aligned data does not extend to the ends. This is most
91 useful for construction of a full length reference sequence.
92
93
94 --rf, --incl-flags STR|INT
95 Only include reads with at least one FLAG bit set. Defaults
96 to zero, which filters no reads.
97
98
99 --ff, --excl-flags STR|INT
100 Exclude reads with any FLAG bit set. Defaults to "UNMAP,SEC‐
101 ONDARY,QCFAIL,DUP".
102
103
104 --min-MQ INT
105 Filters out reads with a mapping quality below INT. This de‐
106 faults to zero.
107
108
109 --show-del yes/no
110 Whether to show deletions as "*" (no) or to omit from the
111 output (yes). Defaults to no.
112
113
114 --show-ins yes/no
115 Whether to show insertions in the consensus. Defaults to
116 yes.
117
118
119 -A, --ambig
120 Enables IUPAC ambiguity codes in the consensus output. With‐
121 out this the output will be limited to A, C, G, T, N and *.
122
123
124 The following options apply only to the simple (default) consensus
125 mode:
126
127
128 -q, --use-qual
129 For the simple consensus algorithm, this enables use of base
130 quality values. Instead of summing 1 per base called, it
131 sums the base quality instead. These sums are also used in
132 the --call-fract and --het-fract parameters too. Quality
133 values are always used for the "Gap5" consensus method and
134 this option has no affect. Note currently quality values
135 only affect SNPs and not inserted sequences, which still get
136 scores with a fixed +1 per base type occurrence.
137
138
139 -d D, --min-depth D
140 The minimum depth required to make a call. Defaults to 1.
141 Failing this depth check will produce consensus "N", or ab‐
142 sent if it is an insertion.
143
144
145 -H H, --het-fract H
146 For consensus columns containing multiple base types, if the
147 second most frequent type is at least H fraction of the most
148 common type then a heterozygous base type will be reported in
149 the consensus. Otherwise the most common base is used, pro‐
150 vided it meets the --call-fract parameter (otherwise "N").
151 The fractions computed may be modified by the use of quality
152 values if the -q option is enabled. Note although IUPAC has
153 ambiguity codes for A,C,G,T vs any other A,C,G,T it does not
154 have codes for A,C,G,T vs gap (such as in a heterozygous
155 deletion). Given the lack of any official code, we use
156 lower-case letter to symbolise a half-present base type.
157
158
159 -c C, --call-fract C
160 Only used for the simple consensus algorithm. Require at
161 least C fraction of bases agreeing with the most likely con‐
162 sensus call to omit that base type. This defaults to 0.75.
163 Failing this check will output "N".
164
165
166
167 The following options apply only to Bayesian consensus mode enabled
168 with the -5 option.
169
170
171 -5 Enable Bayesian consensus algorithm.
172
173
174 -C C, --cutoff C
175 Only used for the Gap5 consensus mode, which produces a Phred
176 style score for the final consensus quality. If this is be‐
177 low C then the consensus is called as "N".
178
179
180 --use-MQ, --no-use-MQ
181 Enable or disable the use of mapping qualities. Defaults to
182 on.
183
184
185 --adj-MQ, --no-adj-MQ
186 If mapping qualities are used, this controls whether they are
187 scaled by the local number of mismatches to the reference.
188 The reference is unknown by this tool, so this data is ob‐
189 tained from the MD:Z auxiliary tag (or ignored if not
190 present). Defaults to on.
191
192
193 --NM-halo INT
194 Specifies the distance either side of the base call being
195 considered for computing the number of local mismatches.
196
197
198 --low-MQ MIN, --high-MQ MAX
199 Specifies a minimum and maximum value of the mapping quality.
200 These are not filters and instead simply put upper and lower
201 caps on the values. The defaults are 0 and 60.
202
203
204 --scale-MQ FLOAT
205 This is a general multiplicative mapping quality scaling
206 factor. The effect is to globally raise or lower the quality
207 values used in the consensus algorithm. Defaults to 1.0,
208 which leaves the values unchanged.
209
210
211 --P-het FLOAT
212 Controls the likelihood of any position being a heterozygous
213 site. Defaults to 1e-4. Smaller numbers makes the algorithm
214 more likely to call a pure base type. Note the algorithm
215 will always compute the probability of the base being homozy‐
216 gous vs heterozygous, irrespective of whether the output is
217 reported as ambiguous (it will be "N" if deemed to be het‐
218 erozygous without --ambig mode enabled).
219
220
222 - Create a modified FASTA reference that has a 1:1 coordinate cor‐
223 respondence with the original reference used in alignment.
224
225 samtools consensus -a --show-ins no in.bam -o ref.fa
226
227
228
229 - Create a FASTQ file for the contigs with aligned data, including
230 insertions.
231
232 samtools consensus -f fastq in.bam -o cons.fq
233
234
235
237 Written by James Bonfield from the Sanger Institute.
238
239
241 samtools(1), samtools-mpileup(1),
242
243 Samtools website: <http://www.htslib.org/>
244
245
246
247samtools-1.15.1 7 April 2022 samtools-consensus(1)