1samtools-mpileup(1) Bioinformatics tools samtools-mpileup(1)
2
3
4
6 samtools mpileup - produces "pileup" textual format from an alignment
7
9 samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q
10 minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
11
12
14 Generate text pileup output for one or multiple BAM files. Each input
15 file produces a separate group of pileup columns in the output.
16
17 Samtools mpileup can still produce VCF and BCF output (with -g or -u),
18 but this feature is deprecated and will be removed in a future release.
19 Please use bcftools mpileup for this instead. (Documentation on the
20 deprecated options has been removed from this manual page, but older
21 versions are available online at <http://www.htslib.org/doc/>.)
22
23 Note that there are two orthogonal ways to specify locations in the in‐
24 put file; via -r region and -l file. The former uses (and requires) an
25 index to do random access while the latter streams through the file
26 contents filtering out the specified regions, requiring no index. The
27 two may be used in conjunction. For example a BED file containing lo‐
28 cations of genes in chromosome 20 could be specified using -r 20 -l
29 chr20.bed, meaning that the index is used to find chromosome 20 and
30 then it is filtered for the regions listed in the bed file.
31
32
33 Pileup Format
34 Pileup format consists of TAB-separated lines, with each line repre‐
35 senting the pileup of reads at a single genomic position.
36
37 Several columns contain numeric quality values encoded as individual
38 ASCII characters. Each character can range from “!” to “~” and is de‐
39 coded by taking its ASCII value and subtracting 33; e.g., “A” encodes
40 the numeric value 32.
41
42 The first three columns give the position and reference:
43
44 ○ Chromosome name.
45
46 ○ 1-based position on the chromosome.
47
48 ○ Reference base at this position (this will be “N” on all lines if
49 -f/--fasta-ref has not been used).
50
51 The remaining columns show the pileup data, and are repeated for each
52 input BAM file specified:
53
54 ○ Number of reads covering this position.
55
56 ○ Read bases. This encodes information on matches, mismatches, indels,
57 strand, mapping quality, and starts and ends of reads.
58
59 For each read covering the position, this column contains:
60
61 • If this is the first position covered by the read, a “^” character
62 followed by the alignment's mapping quality encoded as an ASCII
63 character.
64
65 • A single character indicating the read base and the strand to which
66 the read has been mapped:
67
68 Forward Reverse Meaning
69 ───────────────────────────────────────────────────────────────
70 . dot , comma Base matches the reference base
71 ACGTN acgtn Base is a mismatch to the reference base
72 > < Reference skip (due to CIGAR “N”)
73 * */# Deletion of the reference base (CIGAR “D”)
74
75 Deleted bases are shown as “*” on both strands unless --reverse-del
76 is used, in which case they are shown as “#” on the reverse strand.
77
78 • If there is an insertion after this read base, text matching
79 “\+[0-9]+[ACGTNacgtn*#]+”: a “+” character followed by an integer
80 giving the length of the insertion and then the inserted sequence.
81 Pads are shown as “*” unless --reverse-del is used, in which case
82 pads on the reverse strand will be shown as “#”.
83
84 • If there is a deletion after this read base, text matching
85 “-[0-9]+[ACGTNacgtn]+”: a “-” character followed by the deleted
86 reference bases represented similarly. (Subsequent pileup lines
87 will contain “*” for this read indicating the deleted bases.)
88
89 • If this is the last position covered by the read, a “$” character.
90
91 ○ Base qualities, encoded as ASCII characters.
92
93 ○ Alignment mapping qualities, encoded as ASCII characters. (Column
94 only present when -s/--output-MQ is used.)
95
96 ○ Comma-separated 1-based positions within the alignments, e.g., 5 in‐
97 dicates that it is the fifth base of the corresponding read that is
98 mapped to this genomic position. (Column only present when -O/--out‐
99 put-BP is used.)
100
101 ○ Additional comma-separated read field columns, as selected via --out‐
102 put-extra. The fields selected appear in the same order as in SAM:
103 QNAME, FLAG, RNAME, POS, MAPQ (displayed numerically), RNEXT, PNEXT.
104
105 ○ Additional read tag field columns, as selected via --output-extra.
106 These columns are formatted as determined by --output-sep and --out‐
107 put-empty (comma-separated by default), and appear in the same order
108 as the tags are given in --output-extra.
109
110
112 -6, --illumina1.3+
113 Assume the quality is in the Illumina 1.3+ encoding.
114
115 -A, --count-orphans
116 Do not skip anomalous read pairs in variant calling. Anoma‐
117 lous read pairs are those marked in the FLAG field as paired
118 in sequencing but without the properly-paired flag set.
119
120 -b, --bam-list FILE
121 List of input BAM files, one file per line [null]
122
123 -B, --no-BAQ
124 Disable base alignment quality (BAQ) computation. See BAQ
125 below.
126
127 -C, --adjust-MQ INT
128 Coefficient for downgrading mapping quality for reads con‐
129 taining excessive mismatches. Given a read with a phred-
130 scaled probability q of being generated from the mapped posi‐
131 tion, the new mapping quality is about sqrt((INT-q)/INT)*INT.
132 A zero value disables this functionality; if enabled, the
133 recommended value for BWA is 50. [0]
134
135 -d, --max-depth INT
136 At a position, read maximally INT reads per input file. Set‐
137 ting this limit reduces the amount of memory and time needed
138 to process regions with very high coverage. Passing zero for
139 this option sets it to the highest possible value, effec‐
140 tively removing the depth limit. [8000]
141
142 Note that up to release 1.8, samtools would enforce a minimum
143 value for this option. This no longer happens and the limit
144 is set exactly as specified.
145
146 -E, --redo-BAQ
147 Recalculate BAQ on the fly, ignore existing BQ tags. See BAQ
148 below.
149
150 -f, --fasta-ref FILE
151 The faidx-indexed reference file in the FASTA format. The
152 file can be optionally compressed by bgzip. [null]
153
154 Supplying a reference file will enable base alignment quality
155 calculation for all reads aligned to a reference in the file.
156 See BAQ below.
157
158 -G, --exclude-RG FILE
159 Exclude reads from read groups listed in FILE (one @RG-ID per
160 line)
161
162 -l, --positions FILE
163 BED or position list file containing a list of regions or
164 sites where pileup or BCF should be generated. Position list
165 files contain two columns (chromosome and position) and start
166 counting from 1. BED files contain at least 3 columns (chro‐
167 mosome, start and end position) and are 0-based half-open.
168 While it is possible to mix both position-list and BED coor‐
169 dinates in the same file, this is strongly ill advised due to
170 the differing coordinate systems. [null]
171
172 -q, --min-MQ INT
173 Minimum mapping quality for an alignment to be used [0]
174
175 -Q, --min-BQ INT
176 Minimum base quality for a base to be considered [13]
177
178 -r, --region STR
179 Only generate pileup in region. Requires the BAM files to be
180 indexed. If used in conjunction with -l then considers the
181 intersection of the two requests. STR [all sites]
182
183 -R, --ignore-RG
184 Ignore RG tags. Treat all reads in one BAM as one sample.
185
186 --rf, --incl-flags STR|INT
187 Required flags: include reads with any of the mask bits set
188 [null]
189
190 --ff, --excl-flags STR|INT
191 Filter flags: skip reads with any of the mask bits set [UN‐
192 MAP,SECONDARY,QCFAIL,DUP]
193
194 -x, --ignore-overlaps
195 Disable read-pair overlap detection.
196
197 -X Include customized index file as a part of arguments. See EX‐
198 AMPLES section for sample of usage.
199
200
201 Output Options:
202
203 -o, --output FILE
204 Write pileup output to FILE, rather than the default of stan‐
205 dard output.
206
207 (The same short option is used for both the deprecated
208 --open-prob option and --output. If -o's argument contains
209 any non-digit characters other than a leading + or - sign, it
210 is interpreted as --output. Usually the filename extension
211 will take care of this, but to write to an entirely numeric
212 filename use -o ./123 or --output 123.)
213
214 -O, --output-BP
215 Output base positions on reads.
216
217 -s, --output-MQ
218 Output mapping qualities encoded as ASCII characters.
219
220 --output-QNAME
221 Output an extra column containing comma-separated read names.
222 Equivalent to --output-extra QNAME.
223
224 --output-extra STR
225 Output extra columns containing comma-separated values of
226 read fields or read tags. The names of the selected fields
227 have to be provided as they are described in the SAM Specifi‐
228 cation (pag. 6) and will be output by the mpileup command in
229 the same order as in the document (i.e. QNAME, FLAG,
230 RNAME,...) The names are case sensitive. Currently, only the
231 following fields are supported:
232
233 QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT
234
235 Anything that is not on this list is treated as a potential
236 tag, although only two character tags are accepted. In the
237 mpileup output, tag columns are displayed in the order they
238 were provided by the user in the command line. Field and tag
239 names have to be provided in a comma-separated string to the
240 mpileup command. E.g.
241
242 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam
243
244 will display four extra columns in the mpileup output, the
245 first being a list of comma-separated read names, followed by
246 a list of flag values, a list of RG tag values and a list of
247 NM tag values. Field values are always displayed before tag
248 values.
249
250 --output-sep CHAR
251 Specify a different separator character for tag value lists,
252 when those values might contain one or more commas (,), which
253 is the default list separator. This option only affects col‐
254 umns for two-letter tags like NM; standard fields like FLAG
255 or QNAME will always be separated by commas.
256
257 --output-empty CHAR
258 Specify a different 'no value' character for tag list entries
259 corresponding to reads that don't have a tag requested with
260 the --output-extra option. The default is *.
261
262 This option only applies to rows that have at least one read
263 in the pileup, and only to columns for two-letter tags. Col‐
264 umns for empty rows will always be printed as *.
265
266 --reverse-del
267 Mark the deletions on the reverse strand with the character
268 #, instead of the usual *.
269
270 -a Output all positions, including those with zero depth.
271
272 -a -a, -aa
273 Output absolutely all positions, including unused reference
274 sequences. Note that when used in conjunction with a BED
275 file the -a option may sometimes operate as if -aa was speci‐
276 fied if the reference sequence has coverage outside of the
277 region specified in the BED file.
278
279 BAQ (Base Alignment Quality)
280
281 BAQ is the Phred-scaled probability of a read base being misaligned.
282 It greatly helps to reduce false SNPs caused by misalignments. BAQ is
283 calculated using the probabilistic realignment method described in the
284 paper “Improving SNP discovery by base alignment quality”, Heng Li,
285 Bioinformatics, Volume 27, Issue 8 <https://doi.org/10.1093/bioinfor‐
286 matics/btr076>
287
288 BAQ is turned on when a reference file is supplied using the -f option.
289 To disable it, use the -B option.
290
291 It is possible to store precalculated BAQ values in a SAM BQ:Z tag.
292 Samtools mpileup will use the precalculated values if it finds them.
293 The -E option can be used to make it ignore the contents of the BQ:Z
294 tag and force it to recalculate the BAQ scores by making a new align‐
295 ment.
296
297
299 o Call SNPs and short INDELs:
300
301 samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
302 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf > var.flt.vcf
303
304 The bcftools filter command marks low quality sites and sites with
305 the read depth exceeding a limit, which should be adjusted to about
306 twice the average read depth (bigger read depths usually indicate
307 problematic regions which are often enriched for artefacts). One may
308 consider to add -C50 to mpileup if mapping quality is overestimated
309 for reads containing excessive mismatches. Applying this option usu‐
310 ally helps BWA-short but may not other mappers.
311
312 Individuals are identified from the SM tags in the @RG header lines.
313 Individuals can be pooled in one alignment file; one individual can
314 also be separated into multiple files. The -P option specifies that
315 indel candidates should be collected only from read groups with the
316 @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
317 sequenced by an indel-prone technology may affect the performance of
318 indel calling.
319
320
321 o Generate the consensus sequence for one diploid individual:
322
323 samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
324
325
326 o Include customized index file as a part of arguments.
327
328 samtools mpileup [options] -X /data_folder/in1.bam [/data_folder/in2.bam [...]] /index_folder/index1.bai [/index_folder/index2.bai [...]]
329
330
331 o Phase one individual:
332
333 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
334
335 The calmd command is used to reduce false heterozygotes around IN‐
336 DELs.
337
338
340 Written by Heng Li from the Sanger Institute.
341
342
344 samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)
345
346 Samtools website: <http://www.htslib.org/>
347
348
349
350samtools-1.13 7 July 2021 samtools-mpileup(1)