1samtools-mpileup(1) Bioinformatics tools samtools-mpileup(1)
2
3
4
6 samtools mpileup - produces "pileup" textual format from an alignment
7
9 samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q
10 minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
11
12
14 Generate text pileup output for one or multiple BAM files. Each input
15 file produces a separate group of pileup columns in the output.
16
17 Note that there are two orthogonal ways to specify locations in the in‐
18 put file; via -r region and -l file. The former uses (and requires) an
19 index to do random access while the latter streams through the file
20 contents filtering out the specified regions, requiring no index. The
21 two may be used in conjunction. For example a BED file containing lo‐
22 cations of genes in chromosome 20 could be specified using -r 20 -l
23 chr20.bed, meaning that the index is used to find chromosome 20 and
24 then it is filtered for the regions listed in the bed file.
25
26
27 Pileup Format
28 Pileup format consists of TAB-separated lines, with each line repre‐
29 senting the pileup of reads at a single genomic position.
30
31 Several columns contain numeric quality values encoded as individual
32 ASCII characters. Each character can range from “!” to “~” and is de‐
33 coded by taking its ASCII value and subtracting 33; e.g., “A” encodes
34 the numeric value 32.
35
36 The first three columns give the position and reference:
37
38 ○ Chromosome name.
39
40 ○ 1-based position on the chromosome.
41
42 ○ Reference base at this position (this will be “N” on all lines if
43 -f/--fasta-ref has not been used).
44
45 The remaining columns show the pileup data, and are repeated for each
46 input BAM file specified:
47
48 ○ Number of reads covering this position.
49
50 ○ Read bases. This encodes information on matches, mismatches, indels,
51 strand, mapping quality, and starts and ends of reads.
52
53 For each read covering the position, this column contains:
54
55 • If this is the first position covered by the read, a “^” character
56 followed by the alignment's mapping quality encoded as an ASCII
57 character.
58
59 • A single character indicating the read base and the strand to which
60 the read has been mapped:
61
62 Forward Reverse Meaning
63 ───────────────────────────────────────────────────────────────
64 . dot , comma Base matches the reference base
65 ACGTN acgtn Base is a mismatch to the reference base
66
67 > < Reference skip (due to CIGAR “N”)
68 * */# Deletion of the reference base (CIGAR “D”)
69
70 Deleted bases are shown as “*” on both strands unless --reverse-del
71 is used, in which case they are shown as “#” on the reverse strand.
72
73 • If there is an insertion after this read base, text matching
74 “\+[0-9]+[ACGTNacgtn*#]+”: a “+” character followed by an integer
75 giving the length of the insertion and then the inserted sequence.
76 Pads are shown as “*” unless --reverse-del is used, in which case
77 pads on the reverse strand will be shown as “#”.
78
79 • If there is a deletion after this read base, text matching
80 “-[0-9]+[ACGTNacgtn]+”: a “-” character followed by the deleted
81 reference bases represented similarly. (Subsequent pileup lines
82 will contain “*” for this read indicating the deleted bases.)
83
84 • If this is the last position covered by the read, a “$” character.
85
86 ○ Base qualities, encoded as ASCII characters.
87
88 ○ Alignment mapping qualities, encoded as ASCII characters. (Column
89 only present when -s/--output-MQ is used.)
90
91 ○ Comma-separated 1-based positions within the alignments, in the ori‐
92 entation shown in the input file. E.g., 5 indicates that it is the
93 fifth base of the corresponding read that is mapped to this genomic
94 position. (Column only present when -O/--output-BP is used.)
95
96 ○ Additional comma-separated read field columns, as selected via --out‐
97 put-extra. The fields selected appear in the same order as in SAM:
98 QNAME, FLAG, RNAME, POS, MAPQ (displayed numerically), RNEXT, PNEXT.
99
100 ○ Comma-separated 1-based positions within the alignments, in 5' to 3'
101 orientation. E.g., 5 indicates that it is the fifth base of the cor‐
102 responding read as produced by the sequencing instrument, that is
103 mapped to this genomic position. (Column only present when --output-
104 BP-5 is used.)
105
106
107 ○ Additional read tag field columns, as selected via --output-extra.
108 These columns are formatted as determined by --output-sep and --out‐
109 put-empty (comma-separated by default), and appear in the same order
110 as the tags are given in --output-extra.
111
112 Any output column that would be empty, such as a tag which is not
113 present or the filtered sequence depth is zero, is reported as "*".
114 This ensures a consistent number of columns across all reported posi‐
115 tions.
116
117
119 -6, --illumina1.3+
120 Assume the quality is in the Illumina 1.3+ encoding.
121
122 -A, --count-orphans
123 Do not skip anomalous read pairs in variant calling. Anoma‐
124 lous read pairs are those marked in the FLAG field as paired
125 in sequencing but without the properly-paired flag set.
126
127 -b, --bam-list FILE
128 List of input BAM files, one file per line [null]
129
130 -B, --no-BAQ
131 Disable base alignment quality (BAQ) computation. See BAQ
132 below.
133
134 -C, --adjust-MQ INT
135 Coefficient for downgrading mapping quality for reads con‐
136 taining excessive mismatches. Given a read with a phred-
137 scaled probability q of being generated from the mapped posi‐
138 tion, the new mapping quality is about sqrt((INT-q)/INT)*INT.
139 A zero value disables this functionality; if enabled, the
140 recommended value for BWA is 50. [0]
141
142 -d, --max-depth INT
143 At a position, read maximally INT reads per input file. Set‐
144 ting this limit reduces the amount of memory and time needed
145 to process regions with very high coverage. Passing zero for
146 this option sets it to the highest possible value, effec‐
147 tively removing the depth limit. [8000]
148
149 Note that up to release 1.8, samtools would enforce a minimum
150 value for this option. This no longer happens and the limit
151 is set exactly as specified.
152
153 -E, --redo-BAQ
154 Recalculate BAQ on the fly, ignore existing BQ tags. See BAQ
155 below.
156
157 -f, --fasta-ref FILE
158 The faidx-indexed reference file in the FASTA format. The
159 file can be optionally compressed by bgzip. [null]
160
161 Supplying a reference file will enable base alignment quality
162 calculation for all reads aligned to a reference in the file.
163 See BAQ below.
164
165 -G, --exclude-RG FILE
166 Exclude reads from read groups listed in FILE (one @RG-ID per
167 line)
168
169 -l, --positions FILE
170 BED or position list file containing a list of regions or
171 sites where pileup or BCF should be generated. Position list
172 files contain two columns (chromosome and position) and start
173 counting from 1. BED files contain at least 3 columns (chro‐
174 mosome, start and end position) and are 0-based half-open.
175 While it is possible to mix both position-list and BED coor‐
176 dinates in the same file, this is strongly ill advised due to
177 the differing coordinate systems. [null]
178
179 -q, --min-MQ INT
180 Minimum mapping quality for an alignment to be used [0]
181
182 -Q, --min-BQ INT
183 Minimum base quality for a base to be considered. [13]
184
185 Note base-quality 0 is used as a filtering mechanism for
186 overlap removal. Hence using --min-BQ 0 will disable the
187 overlap removal code and act as if the --ignore-overlaps op‐
188 tion has been set.
189
190 -r, --region STR
191 Only generate pileup in region. Requires the BAM files to be
192 indexed. If used in conjunction with -l then considers the
193 intersection of the two requests. STR [all sites]
194
195 -R, --ignore-RG
196 Ignore RG tags. Treat all reads in one BAM as one sample.
197
198 --rf, --incl-flags STR|INT
199 Required flags: include reads with any of the mask bits set
200 [null]
201
202 --ff, --excl-flags STR|INT
203 Filter flags: skip reads with any of the mask bits set [UN‐
204 MAP,SECONDARY,QCFAIL,DUP]
205
206 -x, --ignore-overlaps
207 Disable read-pair overlap detection.
208
209 -X Include customized index file as a part of arguments. See EX‐
210 AMPLES section for sample of usage.
211
212
213 Output Options:
214
215 -o, --output FILE
216 Write pileup output to FILE, rather than the default of stan‐
217 dard output.
218
219
220 -O, --output-BP
221 Output base positions on reads in orientation listed in the
222 SAM file (left to right).
223
224 --output-BP-5
225 Output base positions on reads in their original 5' to 3'
226 orientation.
227
228 -s, --output-MQ
229 Output mapping qualities encoded as ASCII characters.
230
231 --output-QNAME
232 Output an extra column containing comma-separated read names.
233 Equivalent to --output-extra QNAME.
234
235 --output-extra STR
236 Output extra columns containing comma-separated values of
237 read fields or read tags. The names of the selected fields
238 have to be provided as they are described in the SAM Specifi‐
239 cation (pag. 6) and will be output by the mpileup command in
240 the same order as in the document (i.e. QNAME, FLAG,
241 RNAME,...) The names are case sensitive. Currently, only the
242 following fields are supported:
243
244 QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT
245
246 Anything that is not on this list is treated as a potential
247 tag, although only two character tags are accepted. In the
248 mpileup output, tag columns are displayed in the order they
249 were provided by the user in the command line. Field and tag
250 names have to be provided in a comma-separated string to the
251 mpileup command. Tags with type B (byte array) type are not
252 supported. An absent or unsupported tag will be listed as
253 "*". E.g.
254
255 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam
256
257 will display four extra columns in the mpileup output, the
258 first being a list of comma-separated read names, followed by
259 a list of flag values, a list of RG tag values and a list of
260 NM tag values. Field values are always displayed before tag
261 values.
262
263 --output-sep CHAR
264 Specify a different separator character for tag value lists,
265 when those values might contain one or more commas (,), which
266 is the default list separator. This option only affects col‐
267 umns for two-letter tags like NM; standard fields like FLAG
268 or QNAME will always be separated by commas.
269
270 --output-empty CHAR
271 Specify a different 'no value' character for tag list entries
272 corresponding to reads that don't have a tag requested with
273 the --output-extra option. The default is *.
274
275 This option only applies to rows that have at least one read
276 in the pileup, and only to columns for two-letter tags. Col‐
277 umns for empty rows will always be printed as *.
278
279
280 -M, --output-mods
281 Adds base modification markup into the sequence column. This
282 uses the Mm and Ml auxiliary tags (or their uppercase equiva‐
283 lents). Any base in the sequence output may be followed by a
284 series of strand code quality strings enclosed within square
285 brackets where strand is "+" or "-", code is a single charac‐
286 ter (such as "m" or "h") or a ChEBI numeric in parentheses,
287 and quality is an optional numeric quality value. For exam‐
288 ple a "C" base with possible 5mC and 5hmC base modification
289 may be reported as "C[+m179+h40]".
290
291 Quality values are from 0 to 255 inclusive, representing a
292 linear scale of probability 0.0 to 1.0 in 1/256ths incre‐
293 ments. If quality values are absent (no Ml tag) these are
294 omitted, giving an example string of "C[+m+h]".
295
296 Note the base modifications may be identified on the reverse
297 strand, either due to the native ability for this detection
298 by the sequencing instrument or by the sequence subsequently
299 being reverse complemented. This can lead to modification
300 codes, such as "m" meaning 5mC, being shown for their comple‐
301 mentary bases, such as "G[-m50]".
302
303 When --output-mods is selected base modifications can appear
304 on any base in the sequence output, including during inser‐
305 tions. This may make parsing the string more complex, so
306 also see the --no-output-ins-mods and --no-output-ins options
307 to simplify this process.
308
309
310 --no-output-ins
311 Do not output the inserted bases in the sequence column.
312 Usually this is reported as "+length sequence", but with this
313 option it becomes simply "+length". For example an insertion
314 of AGT in a pileup column changes from "CCC+3AGTGCC" to
315 "CCC+3GCC".
316
317 Specifying this option twice also removes the "+length" por‐
318 tion, changing the example above to "CCCGCC".
319
320 The purpose of this change is to simplify parsing using basic
321 regular expressions, which traditionally cannot perform
322 counting operations. It is particularly beneficial when used
323 in conjunction with --output-mods as the syntax of the in‐
324 serted sequence is adjusted to also report possible base mod‐
325 ifications, but see also --no-output-ins-mods as an alterna‐
326 tive.
327
328
329 --no-output-ins-mods
330 Outputs the inserted bases in the sequence, but excluding any
331 base modifications. This only affects output when --output-
332 mods is also used.
333
334
335 --no-output-del
336 Do not output deleted reference bases in the sequence column.
337 Normally this is reported as "+length sequence", but with
338 this option it becomes simply "+length". For example an
339 deletion of 3 unknown bases (due to no reference being speci‐
340 fied) would normally be seen in a column as e.g.
341 "CCC-3NNNGCC", but will be reported as "CCC-3GCC" with this
342 option.
343
344 Specifying this option twice also removes the "-length" por‐
345 tion, changing the example above to "CCCGCC".
346
347 The purpose of this change is to simplify parsing using basic
348 regular expressions, which traditionally cannot perform
349 counting operations. See also --no-output-ins.
350
351
352 --no-output-ends
353 Removes the “^” (with mapping quality) and “$” markup from
354 the sequence column.
355
356
357 --reverse-del
358 Mark the deletions on the reverse strand with the character
359 #, instead of the usual *.
360
361 -a Output all positions, including those with zero depth.
362
363 -a -a, -aa
364 Output absolutely all positions, including unused reference
365 sequences. Note that when used in conjunction with a BED
366 file the -a option may sometimes operate as if -aa was speci‐
367 fied if the reference sequence has coverage outside of the
368 region specified in the BED file.
369
370 BAQ (Base Alignment Quality)
371
372 BAQ is the Phred-scaled probability of a read base being misaligned.
373 It greatly helps to reduce false SNPs caused by misalignments. BAQ is
374 calculated using the probabilistic realignment method described in the
375 paper “Improving SNP discovery by base alignment quality”, Heng Li,
376 Bioinformatics, Volume 27, Issue 8 <https://doi.org/10.1093/bioinfor‐
377 matics/btr076>
378
379 BAQ is turned on when a reference file is supplied using the -f option.
380 To disable it, use the -B option.
381
382 It is possible to store precalculated BAQ values in a SAM BQ:Z tag.
383 Samtools mpileup will use the precalculated values if it finds them.
384 The -E option can be used to make it ignore the contents of the BQ:Z
385 tag and force it to recalculate the BAQ scores by making a new align‐
386 ment.
387
388
390 Written by Heng Li from the Sanger Institute.
391
392
394 samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)
395
396 Samtools website: <http://www.htslib.org/>
397
398
399
400samtools-1.15.1 7 April 2022 samtools-mpileup(1)