1samtools-mpileup(1)          Bioinformatics tools          samtools-mpileup(1)
2
3
4

NAME

6       samtools mpileup - produces "pileup" textual format from an alignment
7

SYNOPSIS

9       samtools  mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q
10       minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
11
12

DESCRIPTION

14       Generate text pileup output for one or multiple BAM files.  Each  input
15       file produces a separate group of pileup columns in the output.
16
17       Samtools  mpileup can still produce VCF and BCF output (with -g or -u),
18       but this feature is deprecated and will be removed in a future release.
19       Please  use  bcftools  mpileup for this instead.  (Documentation on the
20       deprecated options has been removed from this manual  page,  but  older
21       versions are available online at <http://www.htslib.org/doc/>.)
22
23       Note that there are two orthogonal ways to specify locations in the in‐
24       put file; via -r region and -l file.  The former uses (and requires) an
25       index  to  do  random  access while the latter streams through the file
26       contents filtering out the specified regions, requiring no index.   The
27       two  may be used in conjunction.  For example a BED file containing lo‐
28       cations of genes in chromosome 20 could be specified  using  -r  20  -l
29       chr20.bed,  meaning  that  the  index is used to find chromosome 20 and
30       then it is filtered for the regions listed in the bed file.
31
32
33   Pileup Format
34       Pileup format consists of TAB-separated lines, with  each  line  repre‐
35       senting the pileup of reads at a single genomic position.
36
37       Several  columns  contain  numeric quality values encoded as individual
38       ASCII characters.  Each character can range from “!” to “~” and is  de‐
39       coded  by  taking its ASCII value and subtracting 33; e.g., “A” encodes
40       the numeric value 32.
41
42       The first three columns give the position and reference:
43
44       ○ Chromosome name.
45
46       ○ 1-based position on the chromosome.
47
48       ○ Reference base at this position (this will be “N”  on  all  lines  if
49         -f/--fasta-ref has not been used).
50
51       The  remaining  columns show the pileup data, and are repeated for each
52       input BAM file specified:
53
54       ○ Number of reads covering this position.
55
56       ○ Read bases.  This encodes information on matches, mismatches, indels,
57         strand, mapping quality, and starts and ends of reads.
58
59         For each read covering the position, this column contains:
60
61         • If  this is the first position covered by the read, a “^” character
62           followed by the alignment's mapping quality  encoded  as  an  ASCII
63           character.
64
65         • A single character indicating the read base and the strand to which
66           the read has been mapped:
67
68           Forward   Reverse                    Meaning
69           ───────────────────────────────────────────────────────────────
70            . dot    , comma   Base matches the reference base
71            ACGTN     acgtn    Base is a mismatch to the reference base
72              >         <      Reference skip (due to CIGAR “N”)
73              *        */#     Deletion of the reference base (CIGAR “D”)
74
75           Deleted bases are shown as “*” on both strands unless --reverse-del
76           is used, in which case they are shown as “#” on the reverse strand.
77
78         • If  there  is  an  insertion  after  this  read base, text matching
79           “\+[0-9]+[ACGTNacgtn*#]+”: a “+” character followed by  an  integer
80           giving  the length of the insertion and then the inserted sequence.
81           Pads are shown as “*” unless --reverse-del is used, in  which  case
82           pads on the reverse strand will be shown as “#”.
83
84         • If  there  is  a  deletion  after  this  read  base,  text matching
85           “-[0-9]+[ACGTNacgtn]+”: a “-” character  followed  by  the  deleted
86           reference  bases  represented  similarly.  (Subsequent pileup lines
87           will contain “*” for this read indicating the deleted bases.)
88
89         • If this is the last position covered by the read, a “$” character.
90
91       ○ Base qualities, encoded as ASCII characters.
92
93       ○ Alignment mapping qualities, encoded as  ASCII  characters.   (Column
94         only present when -s/--output-MQ is used.)
95
96       ○ Comma-separated  1-based positions within the alignments, e.g., 5 in‐
97         dicates that it is the fifth base of the corresponding read  that  is
98         mapped to this genomic position.  (Column only present when -O/--out‐
99         put-BP is used.)
100
101       ○ Additional comma-separated read field columns, as selected via --out‐
102         put-extra.   The  fields selected appear in the same order as in SAM:
103         QNAME, FLAG, RNAME, POS, MAPQ (displayed numerically), RNEXT, PNEXT.
104
105       ○ Additional read tag field columns, as  selected  via  --output-extra.
106         These  columns are formatted as determined by --output-sep and --out‐
107         put-empty (comma-separated by default), and appear in the same  order
108         as the tags are given in --output-extra.
109
110

OPTIONS

112       -6, --illumina1.3+
113                 Assume the quality is in the Illumina 1.3+ encoding.
114
115       -A, --count-orphans
116                 Do  not skip anomalous read pairs in variant calling.  Anoma‐
117                 lous read pairs are those marked in the FLAG field as  paired
118                 in sequencing but without the properly-paired flag set.
119
120       -b, --bam-list FILE
121                 List of input BAM files, one file per line [null]
122
123       -B, --no-BAQ
124                 Disable  base  alignment  quality (BAQ) computation.  See BAQ
125                 below.
126
127       -C, --adjust-MQ INT
128                 Coefficient for downgrading mapping quality  for  reads  con‐
129                 taining  excessive  mismatches.  Given  a  read with a phred-
130                 scaled probability q of being generated from the mapped posi‐
131                 tion, the new mapping quality is about sqrt((INT-q)/INT)*INT.
132                 A zero value disables this  functionality;  if  enabled,  the
133                 recommended value for BWA is 50. [0]
134
135       -d, --max-depth INT
136                 At  a position, read maximally INT reads per input file. Set‐
137                 ting this limit reduces the amount of memory and time  needed
138                 to process regions with very high coverage.  Passing zero for
139                 this option sets it to the  highest  possible  value,  effec‐
140                 tively removing the depth limit. [8000]
141
142                 Note that up to release 1.8, samtools would enforce a minimum
143                 value for this option.  This no longer happens and the  limit
144                 is set exactly as specified.
145
146       -E, --redo-BAQ
147                 Recalculate BAQ on the fly, ignore existing BQ tags.  See BAQ
148                 below.
149
150       -f, --fasta-ref FILE
151                 The faidx-indexed reference file in  the  FASTA  format.  The
152                 file can be optionally compressed by bgzip.  [null]
153
154                 Supplying a reference file will enable base alignment quality
155                 calculation for all reads aligned to a reference in the file.
156                 See BAQ below.
157
158       -G, --exclude-RG FILE
159                 Exclude reads from read groups listed in FILE (one @RG-ID per
160                 line)
161
162       -l, --positions FILE
163                 BED or position list file containing a  list  of  regions  or
164                 sites  where pileup or BCF should be generated. Position list
165                 files contain two columns (chromosome and position) and start
166                 counting from 1.  BED files contain at least 3 columns (chro‐
167                 mosome, start and end position) and are 0-based half-open.
168                 While it is possible to mix both position-list and BED  coor‐
169                 dinates in the same file, this is strongly ill advised due to
170                 the differing coordinate systems. [null]
171
172       -q, --min-MQ INT
173                 Minimum mapping quality for an alignment to be used [0]
174
175       -Q, --min-BQ INT
176                 Minimum base quality for a base to be considered [13]
177
178       -r, --region STR
179                 Only generate pileup in region. Requires the BAM files to  be
180                 indexed.   If  used in conjunction with -l then considers the
181                 intersection of the two requests.  STR [all sites]
182
183       -R, --ignore-RG
184                 Ignore RG tags. Treat all reads in one BAM as one sample.
185
186       --rf, --incl-flags STR|INT
187                 Required flags: include reads with any of the mask  bits  set
188                 [null]
189
190       --ff, --excl-flags STR|INT
191                 Filter  flags:  skip reads with any of the mask bits set [UN‐
192                 MAP,SECONDARY,QCFAIL,DUP]
193
194       -x, --ignore-overlaps
195                 Disable read-pair overlap detection.
196
197       -X        Include customized index file as a part of arguments. See EX‐
198                 AMPLES section for sample of usage.
199
200
201       Output Options:
202
203       -o, --output FILE
204                 Write pileup output to FILE, rather than the default of stan‐
205                 dard output.
206
207                 (The same short  option  is  used  for  both  the  deprecated
208                 --open-prob  option  and --output.  If -o's argument contains
209                 any non-digit characters other than a leading + or - sign, it
210                 is  interpreted  as --output.  Usually the filename extension
211                 will take care of this, but to write to an  entirely  numeric
212                 filename use -o ./123 or --output 123.)
213
214       -O, --output-BP
215                 Output base positions on reads.
216
217       -s, --output-MQ
218                 Output mapping qualities encoded as ASCII characters.
219
220       --output-QNAME
221                 Output an extra column containing comma-separated read names.
222                 Equivalent to --output-extra QNAME.
223
224       --output-extra STR
225                 Output extra columns  containing  comma-separated  values  of
226                 read  fields  or  read tags. The names of the selected fields
227                 have to be provided as they are described in the SAM Specifi‐
228                 cation  (pag. 6) and will be output by the mpileup command in
229                 the same  order  as  in  the  document  (i.e.   QNAME,  FLAG,
230                 RNAME,...)  The names are case sensitive. Currently, only the
231                 following fields are supported:
232
233                 QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT
234
235                 Anything that is not on this list is treated as  a  potential
236                 tag,  although  only  two character tags are accepted. In the
237                 mpileup output, tag columns are displayed in the  order  they
238                 were provided by the user in the command line.  Field and tag
239                 names have to be provided in a comma-separated string to  the
240                 mpileup command.  E.g.
241
242                 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam
243
244                 will  display  four  extra columns in the mpileup output, the
245                 first being a list of comma-separated read names, followed by
246                 a  list of flag values, a list of RG tag values and a list of
247                 NM tag values. Field values are always displayed  before  tag
248                 values.
249
250       --output-sep CHAR
251                 Specify  a different separator character for tag value lists,
252                 when those values might contain one or more commas (,), which
253                 is the default list separator.  This option only affects col‐
254                 umns for two-letter tags like NM; standard fields  like  FLAG
255                 or QNAME will always be separated by commas.
256
257       --output-empty CHAR
258                 Specify a different 'no value' character for tag list entries
259                 corresponding to reads that don't have a tag  requested  with
260                 the --output-extra option. The default is *.
261
262                 This  option only applies to rows that have at least one read
263                 in the pileup, and only to columns for two-letter tags.  Col‐
264                 umns for empty rows will always be printed as *.
265
266       --reverse-del
267                 Mark  the  deletions on the reverse strand with the character
268                 #, instead of the usual *.
269
270       -a        Output all positions, including those with zero depth.
271
272       -a -a, -aa
273                 Output absolutely all positions, including  unused  reference
274                 sequences.   Note  that  when  used in conjunction with a BED
275                 file the -a option may sometimes operate as if -aa was speci‐
276                 fied  if  the  reference sequence has coverage outside of the
277                 region specified in the BED file.
278
279       BAQ (Base Alignment Quality)
280
281       BAQ is the Phred-scaled probability of a read  base  being  misaligned.
282       It  greatly helps to reduce false SNPs caused by misalignments.  BAQ is
283       calculated using the probabilistic realignment method described in  the
284       paper  “Improving  SNP  discovery  by base alignment quality”, Heng Li,
285       Bioinformatics, Volume 27, Issue  8  <https://doi.org/10.1093/bioinfor
286       matics/btr076>
287
288       BAQ is turned on when a reference file is supplied using the -f option.
289       To disable it, use the -B option.
290
291       It is possible to store precalculated BAQ values in  a  SAM  BQ:Z  tag.
292       Samtools  mpileup  will  use the precalculated values if it finds them.
293       The -E option can be used to make it ignore the contents  of  the  BQ:Z
294       tag  and  force it to recalculate the BAQ scores by making a new align‐
295       ment.
296
297

EXAMPLES

299       o Call SNPs and short INDELs:
300
301           samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
302           bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf  > var.flt.vcf
303
304         The bcftools filter command marks low quality sites  and  sites  with
305         the  read  depth exceeding a limit, which should be adjusted to about
306         twice the average read depth (bigger  read  depths  usually  indicate
307         problematic regions which are often enriched for artefacts).  One may
308         consider to add -C50 to mpileup if mapping quality  is  overestimated
309         for  reads containing excessive mismatches. Applying this option usu‐
310         ally helps BWA-short but may not other mappers.
311
312         Individuals are identified from the SM tags in the @RG header  lines.
313         Individuals  can  be pooled in one alignment file; one individual can
314         also be separated into multiple files. The -P option  specifies  that
315         indel  candidates  should be collected only from read groups with the
316         @RG-PL tag set to ILLUMINA.  Collecting indel candidates  from  reads
317         sequenced  by an indel-prone technology may affect the performance of
318         indel calling.
319
320
321       o Generate the consensus sequence for one diploid individual:
322
323           samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
324
325
326       o Include customized index file as a part of arguments.
327
328           samtools mpileup [options] -X /data_folder/in1.bam [/data_folder/in2.bam [...]] /index_folder/index1.bai [/index_folder/index2.bai [...]]
329
330
331       o Phase one individual:
332
333           samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
334
335         The calmd command is used to reduce false  heterozygotes  around  IN‐
336         DELs.
337
338

AUTHOR

340       Written by Heng Li from the Sanger Institute.
341
342

SEE ALSO

344       samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)
345
346       Samtools website: <http://www.htslib.org/>
347
348
349
350samtools-1.13                     7 July 2021              samtools-mpileup(1)
Impressum