samtools-mpileup(1)

1samtools-mpileup(1)          Bioinformatics tools          samtools-mpileup(1)
2
3
4

NAME

6       samtools mpileup - produces "pileup" textual format from an alignment
7

SYNOPSIS

9       samtools  mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q
10       minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
11
12

DESCRIPTION

14       Generate text pileup output for one or multiple BAM files.  Each  input
15       file produces a separate group of pileup columns in the output.
16
17       Note that there are two orthogonal ways to specify locations in the in‐
18       put file; via -r region and -l file.  The former uses (and requires) an
19       index  to  do  random  access while the latter streams through the file
20       contents filtering out the specified regions, requiring no index.   The
21       two  may be used in conjunction.  For example a BED file containing lo‐
22       cations of genes in chromosome 20 could be specified  using  -r  20  -l
23       chr20.bed,  meaning  that  the  index is used to find chromosome 20 and
24       then it is filtered for the regions listed in the bed file.
25
26
27   Pileup Format
28       Pileup format consists of TAB-separated lines, with  each  line  repre‐
29       senting the pileup of reads at a single genomic position.
30
31       Several  columns  contain  numeric quality values encoded as individual
32       ASCII characters.  Each character can range from “!” to “~” and is  de‐
33       coded  by  taking its ASCII value and subtracting 33; e.g., “A” encodes
34       the numeric value 32.
35
36       The first three columns give the position and reference:
37
38       ○ Chromosome name.
39
40       ○ 1-based position on the chromosome.
41
42       ○ Reference base at this position (this will be “N”  on  all  lines  if
43         -f/--fasta-ref has not been used).
44
45       The  remaining  columns show the pileup data, and are repeated for each
46       input BAM file specified:
47
48       ○ Number of reads covering this position.
49
50       ○ Read bases.  This encodes information on matches, mismatches, indels,
51         strand, mapping quality, and starts and ends of reads.
52
53         For each read covering the position, this column contains:
54
55         • If  this is the first position covered by the read, a “^” character
56           followed by the alignment's mapping quality  encoded  as  an  ASCII
57           character.
58
59         • A single character indicating the read base and the strand to which
60           the read has been mapped:
61
62           Forward   Reverse                    Meaning
63           ───────────────────────────────────────────────────────────────
64            . dot    , comma   Base matches the reference base
65            ACGTN     acgtn    Base is a mismatch to the reference base
66
67              >         <      Reference skip (due to CIGAR “N”)
68              *        */#     Deletion of the reference base (CIGAR “D”)
69
70           Deleted bases are shown as “*” on both strands unless --reverse-del
71           is used, in which case they are shown as “#” on the reverse strand.
72
73         • If  there  is  an  insertion  after  this  read base, text matching
74           “\+[0-9]+[ACGTNacgtn*#]+”: a “+” character followed by  an  integer
75           giving  the length of the insertion and then the inserted sequence.
76           Pads are shown as “*” unless --reverse-del is used, in  which  case
77           pads on the reverse strand will be shown as “#”.
78
79         • If  there  is  a  deletion  after  this  read  base,  text matching
80           “-[0-9]+[ACGTNacgtn]+”: a “-” character  followed  by  the  deleted
81           reference  bases  represented  similarly.  (Subsequent pileup lines
82           will contain “*” for this read indicating the deleted bases.)
83
84         • If this is the last position covered by the read, a “$” character.
85
86       ○ Base qualities, encoded as ASCII characters.
87
88       ○ Alignment mapping qualities, encoded as  ASCII  characters.   (Column
89         only present when -s/--output-MQ is used.)
90
91       ○ Comma-separated  1-based positions within the alignments, in the ori‐
92         entation shown in the input file.  E.g., 5 indicates that it  is  the
93         fifth  base  of the corresponding read that is mapped to this genomic
94         position.  (Column only present when -O/--output-BP is used.)
95
96       ○ Additional comma-separated read field columns, as selected via --out‐
97         put-extra.   The  fields selected appear in the same order as in SAM:
98         QNAME, FLAG, RNAME, POS, MAPQ (displayed numerically), RNEXT, PNEXT.
99
100       ○ Comma-separated 1-based positions within the alignments, in 5' to  3'
101         orientation.  E.g., 5 indicates that it is the fifth base of the cor‐
102         responding read as produced by the  sequencing  instrument,  that  is
103         mapped  to this genomic position. (Column only present when --output-
104         BP-5 is used.)
105
106
107       ○ Additional read tag field columns, as  selected  via  --output-extra.
108         These  columns are formatted as determined by --output-sep and --out‐
109         put-empty (comma-separated by default), and appear in the same  order
110         as the tags are given in --output-extra.
111
112         Any  output  column  that  would be empty, such as a tag which is not
113         present or the filtered sequence depth is zero, is reported  as  "*".
114         This ensures a consistent number of columns across all reported posi‐
115         tions.
116
117

OPTIONS

119       -6, --illumina1.3+
120                 Assume the quality is in the Illumina 1.3+ encoding.
121
122       -A, --count-orphans
123                 Do not skip anomalous read pairs in variant calling.   Anoma‐
124                 lous  read pairs are those marked in the FLAG field as paired
125                 in sequencing but without the properly-paired flag set.
126
127       -b, --bam-list FILE
128                 List of input BAM files, one file per line [null]
129
130       -B, --no-BAQ
131                 Disable base alignment quality (BAQ)  computation.   See  BAQ
132                 below.
133
134       -C, --adjust-MQ INT
135                 Coefficient  for  downgrading  mapping quality for reads con‐
136                 taining excessive mismatches. Given  a  read  with  a  phred-
137                 scaled probability q of being generated from the mapped posi‐
138                 tion, the new mapping quality is about sqrt((INT-q)/INT)*INT.
139                 A  zero  value  disables  this functionality; if enabled, the
140                 recommended value for BWA is 50. [0]
141
142       -d, --max-depth INT
143                 At a position, read maximally INT reads per input file.  Set‐
144                 ting  this limit reduces the amount of memory and time needed
145                 to process regions with very high coverage.  Passing zero for
146                 this  option  sets  it  to the highest possible value, effec‐
147                 tively removing the depth limit. [8000]
148
149                 Note that up to release 1.8, samtools would enforce a minimum
150                 value  for this option.  This no longer happens and the limit
151                 is set exactly as specified.
152
153       -E, --redo-BAQ
154                 Recalculate BAQ on the fly, ignore existing BQ tags.  See BAQ
155                 below.
156
157       -f, --fasta-ref FILE
158                 The  faidx-indexed  reference  file  in the FASTA format. The
159                 file can be optionally compressed by bgzip.  [null]
160
161                 Supplying a reference file will enable base alignment quality
162                 calculation for all reads aligned to a reference in the file.
163                 See BAQ below.
164
165       -G, --exclude-RG FILE
166                 Exclude reads from read groups listed in FILE (one @RG-ID per
167                 line)
168
169       -l, --positions FILE
170                 BED  or  position  list  file containing a list of regions or
171                 sites where pileup or BCF should be generated. Position  list
172                 files contain two columns (chromosome and position) and start
173                 counting from 1.  BED files contain at least 3 columns (chro‐
174                 mosome, start and end position) and are 0-based half-open.
175                 While  it is possible to mix both position-list and BED coor‐
176                 dinates in the same file, this is strongly ill advised due to
177                 the differing coordinate systems. [null]
178
179       -q, --min-MQ INT
180                 Minimum mapping quality for an alignment to be used [0]
181
182       -Q, --min-BQ INT
183                 Minimum base quality for a base to be considered. [13]
184
185                 Note  base-quality  0  is  used  as a filtering mechanism for
186                 overlap removal.  Hence using --min-BQ  0  will  disable  the
187                 overlap  removal code and act as if the --ignore-overlaps op‐
188                 tion has been set.
189
190       -r, --region STR
191                 Only generate pileup in region. Requires the BAM files to  be
192                 indexed.   If  used in conjunction with -l then considers the
193                 intersection of the two requests.  STR [all sites]
194
195       -R, --ignore-RG
196                 Ignore RG tags. Treat all reads in one BAM as one sample.
197
198       --rf, --incl-flags STR|INT
199                 Required flags: include reads with any of the mask  bits  set
200                 [null]
201
202       --ff, --excl-flags STR|INT
203                 Filter  flags:  skip reads with any of the mask bits set [UN‐
204                 MAP,SECONDARY,QCFAIL,DUP]
205
206       -x, --ignore-overlaps
207                 Disable read-pair overlap detection.
208
209       -X        Include customized index file as a part of arguments. See EX‐
210                 AMPLES section for sample of usage.
211
212
213       Output Options:
214
215       -o, --output FILE
216                 Write pileup output to FILE, rather than the default of stan‐
217                 dard output.
218
219
220       -O, --output-BP
221                 Output base positions on reads in orientation listed  in  the
222                 SAM file (left to right).
223
224       --output-BP-5
225                 Output  base  positions  on  reads in their original 5' to 3'
226                 orientation.
227
228       -s, --output-MQ
229                 Output mapping qualities encoded as ASCII characters.
230
231       --output-QNAME
232                 Output an extra column containing comma-separated read names.
233                 Equivalent to --output-extra QNAME.
234
235       --output-extra STR
236                 Output  extra  columns  containing  comma-separated values of
237                 read fields or read tags. The names of  the  selected  fields
238                 have to be provided as they are described in the SAM Specifi‐
239                 cation (pag. 6) and will be output by the mpileup command  in
240                 the  same  order  as  in  the  document  (i.e.   QNAME, FLAG,
241                 RNAME,...)  The names are case sensitive. Currently, only the
242                 following fields are supported:
243
244                 QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT
245
246                 Anything  that  is not on this list is treated as a potential
247                 tag, although only two character tags are  accepted.  In  the
248                 mpileup  output,  tag columns are displayed in the order they
249                 were provided by the user in the command line.  Field and tag
250                 names  have to be provided in a comma-separated string to the
251                 mpileup command.  Tags with type B (byte array) type are  not
252                 supported.   An  absent  or unsupported tag will be listed as
253                 "*".  E.g.
254
255                 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam
256
257                 will display four extra columns in the  mpileup  output,  the
258                 first being a list of comma-separated read names, followed by
259                 a list of flag values, a list of RG tag values and a list  of
260                 NM  tag  values. Field values are always displayed before tag
261                 values.
262
263       --output-sep CHAR
264                 Specify a different separator character for tag value  lists,
265                 when those values might contain one or more commas (,), which
266                 is the default list separator.  This option only affects col‐
267                 umns  for  two-letter tags like NM; standard fields like FLAG
268                 or QNAME will always be separated by commas.
269
270       --output-empty CHAR
271                 Specify a different 'no value' character for tag list entries
272                 corresponding  to  reads that don't have a tag requested with
273                 the --output-extra option. The default is *.
274
275                 This option only applies to rows that have at least one  read
276                 in the pileup, and only to columns for two-letter tags.  Col‐
277                 umns for empty rows will always be printed as *.
278
279
280       -M, --output-mods
281                 Adds base modification markup into the sequence column.  This
282                 uses the Mm and Ml auxiliary tags (or their uppercase equiva‐
283                 lents).  Any base in the sequence output may be followed by a
284                 series  of strand code quality strings enclosed within square
285                 brackets where strand is "+" or "-", code is a single charac‐
286                 ter  (such  as "m" or "h") or a ChEBI numeric in parentheses,
287                 and quality is an optional numeric quality value.  For  exam‐
288                 ple  a  "C" base with possible 5mC and 5hmC base modification
289                 may be reported as "C[+m179+h40]".
290
291                 Quality values are from 0 to 255  inclusive,  representing  a
292                 linear  scale  of  probability  0.0 to 1.0 in 1/256ths incre‐
293                 ments.  If quality values are absent (no Ml  tag)  these  are
294                 omitted, giving an example string of "C[+m+h]".
295
296                 Note  the base modifications may be identified on the reverse
297                 strand, either due to the native ability for  this  detection
298                 by  the sequencing instrument or by the sequence subsequently
299                 being reverse complemented.  This can  lead  to  modification
300                 codes, such as "m" meaning 5mC, being shown for their comple‐
301                 mentary bases, such as "G[-m50]".
302
303                 When --output-mods is selected base modifications can  appear
304                 on  any  base in the sequence output, including during inser‐
305                 tions.  This may make parsing the  string  more  complex,  so
306                 also see the --no-output-ins-mods and --no-output-ins options
307                 to simplify this process.
308
309
310       --no-output-ins
311                 Do not output the inserted  bases  in  the  sequence  column.
312                 Usually this is reported as "+length sequence", but with this
313                 option it becomes simply "+length".  For example an insertion
314                 of  AGT  in  a  pileup  column  changes from "CCC+3AGTGCC" to
315                 "CCC+3GCC".
316
317                 Specifying this option twice also removes the "+length"  por‐
318                 tion, changing the example above to "CCCGCC".
319
320                 The purpose of this change is to simplify parsing using basic
321                 regular  expressions,  which  traditionally  cannot   perform
322                 counting operations.  It is particularly beneficial when used
323                 in conjunction with --output-mods as the syntax  of  the  in‐
324                 serted sequence is adjusted to also report possible base mod‐
325                 ifications, but see also --no-output-ins-mods as an  alterna‐
326                 tive.
327
328
329       --no-output-ins-mods
330                 Outputs the inserted bases in the sequence, but excluding any
331                 base modifications.  This only affects output when  --output-
332                 mods is also used.
333
334
335       --no-output-del
336                 Do not output deleted reference bases in the sequence column.
337                 Normally this is reported as  "+length  sequence",  but  with
338                 this  option  it  becomes  simply  "+length".  For example an
339                 deletion of 3 unknown bases (due to no reference being speci‐
340                 fied)   would   normally   be   seen  in  a  column  as  e.g.
341                 "CCC-3NNNGCC", but will be reported as "CCC-3GCC"  with  this
342                 option.
343
344                 Specifying  this option twice also removes the "-length" por‐
345                 tion, changing the example above to "CCCGCC".
346
347                 The purpose of this change is to simplify parsing using basic
348                 regular   expressions,  which  traditionally  cannot  perform
349                 counting operations.  See also --no-output-ins.
350
351
352       --no-output-ends
353                 Removes the “^” (with mapping quality) and  “$”  markup  from
354                 the sequence column.
355
356
357       --reverse-del
358                 Mark  the  deletions on the reverse strand with the character
359                 #, instead of the usual *.
360
361       -a        Output all positions, including those with zero depth.
362
363       -a -a, -aa
364                 Output absolutely all positions, including  unused  reference
365                 sequences.   Note  that  when  used in conjunction with a BED
366                 file the -a option may sometimes operate as if -aa was speci‐
367                 fied  if  the  reference sequence has coverage outside of the
368                 region specified in the BED file.
369
370       BAQ (Base Alignment Quality)
371
372       BAQ is the Phred-scaled probability of a read  base  being  misaligned.
373       It  greatly helps to reduce false SNPs caused by misalignments.  BAQ is
374       calculated using the probabilistic realignment method described in  the
375       paper  “Improving  SNP  discovery  by base alignment quality”, Heng Li,
376       Bioinformatics, Volume 27, Issue  8  <https://doi.org/10.1093/bioinfor‐
377       matics/btr076>
378
379       BAQ is turned on when a reference file is supplied using the -f option.
380       To disable it, use the -B option.
381
382       It is possible to store precalculated BAQ values in  a  SAM  BQ:Z  tag.
383       Samtools  mpileup  will  use the precalculated values if it finds them.
384       The -E option can be used to make it ignore the contents  of  the  BQ:Z
385       tag  and  force it to recalculate the BAQ scores by making a new align‐
386       ment.
387
388

AUTHOR

390       Written by Heng Li from the Sanger Institute.
391
392

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHOR

SEE ALSO