samtools(1)

1samtools(1)                  Bioinformatics tools                  samtools(1)
2
3
4

NAME

6       samtools - Utilities for the Sequence Alignment/Map (SAM) format
7

SYNOPSIS

9       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
10
11       samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
12
13       samtools index aln.sorted.bam
14
15       samtools idxstats aln.sorted.bam
16
17       samtools flagstat aln.sorted.bam
18
19       samtools stats aln.sorted.bam
20
21       samtools bedcov aln.sorted.bam
22
23       samtools depth aln.sorted.bam
24
25       samtools view aln.sorted.bam chr2:20,100,000-20,200,000
26
27       samtools merge out.bam in1.bam in2.bam in3.bam
28
29       samtools faidx ref.fasta
30
31       samtools fqidx ref.fastq
32
33       samtools tview aln.sorted.bam ref.fasta
34
35       samtools split merged.bam
36
37       samtools quickcheck in1.bam in2.cram
38
39       samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
40
41       samtools fixmate in.namesorted.sam out.bam
42
43       samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
44
45       samtools flags PAIRED,UNMAP,MUNMAP
46
47       samtools fastq input.bam > output.fastq
48
49       samtools fasta input.bam > output.fasta
50
51       samtools  addreplacerg  -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out‐
52       put.bam input.bam
53
54       samtools collate -o aln.name_collated.bam aln.sorted.bam
55
56       samtools depad input.bam
57
58       samtools markdup in.algnsorted.bam out.bam
59
60

DESCRIPTION

62       Samtools is a set of utilities that manipulate alignments  in  the  BAM
63       format. It imports from and exports to the SAM (Sequence Alignment/Map)
64       format, does sorting, merging and  indexing,  and  allows  to  retrieve
65       reads in any regions swiftly.
66
67       Samtools  is designed to work on a stream. It regards an input file `-'
68       as the standard input (stdin) and an output file `-'  as  the  standard
69       output (stdout). Several commands can thus be combined with Unix pipes.
70       Samtools always output warning and error messages to the standard error
71       output (stderr).
72
73       Samtools  is  also able to open a BAM (not SAM) file on a remote FTP or
74       HTTP server if the BAM file name starts  with  `ftp://'  or  `http://'.
75       Samtools  checks  the  current working directory for the index file and
76       will download the index upon absence. Samtools does  not  retrieve  the
77       entire alignment file unless it is asked to do so.
78
79

COMMANDS AND OPTIONS

81       view      samtools view [options] in.sam|in.bam|in.cram [region...]
82
83                 With  no  options or regions specified, prints all alignments
84                 in the specified input alignment file (in SAM, BAM,  or  CRAM
85                 format) to standard output in SAM format (with no header).
86
87                 You may specify one or more space-separated region specifica‐
88                 tions after the input filename to  restrict  output  to  only
89                 those  alignments  which overlap the specified region(s). Use
90                 of region specifications  requires  a  coordinate-sorted  and
91                 indexed input file (in BAM or CRAM format).
92
93                 The  -b, -C, -1, -u, -h, -H, and -c options change the output
94                 format from the default of headerless SAM, and the -o and  -U
95                 options set the output file name(s).
96
97                 The  -t and -T options provide additional reference data. One
98                 of these two options is required when SAM input does not con‐
99                 tain  @SQ  headers,  and  the  -T option is required whenever
100                 writing CRAM output.
101
102                 The -L, -M, -r, -R, -s, -q, -l, -m, -f, -F,  and  -G  options
103                 filter  the alignments that will be included in the output to
104                 only those alignments that match certain criteria.
105
106                 The -x and -B options modify the data which is  contained  in
107                 each alignment.
108
109                 Finally,  the  -@  option  can be used to allocate additional
110                 threads to be  used  for  compression,  and  the  -?   option
111                 requests a long help message.
112
113
114       REGIONS:
115                 Regions  can  be  specified as: RNAME[:STARTPOS[-ENDPOS]] and
116                 all position coordinates are 1-based.
117
118                 Important note: when multiple regions are given, some  align‐
119                 ments  may be output multiple times if they overlap more than
120                 one of the specified regions.
121
122                 Examples of region specifications:
123
124                 chr1      Output  all  alignments  mapped  to  the  reference
125                           sequence named `chr1' (i.e. @SQ SN:chr1).
126
127                 chr2:1000000
128                           The  region  on  chr2  beginning  at  base position
129                           1,000,000 and ending at the end of the chromosome.
130
131                 chr3:1000-2000
132                           The 1001bp region on chr3 beginning at  base  posi‐
133                           tion  1,000  and  ending  at  base  position  2,000
134                           (including both end positions).
135
136                 '*'       Output the unmapped reads at the end of  the  file.
137                           (This does not include any unmapped reads placed on
138                           a reference sequence alongside their mapped mates.)
139
140                 .         Output all alignments.  (Mostly unnecessary as  not
141                           specifying a region at all has the same effect.)
142
143       OPTIONS:
144
145                 -b        Output in the BAM format.
146
147                 -C        Output in the CRAM format (requires -T).
148
149                 -1        Enable fast BAM compression (implies -b).
150
151                 -u        Output  uncompressed  BAM.  This  option saves time
152                           spent on compression/decompression and is thus pre‐
153                           ferred when the output is piped to another samtools
154                           command.
155
156                 -h        Include the header in the output.
157
158                 -H        Output the header only.
159
160                 -c        Instead of printing the alignments, only count them
161                           and  print  the  total  number. All filter options,
162                           such as -f, -F, and -q, are taken into account.
163
164                 -?        Output long help and exit immediately.
165
166                 -o FILE   Output to FILE [stdout].
167
168                 -U FILE   Write alignments that are not selected by the vari‐
169                           ous  filter  options  to FILE.  When this option is
170                           used, all alignments (or all alignments  intersect‐
171                           ing  the  regions  specified) are written to either
172                           the output file or this file, but never both.
173
174                 -t FILE   A tab-delimited FILE.  Each line must  contain  the
175                           reference  name  in the first column and the length
176                           of the reference in the  second  column,  with  one
177                           line  for  each distinct reference.  Any additional
178                           fields beyond the second column are  ignored.  This
179                           file  also  defines  the  order  of  the  reference
180                           sequences in sorting. If you run:  `samtools  faidx
181                           <ref.fa>',  the  resulting  index file <ref.fa>.fai
182                           can be used as this FILE.
183
184                 -T FILE   A FASTA  format  reference  FILE,  optionally  com‐
185                           pressed  by  bgzip  and ideally indexed by samtools
186                           faidx.  If an index is not  present,  one  will  be
187                           generated for you.
188
189                 -L FILE   Only  output  alignments  overlapping the input BED
190                           FILE [null].
191
192                 -M        Use the multi-region iterator on the union  of  the
193                           BED  file  and command-line region arguments.  This
194                           avoids re-reading the same regions of files so  can
195                           sometimes  be  much faster.  Note this also removes
196                           duplicate sequences.  Without this a sequence  that
197                           overlaps  multiple regions specified on the command
198                           line will be reported multiple times.
199
200                 -r STR    Output alignments in read group STR  [null].   Note
201                           that  records  with  no  RG tag will also be output
202                           when using this option.  This behaviour may  change
203                           in a future release.
204
205                 -R FILE   Output  alignments  in  read  groups listed in FILE
206                           [null].  Note that records with no RG tag will also
207                           be  output  when using this option.  This behaviour
208                           may change in a future release.
209
210                 -q INT    Skip alignments with MAPQ smaller than INT [0].
211
212                 -l STR    Only output alignments in library STR [null].
213
214                 -m INT    Only output alignments with number of  CIGAR  bases
215                           consuming query sequence ≥ INT [0]
216
217                 -f INT    Only  output  alignments  with  all bits set in INT
218                           present in the FLAG field.  INT can be specified in
219                           hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
220                           in octal by beginning with  `0'  (i.e.  /^0[0-7]+/)
221                           [0].
222
223                 -F INT    Do  not  output alignments with any bits set in INT
224                           present in the FLAG field.  INT can be specified in
225                           hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
226                           in octal by beginning with  `0'  (i.e.  /^0[0-7]+/)
227                           [0].
228
229                 -G INT    Do  not  output alignments with all bits set in INT
230                           present in the FLAG field.  This is the opposite of
231                           -f  such that -f12 -G12 is the same as no filtering
232                           at all.  INT can be specified in hex  by  beginning
233                           with  `0x'  (i.e.  /^0x[0-9A-F]+/)  or  in octal by
234                           beginning with `0' (i.e. /^0[0-7]+/) [0].
235
236                 -x STR    Read tag to exclude from output (repeatable) [null]
237
238                 -B        Collapse the backward CIGAR operation.
239
240                 -s FLOAT  Output only a proportion of the  input  alignments.
241                           This subsampling acts in the same way on all of the
242                           alignment records in  the  same  template  or  read
243                           pair, so it never keeps a read but not its mate.
244
245                           The integer and fractional parts of the -s INT.FRAC
246                           option are used separately: the part after the dec‐
247                           imal  point sets the fraction of templates/pairs to
248                           be kept, while the integer part is used as  a  seed
249                           that influences which subset of reads is kept.
250
251                           When subsampling data that has previously been sub‐
252                           sampled, be sure to use a different seed value from
253                           those used previously; otherwise more reads will be
254                           retained than expected.
255
256                 -@ INT    Number of BAM compression threads to use  in  addi‐
257                           tion to main thread [0].
258
259                 -S        Ignored  for  compatibility  with previous samtools
260                           versions.  Previously this option was  required  if
261                           input was in SAM format, but now the correct format
262                           is automatically detected by  examining  the  first
263                           few characters of input.
264
265
266       sort      samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
267                 [-n] [-t tag] [-T tmpprefix] [-@ threads]
268                 [in.sam|in.bam|in.cram]
269
270                 Sort alignments by leftmost coordinates, or by read name when
271                 -n is used.  An appropriate @HD-SO sort order header tag will
272                 be added or an existing one updated if necessary.
273
274                 The  sorted  output is written to standard output by default,
275                 or to the specified file (out.bam) when  -o  is  used.   This
276                 command  will also create temporary files tmpprefix.%d.bam as
277                 needed when the entire alignment data cannot fit into  memory
278                 (as controlled via the -m option).
279
280                 Options:
281
282                 -l INT     Set  the  desired  compression level for the final
283                            output file, ranging from 0  (uncompressed)  or  1
284                            (fastest  but minimal compression) to 9 (best com‐
285                            pression  but  slowest  to  write),  similarly  to
286                            gzip(1)'s compression level setting.
287
288                            If  -l  is not used, the default compression level
289                            will apply.
290
291                 -m INT     Approximately  the  maximum  required  memory  per
292                            thread,  specified either in bytes or with a K, M,
293                            or G suffix.  [768 MiB]
294
295                            To prevent sort from creating  a  huge  number  of
296                            temporary files, it enforces a minimum value of 1M
297                            for this setting.
298
299                 -n         Sort by read names (i.e., the QNAME field)  rather
300                            than by chromosomal coordinates.
301
302                 -t TAG     Sort  first by the value in the alignment tag TAG,
303                            then by position or name (if also using  -n).   -o
304                            FILE Write the final sorted output to FILE, rather
305                            than to standard output.
306
307                 -O FORMAT  Write the final output as sam, bam, or cram.
308
309                            By default, samtools  tries  to  select  a  format
310                            based  on  the -o filename extension; if output is
311                            to standard output or no format  can  be  deduced,
312                            bam is selected.
313
314                 -T PREFIX  Write  temporary  files  to PREFIX.nnnn.bam, or if
315                            the specified PREFIX is an existing directory,  to
316                            PREFIX/samtools.mmm.mmm.tmp.nnnn.bam, where mmm is
317                            unique to this invocation of the sort command.
318
319                            By default, any temporary files are written along‐
320                            side  the output file, as out.bam.tmp.nnnn.bam, or
321                            if output is to standard output,  in  the  current
322                            directory as samtools.mmm.mmm.tmp.nnnn.bam.
323
324                 -@ INT     Set number of sorting and compression threads.  By
325                            default, operation is single-threaded.
326
327                 Ordering Rules
328
329                 The following rules are used for ordering records.
330
331                 If option -t is in use, records are first sorted by the value
332                 of  the given alignment tag, and then by position or name (if
333                 using -n).  For example, “-t RG” will  make  read  group  the
334                 primary sort key.  The rules for ordering by tag are:
335
336
337                 ·   Records  that  do not have the tag are sorted before ones
338                     that do.
339
340                 ·   If the types of the tags  are  different,  they  will  be
341                     sorted so that single character tags (type A) come before
342                     array tags (type B), then string tags (types  H  and  Z),
343                     then numeric tags (types f and i).
344
345                 ·   Numeric tags (types f and i) are compared by value.  Note
346                     that comparisons of floating-point values are subject  to
347                     issues of rounding and precision.
348
349                 ·   String  tags  (types  H  and Z) are compared based on the
350                     binary contents of the tag using the  C  strcmp(3)  func‐
351                     tion.
352
353                 ·   Character  tags (type A) are compared by binary character
354                     value.
355
356                 ·   No attempt is made to  compare  tags  of  other  types  —
357                     notably type B array values will not be compared.
358
359                 When  the  -n  option is present, records are sorted by name.
360                 Names are compared so as to give a “natural” ordering —  i.e.
361                 sections  consisting of digits are compared numerically while
362                 all other sections are compared based on their binary  repre‐
363                 sentation.   This  means  “a1” will come before “b1” and “a9”
364                 will come before “a10”.  Records with the same name  will  be
365                 ordered  according to the values of the READ1 and READ2 flags
366                 (see flags).
367
368                 When the -n option is not present, reads are sorted by refer‐
369                 ence (according to the order of the @SQ header records), then
370                 by position in the reference, and then by the REVERSE flag.
371
372                 Note
373
374
375                 Historically samtools sort also accepted a less flexible  way
376                 of specifying the final and temporary output filenames:
377
378                        samtools sort [-f] [-o] in.bam out.prefix
379
380                 This  has now been removed.  The previous out.prefix argument
381                 (and -f option, if any) should be changed to  an  appropriate
382                 combination of -T PREFIX and -o FILE.  The previous -o option
383                 should be removed, as output defaults to standard output.
384
385
386       index     samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]
387
388                 Index a coordinate-sorted BAM or CRAM file  for  fast  random
389                 access.  (Note that this does not work with SAM files even if
390                 they are bgzip compressed — to index such files, use tabix(1)
391                 instead.)
392
393                 This  index is needed when region arguments are used to limit
394                 samtools view and similar commands to particular  regions  of
395                 interest.
396
397                 If  an output filename is given, the index file will be writ‐
398                 ten to out.index.  Otherwise, for a CRAM file aln.cram, index
399                 file  aln.cram.crai  will be created; for a BAM file aln.bam,
400                 either aln.bam.bai or aln.bam.csi will be created,  depending
401                 on the index format selected.
402
403                 Options:
404
405                 -b      Create  a  BAI  index.  This is currently the default
406                         when no format options are used.
407
408                 -c      Create a CSI index.  By default, the minimum interval
409                         size  for the index is 2^14, which is the same as the
410                         fixed value used by the BAI format.
411
412                 -m INT  Create a CSI index, with a minimum interval  size  of
413                         2^INT.
414
415
416       idxstats  samtools idxstats in.sam|in.bam|in.cram
417
418                 Retrieve  and  print stats in the index file corresponding to
419                 the input file.  Before calling idxstats, the input BAM  file
420                 should be indexed by samtools index.
421
422                 If  run  on a SAM or CRAM file or an unindexed BAM file, this
423                 command will still produce the same summary  statistics,  but
424                 does  so  by  reading  through  the entire file.  This is far
425                 slower than using the BAM indices.
426
427                 The output is TAB-delimited with each line consisting of ref‐
428                 erence  sequence  name, sequence length, # mapped reads and #
429                 unmapped reads. It is written to stdout.
430
431
432       flagstat  samtools flagstat in.sam|in.bam|in.cram
433
434                 Does a full pass through the  input  file  to  calculate  and
435                 print statistics to stdout.
436
437                 Provides  counts for each of 13 categories based primarily on
438                 bit flags in the FLAG field. Each category in the  output  is
439                 broken  down  into QC pass and QC fail, which is presented as
440                 "#PASS + #FAIL" followed by a description of the category.
441
442                 The first row of output gives the total number of reads  that
443                 are QC pass and fail (according to flag bit 0x200). For exam‐
444                 ple:
445
446                   122 + 28 in total (QC-passed reads + QC-failed reads)
447
448                 Which would indicate that there are a total of 150  reads  in
449                 the  input file, 122 of which are marked as QC pass and 28 of
450                 which are marked as "not passing quality controls"
451
452                 Following this, additional categories  are  given  for  reads
453                 which are:
454
455
456                         secondary
457                                0x100 bit set
458
459                         supplementary
460                                0x800 bit set
461
462                         duplicates
463                                0x400 bit set
464
465                         mapped 0x4 bit not set
466
467                         paired in sequencing
468                                0x1 bit set
469
470                         read1  both 0x1 and 0x40 bits set
471
472                         read2  both 0x1 and 0x80 bits set
473
474                         properly paired
475                                both 0x1 and 0x2 bits set and 0x4 bit not set
476
477                         with itself and mate mapped
478                                0x1 bit set and neither 0x4 nor 0x8 bits set
479
480                         singletons
481                                both 0x1 and 0x8 bits set and bit 0x4 not set
482
483                 And  finally,  two rows are given that additionally filter on
484                 the reference name (RNAME), mate reference name  (MRNM),  and
485                 mapping quality (MAPQ) fields:
486
487
488                         with mate mapped to a different chr
489                                0x1  bit  set and neither 0x4 nor 0x8 bits set
490                                and MRNM not equal to RNAME
491
492                         with mate mapped to a different chr (mapQ>=5)
493                                0x1 bit set and neither 0x4 nor 0x8  bits  set
494                                and MRNM not equal to RNAME and MAPQ >= 5
495
496
497       stats     samtools stats [options] in.sam|in.bam|in.cram [region...]
498
499                 samtools stats collects statistics from BAM files and outputs
500                 in a text format.  The output can be  visualized  graphically
501                 using plot-bamstats.
502
503                 Options:
504
505                 -c, --coverage MIN,MAX,STEP
506                         Set  coverage  distribution  to  the  specified range
507                         (MIN, MAX, STEP all given as integers) [1,1000,1]
508
509                 -d, --remove-dups
510                         Exclude from statistics reads marked as duplicates
511
512                 -f, --required-flag STR|INT
513                         Required flag, 0 for unset. See also `samtools flags`
514                         [0]
515
516                 -F, --filtering-flag STR|INT
517                         Filtering  flag,  0  for  unset.  See  also `samtools
518                         flags` [0]
519
520                 --GC-depth FLOAT
521                         the  size  of  GC-depth  bins  (decreasing  bin  size
522                         increases memory requirement) [2e4]
523
524                 -h, --help
525                         This help message
526
527                 -i, --insert-size INT
528                         Maximum insert size [8000]
529
530                 -I, --id STR
531                         Include only listed read group or sample name []
532
533                 -l, --read-length INT
534                         Include  in  the statistics only reads with the given
535                         read length []
536
537                 -m, --most-inserts FLOAT
538                         Report only the main part of inserts [0.99]
539
540                 -P, --split-prefix STR
541                         A path or string prefix to prepend to filenames  out‐
542                         put  when  creating categorised statistics files with
543                         -S/--split.  [input filename]
544
545                 -q, --trim-quality INT
546                         The BWA trimming parameter [0]
547
548                 -r, --ref-seq FILE
549                         Reference sequence (required for  GC-depth  and  mis‐
550                         matches-per-cycle calculation).  []
551
552                 -S, --split TAG
553                         In  addition  to the complete statistics, also output
554                         categorised statistics based on the tagged field  TAG
555                         (e.g., use --split RG to split into read groups).
556
557                         Categorised  statistics  are  written  to files named
558                         <prefix>_<value>.bamstat, where prefix is as given by
559                         --split-prefix (or the input filename by default) and
560                         value has been encountered as  the  specified  tagged
561                         field's value in one or more alignment records.
562
563                 -t, --target-regions FILE
564                         Do  stats  in  these regions only. Tab-delimited file
565                         chr,from,to, 1-based, inclusive.  []
566
567                 -x, --sparse
568                         Suppress outputting IS rows where there are no inser‐
569                         tions.
570
571
572       bedcov    samtools          bedcov         [options]         region.bed
573                 in1.sam|in1.bam|in1.cram[...]
574
575                 Reports the total read base count (i.e. the sum of  per  base
576                 read  depths)  for  each genomic region specified in the sup‐
577                 plied BED file. The regions are output as they appear in  the
578                 BED  file  and  are  0-based.  Counts for each alignment file
579                 supplied are reported in separate columns.
580
581                 Options:
582
583                 -Q INT Only count reads with mapping quality greater than INT
584
585                 -j     Do not include deletions (D) and ref skips (N) in bed‐
586                        cov computation.
587
588
589       depth     samtools     depth     [options]    [in1.sam|in1.bam|in1.cram
590                 [in2.sam|in2.bam|in2.cram] [...]]
591
592                 Computes the depth at each position or region.
593
594                 Options:
595
596                 -a      Output  all  positions  (including  those  with  zero
597                         depth)
598
599                 -a -a, -aa
600                         Output  absolutely  all  positions,  including unused
601                         reference sequences.  Note that when used in conjunc‐
602                         tion  with  a  BED  file  the -a option may sometimes
603                         operate as if -aa  was  specified  if  the  reference
604                         sequence has coverage outside of the region specified
605                         in the BED file.
606
607                 -b FILE Compute depth at list  of  positions  or  regions  in
608                         specified BED FILE.  []
609
610                 -f FILE Use  the  BAM  files specified in the FILE (a file of
611                         filenames, one file per line) []
612
613                 -l INT  Ignore reads shorter than INT
614
615                 -m, -d INT
616                         Truncate reported depth at a maximum  of  INT  reads.
617                         [8000].  If  0,  depth  is set to the maximum integer
618                         value, effectively removing any depth limit.
619
620                 -q INT  Only count reads with base quality greater than INT
621
622                 -Q INT  Only count reads with mapping  quality  greater  than
623                         INT
624
625                 -r CHR:FROM-TO
626                         Only report depth in specified region.
627
628
629       merge     samtools  merge  [-nur1f]  [-h  inh.sam] [-R reg] [-b <list>]
630                 <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>]
631
632                 Merge multiple sorted alignment  files,  producing  a  single
633                 sorted  output  file  that contains all the input records and
634                 maintains the existing sort order.
635
636                 If -h is specified the @SQ headers of  input  files  will  be
637                 merged  into  the  specified  header,  otherwise they will be
638                 merged into a composite header created from the  input  head‐
639                 ers.   If  in the process of merging @SQ lines for coordinate
640                 sorted input files, a conflict arises as to  the  order  (for
641                 example  input1.bam  has  @SQ  for  a,b,c  and input2.bam has
642                 b,a,c) then the resulting output file will  need  to  be  re-
643                 sorted back into coordinate order.
644
645                 Unless the -c or -p flags are specified then when merging @RG
646                 and @PG records into the output header then any IDs found  to
647                 be  duplicates of existing IDs in the output header will have
648                 a suffix appended to them to differentiate them from  similar
649                 header  records from other files and the read records will be
650                 updated to reflect this.
651
652                 The ordering of the records in the input files must match the
653                 usage of the -n and -t command-line options.  If they do not,
654                 the output order will be undefined.  See sort for information
655                 about record ordering.
656
657                 OPTIONS:
658
659                 -1      Use zlib compression level 1 to compress the output.
660
661                 -b FILE List of input BAM files, one file per line.
662
663                 -f      Force to overwrite the output file if present.
664
665                 -h FILE Use  the lines of FILE as `@' headers to be copied to
666                         out.bam, replacing any header lines that would other‐
667                         wise  be  copied  from in1.bam.  (FILE is actually in
668                         SAM format, though any alignment records it may  con‐
669                         tain are ignored.)
670
671                 -n      The  input alignments are sorted by read names rather
672                         than by chromosomal coordinates
673
674                 -t TAG  The input alignments have been sorted by the value of
675                         TAG,  then  by  either  position  or  name  (if -n is
676                         given).
677
678                 -R STR  Merge files in the specified region indicated by  STR
679                         [null]
680
681                 -r      Attach  an RG tag to each alignment. The tag value is
682                         inferred from file names.
683
684                 -u      Uncompressed BAM output
685
686                 -c      When several input files contain @RG headers with the
687                         same  ID,  emit  only one of them (namely, the header
688                         line from the first file we find that ID in)  to  the
689                         merged  output file.  Combining these similar headers
690                         is usually the right thing to do when the files being
691                         merged originated from the same file.
692
693                         Without  -c,  all  @RG  headers  appear in the output
694                         file, with random suffixes added to their  IDs  where
695                         necessary to differentiate them.
696
697                 -p      Similarly,  for  each  @PG  ID in the set of files to
698                         merge, use the @PG line of the  first  file  we  find
699                         that ID in rather than adding a suffix to differenti‐
700                         ate similar IDs.
701
702
703       faidx     samtools faidx <ref.fasta> [region1 [...]]
704
705                 Index reference sequence in the FASTA format or extract  sub‐
706                 sequence  from  indexed  reference  sequence. If no region is
707                 specified,   faidx   will   index   the   file   and   create
708                 <ref.fasta>.fai  on  the  disk. If regions are specified, the
709                 subsequences will be retrieved and printed to stdout  in  the
710                 FASTA format.
711
712                 The input file can be compressed in the BGZF format.
713
714                 The  sequences  in  the  input file should all have different
715                 names.  If they do not, indexing will emit  a  warning  about
716                 duplicate  sequences  and  retrieval will only produce subse‐
717                 quences from the first sequence with the duplicated name.
718
719                 FASTQ files can be read and indexed by this command.  Without
720                 using --fastq any extracted subsequence will be in FASTA for‐
721                 mat.
722
723                 Options
724
725                 -o, --output FILE
726                         Write FASTA to file rather than to stdout.
727
728                 -n, --length INT
729                         Length of FASTA sequence line.  [60]
730
731                 -c, --continue
732                         Continue  working  if  a   non-existant   region   is
733                         requested.
734
735                 -r, --region-file FILE
736                         Read  regions from a file. Format is chr:from-to, one
737                         per line.
738
739                 -f, --fastq
740                         Read FASTQ files and output  extracted  sequences  in
741                         FASTQ format.  Same as using samtools fqidx.
742
743                 -i, --reverse-complement
744                         Output  the sequence as the reverse complement.  When
745                         this option is used, “/rc” will be  appended  to  the
746                         sequence  names.   To  turn  this  off  or change the
747                         string appended, use the --mark-strand option.
748
749                 --mark-strand TYPE
750                         Append strand indicator to sequence name.   TYPE  can
751                         be one of:
752
753                         rc     Append  '/rc' when writing the reverse comple‐
754                                ment.  This is the default.
755
756                         no     Do not append anything.
757
758                         sign   Append '(+)' for forward strand or  '(-)'  for
759                                reverse  complement.   This matches the output
760                                of “bedtools getfasta -s”.
761
762                         custom,<pos>,<neg>
763                                Append string <pos> to names when writing  the
764                                forward  strand  and  <neg>  when  writing the
765                                reverse strand.  Spaces are preserved,  so  it
766                                is  possible  to  move  the indicator into the
767                                comment  part  of  the  description  line   by
768                                including a leading space in the strings <pos>
769                                and <neg>.
770
771                 -h, --help
772                         Print help message and exit.
773
774
775       fqidx     samtools fqidx <ref.fastq> [region1 [...]]
776
777                 Index reference sequence in the FASTQ format or extract  sub‐
778                 sequence  from  indexed  reference  sequence. If no region is
779                 specified,   fqidx   will   index   the   file   and   create
780                 <ref.fastq>.fai  on  the  disk. If regions are specified, the
781                 subsequences will be retrieved and printed to stdout  in  the
782                 FASTQ format.
783
784                 The input file can be compressed in the BGZF format.
785
786                 The  sequences  in  the  input file should all have different
787                 names.  If they do not, indexing will emit  a  warning  about
788                 duplicate  sequences  and  retrieval will only produce subse‐
789                 quences from the first sequence with the duplicated name.
790
791                 samtools fqidx should only be used  on  fastq  files  with  a
792                 small number of entries.  Trying to use it on a file contain‐
793                 ing millions of short sequencing reads will produce an  index
794                 that  is  almost  as  big  as the original file, and searches
795                 using the index will be very slow and use a lot of memory.
796
797                 Options
798
799                 -o, --output FILE
800                         Write FASTQ to file rather than to stdout.
801
802                 -n, --length INT
803                         Length of FASTQ sequence line.  [60]
804
805                 -c, --continue
806                         Continue  working  if  a   non-existant   region   is
807                         requested.
808
809                 -r, --region-file FILE
810                         Read  regions from a file. Format is chr:from-to, one
811                         per line.
812
813                 -i, --reverse-complement
814                         Output the sequence as the reverse complement.   When
815                         this  option  is  used, “/rc” will be appended to the
816                         sequence names.  To  turn  this  off  or  change  the
817                         string appended, use the --mark-strand option.
818
819                 --mark-strand TYPE
820                         Append  strand  indicator to sequence name.  TYPE can
821                         be one of:
822
823                         rc     Append '/rc' when writing the reverse  comple‐
824                                ment.  This is the default.
825
826                         no     Do not append anything.
827
828                         sign   Append  '(+)'  for forward strand or '(-)' for
829                                reverse complement.  This matches  the  output
830                                of “bedtools getfasta -s”.
831
832                         custom,<pos>,<neg>
833                                Append  string <pos> to names when writing the
834                                forward strand  and  <neg>  when  writing  the
835                                reverse  strand.   Spaces are preserved, so it
836                                is possible to move  the  indicator  into  the
837                                comment   part  of  the  description  line  by
838                                including a leading space in the strings <pos>
839                                and <neg>.
840
841                 -h, --help
842                         Print help message and exit.
843
844
845       tview     samtools   tview   [-p   chr:pos]   [-s   STR]  [-d  display]
846                 <in.sorted.bam> [ref.fasta]
847
848                 Text alignment viewer (based on the ncurses library). In  the
849                 viewer,  press `?' for help and press `g' to check the align‐
850                 ment   start   from   a   region   in   the    format    like
851                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
852                 reference sequence.
853
854                 Options:
855
856                 -d display    Output as (H)tml or (C)urses or (T)ext
857
858                 -p chr:pos    Go directly to this position
859
860                 -s STR        Display only alignments  from  this  sample  or
861                               read group
862
863
864       split     samtools split [options] merged.sam|merged.bam|merged.cram
865
866                 Splits a file by read group.
867
868                 Options:
869
870                 -u FILE1      Put  reads with no RG tag or an unrecognised RG
871                               tag into FILE1
872
873                 -u FILE1:FILE2
874                               As above, but assigns an RG tag as given in the
875                               header of FILE2
876
877                 -f STRING     Output   filename  format  string  (see  below)
878                               ["%*_%#.%."]
879
880                 -v            Verbose output
881
882                 Format string expansions:
883
884                             %%   %
885                             %*   basename
886                             %#   @RG index
887                             %!   @RG ID
888                             %.   output format filename extension
889
890
891       quickcheck
892                 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]
893
894                 Quickly check that input files appear to  be  intact.  Checks
895                 that  beginning of the file contains a valid header (all for‐
896                 mats) containing at least one target sequence and then  seeks
897                 to  the  end of the file and checks that an end-of-file (EOF)
898                 is present and intact (BAM only).
899
900                 Data in the middle of the file is not read since  that  would
901                 be much more time consuming, so please note that this command
902                 will not detect internal corruption, but is useful for  test‐
903                 ing  that  files  are  not  truncated  before performing more
904                 intensive tasks on them.
905
906                 This command will exit with a non-zero exit code if any input
907                 files  don't have a valid header or are missing an EOF block.
908                 Otherwise it will exit successfully (with a zero exit code).
909
910                 Options:
911
912                 -v      Verbose output: will additionally print the names  of
913                         all  input files that don't pass the check to stdout.
914                         Multiple -v options will  cause  additional  messages
915                         regarding check results to be printed to stderr.
916
917                 -q      Quiet mode: disables warning messages on stderr about
918                         files that fail.  If both -q and -v options are  used
919                         then the appropriate level of -v takes precedence.
920
921
922       dict      samtools dict <ref.fasta|ref.fasta.gz>
923
924                 Create a sequence dictionary file from a fasta file.
925
926                 OPTIONS:
927
928                 -a, --assembly STR
929                            Specify the assembly for the AS tag.
930
931                 -H, --no-header
932                            Do not print the @HD header line.
933
934                 -o, --output FILE
935                            Output to FILE [stdout].
936
937                 -s, --species STR
938                            Specify the species for the SP tag.
939
940                 -u, --uri STR
941                            Specify  the  URI  for the UR tag. Defaults to the
942                            absolute path of  ref.fasta  unless  reading  from
943                            stdin.
944
945
946       fixmate   samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
947
948                 Fill in mate coordinates, ISIZE and mate related flags from a
949                 name-sorted alignment.
950
951                 OPTIONS:
952
953                 -r         Remove secondary and unmapped reads.
954
955                 -p         Disable FR proper pair check.
956
957                 -c         Add template cigar ct tag.
958
959                 -m         Add ms (mate  score)  tags.   These  are  used  by
960                            markdup to select the best reads to keep.
961
962                 -O FORMAT  Write the final output as sam, bam, or cram.
963
964                            By  default,  samtools  tries  to  select a format
965                            based on the output filename extension; if  output
966                            is to standard output or no format can be deduced,
967                            bam is selected.
968
969
970       mpileup   samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa]  [-l
971                 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
972
973                 Generate  pileup  for  one  or  multiple BAM files. Alignment
974                 records are grouped by sample (SM) identifiers in @RG  header
975                 lines.  If  sample identifiers are absent, each input file is
976                 regarded as one sample.
977
978                 Samtools mpileup can still produce VCF and  BCF  output,  but
979                 this  feature  is  deprecated and will be removed in a future
980                 release.  Please  use  bcftools  mpileup  for  this  instead.
981                 (Documentation  on  the  deprecated  options has been removed
982                 from this manual  page,  but  older  versions  are  available
983                 online at <http://www.htslib.org/doc/>.)
984
985                 In the pileup format (without -u or -g), each line represents
986                 a genomic position, consisting of  chromosome  name,  1-based
987                 coordinate,  reference base, the number of reads covering the
988                 site, read bases, base qualities and alignment mapping quali‐
989                 ties.  Information on match, mismatch, indel, strand, mapping
990                 quality and start and end of a read are all  encoded  at  the
991                 read base column. At this column, a dot stands for a match to
992                 the reference base on the forward strand, a comma for a match
993                 on  the  reverse  strand,  a '>' or '<' for a reference skip,
994                 `ACGTN' for a mismatch on the forward strand and `acgtn'  for
995                 a  mismatch  on the reverse strand. A pattern `\+[0-9]+[ACGT‐
996                 Nacgtn]+' indicates there is an insertion between this refer‐
997                 ence  position and the next reference position. The length of
998                 the insertion is given by the integer in  the  pattern,  fol‐
999                 lowed   by   the  inserted  sequence.  Similarly,  a  pattern
1000                 `-[0-9]+[ACGTNacgtn]+' represents a deletion from the  refer‐
1001                 ence.  The deleted bases will be presented as `*' in the fol‐
1002                 lowing lines. Also at the read  base  column,  a  symbol  `^'
1003                 marks the start of a read. The ASCII of the character follow‐
1004                 ing `^' minus 33 gives the  mapping  quality.  A  symbol  `$'
1005                 marks the end of a read segment.
1006
1007                 Note  that there are two orthogonal ways to specify locations
1008                 in the input file; via -r region and  -l  file.   The  former
1009                 uses  (and  requires)  an index to do random access while the
1010                 latter streams through the file contents  filtering  out  the
1011                 specified  regions,  requiring no index.  The two may be used
1012                 in conjunction.  For example a BED file containing  locations
1013                 of  genes  in chromosome 20 could be specified using -r 20 -l
1014                 chr20.bed, meaning that the index is used to find  chromosome
1015                 20  and then it is filtered for the regions listed in the bed
1016                 file.
1017
1018                 Input Options:
1019
1020                 -6, --illumina1.3+
1021                           Assume the quality is in the Illumina  1.3+  encod‐
1022                           ing.
1023
1024                 -A, --count-orphans
1025                           Do  not  skip anomalous read pairs in variant call‐
1026                           ing.
1027
1028                 -b, --bam-list FILE
1029                           List of input BAM files, one file per line [null]
1030
1031                 -B, --no-BAQ
1032                           Disable base alignment quality  (BAQ)  computation.
1033                           See BAQ below.
1034
1035                 -C, --adjust-MQ INT
1036                           Coefficient  for  downgrading  mapping  quality for
1037                           reads containing excessive mismatches. Given a read
1038                           with  a  phred-scaled probability q of being gener‐
1039                           ated from the  mapped  position,  the  new  mapping
1040                           quality  is  about  sqrt((INT-q)/INT)*INT.  A  zero
1041                           value disables this functionality; if enabled,  the
1042                           recommended value for BWA is 50. [0]
1043
1044                 -d, --max-depth INT
1045                           At  a  position, read maximally INT reads per input
1046                           file. Setting this limit reduces the amount of mem‐
1047                           ory  and  time  needed to process regions with very
1048                           high coverage.  Passing zero for this  option  sets
1049                           it  to  the  highest  possible  value,  effectively
1050                           removing the depth limit. [8000]
1051
1052                           Note that up to release 1.8, samtools would enforce
1053                           a  minimum  value  for this option.  This no longer
1054                           happens and the limit is set exactly as specified.
1055
1056                 -E, --redo-BAQ
1057                           Recalculate BAQ on  the  fly,  ignore  existing  BQ
1058                           tags.  See BAQ below.
1059
1060                 -f, --fasta-ref FILE
1061                           The  faidx-indexed reference file in the FASTA for‐
1062                           mat. The  file  can  be  optionally  compressed  by
1063                           bgzip.  [null]
1064
1065                           Supplying  a reference file will enable base align‐
1066                           ment quality calculation for all reads aligned to a
1067                           reference in the file.  See BAQ below.
1068
1069                 -G, --exclude-RG FILE
1070                           Exclude  reads  from readgroups listed in FILE (one
1071                           @RG-ID per line)
1072
1073                 -l, --positions FILE
1074                           BED or position list  file  containing  a  list  of
1075                           regions or sites where pileup or BCF should be gen‐
1076                           erated. Position list  files  contain  two  columns
1077                           (chromosome  and  position) and start counting from
1078                           1.  BED files contain at least 3  columns  (chromo‐
1079                           some, start and end position) and are 0-based half-
1080                           open.
1081                           While it is possible to mix both position-list  and
1082                           BED  coordinates in the same file, this is strongly
1083                           ill advised due to the  differing  coordinate  sys‐
1084                           tems. [null]
1085
1086                 -q, -min-MQ INT
1087                           Minimum mapping quality for an alignment to be used
1088                           [0]
1089
1090                 -Q, --min-BQ INT
1091                           Minimum base quality for a base  to  be  considered
1092                           [13]
1093
1094                 -r, --region STR
1095                           Only  generate  pileup  in region. Requires the BAM
1096                           files to be indexed.  If used in  conjunction  with
1097                           -l  then  considers  the  intersection  of  the two
1098                           requests.  STR [all sites]
1099
1100                 -R, --ignore-RG
1101                           Ignore RG tags. Treat all reads in one BAM  as  one
1102                           sample.
1103
1104                 --rf, --incl-flags STR|INT
1105                           Required  flags:  skip  reads  with mask bits unset
1106                           [null]
1107
1108                 --ff, --excl-flags STR|INT
1109                           Filter  flags:  skip  reads  with  mask  bits   set
1110                           [UNMAP,SECONDARY,QCFAIL,DUP]
1111
1112                 -x, --ignore-overlaps
1113                           Disable read-pair overlap detection.
1114
1115                 Output Options:
1116
1117                 -o, --output FILE
1118                           Write  pileup  output  to  FILE,  rather  than  the
1119                           default of standard output.
1120
1121                           (The same short option is used for both the  depre‐
1122                           cated  --open-prob  option  and --output .  If -o's
1123                           argument contains any  non-digit  characters  other
1124                           than  a  leading  + or - sign, it is interpreted as
1125                           --output.  Usually the filename extension will take
1126                           care  of  this, but to write to an entirely numeric
1127                           filename use -o ./123 or --output 123.)
1128
1129                 -O, --output-BP
1130                           Output base positions on reads.
1131
1132                 -s, --output-MQ
1133                           Output mapping quality.
1134
1135                 --output-QNAME
1136                           Output an extra column  containing  comma-separated
1137                           read names.
1138
1139                 -a        Output  all  positions,  including  those with zero
1140                           depth.
1141
1142                 -a -a, -aa
1143                           Output absolutely all positions,  including  unused
1144                           reference  sequences.   Note that when used in con‐
1145                           junction with a BED file the -a  option  may  some‐
1146                           times operate as if -aa was specified if the refer‐
1147                           ence sequence has coverage outside  of  the  region
1148                           specified in the BED file.
1149
1150                 BAQ (Base Alignment Quality)
1151
1152                 BAQ is the Phred-scaled probability of a read base being mis‐
1153                 aligned.  It greatly helps to reduce  false  SNPs  caused  by
1154                 misalignments.   BAQ  is  calculated  using the probabilistic
1155                 realignment method described in the paper “Improving SNP dis‐
1156                 covery  by  base alignment quality”, Heng Li, Bioinformatics,
1157                 Volume  27,  Issue  8   <https://doi.org/10.1093/bioinformat‐
1158                 ics/btr076>
1159
1160                 BAQ  is turned on when a reference file is supplied using the
1161                 -f option.  To disable it, use the -B option.
1162
1163                 It is possible to store pre-calculated BAQ values  in  a  SAM
1164                 BQ:Z tag.  Samtools mpileup will use the precalculated values
1165                 if it finds them.  The -E option  can  be  used  to  make  it
1166                 ignore  the contents of the BQ:Z tag and force it to recalcu‐
1167                 late the BAQ scores by making a new alignment.
1168
1169
1170       flags     samtools flags INT|STR[,...]
1171
1172                 Convert between textual and numeric flag representation.
1173
1174                 FLAGS:
1175
1176                   0x1   PAIRED          paired-end (or multiple-segment) sequencing technology
1177                   0x2   PROPER_PAIR     each segment properly aligned according to the aligner
1178                   0x4   UNMAP           segment unmapped
1179                   0x8   MUNMAP          next segment in the template unmapped
1180                  0x10   REVERSE         SEQ is reverse complemented
1181                  0x20   MREVERSE        SEQ of the next segment in the template is reverse complemented
1182                  0x40   READ1           the first segment in the template
1183                  0x80   READ2           the last segment in the template
1184                 0x100   SECONDARY       secondary alignment
1185                 0x200   QCFAIL          not passing quality controls
1186                 0x400   DUP             PCR or optical duplicate
1187                 0x800   SUPPLEMENTARY   supplementary alignment
1188
1189
1190       fastq/a   samtools fastq [options] in.bam
1191                 samtools fasta [options] in.bam
1192
1193                 Converts a BAM or CRAM into  either  FASTQ  or  FASTA  format
1194                 depending on the command invoked. The files will be automati‐
1195                 cally compressed if the file names have a .gz or .bgzf exten‐
1196                 sion.
1197
1198                 The input to this program must be collated by name.  Use sam‐
1199                 tools collate or samtools sort -n to ensure this.
1200
1201                 For each different QNAME, the input records  are  categorised
1202                 according to the state of the READ1 and READ2 flag bits.  The
1203                 three categories used are:
1204
1205                 1 : Only READ1 is set.
1206
1207                 2 : Only READ2 is set.
1208
1209                 0 : Either both READ1 and READ2 are set; or neither is set.
1210
1211                 The exact meaning of these categories depends on the sequenc‐
1212                 ing technology used.  It is expected that ordinary single and
1213                 paired-end sequencing reads will be in categories 1 and 2 (in
1214                 the case of paired-end reads, one read of the pair will be in
1215                 category 1, the other in category 2).  Category 0  is  essen‐
1216                 tially  a “catch-all” for reads that do not fit into a simple
1217                 paired-end sequencing model.
1218
1219                 For each category only one sequence will  be  written  for  a
1220                 given  QNAME.   If  more  than  one record is available for a
1221                 given QNAME and category, the first in input file order  that
1222                 has  quality  values  will be used.  If none of the candidate
1223                 records has quality values, then  the  first  in  input  file
1224                 order will be used instead.
1225
1226                 Sequences  will  be  written to standard output unless one of
1227                 the -1,-2, or -0 options is used, in which case sequences for
1228                 that category will be written to the specified file.
1229
1230                 If  a  singleton  file  is specified using the -s option then
1231                 only paired sequences will be output for categories 1 and  2;
1232                 paired meaning that for a given QNAME there are sequences for
1233                 both category 1 and 2.  If there is a sequence for  only  one
1234                 of categories 1 or 2 then it will be diverted into the speci‐
1235                 fied singletons file.  This can  be  used  to  prepare  fastq
1236                 files for programs that cannot handle a mixture of paired and
1237                 singleton reads.
1238
1239                 The -s option only affects category 1  and  2  records.   The
1240                 output  for  category  0 will be the same irrespective of the
1241                 use of this option.
1242
1243                 OPTIONS:
1244
1245                 -n      By default, either '/1' or '/2' is added to  the  end
1246                         of  read names where the corresponding READ1 or READ2
1247                         FLAG bit is set.  Using -n causes read  names  to  be
1248                         left as they are.
1249
1250                 -N      Always  add  either  '/1'  or '/2' to the end of read
1251                         names even when put into different files.
1252
1253                 -O      Use quality values from  OQ  tags  in  preference  to
1254                         standard quality string if available.
1255
1256                 -s FILE Write singleton reads to FILE.
1257
1258                 -t      Copy  RG, BC and QT tags to the FASTQ header line, if
1259                         they exist.
1260
1261                 -T TAGLIST
1262                         Specify a comma-separated list of tags to copy to the
1263                         FASTQ header line, if they exist.
1264
1265                 -1 FILE Write  reads  with  the READ1 FLAG set (and READ2 not
1266                         set) to FILE instead of outputting them.  If  the  -s
1267                         option  is used, only paired reads will be written to
1268                         this file.
1269
1270                 -2 FILE Write reads with the READ2 FLAG set  (and  READ1  not
1271                         set)  to  FILE instead of outputting them.  If the -s
1272                         option is used, only paired reads will be written  to
1273                         this file.
1274
1275                 -0 FILE Write  reads  where the READ1 and READ2 FLAG bits set
1276                         are either both set or both unset to FILE instead  of
1277                         outputting them.
1278
1279                 -f INT  Only  output  alignments  with  all  bits  set in INT
1280                         present in the FLAG field.  INT can be  specified  in
1281                         hex  by  beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
1282                         in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1283
1284                 -F INT  Do not output alignments with any  bits  set  in  INT
1285                         present  in  the FLAG field.  INT can be specified in
1286                         hex by beginning with `0x' (i.e.  /^0x[0-9A-F]+/)  or
1287                         in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1288
1289                 -G INT  Only  EXCLUDE  reads  with all of the bits set in INT
1290                         present in the FLAG field.  INT can be  specified  in
1291                         hex  by  beginning with `0x' (i.e. /^0x[0-9A-F]+/) or
1292                         in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1293
1294                 -i      add Illumina Casava 1.8 format entry  to  header  (eg
1295                         1:N:0:ATCACG)
1296
1297                 -c [0..9]
1298                         set  compression  level when writing gz or bgzf fastq
1299                         files.
1300
1301                 --i1 FILE
1302                         write first index reads to FILE
1303
1304                 --i2 FILE
1305                         write second index reads to FILE
1306
1307                 --barcode-tag TAG
1308                         aux tag to find index reads in [default: BC]
1309
1310                 --quality-tag TAG
1311                         aux tag to find index quality in [default: QT]
1312
1313                 --index-format STR
1314                         string to describe how to parse the barcode and qual‐
1315                         ity tags. For example:
1316
1317
1318                         i14i8   the first 14 characters are index 1, the next
1319                                 8 characters are index 2
1320
1321                         n8i14   ignore the first 8 characters,  and  use  the
1322                                 next 14 characters for index 1
1323
1324                                 If  the  tag  contains  a separator, then the
1325                                 numeric part can be replaced with '*' to mean
1326                                 'read until the separator or end of tag', for
1327                                 example:
1328
1329                         n*i*    ignore the left part of  the  tag  until  the
1330                                 separator, then use the second part
1331
1332                 EXAMPLES
1333
1334                 Output paired reads to separate files, discarding singletons,
1335                 supplementary and secondary reads.  The resulting  files  can
1336                 be used with, for example, the bwa aligner.
1337
1338                     samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam
1339
1340
1341                 Output  paired and singleton reads in a single file, discard‐
1342                 ing supplementary and secondary reads.  To  get  all  of  the
1343                 reads  in a single file, it is necessary to redirect the out‐
1344                 put of samtools fastq.  The output file is suitable  for  use
1345                 with  bwa mem -p which understands interleaved files contain‐
1346                 ing a mixture of paired and singleton reads.
1347
1348                     samtools fastq -0 /dev/null -F 0x900 in.bam > all_reads.fq
1349
1350
1351                 Output paired reads in a single file,  discarding  supplemen‐
1352                 tary  and secondary reads.  Save any singletons in a separate
1353                 file.  Append /1 and /2 to read names.  This format is  suit‐
1354                 able  for use by NextGenMap when using its -p and -q options.
1355                 With this aligner, paired reads must be mapped separately  to
1356                 the singletons.
1357
1358                     samtools fastq -0 /dev/null -s single.fq -N -F 0x900 in.bam > paired.fq
1359
1360
1361                 BUGS
1362
1363
1364                 o The  way  of  specifying output files is far to complicated
1365                   and easy to get wrong.
1366
1367
1368                 o The default value for the -F option should really be  0x900
1369                   so that secondary and supplementary reads are automatically
1370                   excluded.  The existing default of 0 is retained  for  rea‐
1371                   sons of compatibility.
1372
1373
1374
1375       collate   samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
1376
1377                 Shuffles  and groups reads together by their names.  A faster
1378                 alternative to a full query name sort, collate  ensures  that
1379                 reads  of  the  same  name are grouped together in contiguous
1380                 groups, but doesn't make any guarantees about  the  order  of
1381                 read names between groups.
1382
1383                 The output from this command should be suitable for any oper‐
1384                 ation that requires all reads from the same  template  to  be
1385                 grouped together.
1386
1387                 If present, <prefix> is used to name the temporary files that
1388                 collate uses when sorting the data.  If neither the '-O'  nor
1389                 '-o'  options  are used, <prefix> must be present and collate
1390                 will use it to make an output file name by appending a suffix
1391                 depending on the format written (.bam by default).
1392
1393                 If  either the -O or -o option is used, <prefix> is optional.
1394                 If <prefix> is absent, collate will write the temporary files
1395                 to a system-dependent location (/tmp on UNIX).
1396
1397                 Using  -f  for  fast mode will output only primary alignments
1398                 that have either the READ1 or READ2 flags set (but not both).
1399                 Any other alignment records will be filtered out.  The colla‐
1400                 tion will only work correctly if there are no more  than  two
1401                 reads for any given QNAME after filtering.
1402
1403                 Fast  mode  keeps a buffer of alignments in memory so that it
1404                 can write out most pairs as soon as they are found instead of
1405                 storing  them  in  temporary  files.   This allows collate to
1406                 avoid some work and so finish more quickly  compared  to  the
1407                 standard  mode.  The number of alignments held can be changed
1408                 using -r,  storing  more  alignments  uses  more  memory  but
1409                 increases the number of pairs that can be written early.
1410
1411                 While collate normally randomises the ordering of read pairs,
1412                 fast mode does not.   Position-dependent  biases  that  would
1413                 normally  be broken up can remain in the fast collate output.
1414                 It is therefore not a good idea to use fast mode when prepar‐
1415                 ing  data  for  programs  that expect randomly ordered paired
1416                 reads.  For example using fast collate instead of  the  stan‐
1417                 dard  mode  may  lead to significantly different results from
1418                 aligners that estimate library insert  sizes  on  batches  of
1419                 reads.
1420
1421                 Options:
1422
1423                 -O      Output  to  stdout.   This option cannot be used with
1424                         '-o'.
1425
1426                 -o FILE Write output to FILE.  This  option  cannot  be  used
1427                         with '-O'.
1428
1429                 -u      Write uncompressed BAM output
1430
1431                 -l INT  Compression level.  [1]
1432
1433                 -n INT  Number of temporary files to use.  [64]
1434
1435                 -f      Fast mode (primary alignments only).
1436
1437                 -r INT  Number of reads to store in memory (for use with -f).
1438                         [10000]
1439
1440
1441       reheader  samtools reheader [-iP] in.header.sam in.bam
1442
1443                 Replace  the  header   in   in.bam   with   the   header   in
1444                 in.header.sam.   This  command  is much faster than replacing
1445                 the header with a BAM→SAM→BAM conversion.
1446
1447                 By default this command outputs the BAM or CRAM file to stan‐
1448                 dard  output  (stdout),  but for CRAM format files it has the
1449                 option to perform an in-place edit, both reading and  writing
1450                 to  the  same file.  No validity checking is performed on the
1451                 header, nor that it is suitable to use with the sequence data
1452                 itself.
1453
1454                 OPTIONS:
1455
1456                 -P, --no-PG
1457                         Do not generate an @PG header line.
1458
1459                 -i, --in-place
1460                         Perform  the header edit in-place, if possible.  This
1461                         only works on CRAM files and only if there is  suffi‐
1462                         cient  room  to  store the new header.  The amount of
1463                         space available will differ for each CRAM file.
1464
1465
1466       cat       samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam>
1467                 <in2.bam> [ ... ]
1468
1469                 Concatenate  BAMs or CRAMs. Although this works on either BAM
1470                 or CRAM, all input files must be  the  same  format  as  each
1471                 other.  The  sequence  dictionary  of each input file must be
1472                 identical, although this command does not  check  this.  This
1473                 command  uses  a similar trick to reheader which enables fast
1474                 BAM concatenation.
1475
1476                 OPTIONS:
1477
1478                 -b FOFN Read the list of input BAM or CRAM files  from  FOFN.
1479                         These  are  concatenated prior to any files specified
1480                         on the command line.  Multiple -b FOFN options may be
1481                         specified  to  concatenate multiple lists of BAM/CRAM
1482                         files.
1483
1484                 -h FILE Uses the SAM header from FILE.  By default the header
1485                         is taken from the first file to be concatenated.
1486
1487                 -o FILE Write  the  concatenated  output to FILE.  By default
1488                         this is sent to stdout.
1489
1490
1491       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
1492
1493                 This command is obsolete. Use markdup instead.
1494
1495                 Remove potential PCR duplicates: if multiple read pairs  have
1496                 identical  external  coordinates,  only  retain the pair with
1497                 highest mapping quality.  In the paired-end mode,  this  com‐
1498                 mand  ONLY  works  with  FR orientation and requires ISIZE is
1499                 correctly set. It does not work for unpaired reads (e.g.  two
1500                 ends mapped to different chromosomes or orphan reads).
1501
1502                 OPTIONS:
1503
1504                 -s      Remove  duplicates  for single-end reads. By default,
1505                         the command works for paired-end reads only.
1506
1507                 -S      Treat paired-end reads and single-end reads.
1508
1509
1510       addreplacerg
1511                 samtools addreplacerg [-r rg line | -R rg ID] [-m  mode]  [-l
1512                 level] [-o out.bam] <input.bam>
1513
1514                 Adds or replaces read group tags in a file.
1515
1516                 OPTIONS:
1517
1518                 -r STRING
1519                         Allows  you to specify a read group line to append to
1520                         the header and applies it to the reads  specified  by
1521                         the  -m  option. If repeated it automatically adds in
1522                         tabs between invocations.
1523
1524                 -R STRING
1525                         Allows you to specify the read group ID of an  exist‐
1526                         ing @RG line and applies it to the reads specified.
1527
1528                 -m MODE If  you  choose orphan_only then existing RG tags are
1529                         not overwritten, if you choose overwrite_all,  exist‐
1530                         ing  RG  tags  are  overwritten. The default is over‐
1531                         write_all.
1532
1533                 -o STRING
1534                         Write the final output to STRING. The default  is  to
1535                         write to stdout.
1536
1537                         By  default,  samtools tries to select a format based
1538                         on the output filename extension;  if  output  is  to
1539                         standard  output  or no format can be deduced, bam is
1540                         selected.
1541
1542
1543       calmd     samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>
1544
1545                 Generate the MD tag. If the MD tag is already  present,  this
1546                 command  will  give a warning if the MD tag generated is dif‐
1547                 ferent from the existing tag. Output SAM by default.
1548
1549                 Calmd can also read and write CRAM  files  although  in  most
1550                 cases  it is pointless as CRAM recalculates MD and NM tags on
1551                 the fly.  The one exception to this case is where both  input
1552                 and  output CRAM files have been / are being created with the
1553                 no_ref option.
1554
1555                 OPTIONS:
1556
1557                 -A      When used jointly with -r this option overwrites  the
1558                         original base quality.
1559
1560                 -e      Convert  a  the  read base to = if it is identical to
1561                         the aligned reference base.  Indel  caller  does  not
1562                         support the = bases at the moment.
1563
1564                 -u      Output uncompressed BAM
1565
1566                 -b      Output compressed BAM
1567
1568                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
1569                         reads. See the pileup command for details. [0]
1570
1571                 -r      Compute the BQ tag (without -A) or cap  base  quality
1572                         by BAQ (with -A).
1573
1574                 -E      Extended  BAQ  calculation. This option trades speci‐
1575                         ficity for sensitivity, though the effect is minor.
1576
1577
1578       targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0]  [-1
1579                 em1] [-2 em2] [-f ref] <in.bam>
1580
1581                 This  command identifies target regions by examining the con‐
1582                 tinuity of read depth, computes haploid  consensus  sequences
1583                 of targets and outputs a SAM with each sequence corresponding
1584                 to a target. When option -f is in use, BAQ will  be  applied.
1585                 This  command is only designed for cutting fosmid clones from
1586                 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
1587
1588
1589       phase     samtools phase [-AF] [-k len] [-b  prefix]  [-q  minLOD]  [-Q
1590                 minBaseQ] <in.bam>
1591
1592                 Call and phase heterozygous SNPs.
1593
1594                 OPTIONS:
1595
1596                 -A      Drop reads with ambiguous phase.
1597
1598                 -b STR  Prefix  of  BAM  output.  When this option is in use,
1599                         phase-0 reads will be saved  in  file  STR.0.bam  and
1600                         phase-1 reads in STR.1.bam.  Phase unknown reads will
1601                         be randomly  allocated  to  one  of  the  two  files.
1602                         Chimeric  reads  with  switch errors will be saved in
1603                         STR.chimeric.bam.  [null]
1604
1605                 -F      Do not attempt to fix chimeric reads.
1606
1607                 -k INT  Maximum length for local phasing. [13]
1608
1609                 -q INT  Minimum Phred-scaled LOD to call a heterozygote. [40]
1610
1611                 -Q INT  Minimum base quality to be used in het calling. [13]
1612
1613
1614       depad     samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>
1615
1616                 Converts a BAM aligned against a padded reference  to  a  BAM
1617                 aligned against the depadded reference.  The padded reference
1618                 may contain verbatim "*" bases in it, but "*" bases are  also
1619                 counted  in  the  reference  numbering.   This  means  that a
1620                 sequence base-call aligned against a reference "*" is consid‐
1621                 ered  to be a cigar match ("M" or "X") operator (if the base-
1622                 call is "A", "C", "G" or "T").  After depadding the reference
1623                 "*"  bases  are  deleted and such aligned sequence base-calls
1624                 become insertions.  Similarly transformations apply for dele‐
1625                 tions and padding cigar operations.
1626
1627                 OPTIONS:
1628
1629                 -S     Ignored  for compatibility with previous samtools ver‐
1630                        sions.  Previously this option was required  if  input
1631                        was in SAM format, but now the correct format is auto‐
1632                        matically detected by examining the first few  charac‐
1633                        ters of input.
1634
1635                 -s     Output in SAM format.  The default is BAM.
1636
1637                 -C     Output in CRAM format.  The default is BAM.
1638
1639                 -u     Do  not compress the output.  Applies to either BAM or
1640                        CRAM output format.
1641
1642                 -1     Enable fastest compression level.  Only works for  BAM
1643                        or CRAM output.
1644
1645                 -T FILE
1646                        Provides the padded reference file.  Note that without
1647                        this the @SQ line lengths will be  incorrect,  so  for
1648                        most  use  cases  this  option  will  be considered as
1649                        mandatory.
1650
1651                 -o FILE
1652                        Specifies the output filename.  By default  output  is
1653                        sent to stdout.
1654
1655
1656       markdup   samtools markdup [-l length] [-r] [-s] [-T] [-S]
1657                 in.algsort.bam out.bam
1658
1659                 Mark duplicate alignments from a coordinate sorted file  that
1660                 has  been  run through fixmate with the -m option.  This pro‐
1661                 gram relies on the MC and ms tags that fixmate provides.
1662
1663
1664                 -l INT     Expected maximum read length of INT bases.  [300]
1665
1666                 -r         Remove duplicate reads.
1667
1668                 -s         Print some basic stats.
1669
1670                 -T PREFIX  Write    temporary    files     to     PREFIX.sam‐
1671                            tools.nnnn.mmmm.tmp
1672
1673                 -S         Mark  supplementary  reads of duplicates as dupli‐
1674                            cates.
1675
1676
1677           EXAMPLE
1678
1679           # The first sort can be omitted if the file is already name ordered
1680           samtools sort -n -o namesort.bam example.bam
1681
1682           # Add ms and MC tags for markdup to use later
1683           samtools fixmate -m namesort.bam fixmate.bam
1684
1685           # Markdup needs position order
1686           samtools sort -o positionsort.bam fixmate.bam
1687
1688           # Finally mark duplicates
1689           samtools markdup positionsort.bam markdup.bam
1690
1691
1692       help, --help
1693                 Display a brief usage message listing the  samtools  commands
1694                 available.   If  the  name  of a command is also given, e.g.,
1695                 samtools help view, the detailed usage message for that  par‐
1696                 ticular command is displayed.
1697
1698
1699       --version Display  the  version  numbers  and copyright information for
1700                 samtools and the important libraries used by samtools.
1701
1702
1703       --version-only
1704                 Display the full samtools version number in  a  machine-read‐
1705                 able format.
1706

GLOBAL OPTIONS

1708       Several  long-options are shared between multiple samtools subcommands:
1709       --input-fmt, --input-fmt-option, --output-fmt, --output-fmt-option, and
1710       --reference.  The input format is typically auto-detected so specifying
1711       the format is usually unnecessary and the option is included  for  com‐
1712       pleteness.   Note  that  not all subcommands have all options.  Consult
1713       the subcommand help for more details.
1714
1715       Format strings recognised are "sam", "bam" and  "cram".   They  may  be
1716       followed  by a comma separated list of options as key or key=value. See
1717       below for examples.
1718
1719       The fmt-option arguments accept either a single option or option=value.
1720       Note  that some options only work on some file formats and only on read
1721       or write streams.  If value is unspecified for a  boolean  option,  the
1722       value is assumed to be 1.  The valid options are as follows.
1723
1724       level=INT
1725           Output  only. Specifies the compression level from 1 to 9, or 0 for
1726           uncompressed.
1727
1728       nthreads=INT
1729           Specifies the number of  threads  to  use  during  encoding  and/or
1730           decoding.  For BAM this will be encoding only.  In CRAM the threads
1731           are dynamically shared between encoder and decoder.
1732
1733       reference=fasta_file
1734           Specifies a FASTA reference file for use in CRAM encoding or decod‐
1735           ing.   It usually is not required for decoding except in the situa‐
1736           tion of the MD5 not being obtainable via the REF_PATH or  REF_CACHE
1737           environment variables.
1738
1739       decode_md=0|1
1740           CRAM input only; defaults to 1 (on).  CRAM does not typically store
1741           MD and NM tags, preferring to  generate  them  on  the  fly.   This
1742           option controls this behaviour.  It can be particularly useful when
1743           combined with a file encoded using store_md=1 and store_nm=1.
1744
1745       store_md=0|1
1746           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
1747           MD  tags when no reference is unknown and lets the decoder generate
1748           these values on-the-fly (see decode_md).
1749
1750       store_nm=0|1
1751           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
1752           NM  tags when no reference is unknown and lets the decoder generate
1753           these values on-the-fly (see decode_md).
1754
1755       ignore_md5=0|1
1756           CRAM input only; defaults to 0 (off).  When enabled,  md5  checksum
1757           errors  on  the reference sequence and block checksum errors within
1758           CRAM are ignored.  Use of this option is strongly discouraged.
1759
1760       required_fields=bit-field
1761           CRAM input only; specifies which SAM columns need to be  populated.
1762           By  default  all  fields are used.  Limiting the decode to specific
1763           columns can have significant performance gains.  The bit-field is a
1764           numerical value constructed from the following table.
1765
1766                                      0x1   SAM_QNAME
1767                                      0x2   SAM_FLAG
1768                                      0x4   SAM_RNAME
1769                                      0x8   SAM_POS
1770                                     0x10   SAM_MAPQ
1771                                     0x20   SAM_CIGAR
1772                                     0x40   SAM_RNEXT
1773                                     0x80   SAM_PNEXT
1774                                    0x100   SAM_TLEN
1775                                    0x200   SAM_SEQ
1776                                    0x400   SAM_QUAL
1777                                    0x800   SAM_AUX
1778                                   0x1000   SAM_RGAUX
1779
1780       name_prefix=string
1781           CRAM  input  only; defaults to output filename.  Any sequences with
1782           auto-generated read names will use string as the name prefix.
1783
1784       multi_seq_per_slice=0|1
1785           CRAM output only; defaults to 0 (off).  By default  CRAM  generates
1786           one  container  per  reference sequence, except in the case of many
1787           small references (such as a fragmented assembly).
1788
1789       version=major.minor
1790           CRAM output only.  Specifies the CRAM version  number.   Acceptable
1791           values are "2.1" and "3.0".
1792
1793       seqs_per_slice=INT
1794           CRAM output only; defaults to 10000.
1795
1796       slices_per_container=INT
1797           CRAM  output  only;  defaults  to 1.  The effect of having multiple
1798           slices per container is  to  share  the  compression  header  block
1799           between  multiple slices.  This is unlikely to have any significant
1800           impact unless  the  number  of  sequences  per  slice  is  reduced.
1801           (Together  these  two  options  control  the  granularity of random
1802           access.)
1803
1804       embed_ref=0|1
1805           CRAM output only; defaults to 0 (off).  If 1, this will store  por‐
1806           tions  of  the  reference sequence in each slice, permitting decode
1807           without  having  requiring  an  external  copy  of  the   reference
1808           sequence.
1809
1810       no_ref=0|1
1811           CRAM  output  only;  defaults  to 0 (off).  If 1, sequences will be
1812           stored verbatim with no reference encoding.  This can be useful  if
1813           no reference is available for the file.
1814
1815       use_bzip2=0|1
1816           CRAM  output  only;  defaults  to 0 (off).  Permits use of bzip2 in
1817           CRAM block compression.
1818
1819       use_lzma=0|1
1820           CRAM output only; defaults to 0 (off).  Permits use of lzma in CRAM
1821           block compression.
1822
1823       lossy_names=0|1
1824           CRAM  output  only;  defaults to 0 (off).  If 1, templates with all
1825           members within the same CRAM  slice  will  have  their  read  names
1826           removed.   New  names will be automatically generated during decod‐
1827           ing.  Also see the name_prefix option.
1828
1829       For example:
1830
1831           samtools view --input-fmt-option decode_md=0
1832               --output-fmt cram,version=3.0 --output-fmt-option embed_ref
1833               --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
1834
1835

REFERENCE SEQUENCES

1837       The CRAM format requires use of a reference sequence for  both  reading
1838       and writing.
1839
1840       When  reading  a  CRAM the @SQ headers are interrogated to identify the
1841       reference sequence MD5sum (M5: tag) and the  local  reference  sequence
1842       filename (UR: tag).  Note that http:// and ftp:// based URLs in the UR:
1843       field are not used, but local fasta filenames (with or without file://)
1844       can be used.
1845
1846       To create a CRAM the @SQ headers will also be read to identify the ref‐
1847       erence sequences, but M5: and UR: tags may not be present. In this case
1848       the -T and -t options of samtools view may be used to specify the fasta
1849       or fasta.fai filenames respectively (provided the  .fasta.fai  file  is
1850       also backed up by a .fasta file).
1851
1852       The search order to obtain a reference is:
1853
1854       1. Use any local file specified by the command line options (eg -T).
1855
1856       2. Look for MD5 via REF_CACHE environment variable.
1857
1858       3. Look for MD5 in each element of the REF_PATH environment variable.
1859
1860       4. Look for a local file listed in the UR: header tag.
1861

ENVIRONMENT VARIABLES

1863       HTS_PATH
1864              A  colon-separated  list  of  directories in which to search for
1865              HTSlib plugins.  If $HTS_PATH starts or ends  with  a  colon  or
1866              contains  a  double colon (::), the built-in list of directories
1867              is searched at that point in the search.
1868
1869              If no HTS_PATH variable is defined, the built-in list of  direc‐
1870              tories  specified when HTSlib was built is used, which typically
1871              includes /usr/local/libexec/htslib and similar directories.
1872
1873
1874       REF_PATH
1875              A colon separated (semi-colon on Windows) list of  locations  in
1876              which  to  look for sequences identified by their MD5sums.  This
1877              can be either a list of directories or URLs. Note that if a  URL
1878              is  included  then  the  colon  in  http://  and  ftp:// and the
1879              optional port number will be treated as part of the URL and  not
1880              a  PATH field separator.  For URLs, the text %s will be replaced
1881              by the MD5sum being read.
1882
1883              If  no  REF_PATH  has  been  specified  it   will   default   to
1884              http://www.ebi.ac.uk/ena/cram/md5/%s  and  if  REF_CACHE is also
1885              unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s.  If
1886              $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system tempo‐
1887              rary directory if no home directory is found) will be used simi‐
1888              larly.
1889
1890
1891       REF_CACHE
1892              This  can be defined to a single directory housing a local cache
1893              of references.  Upon downloading a reference it will  be  stored
1894              in  the location pointed to by REF_CACHE.  When reading a refer‐
1895              ence it will be looked for in this  directory  before  searching
1896              REF_PATH.   To  avoid many files being stored in the same direc‐
1897              tory, a pathname may be constructed using %nums and %s notation,
1898              consuming   num   characters   of   the   MD5sum.   For  example
1899              /local/ref_cache/%2s/%2s/%s will create 2 nested  subdirectories
1900              with  the  filenames  in the deepest directory being the last 28
1901              characters of the md5sum.
1902
1903              The REF_CACHE directory will be searched for  before  attempting
1904              to  load  via  the  REF_PATH  search  list.   If  no REF_PATH is
1905              defined, both REF_PATH and REF_CACHE will be  automatically  set
1906              (see  above),  but if REF_PATH is defined and REF_CACHE not then
1907              no local cache is used.
1908
1909              To  aid  population  of  the  REF_CACHE   directory   a   script
1910              misc/seq_cache_populate.pl is provided in the Samtools distribu‐
1911              tion. This takes a fasta file or a directory of fasta files  and
1912              generates the MD5sum named files.
1913

EXAMPLES

1915       o Import SAM to BAM when @SQ lines are present in the header:
1916
1917           samtools view -bS aln.sam > aln.bam
1918
1919         If @SQ lines are absent:
1920
1921           samtools faidx ref.fa
1922           samtools view -bt ref.fa.fai aln.sam > aln.bam
1923
1924         where ref.fa.fai is generated automatically by the faidx command.
1925
1926
1927       o Convert a BAM file to a CRAM file using a local reference sequence.
1928
1929           samtools view -C -T ref.fa aln.bam > aln.cram
1930
1931
1932       o Attach the RG tag while merging sorted alignments:
1933
1934           perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
1935           samtools merge -rh rg.txt merged.bam ga.bam 454.bam
1936
1937         The value in a RG tag is determined by the file name the read is com‐
1938         ing from. In this example, in the merged.bam, reads from ga.bam  will
1939         be  attached  RG:Z:ga,  while  reads  from  454.bam  will be attached
1940         RG:Z:454.
1941
1942
1943       o Convert a BAM file to a CRAM with NM  and  MD  tags  stored  verbatim
1944         rather  than calculating on the fly during CRAM decode, so that mixed
1945         data sets with MD/NM only on some records,  or  NM  calculated  using
1946         different  definitions  of  mismatch,  can be decoded without change.
1947         The second command demonstrates how  to  decode  such  a  file.   The
1948         request  to not decode MD here is turning off auto-generation of both
1949         MD and NM; it will still emit the MD/NM  tags  on  records  that  had
1950         these stored verbatim.
1951
1952           samtools view -C --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o aln.cram aln.bam
1953           samtools view --input-fmt-option decode_md=0 -o aln.new.bam aln.cram
1954
1955
1956       o An alternative way of achieving the above is listing multiple options
1957         after the --output-fmt or -O option.  The commands below are  equiva‐
1958         lent to the two above.
1959
1960           samtools view -O cram,store_md=1,store_nm=1 -o aln.cram aln.bam
1961           samtools view --input-fmt cram,decode_md=0 -o aln.new.bam aln.cram
1962
1963
1964
1965       o Call SNPs and short INDELs:
1966
1967           samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
1968           bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf  > var.flt.vcf
1969
1970         The  bcftools  filter  command marks low quality sites and sites with
1971         the read depth exceeding a limit, which should be adjusted  to  about
1972         twice  the  average  read  depth (bigger read depths usually indicate
1973         problematic regions which are often enriched for artefacts).  One may
1974         consider  to  add -C50 to mpileup if mapping quality is overestimated
1975         for reads containing excessive mismatches. Applying this option  usu‐
1976         ally helps BWA-short but may not other mappers.
1977
1978         Individuals  are identified from the SM tags in the @RG header lines.
1979         Individuals can be pooled in one alignment file; one  individual  can
1980         also  be  separated into multiple files. The -P option specifies that
1981         indel candidates should be collected only from read groups  with  the
1982         @RG-PL  tag  set to ILLUMINA.  Collecting indel candidates from reads
1983         sequenced by an indel-prone technology may affect the performance  of
1984         indel calling.
1985
1986
1987       o Generate the consensus sequence for one diploid individual:
1988
1989           samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
1990
1991
1992       o Phase one individual:
1993
1994           samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
1995
1996         The  calmd  command  is  used  to  reduce  false heterozygotes around
1997         INDELs.
1998
1999
2000
2001       o Dump BAQ applied alignment for other SNP callers:
2002
2003           samtools calmd -bAr aln.bam > aln.baq.bam
2004
2005         It adds and corrects the NM and MD tags at the same time.  The  calmd
2006         command  also comes with the -C option, the same as the one in pileup
2007         and mpileup.  Apply if it helps.
2008
2009

LIMITATIONS

2011       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
2012         bam_aux.c.
2013
2014       o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
2015         orphan reads or ends mapped to different chromosomes). If this  is  a
2016         concern,  please  use Picard's MarkDuplicates which correctly handles
2017         these cases, although a little slower.
2018
2019

AUTHOR

2021       Heng Li from the Sanger Institute wrote the original C version of  sam‐
2022       tools.   Bob  Handsaker  from  the Broad Institute implemented the BGZF
2023       library.  James Bonfield from the Sanger Institute developed  the  CRAM
2024       implementation.   John  Marshall  and  Petr  Danecek  contribute to the
2025       source code and various people from the 1000 Genomes Project have  con‐
2026       tributed to the SAM format specification.
2027
2028