1samtools(1)                  Bioinformatics tools                  samtools(1)
2
3
4

NAME

6       samtools - Utilities for the Sequence Alignment/Map (SAM) format
7

SYNOPSIS

9       samtools  addreplacerg  -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out‐
10       put.bam input.bam
11
12       samtools ampliconclip -b bed.file input.bam
13
14       samtools ampliconstats primers.bed in.bam
15
16       samtools bedcov aln.sorted.bam
17
18       samtools calmd in.sorted.bam ref.fasta
19
20       samtools cat out.bam in1.bam in2.bam in3.bam
21
22       samtools collate -o aln.name_collated.bam aln.sorted.bam
23
24       samtools consensus -o out.fasta in.bam
25
26       samtools coverage aln.sorted.bam
27
28       samtools depad input.bam
29
30       samtools depth aln.sorted.bam
31
32       samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
33
34       samtools faidx ref.fasta
35
36       samtools fasta input.bam > output.fasta
37
38       samtools fastq input.bam > output.fastq
39
40       samtools fixmate in.namesorted.sam out.bam
41
42       samtools flags PAIRED,UNMAP,MUNMAP
43
44       samtools flagstat aln.sorted.bam
45
46       samtools fqidx ref.fastq
47
48       samtools head in.bam
49
50       samtools idxstats aln.sorted.bam
51
52       samtools import input.fastq > output.bam
53
54       samtools index aln.sorted.bam
55
56       samtools markdup in.algnsorted.bam out.bam
57
58       samtools merge out.bam in1.bam in2.bam in3.bam
59
60       samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
61
62       samtools phase input.bam
63
64       samtools quickcheck in1.bam in2.cram
65
66       samtools reheader in.header.sam in.bam > out.bam
67
68       samtools samples input.bam
69
70       samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
71
72       samtools split merged.bam
73
74       samtools stats aln.sorted.bam
75
76       samtools targetcut input.bam
77
78       samtools tview aln.sorted.bam ref.fasta
79
80       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
81
82

DESCRIPTION

84       Samtools is a set of utilities that manipulate alignments  in  the  SAM
85       (Sequence  Alignment/Map),  BAM, and CRAM formats.  It converts between
86       the formats, does sorting, merging and indexing, and can retrieve reads
87       in any regions swiftly.
88
89       Samtools  is designed to work on a stream. It regards an input file `-'
90       as the standard input (stdin) and an output file `-'  as  the  standard
91       output (stdout). Several commands can thus be combined with Unix pipes.
92       Samtools always output warning and error messages to the standard error
93       output (stderr).
94
95       Samtools is also able to open files on remote FTP or HTTP(S) servers if
96       the file name starts with `ftp://', `http://',  etc.   Samtools  checks
97       the  current working directory for the index file and will download the
98       index upon absence. Samtools does not  retrieve  the  entire  alignment
99       file unless it is asked to do so.
100
101       If  an index is needed, samtools looks for the index suffix appended to
102       the filename, and if that isn't found it tries again without the  file‐
103       name suffix (for example in.bam.bai followed by in.bai).  However if an
104       index is in a completely different location or has  a  different  name,
105       both  the  main data filename and index filename can be pasted together
106       with ##idx##.  For example  /data/in.bam##idx##/indices/in.bam.bai  may
107       be used to explicitly indicate where the data and index files reside.
108
109

COMMANDS

111       Each  command  has  its own man page which can be viewed using e.g. man
112       samtools-view or with a recent GNU man using man samtools view.   Below
113       we have a brief summary of syntax and sub-command description.
114
115       Options  common  to all sub-commands are documented below in the GLOBAL
116       COMMAND OPTIONS section.
117
118
119       view      samtools view [options] in.sam|in.bam|in.cram [region...]
120
121                 With no options or regions specified, prints  all  alignments
122                 in  the  specified input alignment file (in SAM, BAM, or CRAM
123                 format) to standard output in SAM format (with no  header  by
124                 default).
125
126                 You may specify one or more space-separated region specifica‐
127                 tions after the input filename to  restrict  output  to  only
128                 those  alignments  which overlap the specified region(s). Use
129                 of region specifications requires a coordinate-sorted and in‐
130                 dexed input file.
131
132                 Options  exist to change the output format from SAM to BAM or
133                 CRAM, so this command also acts as a file  format  conversion
134                 utility.
135
136
137       tview     samtools   tview   [-p   chr:pos]   [-s   STR]  [-d  display]
138                 <in.sorted.bam> [ref.fasta]
139
140                 Text alignment viewer (based on the ncurses library). In  the
141                 viewer,  press `?' for help and press `g' to check the align‐
142                 ment   start   from   a   region   in   the    format    like
143                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
144                 reference sequence.
145
146
147       quickcheck
148                 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]
149
150                 Quickly check that input files appear to  be  intact.  Checks
151                 that  beginning of the file contains a valid header (all for‐
152                 mats) containing at least one target sequence and then  seeks
153                 to  the  end of the file and checks that an end-of-file (EOF)
154                 is present and intact (BAM only).
155
156                 Data in the middle of the file is not read since  that  would
157                 be much more time consuming, so please note that this command
158                 will not detect internal corruption, but is useful for  test‐
159                 ing  that  files are not truncated before performing more in‐
160                 tensive tasks on them.
161
162                 This command will exit with a non-zero exit code if any input
163                 files  don't have a valid header or are missing an EOF block.
164                 Otherwise it will exit successfully (with a zero exit code).
165
166
167       head      samtools head [options] in.sam|in.bam|in.cram
168
169                 Prints the input file's headers and optionally also its first
170                 few alignment records. This command always displays the head‐
171                 ers as they are in the file, never adding an extra @PG header
172                 itself.
173
174
175       index     samtools  index  [-bc]  [-m  INT] aln.sam.gz|aln.bam|aln.cram
176                 [out.index]
177
178                 Index a coordinate-sorted SAM, BAM or CRAM file for fast ran‐
179                 dom  access.   Note  for  SAM this only works if the file has
180                 been BGZF compressed first.
181
182                 This index is needed when region arguments are used to  limit
183                 samtools  view  and similar commands to particular regions of
184                 interest.
185
186                 If an output filename is given, the index file will be  writ‐
187                 ten to out.index.  Otherwise, for a CRAM file aln.cram, index
188                 file aln.cram.crai will be created; for a  BAM  or  SAM  file
189                 aln.bam,  either  aln.bam.bai or aln.bam.csi will be created,
190                 depending on the index format selected.
191
192
193       sort      samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
194                 [-n] [-t tag] [-T tmpprefix] [-@ threads]
195                 [in.sam|in.bam|in.cram]
196
197                 Sort alignments by leftmost coordinates, or by read name when
198                 -n is used.  An appropriate @HD-SO sort order header tag will
199                 be added or an existing one updated if necessary.
200
201                 The sorted output is written to standard output  by  default,
202                 or  to  the  specified  file (out.bam) when -o is used.  This
203                 command will also create temporary files tmpprefix.%d.bam  as
204                 needed  when the entire alignment data cannot fit into memory
205                 (as controlled via the -m option).
206
207                 Consider using samtools collate instead if you need name col‐
208                 lated data without a full lexicographical sort.
209
210
211       collate   samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
212
213                 Shuffles  and groups reads together by their names.  A faster
214                 alternative to a full query name sort, collate  ensures  that
215                 reads  of  the  same  name are grouped together in contiguous
216                 groups, but doesn't make any guarantees about  the  order  of
217                 read names between groups.
218
219                 The output from this command should be suitable for any oper‐
220                 ation that requires all reads from the same  template  to  be
221                 grouped together.
222
223
224       idxstats  samtools idxstats in.sam|in.bam|in.cram
225
226                 Retrieve  and  print stats in the index file corresponding to
227                 the input file.  Before calling idxstats, the input BAM  file
228                 should be indexed by samtools index.
229
230                 If  run  on a SAM or CRAM file or an unindexed BAM file, this
231                 command will still produce the same summary  statistics,  but
232                 does  so  by  reading  through  the entire file.  This is far
233                 slower than using the BAM indices.
234
235                 The output is TAB-delimited with each line consisting of ref‐
236                 erence  sequence  name, sequence length, # mapped reads and #
237                 unmapped reads. It is written to stdout.
238
239
240       flagstat  samtools flagstat in.sam|in.bam|in.cram
241
242                 Does a full pass through the  input  file  to  calculate  and
243                 print statistics to stdout.
244
245                 Provides  counts for each of 13 categories based primarily on
246                 bit flags in the FLAG field. Each category in the  output  is
247                 broken  down  into QC pass and QC fail, which is presented as
248                 "#PASS + #FAIL" followed by a description of the category.
249
250
251       flags     samtools flags INT|STR[,...]
252
253                 Convert between textual and numeric flag representation.
254
255                 FLAGS:
256
257                   0x1   PAIRED          paired-end (or multiple-segment) sequencing technology
258                   0x2   PROPER_PAIR     each segment properly aligned according to the aligner
259                   0x4   UNMAP           segment unmapped
260                   0x8   MUNMAP          next segment in the template unmapped
261                  0x10   REVERSE         SEQ is reverse complemented
262                  0x20   MREVERSE        SEQ of the next segment in the template is reverse complemented
263                  0x40   READ1           the first segment in the template
264
265                  0x80   READ2           the last segment in the template
266                 0x100   SECONDARY       secondary alignment
267                 0x200   QCFAIL          not passing quality controls
268                 0x400   DUP             PCR or optical duplicate
269                 0x800   SUPPLEMENTARY   supplementary alignment
270
271
272       stats     samtools stats [options] in.sam|in.bam|in.cram [region...]
273
274                 samtools stats collects statistics from BAM files and outputs
275                 in  a  text format.  The output can be visualized graphically
276                 using plot-bamstats.
277
278
279
280       bedcov    samtools         bedcov         [options]          region.bed
281                 in1.sam|in1.bam|in1.cram[...]
282
283                 Reports  the  total read base count (i.e. the sum of per base
284                 read depths) for each genomic region specified  in  the  sup‐
285                 plied  BED file. The regions are output as they appear in the
286                 BED file and are 0-based.  Counts  for  each  alignment  file
287                 supplied are reported in separate columns.
288
289
290       depth     samtools     depth     [options]    [in1.sam|in1.bam|in1.cram
291                 [in2.sam|in2.bam|in2.cram] [...]]
292
293                 Computes the read depth at each position or region.
294
295
296       ampliconstats
297                 samtools      ampliconstats       [options]       primers.bed
298                 in.sam|in.bam|in.cram[...]
299
300                 samtools  ampliconstats  collects statistics from one or more
301                 input alignment files and produces  tables  in  text  format.
302                 The output can be visualized graphically using plot-amplicon‐
303                 stats.
304
305                 The alignment files should have previously  been  clipped  of
306                 primer sequence, for example by samtools ampliconclip and the
307                 sites of these primers should be specified as a bed  file  in
308                 the arguments.
309
310
311       mpileup   samtools  mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l
312                 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
313
314                 Generate textual pileup for one or multiple BAM  files.   For
315                 VCF  and  BCF output, please use the bcftools mpileup command
316                 instead.  Alignment records are grouped by sample (SM)  iden‐
317                 tifiers  in  @RG header lines.  If sample identifiers are ab‐
318                 sent, each input file is regarded as one sample.
319
320                 See the samtools-mpileup man page for a  description  of  the
321                 pileup format and options.
322
323
324       consensus samtools consensus [options] in.bam
325
326                 Generate  consensus from a SAM, BAM or CRAM file based on the
327                 contents of the alignment records.  The consensus is  written
328                 either as FASTA, FASTQ, or a pileup oriented format.
329
330                 The  default  output  for FASTA and FASTQ formats include one
331                 base per non-gap consensus.  Hence insertions with respect to
332                 the aligned reference will be included and deletions removed.
333                 This behaviour can be adjusted.
334
335                 Two consensus calling algorithms are  offered.   The  default
336                 computes  a  heterozygous consensus in a Bayesian manner, de‐
337                 rived from the "Gap5" consensus algorithm.   A  simpler  base
338                 frequency counting method is also available.
339
340
341
342       coverage  samtools    coverage    [options]   [in1.sam|in1.bam|in1.cram
343                 [in2.sam|in2.bam|in2.cram] [...]]
344
345                 Produces a histogram or table of coverage per chromosome.
346
347
348       merge     samtools merge [-nur1f] [-h inh.sam] [-t tag]  [-R  reg]  [-b
349                 list] out.bam in1.bam [in2.bam in3.bam ... inN.bam]
350
351                 Merge  multiple  sorted  alignment  files, producing a single
352                 sorted output file that contains all the  input  records  and
353                 maintains the existing sort order.
354
355                 If  -h  is  specified  the @SQ headers of input files will be
356                 merged into the specified  header,  otherwise  they  will  be
357                 merged  into  a composite header created from the input head‐
358                 ers.  If the @SQ headers differ in order this may require the
359                 output file to be re-sorted after merge.
360
361                 The ordering of the records in the input files must match the
362                 usage of the -n and -t command-line options.  If they do not,
363                 the output order will be undefined.  See sort for information
364                 about record ordering.
365
366
367       split     samtools split [options] merged.sam|merged.bam|merged.cram
368
369                 Splits a file by read group, producing  one  or  more  output
370                 files matching a common prefix (by default based on the input
371                 filename) each containing one read-group.
372
373
374       cat       samtools cat [-b list] [-h header.sam] [-o  out.bam]  in1.bam
375                 in2.bam [ ... ]
376
377                 Concatenate  BAMs or CRAMs. Although this works on either BAM
378                 or CRAM, all input files must be  the  same  format  as  each
379                 other.  The  sequence  dictionary  of each input file must be
380                 identical, although this command does not  check  this.  This
381                 command  uses  a similar trick to reheader which enables fast
382                 BAM concatenation.
383
384
385       import    samtools import [options] in.fastq [ ... ]
386
387                 Converts one or more FASTQ files to  unaligned  SAM,  BAM  or
388                 CRAM.   These  formats  offer a richer capability of tracking
389                 sample meta-data via the SAM header  and  per-read  meta-data
390                 via the auxiliary tags.  The fastq command may be used to re‐
391                 verse this conversion.
392
393
394       fastq/a   samtools fastq [options] in.bam
395                 samtools fasta [options] in.bam
396
397                 Converts a BAM or CRAM into either FASTQ or FASTA format  de‐
398                 pending  on  the command invoked. The files will be automati‐
399                 cally compressed if the file names have a .gz or .bgzf exten‐
400                 sion.
401
402                 The input to this program must be collated by name.  Use sam‐
403                 tools collate or samtools sort -n to ensure this.
404
405
406       faidx     samtools faidx <ref.fasta> [region1 [...]]
407
408                 Index reference sequence in the FASTA format or extract  sub‐
409                 sequence  from  indexed  reference  sequence. If no region is
410                 specified,   faidx   will   index   the   file   and   create
411                 <ref.fasta>.fai  on  the  disk. If regions are specified, the
412                 subsequences will be retrieved and printed to stdout  in  the
413                 FASTA format.
414
415                 The input file can be compressed in the BGZF format.
416
417                 FASTQ files can be read and indexed by this command.  Without
418                 using --fastq any extracted subsequence will be in FASTA for‐
419                 mat.
420
421
422       fqidx     samtools fqidx <ref.fastq> [region1 [...]]
423
424                 Index  reference sequence in the FASTQ format or extract sub‐
425                 sequence from indexed reference sequence.  If  no  region  is
426                 specified,   fqidx   will   index   the   file   and   create
427                 <ref.fastq>.fai on the disk. If regions  are  specified,  the
428                 subsequences  will  be retrieved and printed to stdout in the
429                 FASTQ format.
430
431                 The input file can be compressed in the BGZF format.
432
433                 samtools fqidx should only be used  on  fastq  files  with  a
434                 small number of entries.  Trying to use it on a file contain‐
435                 ing millions of short sequencing reads will produce an  index
436                 that  is almost as big as the original file, and searches us‐
437                 ing the index will be very slow and use a lot of memory.
438
439
440       dict      samtools dict ref.fasta|ref.fasta.gz
441
442                 Create a sequence dictionary file from a fasta file.
443
444
445       calmd     samtools calmd [-Eeubr] [-C capQcoef] aln.bam ref.fasta
446
447                 Generate the MD tag. If the MD tag is already  present,  this
448                 command  will  give a warning if the MD tag generated is dif‐
449                 ferent from the existing tag. Output SAM by default.
450
451                 Calmd can also read and write CRAM  files  although  in  most
452                 cases  it is pointless as CRAM recalculates MD and NM tags on
453                 the fly.  The one exception to this case is where both  input
454                 and  output CRAM files have been / are being created with the
455                 no_ref option.
456
457
458       fixmate   samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
459
460                 Fill in mate coordinates, ISIZE and mate related flags from a
461                 name-sorted alignment.
462
463
464       markdup   samtools markdup [-l length] [-r] [-s] [-T] [-S] in.al‐
465                 gsort.bam out.bam
466
467                 Mark duplicate alignments from a coordinate sorted file  that
468                 has  been  run  through  samtools fixmate with the -m option.
469                 This program relies on the MC and ms tags that  fixmate  pro‐
470                 vides.
471
472
473       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
474
475                 This command is obsolete. Use markdup instead.
476
477
478       addreplacerg
479                 samtools  addreplacerg  [-r rg-line | -R rg-ID] [-m mode] [-l
480                 level] [-o out.bam] in.bam
481
482                 Adds or replaces read group tags in a file.
483
484
485       reheader  samtools reheader [-iP] in.header.sam in.bam
486
487                 Replace  the  header   in   in.bam   with   the   header   in
488                 in.header.sam.   This  command  is much faster than replacing
489                 the header with a BAM→SAM→BAM conversion.
490
491                 By default this command outputs the BAM or CRAM file to stan‐
492                 dard  output  (stdout),  but for CRAM format files it has the
493                 option to perform an in-place edit, both reading and  writing
494                 to  the  same file.  No validity checking is performed on the
495                 header, nor that it is suitable to use with the sequence data
496                 itself.
497
498
499       targetcut samtools  targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1
500                 em1] [-2 em2] [-f ref] in.bam
501
502                 This command identifies target regions by examining the  con‐
503                 tinuity  of  read depth, computes haploid consensus sequences
504                 of targets and outputs a SAM with each sequence corresponding
505                 to  a  target. When option -f is in use, BAQ will be applied.
506                 This command is only designed for cutting fosmid clones  from
507                 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
508
509
510       phase     samtools  phase  [-AF]  [-k  len] [-b prefix] [-q minLOD] [-Q
511                 minBaseQ] in.bam
512
513                 Call and phase heterozygous SNPs.
514
515
516       depad     samtools depad [-SsCu1] [-T ref.fa] [-o output] in.bam
517
518                 Converts a BAM aligned against a padded reference  to  a  BAM
519                 aligned against the depadded reference.  The padded reference
520                 may contain verbatim "*" bases in it, but "*" bases are  also
521                 counted  in  the  reference numbering.  This means that a se‐
522                 quence base-call aligned against a reference "*"  is  consid‐
523                 ered  to be a cigar match ("M" or "X") operator (if the base-
524                 call is "A", "C", "G" or "T").  After depadding the reference
525                 "*"  bases  are  deleted and such aligned sequence base-calls
526                 become insertions.  Similarly transformations apply for dele‐
527                 tions and padding cigar operations.
528
529
530       ampliconclip
531                 samtools  ampliconclip  [-o out.file] [-f stat.file] [--soft-
532                 clip]  [--hard-clip]  [--both-ends]  [--strand]   [--clipped]
533                 [--fail] [--no-PG] -b bed.file in.file
534
535                 Clip  reads in a SAM compatible file based on data from a BED
536                 file.
537
538
539       samples   samtools samples [-o out.file] [-i] [-T TAG] [-f  refs.fasta]
540                 [-F refs_list] [-X]
541
542                 Prints the samples from alignment files
543
544

SAMTOOLS OPTIONS

546       These  are  options  that are passed after the samtools command, before
547       any sub-command is specified.
548
549
550       help, --help
551              Display a brief usage  message  listing  the  samtools  commands
552              available.   If  the name of a command is also given, e.g., sam‐
553              tools help view, the detailed usage message for that  particular
554              command is displayed.
555
556
557       --version
558              Display  the  version numbers and copyright information for sam‐
559              tools and the important libraries used by samtools.
560
561
562       --version-only
563              Display the full samtools version number in  a  machine-readable
564              format.
565

GLOBAL COMMAND OPTIONS

567       Several long-options are shared between multiple samtools sub-commands:
568       --input-fmt,  --input-fmt-option,  --output-fmt,   --output-fmt-option,
569       --reference, --write-index, and --verbosity.  The input format is typi‐
570       cally auto-detected so specifying the format is usually unnecessary and
571       the option is included for completeness.  Note that not all subcommands
572       have all options.  Consult the subcommand help for more details.
573
574       Format strings recognised are "sam", "sam.gz", "bam" and "cram".   They
575       may  be  followed  by  a  comma  separated  list  of  options as key or
576       key=value. See below for examples.
577
578       The fmt-option arguments accept either a single option or option=value.
579       Note  that some options only work on some file formats and only on read
580       or write streams.  If value is unspecified for a  boolean  option,  the
581       value is assumed to be 1.  The valid options are as follows.
582
583       level=INT
584           Output  only. Specifies the compression level from 1 to 9, or 0 for
585           uncompressed.  If the output format is SAM, this also enables  BGZF
586           compression, otherwise SAM defaults to uncompressed.
587
588       nthreads=INT
589           Specifies  the  number of threads to use during encoding and/or de‐
590           coding.  For BAM this will be encoding only.  In CRAM  the  threads
591           are dynamically shared between encoder and decoder.
592
593       filter=STRING
594           Apply  filter STRING to all incoming records, rejecting any that do
595           not satisfy the expression.  See the FILTER EXPRESSIONS section be‐
596           low for specifics.
597
598       reference=fasta_file
599           Specifies a FASTA reference file for use in CRAM encoding or decod‐
600           ing.  It usually is not required for decoding except in the  situa‐
601           tion  of the MD5 not being obtainable via the REF_PATH or REF_CACHE
602           environment variables.
603
604       decode_md=0|1
605           CRAM input only; defaults to 1 (on).  CRAM does not typically store
606           MD  and NM tags, preferring to generate them on the fly.  When this
607           option is 0 missing MD, NM tags will not be generated.  It  can  be
608           particularly  useful  when  combined  with  a  file  encoded  using
609           store_md=1 and store_nm=1.
610
611       store_md=0|1
612           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
613           MD tags when the reference is unknown and lets the decoder generate
614           these values on-the-fly (see decode_md).
615
616       store_nm=0|1
617           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
618           NM tags when the reference is unknown and lets the decoder generate
619           these values on-the-fly (see decode_md).
620
621       ignore_md5=0|1
622           CRAM input only; defaults to 0 (off).  When enabled,  md5  checksum
623           errors  on  the reference sequence and block checksum errors within
624           CRAM are ignored.  Use of this option is strongly discouraged.
625
626       required_fields=bit-field
627           CRAM input only; specifies which SAM columns need to be  populated.
628           By  default  all  fields are used.  Limiting the decode to specific
629           columns can have significant performance gains.  The bit-field is a
630           numerical value constructed from the following table.
631
632              0x1   SAM_QNAME
633              0x2   SAM_FLAG
634              0x4   SAM_RNAME
635              0x8   SAM_POS
636             0x10   SAM_MAPQ
637             0x20   SAM_CIGAR
638             0x40   SAM_RNEXT
639             0x80   SAM_PNEXT
640            0x100   SAM_TLEN
641            0x200   SAM_SEQ
642            0x400   SAM_QUAL
643            0x800   SAM_AUX
644           0x1000   SAM_RGAUX
645
646       name_prefix=string
647           CRAM  input  only; defaults to output filename.  Any sequences with
648           auto-generated read names will use string as the name prefix.
649
650       multi_seq_per_slice=0|1
651           CRAM output only; defaults to 0 (off).  By default  CRAM  generates
652           one  container  per  reference sequence, except in the case of many
653           small references (such as a fragmented assembly).
654
655       version=major.minor
656           CRAM output only.  Specifies the CRAM version  number.   Acceptable
657           values are "2.1" and "3.0".
658
659       seqs_per_slice=INT
660           CRAM output only; defaults to 10000.
661
662       slices_per_container=INT
663           CRAM  output  only;  defaults  to 1.  The effect of having multiple
664           slices per container is to share the compression header  block  be‐
665           tween  multiple  slices.   This is unlikely to have any significant
666           impact unless the number of sequences per slice is  reduced.   (To‐
667           gether these two options control the granularity of random access.)
668
669       embed_ref=0|1
670           CRAM  output only; defaults to 0 (off).  If 1, this will store por‐
671           tions of the reference sequence in each  slice,  permitting  decode
672           without  having  requiring  an  external  copy of the reference se‐
673           quence.
674
675       no_ref=0|1
676           CRAM output only; defaults to 0 (off).  If  1,  sequences  will  be
677           stored  verbatim with no reference encoding.  This can be useful if
678           no reference is available for the file.
679
680       use_bzip2=0|1
681           CRAM output only; defaults to 0 (off).  Permits  use  of  bzip2  in
682           CRAM block compression.
683
684       use_lzma=0|1
685           CRAM output only; defaults to 0 (off).  Permits use of lzma in CRAM
686           block compression.
687
688       lossy_names=0|1
689           CRAM output only; defaults to 0 (off).  If 1,  templates  with  all
690           members  within  the same CRAM slice will have their read names re‐
691           moved.  New names will be automatically generated during  decoding.
692           Also see the name_prefix option.
693
694       For example:
695
696           samtools view --input-fmt-option decode_md=0
697               --output-fmt cram,version=3.0 --output-fmt-option embed_ref
698               --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
699
700
701       The --write-index option enables automatic index creation while writing
702       out BAM, CRAM or bgzf SAM files.  Note to get  compressed  SAM  as  the
703       output  format you need to manually request a compression level, other‐
704       wise all SAM files are uncompressed.  By default SAM and BAM  will  use
705       CSI  indices  while  CRAM will use CRAI indices.  If you need to create
706       BAI indices note that it is possible to specify the name of  the  index
707       being written to, and hence the format, by using the filename##idx##in‐
708       dexname notation.
709
710       For example: to convert a BAM to a compressed SAM with CSI indexing:
711
712           samtools view -h -O sam,level=6 --write-index in.bam -o out.sam.gz
713
714
715       To convert a SAM to a compressed BAM using BAI indexing:
716
717           samtools view --write-index in.sam -o out.bam##idx##out.bam.bai
718
719
720       The --verbosity INT option sets the verbosity level  for  samtools  and
721       HTSlib.  The default is 3 (HTS_LOG_WARNING); 2 reduces warning messages
722       and 0 or 1 also reduces some error messages, while values greater  than
723       3  produce  increasing  numbers of additional warnings and logging mes‐
724       sages.
725
726

REFERENCE SEQUENCES

728       The CRAM format requires use of a reference sequence for  both  reading
729       and writing.
730
731       When  reading  a  CRAM the @SQ headers are interrogated to identify the
732       reference sequence MD5sum (M5: tag) and the  local  reference  sequence
733       filename (UR: tag).  Note that http:// and ftp:// based URLs in the UR:
734       field are not used, but local fasta filenames (with or without file://)
735       can be used.
736
737       To create a CRAM the @SQ headers will also be read to identify the ref‐
738       erence sequences, but M5: and UR: tags may not be present. In this case
739       the -T and -t options of samtools view may be used to specify the fasta
740       or fasta.fai filenames respectively (provided the  .fasta.fai  file  is
741       also backed up by a .fasta file).
742
743       The search order to obtain a reference is:
744
745       1. Use any local file specified by the command line options (eg -T).
746
747       2. Look for MD5 via REF_CACHE environment variable.
748
749       3. Look for MD5 in each element of the REF_PATH environment variable.
750
751       4. Look for a local file listed in the UR: header tag.
752
753

FILTER EXPRESSIONS

755       Filter  expressions are used as an on-the-fly checking of incoming SAM,
756       BAM or CRAM records, discarding records that do not match the specified
757       expression.
758
759       The  language  used is primarily C style, but with a few differences in
760       the precedence rules for bit operators and the inclusion of regular ex‐
761       pression matching.
762
763       The operator precedence, from strongest binding to weakest, is:
764
765
766       Grouping        (, )             E.g. "(1+2)*3"
767       Values:         literals, vars   Numbers, strings and variables
768       Unary ops:      +, -, !, ~       E.g. -10 +10, !10 (not), ~5 (bit not)
769       Math ops:       *, /, %          Multiply, division and (integer) modulo
770       Math ops:       +, -             Addition / subtraction
771       Bit-wise:       &                Integer AND
772       Bit-wise        ^                Integer XOR
773       Bit-wise        |                Integer OR
774       Conditionals:   >, >=, <, <=
775       Equality:       ==, !=, =~, !~   =~ and !~ match regular expressions
776       Boolean:        &&, ||           Logical AND / OR
777
778       Expressions  are computed using floating point mathematics, so "10 / 4"
779       evaluates to 2.5 rather than 2.  They may be  written  as  integers  in
780       decimal  or  "0x"  plus hexadecimal, and floating point with or without
781       exponents.However operations that require integers first do an implicit
782       type  conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7
783       & 4", which is 4.  Strings are always specified  using  double  quotes.
784       To  get  a double quote in a string, use backslash.  Similarly a double
785       backslash is used to get a literal backslash.  For example ab\"c\\d  is
786       the string ab"c\d.
787
788       Comparison  operators  are  evaluated as a match being 1 and a mismatch
789       being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2.
790
791       The variables are where the file format specifics are accessed from the
792       expression.   The  variables  correspond  to SAM fields, for example to
793       find paired alignments with high mapping quality and a very  large  in‐
794       sert  size, we may use the expression "mapq >= 30 && (tlen >= 100000 ||
795       tlen <= -100000)".  Valid variable names and their data types are:
796
797
798       endpos               int            Alignment end position (1-based)
799       flag                 int            Combined FLAG field
800       flag.paired          int            Single bit, 0 or 1
801       flag.proper_pair     int            Single bit, 0 or 2
802
803       flag.unmap           int            Single bit, 0 or 4
804       flag.munmap          int            Single bit, 0 or 8
805       flag.reverse         int            Single bit, 0 or 16
806       flag.mreverse        int            Single bit, 0 or 32
807       flag.read1           int            Single bit, 0 or 64
808       flag.read2           int            Single bit, 0 or 128
809       flag.secondary       int            Single bit, 0 or 256
810       flag.qcfail          int            Single bit, 0 or 512
811       flag.dup             int            Single bit, 0 or 1024
812       flag.supplementary   int            Single bit, 0 or 2048
813       library              string         Library (LB header via RG)
814       mapq                 int            Mapping quality
815       mpos                 int            Synonym for pnext
816       mrefid               int            Mate reference number (0 based)
817       mrname               string         Synonym for rnext
818       ncigar               int            Number of cigar operations
819       pnext                int            Mate's alignment position (1-based)
820       pos                  int            Alignment position (1-based)
821       qlen                 int            Alignment length: no. query bases
822       qname                string         Query name
823       qual                 string         Quality values (raw, 0 based)
824       refid                int            Integer reference number (0 based)
825       rlen                 int            Alignment length: no. reference bases
826       rname                string         Reference name
827       rnext                string         Mate's reference name
828       seq                  string         Sequence
829       tlen                 int            Template length (insert size)
830       [XX]                 int / string   XX tag value
831
832       Flags are returned either as the whole flag value or by checking for  a
833       single bit.  Hence the filter expression flag.dup is equivalent to flag
834       & 1024.
835
836       "qlen" and "rlen" are measured using the CIGAR string to count the num‐
837       ber  of query (sequence) and reference bases consumed.  Note "qlen" may
838       not exactly match the length of the "seq" field if the sequence is "*".
839
840       "endpos" is the (1-based inclusive) position of  the  rightmost  mapped
841       base  of  the  read, as measured using the CIGAR string, and for mapped
842       reads is equivalent to "pos+rlen-1". For unmapped reads, it is the same
843       as "pos".
844
845       Reference  names  may  be matched either by their string forms ("rname"
846       and "mrname") or as the Nth @SQ line (counting from zero) as stored  in
847       BAM using "tid" and "mtid" respectively.
848
849       Auxiliary tags are described in square brackets and these expand to ei‐
850       ther integer or string as defined by the  tag  itself  (XX:Z:string  or
851       XX:i:int).   For  example  [NM]>=10  can be used to look for alignments
852       with many mismatches and [RG]=~"grp[ABC]-" will  match  the  read-group
853       string.
854
855       If no comparison is used with an auxiliary tag it is taken simply to be
856       a test for the existence of that tag.  So "[NM]" will return any record
857       containing an NM tag, even if that tag is zero (NM:i:0).
858
859       If you need to check specifically for a non-zero value then use [NM] &&
860       [NM]!=0.
861
862       Some simple functions are available to operate on strings.  These treat
863       the strings as arrays of bytes, permitting their length, minimum, maxi‐
864       mum and average values to be computed.
865
866
867       length   Length of the string (excluding nul char)
868       min      Minimum byte value in the string
869       max      Maximum byte value in the string
870       avg      Average byte value in the string
871
872       Note that "avg" is a floating point value and it may be NAN  for  empty
873       strings.   This  means  that  "avg(qual)" does not produce an error for
874       records that have both seq and qual of "*".  This value will  fail  any
875       conditional  checks, so e.g. "avg(qual) > 20" works and will not report
876       these records.
877
878

ENVIRONMENT VARIABLES

880       HTS_PATH
881              A colon-separated list of directories in which to search for HT‐
882              Slib  plugins.  If $HTS_PATH starts or ends with a colon or con‐
883              tains a double colon (::), the built-in list of  directories  is
884              searched at that point in the search.
885
886              If  no HTS_PATH variable is defined, the built-in list of direc‐
887              tories specified when HTSlib was built is used, which  typically
888              includes /usr/local/libexec/htslib and similar directories.
889
890
891       REF_PATH
892              A  colon  separated (semi-colon on Windows) list of locations in
893              which to look for sequences identified by their  MD5sums.   This
894              can  be either a list of directories or URLs. Note that if a URL
895              is included then the colon in http:// and  ftp://  and  the  op‐
896              tional  port number will be treated as part of the URL and not a
897              PATH field separator.  For URLs, the text %s will be replaced by
898              the MD5sum being read.
899
900              If   no   REF_PATH   has  been  specified  it  will  default  to
901              http://www.ebi.ac.uk/ena/cram/md5/%s and if  REF_CACHE  is  also
902              unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s.  If
903              $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system tempo‐
904              rary directory if no home directory is found) will be used simi‐
905              larly.
906
907
908       REF_CACHE
909              This can be defined to a single location housing a  local  cache
910              of  references.   Upon downloading a reference it will be stored
911              in the location pointed to  by  REF_CACHE.   REF_CACHE  will  be
912              searched before attempting to load via the REF_PATH search list.
913              If no REF_PATH is defined, both REF_PATH and REF_CACHE  will  be
914              automatically  set  (see  above), but if REF_PATH is defined and
915              REF_CACHE not then no local cache is used.
916
917              To  avoid  many  files  being  stored  in  the  same  directory,
918              REF_CACHE may be defined as a pattern using %nums to consume num
919              characters of the MD5sum and %s to consume all remaining charac‐
920              ters.   If  REF_CACHE  lacks %s then it will get an implicit /%s
921              appended.
922
923              To  aid  population  of  the  REF_CACHE   directory   a   script
924              misc/seq_cache_populate.pl is provided in the Samtools distribu‐
925              tion. This takes a fasta file or a directory of fasta files  and
926              generates the MD5sum named files.
927
928              For  example if you use seq_cache_populate -subdirs 2 -root /lo‐
929              cal/ref_cache to create 2 nested subdirectories  (the  default),
930              each  consuming  2 characters of the MD5sum, then REF_CACHE must
931              be set to /local/ref_cache/%2s/%2s/%s.
932

EXAMPLES

934       o Import SAM to BAM when @SQ lines are present in the header:
935
936           samtools view -b aln.sam > aln.bam
937
938         If @SQ lines are absent:
939
940           samtools faidx ref.fa
941           samtools view -bt ref.fa.fai aln.sam > aln.bam
942
943         where ref.fa.fai is generated automatically by the faidx command.
944
945
946       o Convert a BAM file to a CRAM file using a local reference sequence.
947
948           samtools view -C -T ref.fa aln.bam > aln.cram
949
950
951

AUTHOR

953       Heng Li from the Sanger Institute wrote the original C version of  sam‐
954       tools.  Bob Handsaker from the Broad Institute implemented the BGZF li‐
955       brary.  Petr Danecek and Heng  Li  wrote  the  VCF/BCF  implementation.
956       James Bonfield from the Sanger Institute developed the CRAM implementa‐
957       tion.  Other large code contributions have been made by John  Marshall,
958       Rob  Davies,  Martin  Pollard, Andrew Whitwham, Valeriu Ohan (all while
959       primarily at the Sanger Institute), with  numerous  other  smaller  but
960       valuable  contributions.   See the per-command manual pages for further
961       authorship.
962
963

SEE ALSO

965       samtools-addreplacerg(1), samtools-ampliconclip(1),  samtools-amplicon‐
966       stats(1),  samtools-bedcov(1), samtools-calmd(1), samtools-cat(1), sam‐
967       tools-collate(1),  samtools-consensus(1),  samtools-coverage(1),   sam‐
968       tools-depad(1), samtools-depth(1), samtools-dict(1), samtools-faidx(1),
969       samtools-fasta(1),  samtools-fastq(1),  samtools-fixmate(1),  samtools-
970       flags(1),  samtools-flagstat(1),  samtools-fqidx(1),  samtools-head(1),
971       samtools-idxstats(1), samtools-import(1), samtools-index(1),  samtools-
972       markdup(1),  samtools-merge(1), samtools-mpileup(1), samtools-phase(1),
973       samtools-quickcheck(1), samtools-reheader(1),  samtools-rmdup(1),  sam‐
974       tools-sort(1),  samtools-split(1),  samtools-stats(1), samtools-target‐
975       cut(1),  samtools-tview(1),  samtools-view(1),   bcftools(1),   sam(5),
976       tabix(1)
977
978       Samtools website: <http://www.htslib.org/>
979       File   format   specification   of  SAM/BAM,CRAM,VCF/BCF:  <http://sam
980       tools.github.io/hts-specs>
981       Samtools latest source: <https://github.com/samtools/samtools>
982       HTSlib latest source: <https://github.com/samtools/htslib>
983       Bcftools website: <http://samtools.github.io/bcftools>
984
985
986
987samtools-1.15.1                  7 April 2022                      samtools(1)
Impressum