1samtools(1)                  Bioinformatics tools                  samtools(1)
2
3
4

NAME

6       samtools - Utilities for the Sequence Alignment/Map (SAM) format
7

SYNOPSIS

9       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
10
11       samtools tview aln.sorted.bam ref.fasta
12
13       samtools quickcheck in1.bam in2.cram
14
15       samtools index aln.sorted.bam
16
17       samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
18
19       samtools collate -o aln.name_collated.bam aln.sorted.bam
20
21       samtools idxstats aln.sorted.bam
22
23       samtools flagstat aln.sorted.bam
24
25       samtools flags PAIRED,UNMAP,MUNMAP
26
27       samtools stats aln.sorted.bam
28
29       samtools bedcov aln.sorted.bam
30
31       samtools depth aln.sorted.bam
32
33       samtools ampliconstats primers.bed in.bam
34
35       samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
36
37       samtools coverage aln.sorted.bam
38
39       samtools merge out.bam in1.bam in2.bam in3.bam
40
41       samtools split merged.bam
42
43       samtools cat out.bam in1.bam in2.bam in3.bam
44
45       samtools import input.fastq > output.bam
46
47       samtools fastq input.bam > output.fastq
48
49       samtools fasta input.bam > output.fasta
50
51       samtools faidx ref.fasta
52
53       samtools fqidx ref.fastq
54
55       samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
56
57       samtools calmd in.sorted.bam ref.fasta
58
59       samtools fixmate in.namesorted.sam out.bam
60
61       samtools markdup in.algnsorted.bam out.bam
62
63       samtools  addreplacerg  -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out‐
64       put.bam input.bam
65
66       samtools reheader in.header.sam in.bam > out.bam
67
68       samtools targetcut input.bam
69
70       samtools phase input.bam
71
72       samtools depad input.bam
73
74       samtools ampliconclip -b bed.file input.bam
75
76

DESCRIPTION

78       Samtools is a set of utilities that manipulate alignments  in  the  SAM
79       (Sequence  Alignment/Map),  BAM, and CRAM formats.  It converts between
80       the formats, does sorting, merging and indexing, and can retrieve reads
81       in any regions swiftly.
82
83       Samtools  is designed to work on a stream. It regards an input file `-'
84       as the standard input (stdin) and an output file `-'  as  the  standard
85       output (stdout). Several commands can thus be combined with Unix pipes.
86       Samtools always output warning and error messages to the standard error
87       output (stderr).
88
89       Samtools is also able to open files on remote FTP or HTTP(S) servers if
90       the file name starts with `ftp://', `http://',  etc.   Samtools  checks
91       the  current working directory for the index file and will download the
92       index upon absence. Samtools does not  retrieve  the  entire  alignment
93       file unless it is asked to do so.
94
95       If  an index is needed, samtools looks for the index suffix appended to
96       the filename, and if that isn't found it tries again without the  file‐
97       name suffix (for example in.bam.bai followed by in.bai).  However if an
98       index is in a completely different location or has  a  different  name,
99       both  the  main data filename and index filename can be pasted together
100       with ##idx##.  For example  /data/in.bam##idx##/indices/in.bam.bai  may
101       be used to explicitly indicate where the data and index files reside.
102
103

COMMANDS

105       Each  command  has  its own man page which can be viewed using e.g. man
106       samtools-view or with a recent GNU man using man samtools view.   Below
107       we have a brief summary of syntax and sub-command description.
108
109       Options  common  to all sub-commands are documented below in the GLOBAL
110       COMMAND OPTIONS section.
111
112
113       view      samtools view [options] in.sam|in.bam|in.cram [region...]
114
115                 With no options or regions specified, prints  all  alignments
116                 in  the  specified input alignment file (in SAM, BAM, or CRAM
117                 format) to standard output in SAM format (with no  header  by
118                 default).
119
120                 You may specify one or more space-separated region specifica‐
121                 tions after the input filename to  restrict  output  to  only
122                 those  alignments  which overlap the specified region(s). Use
123                 of region specifications requires a coordinate-sorted and in‐
124                 dexed input file.
125
126                 Options  exist to change the output format from SAM to BAM or
127                 CRAM, so this command also acts as a file  format  conversion
128                 utility.
129
130
131       tview     samtools   tview   [-p   chr:pos]   [-s   STR]  [-d  display]
132                 <in.sorted.bam> [ref.fasta]
133
134                 Text alignment viewer (based on the ncurses library). In  the
135                 viewer,  press `?' for help and press `g' to check the align‐
136                 ment   start   from   a   region   in   the    format    like
137                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
138                 reference sequence.
139
140
141       quickcheck
142                 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]
143
144                 Quickly check that input files appear to  be  intact.  Checks
145                 that  beginning of the file contains a valid header (all for‐
146                 mats) containing at least one target sequence and then  seeks
147                 to  the  end of the file and checks that an end-of-file (EOF)
148                 is present and intact (BAM only).
149
150                 Data in the middle of the file is not read since  that  would
151                 be much more time consuming, so please note that this command
152                 will not detect internal corruption, but is useful for  test‐
153                 ing  that  files are not truncated before performing more in‐
154                 tensive tasks on them.
155
156                 This command will exit with a non-zero exit code if any input
157                 files  don't have a valid header or are missing an EOF block.
158                 Otherwise it will exit successfully (with a zero exit code).
159
160
161       index     samtools index  [-bc]  [-m  INT]  aln.sam.gz|aln.bam|aln.cram
162                 [out.index]
163
164                 Index a coordinate-sorted SAM, BAM or CRAM file for fast ran‐
165                 dom access.  Note for SAM this only works  if  the  file  has
166                 been BGZF compressed first.
167
168                 This  index is needed when region arguments are used to limit
169                 samtools view and similar commands to particular  regions  of
170                 interest.
171
172                 If  an output filename is given, the index file will be writ‐
173                 ten to out.index.  Otherwise, for a CRAM file aln.cram, index
174                 file  aln.cram.crai  will  be  created; for a BAM or SAM file
175                 aln.bam, either aln.bam.bai or aln.bam.csi will  be  created,
176                 depending on the index format selected.
177
178
179       sort      samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
180                 [-n] [-t tag] [-T tmpprefix] [-@ threads]
181                 [in.sam|in.bam|in.cram]
182
183                 Sort alignments by leftmost coordinates, or by read name when
184                 -n is used.  An appropriate @HD-SO sort order header tag will
185                 be added or an existing one updated if necessary.
186
187                 The  sorted  output is written to standard output by default,
188                 or to the specified file (out.bam) when  -o  is  used.   This
189                 command  will also create temporary files tmpprefix.%d.bam as
190                 needed when the entire alignment data cannot fit into  memory
191                 (as controlled via the -m option).
192
193                 Consider using samtools collate instead if you need name col‐
194                 lated data without a full lexicographical sort.
195
196
197       collate   samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
198
199                 Shuffles and groups reads together by their names.  A  faster
200                 alternative  to  a full query name sort, collate ensures that
201                 reads of the same name are  grouped  together  in  contiguous
202                 groups,  but  doesn't  make any guarantees about the order of
203                 read names between groups.
204
205                 The output from this command should be suitable for any oper‐
206                 ation  that  requires  all reads from the same template to be
207                 grouped together.
208
209
210       idxstats  samtools idxstats in.sam|in.bam|in.cram
211
212                 Retrieve and print stats in the index file  corresponding  to
213                 the  input file.  Before calling idxstats, the input BAM file
214                 should be indexed by samtools index.
215
216                 If run on a SAM or CRAM file or an unindexed BAM  file,  this
217                 command  will  still produce the same summary statistics, but
218                 does so by reading through the  entire  file.   This  is  far
219                 slower than using the BAM indices.
220
221                 The output is TAB-delimited with each line consisting of ref‐
222                 erence sequence name, sequence length, # mapped reads  and  #
223                 unmapped reads. It is written to stdout.
224
225
226       flagstat  samtools flagstat in.sam|in.bam|in.cram
227
228                 Does  a  full  pass  through  the input file to calculate and
229                 print statistics to stdout.
230
231                 Provides counts for each of 13 categories based primarily  on
232                 bit  flags  in the FLAG field. Each category in the output is
233                 broken down into QC pass and QC fail, which is  presented  as
234                 "#PASS + #FAIL" followed by a description of the category.
235
236
237       flags     samtools flags INT|STR[,...]
238
239                 Convert between textual and numeric flag representation.
240
241                 FLAGS:
242
243                   0x1   PAIRED          paired-end (or multiple-segment) sequencing technology
244                   0x2   PROPER_PAIR     each segment properly aligned according to the aligner
245                   0x4   UNMAP           segment unmapped
246                   0x8   MUNMAP          next segment in the template unmapped
247                  0x10   REVERSE         SEQ is reverse complemented
248                  0x20   MREVERSE        SEQ of the next segment in the template is reverse complemented
249                  0x40   READ1           the first segment in the template
250                  0x80   READ2           the last segment in the template
251                 0x100   SECONDARY       secondary alignment
252                 0x200   QCFAIL          not passing quality controls
253                 0x400   DUP             PCR or optical duplicate
254                 0x800   SUPPLEMENTARY   supplementary alignment
255
256
257       stats     samtools stats [options] in.sam|in.bam|in.cram [region...]
258
259                 samtools stats collects statistics from BAM files and outputs
260                 in a text format.  The output can be  visualized  graphically
261                 using plot-bamstats.
262
263
264
265       bedcov    samtools          bedcov         [options]         region.bed
266                 in1.sam|in1.bam|in1.cram[...]
267
268                 Reports the total read base count (i.e. the sum of  per  base
269                 read  depths)  for  each genomic region specified in the sup‐
270                 plied BED file. The regions are output as they appear in  the
271                 BED  file  and  are  0-based.  Counts for each alignment file
272                 supplied are reported in separate columns.
273
274
275       depth     samtools    depth     [options]     [in1.sam|in1.bam|in1.cram
276                 [in2.sam|in2.bam|in2.cram] [...]]
277
278                 Computes the read depth at each position or region.
279
280
281       ampliconstats
282                 samtools       ampliconstats       [options]      primers.bed
283                 in.sam|in.bam|in.cram[...]
284
285                 samtools ampliconstats collects statistics from one  or  more
286                 input  alignment  files  and  produces tables in text format.
287                 The output can be visualized graphically using plot-amplicon‐
288                 stats.
289
290                 The  alignment  files  should have previously been clipped of
291                 primer sequence, for example by samtools ampliconclip and the
292                 sites  of  these primers should be specified as a bed file in
293                 the arguments.
294
295
296       mpileup   samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa]  [-l
297                 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
298
299                 Generate  textual  pileup for one or multiple BAM files.  For
300                 VCF and BCF output, please use the bcftools  mpileup  command
301                 instead.   Alignment records are grouped by sample (SM) iden‐
302                 tifiers in @RG header lines.  If sample identifiers  are  ab‐
303                 sent, each input file is regarded as one sample.
304
305                 See  the  samtools-mpileup  man page for a description of the
306                 pileup format and options.
307
308
309       coverage  samtools   coverage    [options]    [in1.sam|in1.bam|in1.cram
310                 [in2.sam|in2.bam|in2.cram] [...]]
311
312                 Produces a histogram or table of coverage per chromosome.
313
314
315       merge     samtools  merge  [-nur1f]  [-h inh.sam] [-t tag] [-R reg] [-b
316                 list] out.bam in1.bam [in2.bam in3.bam ... inN.bam]
317
318                 Merge multiple sorted alignment  files,  producing  a  single
319                 sorted  output  file  that contains all the input records and
320                 maintains the existing sort order.
321
322                 If -h is specified the @SQ headers of  input  files  will  be
323                 merged  into  the  specified  header,  otherwise they will be
324                 merged into a composite header created from the  input  head‐
325                 ers.  If the @SQ headers differ in order this may require the
326                 output file to be re-sorted after merge.
327
328                 The ordering of the records in the input files must match the
329                 usage of the -n and -t command-line options.  If they do not,
330                 the output order will be undefined.  See sort for information
331                 about record ordering.
332
333
334       split     samtools split [options] merged.sam|merged.bam|merged.cram
335
336                 Splits  a  file  by  read group, producing one or more output
337                 files matching a common prefix (by default based on the input
338                 filename) each containing one read-group.
339
340
341       cat       samtools  cat  [-b list] [-h header.sam] [-o out.bam] in1.bam
342                 in2.bam [ ... ]
343
344                 Concatenate BAMs or CRAMs. Although this works on either  BAM
345                 or  CRAM,  all  input  files  must be the same format as each
346                 other. The sequence dictionary of each  input  file  must  be
347                 identical,  although  this  command does not check this. This
348                 command uses a similar trick to reheader which  enables  fast
349                 BAM concatenation.
350
351
352       import    samtools import [options] in.fastq [ ... ]
353
354                 Converts  one  or  more  FASTQ files to unaligned SAM, BAM or
355                 CRAM.  These formats offer a richer  capability  of  tracking
356                 sample  meta-data  via  the SAM header and per-read meta-data
357                 via the auxiliary tags.  The fastq command may be used to re‐
358                 verse this conversion.
359
360
361       fastq/a   samtools fastq [options] in.bam
362                 samtools fasta [options] in.bam
363
364                 Converts  a BAM or CRAM into either FASTQ or FASTA format de‐
365                 pending on the command invoked. The files will  be  automati‐
366                 cally compressed if the file names have a .gz or .bgzf exten‐
367                 sion.
368
369                 The input to this program must be collated by name.  Use sam‐
370                 tools collate or samtools sort -n to ensure this.
371
372
373       faidx     samtools faidx <ref.fasta> [region1 [...]]
374
375                 Index  reference sequence in the FASTA format or extract sub‐
376                 sequence from indexed reference sequence.  If  no  region  is
377                 specified,   faidx   will   index   the   file   and   create
378                 <ref.fasta>.fai on the disk. If regions  are  specified,  the
379                 subsequences  will  be retrieved and printed to stdout in the
380                 FASTA format.
381
382                 The input file can be compressed in the BGZF format.
383
384                 FASTQ files can be read and indexed by this command.  Without
385                 using --fastq any extracted subsequence will be in FASTA for‐
386                 mat.
387
388
389       fqidx     samtools fqidx <ref.fastq> [region1 [...]]
390
391                 Index reference sequence in the FASTQ format or extract  sub‐
392                 sequence  from  indexed  reference  sequence. If no region is
393                 specified,   fqidx   will   index   the   file   and   create
394                 <ref.fastq>.fai  on  the  disk. If regions are specified, the
395                 subsequences will be retrieved and printed to stdout  in  the
396                 FASTQ format.
397
398                 The input file can be compressed in the BGZF format.
399
400                 samtools  fqidx  should  only  be  used on fastq files with a
401                 small number of entries.  Trying to use it on a file contain‐
402                 ing  millions of short sequencing reads will produce an index
403                 that is almost as big as the original file, and searches  us‐
404                 ing the index will be very slow and use a lot of memory.
405
406
407       dict      samtools dict ref.fasta|ref.fasta.gz
408
409                 Create a sequence dictionary file from a fasta file.
410
411
412       calmd     samtools calmd [-Eeubr] [-C capQcoef] aln.bam ref.fasta
413
414                 Generate  the  MD tag. If the MD tag is already present, this
415                 command will give a warning if the MD tag generated  is  dif‐
416                 ferent from the existing tag. Output SAM by default.
417
418                 Calmd  can  also  read  and write CRAM files although in most
419                 cases it is pointless as CRAM recalculates MD and NM tags  on
420                 the  fly.  The one exception to this case is where both input
421                 and output CRAM files have been / are being created with  the
422                 no_ref option.
423
424
425       fixmate   samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
426
427                 Fill in mate coordinates, ISIZE and mate related flags from a
428                 name-sorted alignment.
429
430
431       markdup   samtools markdup [-l length] [-r] [-s] [-T] [-S] in.al‐
432                 gsort.bam out.bam
433
434                 Mark  duplicate alignments from a coordinate sorted file that
435                 has been run through samtools fixmate  with  the  -m  option.
436                 This  program  relies on the MC and ms tags that fixmate pro‐
437                 vides.
438
439
440       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
441
442                 This command is obsolete. Use markdup instead.
443
444
445       addreplacerg
446                 samtools addreplacerg [-r rg-line | -R rg-ID] [-m  mode]  [-l
447                 level] [-o out.bam] in.bam
448
449                 Adds or replaces read group tags in a file.
450
451
452       reheader  samtools reheader [-iP] in.header.sam in.bam
453
454                 Replace   the   header   in   in.bam   with   the  header  in
455                 in.header.sam.  This command is much  faster  than  replacing
456                 the header with a BAM→SAM→BAM conversion.
457
458                 By default this command outputs the BAM or CRAM file to stan‐
459                 dard output (stdout), but for CRAM format files  it  has  the
460                 option  to perform an in-place edit, both reading and writing
461                 to the same file.  No validity checking is performed  on  the
462                 header, nor that it is suitable to use with the sequence data
463                 itself.
464
465
466       targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0]  [-1
467                 em1] [-2 em2] [-f ref] in.bam
468
469                 This  command identifies target regions by examining the con‐
470                 tinuity of read depth, computes haploid  consensus  sequences
471                 of targets and outputs a SAM with each sequence corresponding
472                 to a target. When option -f is in use, BAQ will  be  applied.
473                 This  command is only designed for cutting fosmid clones from
474                 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
475
476
477       phase     samtools phase [-AF] [-k len] [-b  prefix]  [-q  minLOD]  [-Q
478                 minBaseQ] in.bam
479
480                 Call and phase heterozygous SNPs.
481
482
483       depad     samtools depad [-SsCu1] [-T ref.fa] [-o output] in.bam
484
485                 Converts  a  BAM  aligned against a padded reference to a BAM
486                 aligned against the depadded reference.  The padded reference
487                 may  contain verbatim "*" bases in it, but "*" bases are also
488                 counted in the reference numbering.  This means  that  a  se‐
489                 quence  base-call  aligned against a reference "*" is consid‐
490                 ered to be a cigar match ("M" or "X") operator (if the  base-
491                 call is "A", "C", "G" or "T").  After depadding the reference
492                 "*" bases are deleted and such  aligned  sequence  base-calls
493                 become insertions.  Similarly transformations apply for dele‐
494                 tions and padding cigar operations.
495
496
497       ampliconclip
498                 samtools ampliconclip [-o out.file] [-f  stat.file]  [--soft-
499                 clip]   [--hard-clip]  [--both-ends]  [--strand]  [--clipped]
500                 [--fail] [--no-PG] -b bed.file in.file
501
502                 Clip reads in a SAM compatible file based on data from a  BED
503                 file.
504
505

SAMTOOLS OPTIONS

507       These  are  options  that are passed after the samtools command, before
508       any sub-command is specified.
509
510
511       help, --help
512              Display a brief usage  message  listing  the  samtools  commands
513              available.   If  the name of a command is also given, e.g., sam‐
514              tools help view, the detailed usage message for that  particular
515              command is displayed.
516
517
518       --version
519              Display  the  version numbers and copyright information for sam‐
520              tools and the important libraries used by samtools.
521
522
523       --version-only
524              Display the full samtools version number in  a  machine-readable
525              format.
526

GLOBAL COMMAND OPTIONS

528       Several long-options are shared between multiple samtools sub-commands:
529       --input-fmt,  --input-fmt-option,  --output-fmt,   --output-fmt-option,
530       --reference, --write-index, and --verbosity.  The input format is typi‐
531       cally auto-detected so specifying the format is usually unnecessary and
532       the option is included for completeness.  Note that not all subcommands
533       have all options.  Consult the subcommand help for more details.
534
535       Format strings recognised are "sam", "sam.gz", "bam" and "cram".   They
536       may  be  followed  by  a  comma  separated  list  of  options as key or
537       key=value. See below for examples.
538
539       The fmt-option arguments accept either a single option or option=value.
540       Note  that some options only work on some file formats and only on read
541       or write streams.  If value is unspecified for a  boolean  option,  the
542       value is assumed to be 1.  The valid options are as follows.
543
544       level=INT
545           Output  only. Specifies the compression level from 1 to 9, or 0 for
546           uncompressed.  If the output format is SAM, this also enables  BGZF
547           compression, otherwise SAM defaults to uncompressed.
548
549       nthreads=INT
550           Specifies  the  number of threads to use during encoding and/or de‐
551           coding.  For BAM this will be encoding only.  In CRAM  the  threads
552           are dynamically shared between encoder and decoder.
553
554       filter=STRING
555           Apply  filter STRING to all incoming records, rejecting any that do
556           not satisfy the expression.  See the FILTER EXPRESSIONS section be‐
557           low for specifics.
558
559       reference=fasta_file
560           Specifies a FASTA reference file for use in CRAM encoding or decod‐
561           ing.  It usually is not required for decoding except in the  situa‐
562           tion  of the MD5 not being obtainable via the REF_PATH or REF_CACHE
563           environment variables.
564
565       decode_md=0|1
566           CRAM input only; defaults to 1 (on).  CRAM does not typically store
567           MD  and NM tags, preferring to generate them on the fly.  When this
568           option is 0 missing MD, NM tags will not be generated.  It  can  be
569           particularly  useful  when  combined  with  a  file  encoded  using
570           store_md=1 and store_nm=1.
571
572       store_md=0|1
573           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
574           MD tags when the reference is unknown and lets the decoder generate
575           these values on-the-fly (see decode_md).
576
577       store_nm=0|1
578           CRAM output only; defaults to 0 (off).  CRAM normally  only  stores
579           NM tags when the reference is unknown and lets the decoder generate
580           these values on-the-fly (see decode_md).
581
582       ignore_md5=0|1
583           CRAM input only; defaults to 0 (off).  When enabled,  md5  checksum
584           errors  on  the reference sequence and block checksum errors within
585           CRAM are ignored.  Use of this option is strongly discouraged.
586
587       required_fields=bit-field
588           CRAM input only; specifies which SAM columns need to be  populated.
589           By  default  all  fields are used.  Limiting the decode to specific
590           columns can have significant performance gains.  The bit-field is a
591           numerical value constructed from the following table.
592
593              0x1   SAM_QNAME
594              0x2   SAM_FLAG
595              0x4   SAM_RNAME
596
597              0x8   SAM_POS
598             0x10   SAM_MAPQ
599             0x20   SAM_CIGAR
600             0x40   SAM_RNEXT
601             0x80   SAM_PNEXT
602            0x100   SAM_TLEN
603            0x200   SAM_SEQ
604            0x400   SAM_QUAL
605            0x800   SAM_AUX
606           0x1000   SAM_RGAUX
607
608       name_prefix=string
609           CRAM  input  only; defaults to output filename.  Any sequences with
610           auto-generated read names will use string as the name prefix.
611
612       multi_seq_per_slice=0|1
613           CRAM output only; defaults to 0 (off).  By default  CRAM  generates
614           one  container  per  reference sequence, except in the case of many
615           small references (such as a fragmented assembly).
616
617       version=major.minor
618           CRAM output only.  Specifies the CRAM version  number.   Acceptable
619           values are "2.1" and "3.0".
620
621       seqs_per_slice=INT
622           CRAM output only; defaults to 10000.
623
624       slices_per_container=INT
625           CRAM  output  only;  defaults  to 1.  The effect of having multiple
626           slices per container is to share the compression header  block  be‐
627           tween  multiple  slices.   This is unlikely to have any significant
628           impact unless the number of sequences per slice is  reduced.   (To‐
629           gether these two options control the granularity of random access.)
630
631       embed_ref=0|1
632           CRAM  output only; defaults to 0 (off).  If 1, this will store por‐
633           tions of the reference sequence in each  slice,  permitting  decode
634           without  having  requiring  an  external  copy of the reference se‐
635           quence.
636
637       no_ref=0|1
638           CRAM output only; defaults to 0 (off).  If  1,  sequences  will  be
639           stored  verbatim with no reference encoding.  This can be useful if
640           no reference is available for the file.
641
642       use_bzip2=0|1
643           CRAM output only; defaults to 0 (off).  Permits  use  of  bzip2  in
644           CRAM block compression.
645
646       use_lzma=0|1
647           CRAM output only; defaults to 0 (off).  Permits use of lzma in CRAM
648           block compression.
649
650       lossy_names=0|1
651           CRAM output only; defaults to 0 (off).  If 1,  templates  with  all
652           members  within  the same CRAM slice will have their read names re‐
653           moved.  New names will be automatically generated during  decoding.
654           Also see the name_prefix option.
655
656       For example:
657
658           samtools view --input-fmt-option decode_md=0
659               --output-fmt cram,version=3.0 --output-fmt-option embed_ref
660               --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
661
662
663       The --write-index option enables automatic index creation while writing
664       out BAM, CRAM or bgzf SAM files.  Note to get  compressed  SAM  as  the
665       output  format you need to manually request a compression level, other‐
666       wise all SAM files are uncompressed.  By default SAM and BAM  will  use
667       CSI  indices  while  CRAM will use CRAI indices.  If you need to create
668       BAI indices note that it is possible to specify the name of  the  index
669       being written to, and hence the format, by using the filename##idx##in‐
670       dexname notation.
671
672       For example: to convert a BAM to a compressed SAM with CSI indexing:
673
674           samtools view -h -O sam,level=6 --write-index in.bam -o out.sam.gz
675
676
677       To convert a SAM to a compressed BAM using BAI indexing:
678
679           samtools view --write-index in.sam -o out.bam##idx##out.bam.bai
680
681
682       The --verbosity INT option sets the verbosity level  for  samtools  and
683       HTSlib.  The default is 3 (HTS_LOG_WARNING); 2 reduces warning messages
684       and 0 or 1 also reduces some error messages, while values greater  than
685       3  produce  increasing  numbers of additional warnings and logging mes‐
686       sages.
687
688

REFERENCE SEQUENCES

690       The CRAM format requires use of a reference sequence for  both  reading
691       and writing.
692
693       When  reading  a  CRAM the @SQ headers are interrogated to identify the
694       reference sequence MD5sum (M5: tag) and the  local  reference  sequence
695       filename (UR: tag).  Note that http:// and ftp:// based URLs in the UR:
696       field are not used, but local fasta filenames (with or without file://)
697       can be used.
698
699       To create a CRAM the @SQ headers will also be read to identify the ref‐
700       erence sequences, but M5: and UR: tags may not be present. In this case
701       the -T and -t options of samtools view may be used to specify the fasta
702       or fasta.fai filenames respectively (provided the  .fasta.fai  file  is
703       also backed up by a .fasta file).
704
705       The search order to obtain a reference is:
706
707       1. Use any local file specified by the command line options (eg -T).
708
709       2. Look for MD5 via REF_CACHE environment variable.
710
711       3. Look for MD5 in each element of the REF_PATH environment variable.
712
713       4. Look for a local file listed in the UR: header tag.
714
715

FILTER EXPRESSIONS

717       Filter  expressions are used as an on-the-fly checking of incoming SAM,
718       BAM or CRAM records, discarding records that do not match the specified
719       expression.
720
721       The  language  used is primarily C style, but with a few differences in
722       the precedence rules for bit operators and the inclusion of regular ex‐
723       pression matching.
724
725       The operator precedence, from strongest binding to weakest, is:
726
727
728       Grouping        (, )             E.g. "(1+2)*3"
729       Values:         literals, vars   Numbers, strings and variables
730
731       Unary ops:      +, -, !, ~       E.g. -10 +10, !10 (not), ~5 (bit not)
732       Math ops:       *, /, %          Multiply, division and (integer) modulo
733       Math ops:       +, -             Addition / subtraction
734       Bit-wise:       &                Integer AND
735       Bit-wise        ^                Integer XOR
736       Bit-wise        |                Integer OR
737       Conditionals:   >, >=, <, <=
738       Equality:       ==, !=, =~, !~   =~ and !~ match regular expressions
739       Boolean:        &&, ||           Logical AND / OR
740
741       Expressions  are computed using floating point mathematics, so "10 / 4"
742       evaluates to 2.5 rather than 2.  They may be  written  as  integers  in
743       decimal  or  "0x"  plus hexadecimal, and floating point with or without
744       exponents.However operations that require integers first do an implicit
745       type  conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7
746       & 4", which is 4.  Strings are always specified  using  double  quotes.
747       To  get  a double quote in a string, use backslash.  Similarly a double
748       backslash is used to get a literal backslash.  For example ab\"c\\d  is
749       the string ab"c\d.
750
751       Comparison  operators  are  evaluated as a match being 1 and a mismatch
752       being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2.
753
754       The variables are where the file format specifics are accessed from the
755       expression.   The  variables  correspond  to SAM fields, for example to
756       find paired alignments with high mapping quality and a very  large  in‐
757       sert  size, we may use the expression "mapq >= 30 && (tlen >= 100000 ||
758       tlen <= -100000)".  Valid variable names and their data types are:
759
760
761       flag                 int            Combined FLAG field
762       flag.paired          int            Single bit, 0 or 1
763       flag.proper_pair     int            Single bit, 0 or 2
764       flag.unmap           int            Single bit, 0 or 4
765       flag.munmap          int            Single bit, 0 or 8
766       flag.reverse         int            Single bit, 0 or 16
767       flag.mreverse        int            Single bit, 0 or 32
768       flag.read1           int            Single bit, 0 or 64
769       flag.read2           int            Single bit, 0 or 128
770       flag.secondary       int            Single bit, 0 or 256
771       flag.qcfail          int            Single bit, 0 or 512
772       flag.dup             int            Single bit, 0 or 1024
773       flag.supplementary   int            Single bit, 0 or 2048
774       library              string         Library (LB header via RG)
775       mapq                 int            Mapping quality
776       mpos                 int            Synonym for pnext
777       mrefid               int            Mate reference number (0 based)
778       mrname               string         Synonym for rnext
779       ncigar               int            Number of cigar operations
780       pnext                int            Mate's alignment position (1-based)
781       pos                  int            Alignment position (1-based)
782       qlen                 int            Alignment length: no. query bases
783       qname                string         Query name
784       qual                 string         Quality values (raw, 0 based)
785       refid                int            Integer reference number (0 based)
786       rlen                 int            Alignment length: no. reference bases
787       rname                string         Reference name
788       rnext                string         Mate's reference name
789       seq                  string         Sequence
790       tlen                 int            Template length (insert size)
791       [XX]                 int / string   XX tag value
792
793       Flags are returned either as the whole flag value or by checking for  a
794       single bit.  Hence the filter expression flag.dup is equivalent to flag
795       & 1024.
796
797       "qlen" and "rlen" are measured using the CIGAR string to count the num‐
798       ber  of query (sequence) and reference bases consumed.  Note "qlen" may
799       not exactly match the length of the "seq" field if the sequence is "*".
800
801       Reference names may be matched either by their  string  forms  ("rname"
802       and  "mrname") or as the Nth @SQ line (counting from zero) as stored in
803       BAM using "tid" and "mtid" respectively.
804
805       Auxiliary tags are described in square brackets and these expand to ei‐
806       ther  integer  or  string  as defined by the tag itself (XX:Z:string or
807       XX:i:int).  For example [NM]>=10 can be used  to  look  for  alignments
808       with  many  mismatches  and [RG]=~"grp[ABC]-" will match the read-group
809       string.
810
811       If no comparison is used with an auxiliary tag it is taken simply to be
812       a test for the existence of that tag.  So "[NM]" will return any record
813       containing an NM tag, even if that tag is zero (NM:i:0).
814
815       If you need to check specifically for a non-zero value then use [NM] &&
816       [NM]!=0.
817
818       Some simple functions are available to operate on strings.  These treat
819       the strings as arrays of bytes, permitting their length, minimum, maxi‐
820       mum and average values to be computed.
821
822
823       length   Length of the string (excluding nul char)
824       min      Minimum byte value in the string
825       max      Maximum byte value in the string
826       avg      Average byte value in the string
827
828       Note  that  "avg" is a floating point value and it may be NAN for empty
829       strings.  This means that "avg(qual)" does not  produce  an  error  for
830       records  that  have both seq and qual of "*".  This value will fail any
831       conditional checks, so e.g. "avg(qual) > 20" works and will not  report
832       these records.
833
834

ENVIRONMENT VARIABLES

836       HTS_PATH
837              A colon-separated list of directories in which to search for HT‐
838              Slib plugins.  If $HTS_PATH starts or ends with a colon or  con‐
839              tains  a  double colon (::), the built-in list of directories is
840              searched at that point in the search.
841
842              If no HTS_PATH variable is defined, the built-in list of  direc‐
843              tories  specified when HTSlib was built is used, which typically
844              includes /usr/local/libexec/htslib and similar directories.
845
846
847       REF_PATH
848              A colon separated (semi-colon on Windows) list of  locations  in
849              which  to  look for sequences identified by their MD5sums.  This
850              can be either a list of directories or URLs. Note that if a  URL
851              is  included  then  the  colon in http:// and ftp:// and the op‐
852              tional port number will be treated as part of the URL and not  a
853              PATH field separator.  For URLs, the text %s will be replaced by
854              the MD5sum being read.
855
856              If  no  REF_PATH  has  been  specified  it   will   default   to
857              http://www.ebi.ac.uk/ena/cram/md5/%s  and  if  REF_CACHE is also
858              unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s.  If
859              $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system tempo‐
860              rary directory if no home directory is found) will be used simi‐
861              larly.
862
863
864       REF_CACHE
865              This  can  be defined to a single location housing a local cache
866              of references.  Upon downloading a reference it will  be  stored
867              in  the  location  pointed  to  by REF_CACHE.  REF_CACHE will be
868              searched before attempting to load via the REF_PATH search list.
869              If  no  REF_PATH is defined, both REF_PATH and REF_CACHE will be
870              automatically set (see above), but if REF_PATH  is  defined  and
871              REF_CACHE not then no local cache is used.
872
873              To  avoid  many  files  being  stored  in  the  same  directory,
874              REF_CACHE may be defined as a pattern using %nums to consume num
875              characters of the MD5sum and %s to consume all remaining charac‐
876              ters.  If REF_CACHE lacks %s then it will get  an  implicit  /%s
877              appended.
878
879              To   aid   population   of  the  REF_CACHE  directory  a  script
880              misc/seq_cache_populate.pl is provided in the Samtools distribu‐
881              tion.  This takes a fasta file or a directory of fasta files and
882              generates the MD5sum named files.
883
884              For example if you use seq_cache_populate -subdirs 2 -root  /lo‐
885              cal/ref_cache  to  create 2 nested subdirectories (the default),
886              each consuming 2 characters of the MD5sum, then  REF_CACHE  must
887              be set to /local/ref_cache/%2s/%2s/%s.
888

EXAMPLES

890       o Import SAM to BAM when @SQ lines are present in the header:
891
892           samtools view -b aln.sam > aln.bam
893
894         If @SQ lines are absent:
895
896           samtools faidx ref.fa
897           samtools view -bt ref.fa.fai aln.sam > aln.bam
898
899         where ref.fa.fai is generated automatically by the faidx command.
900
901
902       o Convert a BAM file to a CRAM file using a local reference sequence.
903
904           samtools view -C -T ref.fa aln.bam > aln.cram
905
906
907

LIMITATIONS

909       o Unaligned words used in bam_endian.h, bam.c and bam_aux.c.
910
911

AUTHOR

913       Heng  Li from the Sanger Institute wrote the original C version of sam‐
914       tools.  Bob Handsaker from the Broad Institute implemented the BGZF li‐
915       brary.   Petr  Danecek  and  Heng  Li wrote the VCF/BCF implementation.
916       James Bonfield from the Sanger Institute developed the CRAM implementa‐
917       tion.   Other large code contributions have been made by John Marshall,
918       Rob Davies, Martin Pollard, Andrew Whitwham, Valeriu  Ohan  (all  while
919       primarily  at  the  Sanger  Institute), with numerous other smaller but
920       valuable contributions.  See the per-command manual pages  for  further
921       authorship.
922
923

SEE ALSO

925       samtools-addreplacerg(1),  samtools-ampliconclip(1), samtools-amplicon‐
926       stats(1), samtools-bedcov(1), samtools-calmd(1), samtools-cat(1),  sam‐
927       tools-collate(1),  samtools-coverage(1),  samtools-depad(1),  samtools-
928       depth(1), samtools-dict(1), samtools-faidx(1), samtools-fasta(1),  sam‐
929       tools-fastq(1),  samtools-fixmate(1), samtools-flags(1), samtools-flag‐
930       stat(1), samtools-fqidx(1),  samtools-idxstats(1),  samtools-import(1),
931       samtools-index(1),  samtools-markdup(1),  samtools-merge(1),  samtools-
932       mpileup(1),  samtools-phase(1),  samtools-quickcheck(1),   samtools-re‐
933       header(1), samtools-rmdup(1), samtools-sort(1), samtools-split(1), sam‐
934       tools-stats(1),  samtools-targetcut(1),  samtools-tview(1),   samtools-
935       view(1), bcftools(1), sam(5), tabix(1)
936
937       Samtools website: <http://www.htslib.org/>
938       File   format   specification   of  SAM/BAM,CRAM,VCF/BCF:  <http://sam
939       tools.github.io/hts-specs>
940       Samtools latest source: <https://github.com/samtools/samtools>
941       HTSlib latest source: <https://github.com/samtools/htslib>
942       Bcftools website: <http://samtools.github.io/bcftools>
943
944
945
946samtools-1.13                     7 July 2021                      samtools(1)
Impressum