1samtools(1) Bioinformatics tools samtools(1)
2
3
4
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
10
11 samtools tview aln.sorted.bam ref.fasta
12
13 samtools quickcheck in1.bam in2.cram
14
15 samtools index aln.sorted.bam
16
17 samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
18
19 samtools collate -o aln.name_collated.bam aln.sorted.bam
20
21 samtools idxstats aln.sorted.bam
22
23 samtools flagstat aln.sorted.bam
24
25 samtools flags PAIRED,UNMAP,MUNMAP
26
27 samtools stats aln.sorted.bam
28
29 samtools bedcov aln.sorted.bam
30
31 samtools depth aln.sorted.bam
32
33 samtools ampliconstats primers.bed in.bam
34
35 samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
36
37 samtools coverage aln.sorted.bam
38
39 samtools merge out.bam in1.bam in2.bam in3.bam
40
41 samtools split merged.bam
42
43 samtools cat out.bam in1.bam in2.bam in3.bam
44
45 samtools import input.fastq > output.bam
46
47 samtools fastq input.bam > output.fastq
48
49 samtools fasta input.bam > output.fasta
50
51 samtools faidx ref.fasta
52
53 samtools fqidx ref.fastq
54
55 samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
56
57 samtools calmd in.sorted.bam ref.fasta
58
59 samtools fixmate in.namesorted.sam out.bam
60
61 samtools markdup in.algnsorted.bam out.bam
62
63 samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out‐
64 put.bam input.bam
65
66 samtools reheader in.header.sam in.bam > out.bam
67
68 samtools targetcut input.bam
69
70 samtools phase input.bam
71
72 samtools depad input.bam
73
74 samtools ampliconclip -b bed.file input.bam
75
76
78 Samtools is a set of utilities that manipulate alignments in the SAM
79 (Sequence Alignment/Map), BAM, and CRAM formats. It converts between
80 the formats, does sorting, merging and indexing, and can retrieve reads
81 in any regions swiftly.
82
83 Samtools is designed to work on a stream. It regards an input file `-'
84 as the standard input (stdin) and an output file `-' as the standard
85 output (stdout). Several commands can thus be combined with Unix pipes.
86 Samtools always output warning and error messages to the standard error
87 output (stderr).
88
89 Samtools is also able to open files on remote FTP or HTTP(S) servers if
90 the file name starts with `ftp://', `http://', etc. Samtools checks
91 the current working directory for the index file and will download the
92 index upon absence. Samtools does not retrieve the entire alignment
93 file unless it is asked to do so.
94
95 If an index is needed, samtools looks for the index suffix appended to
96 the filename, and if that isn't found it tries again without the file‐
97 name suffix (for example in.bam.bai followed by in.bai). However if an
98 index is in a completely different location or has a different name,
99 both the main data filename and index filename can be pasted together
100 with ##idx##. For example /data/in.bam##idx##/indices/in.bam.bai may
101 be used to explicitly indicate where the data and index files reside.
102
103
105 Each command has its own man page which can be viewed using e.g. man
106 samtools-view or with a recent GNU man using man samtools view. Below
107 we have a brief summary of syntax and sub-command description.
108
109 Options common to all sub-commands are documented below in the GLOBAL
110 COMMAND OPTIONS section.
111
112
113 view samtools view [options] in.sam|in.bam|in.cram [region...]
114
115 With no options or regions specified, prints all alignments
116 in the specified input alignment file (in SAM, BAM, or CRAM
117 format) to standard output in SAM format (with no header by
118 default).
119
120 You may specify one or more space-separated region specifica‐
121 tions after the input filename to restrict output to only
122 those alignments which overlap the specified region(s). Use
123 of region specifications requires a coordinate-sorted and in‐
124 dexed input file.
125
126 Options exist to change the output format from SAM to BAM or
127 CRAM, so this command also acts as a file format conversion
128 utility.
129
130
131 tview samtools tview [-p chr:pos] [-s STR] [-d display]
132 <in.sorted.bam> [ref.fasta]
133
134 Text alignment viewer (based on the ncurses library). In the
135 viewer, press `?' for help and press `g' to check the align‐
136 ment start from a region in the format like
137 `chr10:10,000,000' or `=10,000,000' when viewing the same
138 reference sequence.
139
140
141 quickcheck
142 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]
143
144 Quickly check that input files appear to be intact. Checks
145 that beginning of the file contains a valid header (all for‐
146 mats) containing at least one target sequence and then seeks
147 to the end of the file and checks that an end-of-file (EOF)
148 is present and intact (BAM only).
149
150 Data in the middle of the file is not read since that would
151 be much more time consuming, so please note that this command
152 will not detect internal corruption, but is useful for test‐
153 ing that files are not truncated before performing more in‐
154 tensive tasks on them.
155
156 This command will exit with a non-zero exit code if any input
157 files don't have a valid header or are missing an EOF block.
158 Otherwise it will exit successfully (with a zero exit code).
159
160
161 index samtools index [-bc] [-m INT] aln.sam.gz|aln.bam|aln.cram
162 [out.index]
163
164 Index a coordinate-sorted SAM, BAM or CRAM file for fast ran‐
165 dom access. Note for SAM this only works if the file has
166 been BGZF compressed first.
167
168 This index is needed when region arguments are used to limit
169 samtools view and similar commands to particular regions of
170 interest.
171
172 If an output filename is given, the index file will be writ‐
173 ten to out.index. Otherwise, for a CRAM file aln.cram, index
174 file aln.cram.crai will be created; for a BAM or SAM file
175 aln.bam, either aln.bam.bai or aln.bam.csi will be created,
176 depending on the index format selected.
177
178
179 sort samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
180 [-n] [-t tag] [-T tmpprefix] [-@ threads]
181 [in.sam|in.bam|in.cram]
182
183 Sort alignments by leftmost coordinates, or by read name when
184 -n is used. An appropriate @HD-SO sort order header tag will
185 be added or an existing one updated if necessary.
186
187 The sorted output is written to standard output by default,
188 or to the specified file (out.bam) when -o is used. This
189 command will also create temporary files tmpprefix.%d.bam as
190 needed when the entire alignment data cannot fit into memory
191 (as controlled via the -m option).
192
193 Consider using samtools collate instead if you need name col‐
194 lated data without a full lexicographical sort.
195
196
197 collate samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
198
199 Shuffles and groups reads together by their names. A faster
200 alternative to a full query name sort, collate ensures that
201 reads of the same name are grouped together in contiguous
202 groups, but doesn't make any guarantees about the order of
203 read names between groups.
204
205 The output from this command should be suitable for any oper‐
206 ation that requires all reads from the same template to be
207 grouped together.
208
209
210 idxstats samtools idxstats in.sam|in.bam|in.cram
211
212 Retrieve and print stats in the index file corresponding to
213 the input file. Before calling idxstats, the input BAM file
214 should be indexed by samtools index.
215
216 If run on a SAM or CRAM file or an unindexed BAM file, this
217 command will still produce the same summary statistics, but
218 does so by reading through the entire file. This is far
219 slower than using the BAM indices.
220
221 The output is TAB-delimited with each line consisting of ref‐
222 erence sequence name, sequence length, # mapped reads and #
223 unmapped reads. It is written to stdout.
224
225
226 flagstat samtools flagstat in.sam|in.bam|in.cram
227
228 Does a full pass through the input file to calculate and
229 print statistics to stdout.
230
231 Provides counts for each of 13 categories based primarily on
232 bit flags in the FLAG field. Each category in the output is
233 broken down into QC pass and QC fail, which is presented as
234 "#PASS + #FAIL" followed by a description of the category.
235
236
237 flags samtools flags INT|STR[,...]
238
239 Convert between textual and numeric flag representation.
240
241 FLAGS:
242
243 0x1 PAIRED paired-end (or multiple-segment) sequencing technology
244 0x2 PROPER_PAIR each segment properly aligned according to the aligner
245 0x4 UNMAP segment unmapped
246 0x8 MUNMAP next segment in the template unmapped
247 0x10 REVERSE SEQ is reverse complemented
248 0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
249 0x40 READ1 the first segment in the template
250 0x80 READ2 the last segment in the template
251 0x100 SECONDARY secondary alignment
252 0x200 QCFAIL not passing quality controls
253 0x400 DUP PCR or optical duplicate
254 0x800 SUPPLEMENTARY supplementary alignment
255
256
257 stats samtools stats [options] in.sam|in.bam|in.cram [region...]
258
259 samtools stats collects statistics from BAM files and outputs
260 in a text format. The output can be visualized graphically
261 using plot-bamstats.
262
263
264
265 bedcov samtools bedcov [options] region.bed
266 in1.sam|in1.bam|in1.cram[...]
267
268 Reports the total read base count (i.e. the sum of per base
269 read depths) for each genomic region specified in the sup‐
270 plied BED file. The regions are output as they appear in the
271 BED file and are 0-based. Counts for each alignment file
272 supplied are reported in separate columns.
273
274
275 depth samtools depth [options] [in1.sam|in1.bam|in1.cram
276 [in2.sam|in2.bam|in2.cram] [...]]
277
278 Computes the read depth at each position or region.
279
280
281 ampliconstats
282 samtools ampliconstats [options] primers.bed
283 in.sam|in.bam|in.cram[...]
284
285 samtools ampliconstats collects statistics from one or more
286 input alignment files and produces tables in text format.
287 The output can be visualized graphically using plot-amplicon‐
288 stats.
289
290 The alignment files should have previously been clipped of
291 primer sequence, for example by samtools ampliconclip and the
292 sites of these primers should be specified as a bed file in
293 the arguments.
294
295
296 mpileup samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l
297 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
298
299 Generate textual pileup for one or multiple BAM files. For
300 VCF and BCF output, please use the bcftools mpileup command
301 instead. Alignment records are grouped by sample (SM) iden‐
302 tifiers in @RG header lines. If sample identifiers are ab‐
303 sent, each input file is regarded as one sample.
304
305 See the samtools-mpileup man page for a description of the
306 pileup format and options.
307
308
309 coverage samtools coverage [options] [in1.sam|in1.bam|in1.cram
310 [in2.sam|in2.bam|in2.cram] [...]]
311
312 Produces a histogram or table of coverage per chromosome.
313
314
315 merge samtools merge [-nur1f] [-h inh.sam] [-t tag] [-R reg] [-b
316 list] out.bam in1.bam [in2.bam in3.bam ... inN.bam]
317
318 Merge multiple sorted alignment files, producing a single
319 sorted output file that contains all the input records and
320 maintains the existing sort order.
321
322 If -h is specified the @SQ headers of input files will be
323 merged into the specified header, otherwise they will be
324 merged into a composite header created from the input head‐
325 ers. If the @SQ headers differ in order this may require the
326 output file to be re-sorted after merge.
327
328 The ordering of the records in the input files must match the
329 usage of the -n and -t command-line options. If they do not,
330 the output order will be undefined. See sort for information
331 about record ordering.
332
333
334 split samtools split [options] merged.sam|merged.bam|merged.cram
335
336 Splits a file by read group, producing one or more output
337 files matching a common prefix (by default based on the input
338 filename) each containing one read-group.
339
340
341 cat samtools cat [-b list] [-h header.sam] [-o out.bam] in1.bam
342 in2.bam [ ... ]
343
344 Concatenate BAMs or CRAMs. Although this works on either BAM
345 or CRAM, all input files must be the same format as each
346 other. The sequence dictionary of each input file must be
347 identical, although this command does not check this. This
348 command uses a similar trick to reheader which enables fast
349 BAM concatenation.
350
351
352 import samtools import [options] in.fastq [ ... ]
353
354 Converts one or more FASTQ files to unaligned SAM, BAM or
355 CRAM. These formats offer a richer capability of tracking
356 sample meta-data via the SAM header and per-read meta-data
357 via the auxiliary tags. The fastq command may be used to re‐
358 verse this conversion.
359
360
361 fastq/a samtools fastq [options] in.bam
362 samtools fasta [options] in.bam
363
364 Converts a BAM or CRAM into either FASTQ or FASTA format de‐
365 pending on the command invoked. The files will be automati‐
366 cally compressed if the file names have a .gz or .bgzf exten‐
367 sion.
368
369 The input to this program must be collated by name. Use sam‐
370 tools collate or samtools sort -n to ensure this.
371
372
373 faidx samtools faidx <ref.fasta> [region1 [...]]
374
375 Index reference sequence in the FASTA format or extract sub‐
376 sequence from indexed reference sequence. If no region is
377 specified, faidx will index the file and create
378 <ref.fasta>.fai on the disk. If regions are specified, the
379 subsequences will be retrieved and printed to stdout in the
380 FASTA format.
381
382 The input file can be compressed in the BGZF format.
383
384 FASTQ files can be read and indexed by this command. Without
385 using --fastq any extracted subsequence will be in FASTA for‐
386 mat.
387
388
389 fqidx samtools fqidx <ref.fastq> [region1 [...]]
390
391 Index reference sequence in the FASTQ format or extract sub‐
392 sequence from indexed reference sequence. If no region is
393 specified, fqidx will index the file and create
394 <ref.fastq>.fai on the disk. If regions are specified, the
395 subsequences will be retrieved and printed to stdout in the
396 FASTQ format.
397
398 The input file can be compressed in the BGZF format.
399
400 samtools fqidx should only be used on fastq files with a
401 small number of entries. Trying to use it on a file contain‐
402 ing millions of short sequencing reads will produce an index
403 that is almost as big as the original file, and searches us‐
404 ing the index will be very slow and use a lot of memory.
405
406
407 dict samtools dict ref.fasta|ref.fasta.gz
408
409 Create a sequence dictionary file from a fasta file.
410
411
412 calmd samtools calmd [-Eeubr] [-C capQcoef] aln.bam ref.fasta
413
414 Generate the MD tag. If the MD tag is already present, this
415 command will give a warning if the MD tag generated is dif‐
416 ferent from the existing tag. Output SAM by default.
417
418 Calmd can also read and write CRAM files although in most
419 cases it is pointless as CRAM recalculates MD and NM tags on
420 the fly. The one exception to this case is where both input
421 and output CRAM files have been / are being created with the
422 no_ref option.
423
424
425 fixmate samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
426
427 Fill in mate coordinates, ISIZE and mate related flags from a
428 name-sorted alignment.
429
430
431 markdup samtools markdup [-l length] [-r] [-s] [-T] [-S] in.al‐
432 gsort.bam out.bam
433
434 Mark duplicate alignments from a coordinate sorted file that
435 has been run through samtools fixmate with the -m option.
436 This program relies on the MC and ms tags that fixmate pro‐
437 vides.
438
439
440 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
441
442 This command is obsolete. Use markdup instead.
443
444
445 addreplacerg
446 samtools addreplacerg [-r rg-line | -R rg-ID] [-m mode] [-l
447 level] [-o out.bam] in.bam
448
449 Adds or replaces read group tags in a file.
450
451
452 reheader samtools reheader [-iP] in.header.sam in.bam
453
454 Replace the header in in.bam with the header in
455 in.header.sam. This command is much faster than replacing
456 the header with a BAM→SAM→BAM conversion.
457
458 By default this command outputs the BAM or CRAM file to stan‐
459 dard output (stdout), but for CRAM format files it has the
460 option to perform an in-place edit, both reading and writing
461 to the same file. No validity checking is performed on the
462 header, nor that it is suitable to use with the sequence data
463 itself.
464
465
466 targetcut samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1
467 em1] [-2 em2] [-f ref] in.bam
468
469 This command identifies target regions by examining the con‐
470 tinuity of read depth, computes haploid consensus sequences
471 of targets and outputs a SAM with each sequence corresponding
472 to a target. When option -f is in use, BAQ will be applied.
473 This command is only designed for cutting fosmid clones from
474 fosmid pool sequencing [Ref. Kitzman et al. (2010)].
475
476
477 phase samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q
478 minBaseQ] in.bam
479
480 Call and phase heterozygous SNPs.
481
482
483 depad samtools depad [-SsCu1] [-T ref.fa] [-o output] in.bam
484
485 Converts a BAM aligned against a padded reference to a BAM
486 aligned against the depadded reference. The padded reference
487 may contain verbatim "*" bases in it, but "*" bases are also
488 counted in the reference numbering. This means that a se‐
489 quence base-call aligned against a reference "*" is consid‐
490 ered to be a cigar match ("M" or "X") operator (if the base-
491 call is "A", "C", "G" or "T"). After depadding the reference
492 "*" bases are deleted and such aligned sequence base-calls
493 become insertions. Similarly transformations apply for dele‐
494 tions and padding cigar operations.
495
496
497 ampliconclip
498 samtools ampliconclip [-o out.file] [-f stat.file] [--soft-
499 clip] [--hard-clip] [--both-ends] [--strand] [--clipped]
500 [--fail] [--no-PG] -b bed.file in.file
501
502 Clip reads in a SAM compatible file based on data from a BED
503 file.
504
505
507 These are options that are passed after the samtools command, before
508 any sub-command is specified.
509
510
511 help, --help
512 Display a brief usage message listing the samtools commands
513 available. If the name of a command is also given, e.g., sam‐
514 tools help view, the detailed usage message for that particular
515 command is displayed.
516
517
518 --version
519 Display the version numbers and copyright information for sam‐
520 tools and the important libraries used by samtools.
521
522
523 --version-only
524 Display the full samtools version number in a machine-readable
525 format.
526
528 Several long-options are shared between multiple samtools sub-commands:
529 --input-fmt, --input-fmt-option, --output-fmt, --output-fmt-option,
530 --reference, --write-index, and --verbosity. The input format is typi‐
531 cally auto-detected so specifying the format is usually unnecessary and
532 the option is included for completeness. Note that not all subcommands
533 have all options. Consult the subcommand help for more details.
534
535 Format strings recognised are "sam", "sam.gz", "bam" and "cram". They
536 may be followed by a comma separated list of options as key or
537 key=value. See below for examples.
538
539 The fmt-option arguments accept either a single option or option=value.
540 Note that some options only work on some file formats and only on read
541 or write streams. If value is unspecified for a boolean option, the
542 value is assumed to be 1. The valid options are as follows.
543
544 level=INT
545 Output only. Specifies the compression level from 1 to 9, or 0 for
546 uncompressed. If the output format is SAM, this also enables BGZF
547 compression, otherwise SAM defaults to uncompressed.
548
549 nthreads=INT
550 Specifies the number of threads to use during encoding and/or de‐
551 coding. For BAM this will be encoding only. In CRAM the threads
552 are dynamically shared between encoder and decoder.
553
554 filter=STRING
555 Apply filter STRING to all incoming records, rejecting any that do
556 not satisfy the expression. See the FILTER EXPRESSIONS section be‐
557 low for specifics.
558
559 reference=fasta_file
560 Specifies a FASTA reference file for use in CRAM encoding or decod‐
561 ing. It usually is not required for decoding except in the situa‐
562 tion of the MD5 not being obtainable via the REF_PATH or REF_CACHE
563 environment variables.
564
565 decode_md=0|1
566 CRAM input only; defaults to 1 (on). CRAM does not typically store
567 MD and NM tags, preferring to generate them on the fly. When this
568 option is 0 missing MD, NM tags will not be generated. It can be
569 particularly useful when combined with a file encoded using
570 store_md=1 and store_nm=1.
571
572 store_md=0|1
573 CRAM output only; defaults to 0 (off). CRAM normally only stores
574 MD tags when the reference is unknown and lets the decoder generate
575 these values on-the-fly (see decode_md).
576
577 store_nm=0|1
578 CRAM output only; defaults to 0 (off). CRAM normally only stores
579 NM tags when the reference is unknown and lets the decoder generate
580 these values on-the-fly (see decode_md).
581
582 ignore_md5=0|1
583 CRAM input only; defaults to 0 (off). When enabled, md5 checksum
584 errors on the reference sequence and block checksum errors within
585 CRAM are ignored. Use of this option is strongly discouraged.
586
587 required_fields=bit-field
588 CRAM input only; specifies which SAM columns need to be populated.
589 By default all fields are used. Limiting the decode to specific
590 columns can have significant performance gains. The bit-field is a
591 numerical value constructed from the following table.
592
593 0x1 SAM_QNAME
594 0x2 SAM_FLAG
595 0x4 SAM_RNAME
596
597 0x8 SAM_POS
598 0x10 SAM_MAPQ
599 0x20 SAM_CIGAR
600 0x40 SAM_RNEXT
601 0x80 SAM_PNEXT
602 0x100 SAM_TLEN
603 0x200 SAM_SEQ
604 0x400 SAM_QUAL
605 0x800 SAM_AUX
606 0x1000 SAM_RGAUX
607
608 name_prefix=string
609 CRAM input only; defaults to output filename. Any sequences with
610 auto-generated read names will use string as the name prefix.
611
612 multi_seq_per_slice=0|1
613 CRAM output only; defaults to 0 (off). By default CRAM generates
614 one container per reference sequence, except in the case of many
615 small references (such as a fragmented assembly).
616
617 version=major.minor
618 CRAM output only. Specifies the CRAM version number. Acceptable
619 values are "2.1" and "3.0".
620
621 seqs_per_slice=INT
622 CRAM output only; defaults to 10000.
623
624 slices_per_container=INT
625 CRAM output only; defaults to 1. The effect of having multiple
626 slices per container is to share the compression header block be‐
627 tween multiple slices. This is unlikely to have any significant
628 impact unless the number of sequences per slice is reduced. (To‐
629 gether these two options control the granularity of random access.)
630
631 embed_ref=0|1
632 CRAM output only; defaults to 0 (off). If 1, this will store por‐
633 tions of the reference sequence in each slice, permitting decode
634 without having requiring an external copy of the reference se‐
635 quence.
636
637 no_ref=0|1
638 CRAM output only; defaults to 0 (off). If 1, sequences will be
639 stored verbatim with no reference encoding. This can be useful if
640 no reference is available for the file.
641
642 use_bzip2=0|1
643 CRAM output only; defaults to 0 (off). Permits use of bzip2 in
644 CRAM block compression.
645
646 use_lzma=0|1
647 CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM
648 block compression.
649
650 lossy_names=0|1
651 CRAM output only; defaults to 0 (off). If 1, templates with all
652 members within the same CRAM slice will have their read names re‐
653 moved. New names will be automatically generated during decoding.
654 Also see the name_prefix option.
655
656 For example:
657
658 samtools view --input-fmt-option decode_md=0
659 --output-fmt cram,version=3.0 --output-fmt-option embed_ref
660 --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
661
662
663 The --write-index option enables automatic index creation while writing
664 out BAM, CRAM or bgzf SAM files. Note to get compressed SAM as the
665 output format you need to manually request a compression level, other‐
666 wise all SAM files are uncompressed. By default SAM and BAM will use
667 CSI indices while CRAM will use CRAI indices. If you need to create
668 BAI indices note that it is possible to specify the name of the index
669 being written to, and hence the format, by using the filename##idx##in‐
670 dexname notation.
671
672 For example: to convert a BAM to a compressed SAM with CSI indexing:
673
674 samtools view -h -O sam,level=6 --write-index in.bam -o out.sam.gz
675
676
677 To convert a SAM to a compressed BAM using BAI indexing:
678
679 samtools view --write-index in.sam -o out.bam##idx##out.bam.bai
680
681
682 The --verbosity INT option sets the verbosity level for samtools and
683 HTSlib. The default is 3 (HTS_LOG_WARNING); 2 reduces warning messages
684 and 0 or 1 also reduces some error messages, while values greater than
685 3 produce increasing numbers of additional warnings and logging mes‐
686 sages.
687
688
690 The CRAM format requires use of a reference sequence for both reading
691 and writing.
692
693 When reading a CRAM the @SQ headers are interrogated to identify the
694 reference sequence MD5sum (M5: tag) and the local reference sequence
695 filename (UR: tag). Note that http:// and ftp:// based URLs in the UR:
696 field are not used, but local fasta filenames (with or without file://)
697 can be used.
698
699 To create a CRAM the @SQ headers will also be read to identify the ref‐
700 erence sequences, but M5: and UR: tags may not be present. In this case
701 the -T and -t options of samtools view may be used to specify the fasta
702 or fasta.fai filenames respectively (provided the .fasta.fai file is
703 also backed up by a .fasta file).
704
705 The search order to obtain a reference is:
706
707 1. Use any local file specified by the command line options (eg -T).
708
709 2. Look for MD5 via REF_CACHE environment variable.
710
711 3. Look for MD5 in each element of the REF_PATH environment variable.
712
713 4. Look for a local file listed in the UR: header tag.
714
715
717 Filter expressions are used as an on-the-fly checking of incoming SAM,
718 BAM or CRAM records, discarding records that do not match the specified
719 expression.
720
721 The language used is primarily C style, but with a few differences in
722 the precedence rules for bit operators and the inclusion of regular ex‐
723 pression matching.
724
725 The operator precedence, from strongest binding to weakest, is:
726
727
728 Grouping (, ) E.g. "(1+2)*3"
729 Values: literals, vars Numbers, strings and variables
730
731 Unary ops: +, -, !, ~ E.g. -10 +10, !10 (not), ~5 (bit not)
732 Math ops: *, /, % Multiply, division and (integer) modulo
733 Math ops: +, - Addition / subtraction
734 Bit-wise: & Integer AND
735 Bit-wise ^ Integer XOR
736 Bit-wise | Integer OR
737 Conditionals: >, >=, <, <=
738 Equality: ==, !=, =~, !~ =~ and !~ match regular expressions
739 Boolean: &&, || Logical AND / OR
740
741 Expressions are computed using floating point mathematics, so "10 / 4"
742 evaluates to 2.5 rather than 2. They may be written as integers in
743 decimal or "0x" plus hexadecimal, and floating point with or without
744 exponents.However operations that require integers first do an implicit
745 type conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7
746 & 4", which is 4. Strings are always specified using double quotes.
747 To get a double quote in a string, use backslash. Similarly a double
748 backslash is used to get a literal backslash. For example ab\"c\\d is
749 the string ab"c\d.
750
751 Comparison operators are evaluated as a match being 1 and a mismatch
752 being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2.
753
754 The variables are where the file format specifics are accessed from the
755 expression. The variables correspond to SAM fields, for example to
756 find paired alignments with high mapping quality and a very large in‐
757 sert size, we may use the expression "mapq >= 30 && (tlen >= 100000 ||
758 tlen <= -100000)". Valid variable names and their data types are:
759
760
761 flag int Combined FLAG field
762 flag.paired int Single bit, 0 or 1
763 flag.proper_pair int Single bit, 0 or 2
764 flag.unmap int Single bit, 0 or 4
765 flag.munmap int Single bit, 0 or 8
766 flag.reverse int Single bit, 0 or 16
767 flag.mreverse int Single bit, 0 or 32
768 flag.read1 int Single bit, 0 or 64
769 flag.read2 int Single bit, 0 or 128
770 flag.secondary int Single bit, 0 or 256
771 flag.qcfail int Single bit, 0 or 512
772 flag.dup int Single bit, 0 or 1024
773 flag.supplementary int Single bit, 0 or 2048
774 library string Library (LB header via RG)
775 mapq int Mapping quality
776 mpos int Synonym for pnext
777 mrefid int Mate reference number (0 based)
778 mrname string Synonym for rnext
779 ncigar int Number of cigar operations
780 pnext int Mate's alignment position (1-based)
781 pos int Alignment position (1-based)
782 qlen int Alignment length: no. query bases
783 qname string Query name
784 qual string Quality values (raw, 0 based)
785 refid int Integer reference number (0 based)
786 rlen int Alignment length: no. reference bases
787 rname string Reference name
788 rnext string Mate's reference name
789 seq string Sequence
790 tlen int Template length (insert size)
791 [XX] int / string XX tag value
792
793 Flags are returned either as the whole flag value or by checking for a
794 single bit. Hence the filter expression flag.dup is equivalent to flag
795 & 1024.
796
797 "qlen" and "rlen" are measured using the CIGAR string to count the num‐
798 ber of query (sequence) and reference bases consumed. Note "qlen" may
799 not exactly match the length of the "seq" field if the sequence is "*".
800
801 Reference names may be matched either by their string forms ("rname"
802 and "mrname") or as the Nth @SQ line (counting from zero) as stored in
803 BAM using "tid" and "mtid" respectively.
804
805 Auxiliary tags are described in square brackets and these expand to ei‐
806 ther integer or string as defined by the tag itself (XX:Z:string or
807 XX:i:int). For example [NM]>=10 can be used to look for alignments
808 with many mismatches and [RG]=~"grp[ABC]-" will match the read-group
809 string.
810
811 If no comparison is used with an auxiliary tag it is taken simply to be
812 a test for the existence of that tag. So "[NM]" will return any record
813 containing an NM tag, even if that tag is zero (NM:i:0).
814
815 If you need to check specifically for a non-zero value then use [NM] &&
816 [NM]!=0.
817
818 Some simple functions are available to operate on strings. These treat
819 the strings as arrays of bytes, permitting their length, minimum, maxi‐
820 mum and average values to be computed.
821
822
823 length Length of the string (excluding nul char)
824 min Minimum byte value in the string
825 max Maximum byte value in the string
826 avg Average byte value in the string
827
828 Note that "avg" is a floating point value and it may be NAN for empty
829 strings. This means that "avg(qual)" does not produce an error for
830 records that have both seq and qual of "*". This value will fail any
831 conditional checks, so e.g. "avg(qual) > 20" works and will not report
832 these records.
833
834
836 HTS_PATH
837 A colon-separated list of directories in which to search for HT‐
838 Slib plugins. If $HTS_PATH starts or ends with a colon or con‐
839 tains a double colon (::), the built-in list of directories is
840 searched at that point in the search.
841
842 If no HTS_PATH variable is defined, the built-in list of direc‐
843 tories specified when HTSlib was built is used, which typically
844 includes /usr/local/libexec/htslib and similar directories.
845
846
847 REF_PATH
848 A colon separated (semi-colon on Windows) list of locations in
849 which to look for sequences identified by their MD5sums. This
850 can be either a list of directories or URLs. Note that if a URL
851 is included then the colon in http:// and ftp:// and the op‐
852 tional port number will be treated as part of the URL and not a
853 PATH field separator. For URLs, the text %s will be replaced by
854 the MD5sum being read.
855
856 If no REF_PATH has been specified it will default to
857 http://www.ebi.ac.uk/ena/cram/md5/%s and if REF_CACHE is also
858 unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s. If
859 $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system tempo‐
860 rary directory if no home directory is found) will be used simi‐
861 larly.
862
863
864 REF_CACHE
865 This can be defined to a single location housing a local cache
866 of references. Upon downloading a reference it will be stored
867 in the location pointed to by REF_CACHE. REF_CACHE will be
868 searched before attempting to load via the REF_PATH search list.
869 If no REF_PATH is defined, both REF_PATH and REF_CACHE will be
870 automatically set (see above), but if REF_PATH is defined and
871 REF_CACHE not then no local cache is used.
872
873 To avoid many files being stored in the same directory,
874 REF_CACHE may be defined as a pattern using %nums to consume num
875 characters of the MD5sum and %s to consume all remaining charac‐
876 ters. If REF_CACHE lacks %s then it will get an implicit /%s
877 appended.
878
879 To aid population of the REF_CACHE directory a script
880 misc/seq_cache_populate.pl is provided in the Samtools distribu‐
881 tion. This takes a fasta file or a directory of fasta files and
882 generates the MD5sum named files.
883
884 For example if you use seq_cache_populate -subdirs 2 -root /lo‐
885 cal/ref_cache to create 2 nested subdirectories (the default),
886 each consuming 2 characters of the MD5sum, then REF_CACHE must
887 be set to /local/ref_cache/%2s/%2s/%s.
888
890 o Import SAM to BAM when @SQ lines are present in the header:
891
892 samtools view -b aln.sam > aln.bam
893
894 If @SQ lines are absent:
895
896 samtools faidx ref.fa
897 samtools view -bt ref.fa.fai aln.sam > aln.bam
898
899 where ref.fa.fai is generated automatically by the faidx command.
900
901
902 o Convert a BAM file to a CRAM file using a local reference sequence.
903
904 samtools view -C -T ref.fa aln.bam > aln.cram
905
906
907
909 o Unaligned words used in bam_endian.h, bam.c and bam_aux.c.
910
911
913 Heng Li from the Sanger Institute wrote the original C version of sam‐
914 tools. Bob Handsaker from the Broad Institute implemented the BGZF li‐
915 brary. Petr Danecek and Heng Li wrote the VCF/BCF implementation.
916 James Bonfield from the Sanger Institute developed the CRAM implementa‐
917 tion. Other large code contributions have been made by John Marshall,
918 Rob Davies, Martin Pollard, Andrew Whitwham, Valeriu Ohan (all while
919 primarily at the Sanger Institute), with numerous other smaller but
920 valuable contributions. See the per-command manual pages for further
921 authorship.
922
923
925 samtools-addreplacerg(1), samtools-ampliconclip(1), samtools-amplicon‐
926 stats(1), samtools-bedcov(1), samtools-calmd(1), samtools-cat(1), sam‐
927 tools-collate(1), samtools-coverage(1), samtools-depad(1), samtools-
928 depth(1), samtools-dict(1), samtools-faidx(1), samtools-fasta(1), sam‐
929 tools-fastq(1), samtools-fixmate(1), samtools-flags(1), samtools-flag‐
930 stat(1), samtools-fqidx(1), samtools-idxstats(1), samtools-import(1),
931 samtools-index(1), samtools-markdup(1), samtools-merge(1), samtools-
932 mpileup(1), samtools-phase(1), samtools-quickcheck(1), samtools-re‐
933 header(1), samtools-rmdup(1), samtools-sort(1), samtools-split(1), sam‐
934 tools-stats(1), samtools-targetcut(1), samtools-tview(1), samtools-
935 view(1), bcftools(1), sam(5), tabix(1)
936
937 Samtools website: <http://www.htslib.org/>
938 File format specification of SAM/BAM,CRAM,VCF/BCF: <http://sam‐
939 tools.github.io/hts-specs>
940 Samtools latest source: <https://github.com/samtools/samtools>
941 HTSlib latest source: <https://github.com/samtools/htslib>
942 Bcftools website: <http://samtools.github.io/bcftools>
943
944
945
946samtools-1.13 7 July 2021 samtools(1)