1BCFTOOLS(1) BCFTOOLS(1)
2
3
4
6 bcftools - utilities for variant calling and manipulating VCFs and
7 BCFs.
8
10 bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
11
13 BCFtools is a set of utilities that manipulate variant calls in the
14 Variant Call Format (VCF) and its binary counterpart BCF. All commands
15 work transparently with both VCFs and BCFs, both uncompressed and
16 BGZF-compressed.
17
18 Most commands accept VCF, bgzipped VCF and BCF with filetype detected
19 automatically even when streaming from a pipe. Indexed VCF and BCF will
20 work in all situations. Un-indexed VCF and BCF and streams will work in
21 most, but not all situations. In general, whenever multiple VCFs are
22 read simultaneously, they must be indexed and therefore also
23 compressed.
24
25 BCFtools is designed to work on a stream. It regards an input file "-"
26 as the standard input (stdin) and outputs to the standard output
27 (stdout). Several commands can thus be combined with Unix pipes.
28
29 VERSION
30 This manual page was last updated 2018-07-18 and refers to bcftools git
31 version 1.9.
32
33 BCF1
34 The BCF1 format output by versions of samtools <= 0.1.19 is not
35 compatible with this version of bcftools. To read BCF1 files one can
36 use the view command from old versions of bcftools packaged with
37 samtools versions <= 0.1.19 to convert to VCF, which can then be read
38 by this version of bcftools.
39
40 samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
41
42 VARIANT CALLING
43 See bcftools call for variant calling from the output of the samtools
44 mpileup command. In versions of samtools <= 0.1.19 calling was done
45 with bcftools view. Users are now required to choose between the old
46 samtools calling model (-c/--consensus-caller) and the new multiallelic
47 calling model (-m/--multiallelic-caller). The multiallelic calling
48 model is recommended for most tasks.
49
51 For a full list of available commands, run bcftools without arguments.
52 For a full list of available options, run bcftools COMMAND without
53 arguments.
54
55 · annotate .. edit VCF files, add or remove annotations
56
57 · call .. SNP/indel calling (former "view")
58
59 · cnv .. Copy Number Variation caller
60
61 · concat .. concatenate VCF/BCF files from the same set of samples
62
63 · consensus .. create consensus sequence by applying VCF variants
64
65 · convert .. convert VCF/BCF to other formats and back
66
67 · csq .. haplotype aware consequence caller
68
69 · filter .. filter VCF/BCF files using fixed thresholds
70
71 · gtcheck .. check sample concordance, detect sample swaps and
72 contamination
73
74 · index .. index VCF/BCF
75
76 · isec .. intersections of VCF/BCF files
77
78 · merge .. merge VCF/BCF files files from non-overlapping sample
79 sets
80
81 · mpileup .. multi-way pileup producing genotype likelihoods
82
83 · norm .. normalize indels
84
85 · plugin .. run user-defined plugin
86
87 · polysomy .. detect contaminations and whole-chromosome aberrations
88
89 · query .. transform VCF/BCF into user-defined formats
90
91 · reheader .. modify VCF/BCF header, change sample names
92
93 · roh .. identify runs of homo/auto-zygosity
94
95 · sort .. sort VCF/BCF files
96
97 · stats .. produce VCF/BCF stats (former vcfcheck)
98
99 · view .. subset, filter and convert VCF and BCF files
100
102 Some helper scripts are bundled with the bcftools code.
103
104 · plot-vcfstats .. plots the output of stats
105
107 Common Options
108 The following options are common to many bcftools commands. See usage
109 for specific commands to see if they apply.
110
111 FILE
112 Files can be both VCF or BCF, uncompressed or BGZF-compressed. The
113 file "-" is interpreted as standard input. Some tools may require
114 tabix- or CSI-indexed files.
115
116 -c, --collapse snps|indels|both|all|some|none|id
117 Controls how to treat records with duplicate positions and defines
118 compatible records across multiple input files. Here by
119 "compatible" we mean records which should be considered as
120 identical by the tools. For example, when performing line
121 intersections, the desire may be to consider as identical all sites
122 with matching positions (bcftools isec -c all), or only sites with
123 matching variant type (bcftools isec -c snps -c indels), or only
124 sites with all alleles identical (bcftools isec -c none).
125
126 none
127 only records with identical REF and ALT alleles are compatible
128
129 some
130 only records where some subset of ALT alleles match are
131 compatible
132
133 all
134 all records are compatible, regardless of whether the ALT
135 alleles match or not. In the case of records with the same
136 position, only the first will be considered and appear on
137 output.
138
139 snps
140 any SNP records are compatible, regardless of whether the ALT
141 alleles match or not. For duplicate positions, only the first
142 SNP record will be considered and appear on output.
143
144 indels
145 all indel records are compatible, regardless of whether the REF
146 and ALT alleles match or not. For duplicate positions, only the
147 first indel record will be considered and appear on output.
148
149 both
150 abbreviation of "-c indels -c snps"
151
152 id
153 only records with identical ID column are compatible. Supported
154 by bcftools merge only.
155
156 -f, --apply-filters LIST
157 Skip sites where FILTER column does not contain any of the strings
158 listed in LIST. For example, to include only sites which have no
159 filters set, use -f .,PASS.
160
161 --no-version
162 Do not append version and command line information to the output
163 VCF header.
164
165 -o, --output FILE
166 When output consists of a single stream, write it to FILE rather
167 than to standard output, where it is written by default.
168
169 -O, --output-type b|u|z|v
170 Output compressed BCF (b), uncompressed BCF (u), compressed VCF
171 (z), uncompressed VCF (v). Use the -Ou option when piping between
172 bcftools subcommands to speed up performance by removing
173 unnecessary compression/decompression and VCF←→BCF conversion.
174
175 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
176 Comma-separated list of regions, see also -R, --regions-file. Note
177 that -r cannot be used in combination with -R.
178
179 -R, --regions-file FILE
180 Regions can be specified either on command line or in a VCF, BED,
181 or tab-delimited file (the default). The columns of the
182 tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where
183 positions are 1-based and inclusive. The columns of the
184 tab-delimited BED file are also CHROM, POS and POS_TO (trailing
185 columns are ignored), but coordinates are 0-based, half-open. To
186 indicate that a file be treated as BED rather than the 1-based
187 tab-delimited file, the file must have the ".bed" or ".bed.gz"
188 suffix (case-insensitive). Uncompressed files are stored in memory,
189 while bgzip-compressed and tabix-indexed region files are streamed.
190 Note that sequence names must match exactly, "chr20" is not the
191 same as "20". Also note that chromosome ordering in FILE will be
192 respected, the VCF will be processed in the order in which
193 chromosomes first appear in FILE. However, within chromosomes, the
194 VCF will always be processed in ascending genomic coordinate order
195 no matter what order they appear in FILE. Note that overlapping
196 regions in FILE can result in duplicated out of order positions in
197 the output. This option requires indexed VCF/BCF files. Note that
198 -R cannot be used in combination with -r.
199
200 -s, --samples [^]LIST
201 Comma-separated list of samples to include or exclude if prefixed
202 with "^". The sample order is updated to reflect that given on the
203 command line. Note that in general tags such as INFO/AC, INFO/AN,
204 etc are not updated to correspond to the subset samples. bcftools
205 view is the exception where some tags will be updated (unless the
206 -I, --no-update option is used; see bcftools view documentation).
207 To use updated tags for the subset in another command one can pipe
208 from view into that command. For example:
209
210 bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
211
212 -S, --samples-file FILE
213 File of sample names to include or exclude if prefixed with "^".
214 One sample per line. See also the note above for the -s, --samples
215 option. The sample order is updated to reflect that given in the
216 input file. The command bcftools call accepts an optional second
217 column indicating ploidy (0, 1 or 2) or sex (as defined by
218 --ploidy, for example "F" or "M"), and can parse also PED files. If
219 the second column is not present, the sex "F" is assumed. With
220 bcftools call -C trio, PED file is expected. File formats examples:
221
222 sample1 1
223 sample2 2
224 sample3 2
225
226 or
227
228 sample1 M
229 sample2 F
230 sample3 F
231
232 or a .ped file (here is shown a minimum working example, the first column is
233 ignored and the last indicates sex: 1=male, 2=female)
234
235 ignored daughterA fatherA motherA 2
236 ignored sonB fatherB motherB 1
237
238 -t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
239 Similar as -r, --regions, but the next position is accessed by
240 streaming the whole VCF/BCF rather than using the tbi/csi index.
241 Both -r and -t options can be applied simultaneously: -r uses the
242 index to jump to a region and -t discards positions which are not
243 in the targets. Unlike -r, targets can be prefixed with "^" to
244 request logical complement. For example, "^X,Y,MT" indicates that
245 sequences X, Y and MT should be skipped. Yet another difference
246 between the two is that -r checks both start and end positions of
247 indels, whereas -t checks start positions only. Note that -t cannot
248 be used in combination with -T.
249
250 -T, --targets-file [^]FILE
251 Same -t, --targets, but reads regions from a file. Note that -T
252 cannot be used in combination with -t.
253
254 With the call -C alleles command, third column of the targets file
255 must be comma-separated list of alleles, starting with the
256 reference allele. Note that the file must be compressed and index.
257 Such a file can be easily created from a VCF using:
258
259 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
260
261 --threads INT
262 Number of output compression threads to use in addition to main
263 thread. Only used when --output-type is b or z. Default: 0.
264
265 bcftools annotate [OPTIONS] FILE
266 Add or remove annotations.
267
268 -a, --annotations file
269 Bgzip-compressed and tabix-indexed file with annotations. The file
270 can be VCF, BED, or a tab-delimited file with mandatory columns
271 CHROM, POS (or, alternatively, FROM and TO), optional columns REF
272 and ALT, and arbitrary number of annotation columns. BED files are
273 expected to have the ".bed" or ".bed.gz" suffix (case-insensitive),
274 otherwise a tab-delimited file is assumed. Note that in case of
275 tab-delimited file, the coordinates POS, FROM and TO are one-based
276 and inclusive. When REF and ALT are present, only matching VCF
277 records will be annotated. When multiple ALT alleles are present in
278 the annotation file (given as comma-separated list of alleles), at
279 least one must match one of the alleles in the corresponding VCF
280 record. Similarly, at least one alternate allele from a
281 multi-allelic VCF record must be present in the annotation file.
282 Note that flag types, such as "INFO/FLAG", can be annotated by
283 including a field with the value "1" to set the flag, "0" to remove
284 it, or "." to keep existing flags. See also -c, --columns and -h,
285 --header-lines.
286
287 # Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
288 1 752566 SomeString 5
289 1 798959 SomeOtherString 6
290 # etc.
291
292 --collapse snps|indels|both|all|some|none
293 Controls how to match records from the annotation file to the
294 target VCF. Effective only when -a is a VCF or BCF. See Common
295 Options for more.
296
297 -c, --columns list
298 Comma-separated list of columns or tags to carry over from the
299 annotation file (see also -a, --annotations). If the annotation
300 file is not a VCF/BCF, list describes the columns of the annotation
301 file and must include CHROM, POS (or, alternatively, FROM and TO),
302 and optionally REF and ALT. Unused columns which should be ignored
303 can be indicated by "-".
304
305 If the annotation file is a VCF/BCF, only the edited columns/tags
306 must be present and their order does not matter. The columns ID,
307 QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be
308 written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be
309 written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations
310 can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
311
312 To carry over all INFO annotations, use "INFO". To add all INFO
313 annotations except "TAG", use "^INFO/TAG". By default, existing
314 values are replaced.
315
316 To add annotations without overwriting existing values (that is, to
317 add missing tags or add values to existing tags with missing
318 values), use "+TAG" instead of "TAG". To append to existing values
319 (rather than replacing or leaving untouched), use "=TAG" (instead
320 of "TAG" or "+TAG"). To replace only existing values without
321 modifying missing annotations, use "-TAG".
322
323 If the annotation file is not a VCF/BCF, all new annotations must
324 be defined via -h, --header-lines.
325
326 -e, --exclude EXPRESSION
327 exclude sites for which EXPRESSION is true. For valid expressions
328 see EXPRESSIONS.
329
330 -h, --header-lines file
331 Lines to append to the VCF header, see also -c, --columns and -a,
332 --annotations. For example:
333
334 ##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
335 ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
336
337 -I, --set-id [+]FORMAT
338 assign ID on the fly. The format is the same as in the query
339 command (see below). By default all existing IDs are replaced. If
340 the format string is preceded by "+", only missing IDs will be set.
341 For example, one can use
342
343 bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
344
345 -i, --include EXPRESSION
346 include only sites for which EXPRESSION is true. For valid
347 expressions see EXPRESSIONS.
348
349 -k, --keep-sites
350 keep sites wich do not pass -i and -e expressions instead of
351 discarding them
352
353 -m, --mark-sites TAG
354 annotate sites which are present ("+") or absent ("-") in the -a
355 file with a new INFO/TAG flag
356
357 --no-version
358 see Common Options
359
360 -o, --output FILE
361 see Common Options
362
363 -O, --output-type b|u|z|v
364 see Common Options
365
366 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
367 see Common Options
368
369 -R, --regions-file file
370 see Common Options
371
372 --rename-chrs file
373 rename chromosomes according to the map in file, with "old_name
374 new_name\n" pairs separated by whitespaces, each on a separate
375 line.
376
377 -s, --samples [^]LIST
378 subset of samples to annotate, see also Common Options
379
380 -S, --samples-file FILE
381 subset of samples to annotate. If the samples are named differently
382 in the target VCF and the -a, --annotations VCF, the name mapping
383 can be given as "src_name dst_name\n", separated by whitespaces,
384 each pair on a separate line.
385
386 --threads INT
387 see Common Options
388
389 -x, --remove list
390 List of annotations to remove. Use "FILTER" to remove all filters
391 or "FILTER/SomeFilter" to remove a specific filter. Similarly,
392 "INFO" can be used to remove all INFO tags and "FORMAT" to remove
393 all FORMAT tags except GT. To remove all INFO tags except "FOO" and
394 "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and
395 FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
396
397 Examples:
398
399 # Remove three fields
400 bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
401
402 # Remove all INFO fields and all FORMAT fields except for GT and PL
403 bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
404
405 # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
406 bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
407
408 # Carry over all INFO and FORMAT annotations except FORMAT/GT
409 bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
410
411 # Annotate from a tab-delimited file with six columns (the fifth is ignored),
412 # first indexing with tabix. The coordinates are 1-based.
413 tabix -s1 -b2 -e2 annots.tab.gz
414 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
415
416 # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
417 tabix -s1 -b2 -e3 annots.tab.gz
418 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf
419
420 # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
421 bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
422
423 bcftools call [OPTIONS] FILE
424 This command replaces the former bcftools view caller. Some of the
425 original functionality has been temporarily lost in the process of
426 transition under htslib, but will be added back on popular demand. The
427 original calling model can be invoked with the -c option.
428
429 File format options:
430 --no-version
431 see Common Options
432
433 -o, --output FILE
434 see Common Options
435
436 -O, --output-type b|u|z|v
437 see Common Options
438
439 --ploidy ASSEMBLY[?]
440 predefined ploidy, use list (or any other unused word) to print
441 a list of all predefined assemblies. Append a question mark to
442 print the actual definition. See also --ploidy-file.
443
444 --ploidy-file FILE
445 ploidy definition given as a space/tab-delimited list of CHROM,
446 FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and
447 correspond to the ones used by --samples-file. The default
448 ploidy can be given using the starred records (see below),
449 unlisted regions have ploidy 2. The default ploidy definition
450 is
451
452 X 1 60000 M 1
453 X 2699521 154931043 M 1
454 Y 1 59373566 M 1
455 Y 1 59373566 F 0
456 MT 1 16569 M 1
457 MT 1 16569 F 1
458 * * * M 2
459 * * * F 2
460
461 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
462 see Common Options
463
464 -R, --regions-file file
465 see Common Options
466
467 -s, --samples LIST
468 see Common Options
469
470 -S, --samples-file FILE
471 see Common Options
472
473 -t, --targets LIST
474 see Common Options
475
476 -T, --targets-file FILE
477 see Common Options
478
479 --threads INT
480 see Common Options
481
482 Input/output options:
483 -A, --keep-alts
484 output all alternate alleles present in the alignments even if
485 they do not appear in any of the genotypes
486
487 -f, --format-fields list
488 comma-separated list of FORMAT fields to output for each
489 sample. Currently GQ and GP fields are supported. For
490 convenience, the fields can be given as lower case letters.
491
492 -F, --prior-freqs AN,AC
493 take advantage of prior knowledge of population allele
494 frequencies. The workflow looks like this:
495
496 # Extract AN,AC values from an existing VCF, such 1000Genomes
497 bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
498
499 # If the tags AN,AC are not already present, use the +fill-AN-AC plugin
500 bcftools +fill-AN-AC 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
501 tabix -s1 -b2 -e2 AFs.tab.gz
502
503 # Create a VCF header description, here we name the tags REF_AN,REF_AC
504 cat AFs.hdr
505 ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
506 ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
507
508 # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
509 bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
510
511 -g, --gvcf INT
512 output also gVCF blocks of homozygous REF calls. The parameter
513 INT is the minimum per-sample depth required to include a site
514 in the non-variant block.
515
516 -i, --insert-missed INT
517 output also sites missed by mpileup but present in -T,
518 --targets-file.
519
520 -M, --keep-masked-ref
521 output sites where REF allele is N
522
523 -V, --skip-variants snps|indels
524 skip indel/SNP sites
525
526 -v, --variants-only
527 output variant sites only
528
529 Consensus/variant calling options:
530 -c, --consensus-caller
531 the original samtools/bcftools calling method (conflicts with
532 -m)
533
534 -C, --constrain alleles|trio
535
536 alleles
537 call genotypes given alleles. See also -T, --targets-file.
538
539 trio
540 call genotypes given the father-mother-child constraint.
541 See also -s, --samples and -n, --novel-rate.
542
543 -m, --multiallelic-caller
544 alternative modelfor multiallelic and rare-variant calling
545 designed to overcome known limitations in -c calling model
546 (conflicts with -c)
547
548 -n, --novel-rate float[,...]
549 likelihood of novel mutation for constrained -C trio calling.
550 The trio genotype calling maximizes likelihood of a particular
551 combination of genotypes for father, mother and the child
552 P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) *
553 (1-Pn). By providing three values, the mutation rate Pn is set
554 explicitly for SNPs, deletions and insertions, respectively. If
555 two values are given, the first is interpreted as the mutation
556 rate of SNPs and the second is used to calculate the mutation
557 rate of indels according to their length as
558 Pn=float*exp(-a-b*len), where a=22.8689, b=0.2994 for
559 insertions and a=21.9313, b=0.2856 for deletions
560 [pubmed:23975140]. If only one value is given, the same
561 mutation rate Pn is used for SNPs and indels.
562
563 -p, --pval-threshold float
564 with -c, accept variant if P(ref|D) < float.
565
566 -P, --prior float
567 expected substitution rate, or 0 to disable the prior. Only
568 with -m.
569
570 -t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
571 see Common Options
572
573 -X, --chromosome-X
574 haploid output for male samples (requires PED file with -s)
575
576 -Y, --chromosome-Y
577 haploid output for males and skips females (requires PED file
578 with -s)
579
580 bcftools cnv [OPTIONS] FILE
581 Copy number variation caller, requires a VCF annotated with the
582 Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR)
583 values. The HMM considers the following copy number states: CN 2
584 (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy
585 gain).
586
587 General Options:
588 -c, --control-sample string
589 optional control sample name. If given, pairwise calling is
590 performed and the -P option can be used
591
592 -f, --AF-file file
593 read allele frequencies from a tab-delimited file with the
594 columns CHR,POS,REF,ALT,AF
595
596 -o, --output-dir path
597 output directory
598
599 -p, --plot-threshold float
600 call matplotlib to produce plots for chromosomes with quality
601 at least float, useful for visual inspection of the calls. With
602 -p 0, plots for all chromosomes will be generated. If not
603 given, a matplotlib script will be created but not called.
604
605 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
606 see Common Options
607
608 -R, --regions-file file
609 see Common Options
610
611 -s, --query-sample string
612 query samply name
613
614 -t, --targets LIST
615 see Common Options
616
617 -T, --targets-file FILE
618 see Common Options
619
620 HMM Options:
621 -a, --aberrant float[,float]
622 fraction of aberrant cells in query and control. The hallmark
623 of duplications and contaminations is the BAF value of
624 heterozygous markers which is dependent on the fraction of
625 aberrant cells. Sensitivity to smaller fractions of cells can
626 be increased by setting -a to a lower value. Note however, that
627 this comes at the cost of increased false discovery rate.
628
629 -b, --BAF-weight float
630 relative contribution from BAF
631
632 -d, --BAF-dev float[,float]
633 expected BAF deviation in query and control, i.e. the noise
634 observed in the data.
635
636 -e, --err-prob float
637 uniform error probability
638
639 -l, --LRR-weight float
640 relative contribution from LRR. With noisy data, this option
641 can have big effect on the number of calls produced. In truly
642 random noise (such as in simulated data), the value should be
643 set high (1.0), but in the presence of systematic noise when
644 LRR are not informative, lower values result in cleaner calls
645 (0.2).
646
647 -L, --LRR-smooth-win int
648 reduce LRR noise by applying moving average given this window
649 size
650
651 -O, --optimize float
652 iteratively estimate the fraction of aberrant cells, down to
653 the given fraction. Lowering this value from the default 1.0 to
654 say, 0.3, can help discover more events but also increases
655 noise
656
657 -P, --same-prob float
658 the prior probability of the query and the control sample being
659 the same. Setting to 0 calls both independently, setting to 1
660 forces the same copy number state in both.
661
662 -x, --xy-prob float
663 the HMM probability of transition to another copy number state.
664 Increasing this values leads to smaller and more frequent
665 calls.
666
667 bcftools concat [OPTIONS] FILE1 FILE2 [...]
668 Concatenate or combine VCF/BCF files. All source files must have the
669 same sample columns appearing in the same order. Can be used, for
670 example, to concatenate chromosome VCFs into one VCF, or combine a SNP
671 VCF and an indel VCF into one. The input files must be sorted by chr
672 and position. The files must be given in the correct order to produce
673 sorted VCF on output unless the -a, --allow-overlaps option is
674 specified. With the --naive option, the files are concatenated without
675 being recompressed, which is very fast but dangerous if the BCF headers
676 differ.
677
678 -a, --allow-overlaps
679 First coordinate of the next file can precede last record of the
680 current file.
681
682 -c, --compact-PS
683 Do not output PS tag at each site, only at the start of a new phase
684 set block.
685
686 -d, --rm-dups snps|indels|both|all|none
687 Output duplicate records of specified type present in multiple
688 files only once. Requires -a, --allow-overlaps.
689
690 -D, --remove-duplicates
691 Alias for -d none
692
693 -f, --file-list FILE
694 Read file names from FILE, one file name per line.
695
696 -l, --ligate
697 Ligate phased VCFs by matching phase at overlapping haplotypes
698
699 --no-version
700 see Common Options
701
702 -n, --naive
703 Concatenate VCF or BCF files without recompression. This is very
704 fast but requires that all files are of the same type (all VCF or
705 all BCF) and have the same headers. This is because all tags and
706 chromosome names in the BCF body rely on the implicit order of the
707 contig and tag definitions in the header. Currently no sanity
708 checks are in place. Dangerous, use with caution.
709
710 -o, --output FILE
711 see Common Options
712
713 -O, --output-type b|u|z|v
714 see Common Options
715
716 -q, --min-PQ INT
717 Break phase set if phasing quality is lower than INT
718
719 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
720 see Common Options. Requires -a, --allow-overlaps.
721
722 -R, --regions-file FILE
723 see Common Options. Requires -a, --allow-overlaps.
724
725 --threads INT
726 see Common Options
727
728 bcftools consensus [OPTIONS] FILE
729 Create consensus sequence by applying VCF variants to a reference fasta
730 file. By default, the program will apply all ALT variants to the
731 reference fasta to obtain the consensus sequence. Using the --sample
732 (and, optionally, --haplotype) option will apply genotype (haplotype)
733 calls from FORMAT/GT. Note that the program does not act as a primitive
734 variant caller and ignores allelic depth information, such as INFO/AD
735 or FORMAT/AD. For that, consider using the setGT plugin.
736
737 -c, --chain FILE
738 write a chain file for liftover
739
740 -e, --exclude EXPRESSION
741 exclude sites for which EXPRESSION is true. For valid expressions
742 see EXPRESSIONS.
743
744 -f, --fasta-ref FILE
745 reference sequence in fasta format
746
747 -H, --haplotype 1|2|R|A|LR|LA|SR|SA
748 choose which allele from the FORMAT/GT field to use (the codes are
749 case-insensitive):
750
751 1
752 the first allele
753
754 2
755 the second allele
756
757 R
758 the REF allele (in heterozygous genotypes)
759
760 A
761 the ALT allele (in heterozygous genotypes)
762
763 LR, LA
764 the longer allele. If both have the same length, use the REF
765 allele (LR), or the ALT allele (LA)
766
767 SR, SA
768 the shorter allele. If both have the same length, use the REF
769 allele (SR), or the ALT allele (SA)
770
771 This option requires *-s*, unless exactly one sample is present in the VCF
772
773 -i, --include EXPRESSION
774 include only sites for which EXPRESSION is true. For valid
775 expressions see EXPRESSIONS.
776
777 -I, --iupac-codes
778 output variants in the form of IUPAC ambiguity codes
779
780 -m, --mask FILE
781 BED file or TAB file with regions to be replaced with N. See
782 discussion of --regions-file in Common Options for file format
783 details.
784
785 -M, --missing CHAR
786 instead of skipping the missing genotypes, output the character
787 CHAR (e.g. "?")
788
789 -o, --output FILE
790 write output to a file
791
792 -s, --sample NAME
793 apply variants of the given sample
794
795 Examples:
796
797 # Apply variants present in sample "NA001", output IUPAC codes for hets
798 bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
799
800 # Create consensus for one region. The fasta header lines are then expected
801 # in the form ">chr:from-to".
802 samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
803
804 bcftools convert [OPTIONS] FILE
805 VCF input options:
806 -e, --exclude EXPRESSION
807 exclude sites for which EXPRESSION is true. For valid
808 expressions see EXPRESSIONS.
809
810 -i, --include EXPRESSION
811 include only sites for which EXPRESSION is true. For valid
812 expressions see EXPRESSIONS.
813
814 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
815 see Common Options
816
817 -R, --regions-file FILE
818 see Common Options
819
820 -s, --samples LIST
821 see Common Options
822
823 -S, --samples-file FILE
824 see Common Options
825
826 -t, --targets LIST
827 see Common Options
828
829 -T, --targets-file FILE
830 see Common Options
831
832 VCF output options:
833 --no-version
834 see Common Options
835
836 -o, --output FILE
837 see Common Options
838
839 -O, --output-type b|u|z|v
840 see Common Options
841
842 --threads INT
843 see Common Options
844
845 GEN/SAMPLE conversion:
846 -G, --gensample2vcf prefix or gen-file,sample-file
847 convert IMPUTE2 output to VCF. The second column must be of the
848 form "CHROM:POS_REF_ALT" to detect possible strand swaps;
849 IMPUTE2 leaves the first one empty ("--") when sites from
850 reference panel are filled in. See also -g below.
851
852 -g, --gensample prefix or gen-file,sample-file
853 convert from VCF to gen/sample format used by IMPUTE2 and
854 SHAPEIT. The columns of .gen file format are ID1,ID2,POS,A,B
855 followed by three genotype probabilities P(AA), P(AB), P(BB)
856 for each sample. In order to prevent strand swaps, the program
857 uses IDs of the form "CHROM:POS_REF_ALT". For example:
858
859 .gen
860 ----
861 1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
862 1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
863
864 .sample
865 -------
866 ID_1 ID_2 missing
867 0 0 0
868 sample1 sample1 0
869 sample2 sample2 0
870
871 --tag STRING
872 tag to take values for .gen file: GT,PL,GL,GP
873
874 --chrom
875 output chromosome in the first column instead of
876 CHROM:POS_REF_ALT
877
878 --sex FILE
879 output sex column in the sample file. The FILE format is
880
881 MaleSample M
882 FemaleSample F
883
884 --vcf-ids
885 output VCF IDs in the second column instead of
886 CHROM:POS_REF_ALT
887
888 gVCF conversion:
889 --gvcf2vcf
890 convert gVCF to VCF, expanding REF blocks into sites. Note that
891 the -i and -e options work differently with this switch. In
892 this situation the filtering expressions define which sites
893 should be expanded and which sites should be left unmodified,
894 but all sites are printed on output. In order to drop sites,
895 stream first through bcftools view.
896
897 -f, --fasta-ref file
898 reference sequence in fasta format. Must be indexed with
899 samtools faidx
900
901 HAP/SAMPLE conversion:
902 --hapsample2vcf prefix or hap-file,sample-file
903 convert from hap/sample format to VCF. The columns of .hap file
904 are similar to .gen file above, but there are only two
905 haplotype columns per sample. Note that the first column of the
906 .hap file is expected to be in the form
907 "CHR:POS_REF_ALT(_END)?", with the _END being optional for
908 defining the INFO/END tag when ALT is a symbolic allele, for
909 example:
910
911 .hap
912 ----
913 1:111485207_G_A rsID1 111485207 G A 0 1 0 0
914 1:111494194_C_T rsID2 111494194 C T 0 1 0 0
915 1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
916
917 --hapsample prefix or hap-file,sample-file
918 convert from VCF to hap/sample format used by IMPUTE2 and
919 SHAPEIT. The columns of .hap file begin with
920 ID,RSID,POS,REF,ALT. In order to prevent strand swaps, the
921 program uses IDs of the form "CHROM:POS_REF_ALT".
922
923 --haploid2diploid
924 with -h option converts haploid genotypes to homozygous diploid
925 genotypes. For example, the program will print 0 0 instead of
926 the default 0 -. This is useful for programs which do not
927 handle haploid genotypes correctly.
928
929 --sex FILE
930 output sex column in the sample file. The FILE format is
931
932 MaleSample M
933 FemaleSample F
934
935 --vcf-ids
936 output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
937
938 HAP/LEGEND/SAMPLE conversion:
939 -H, --haplegendsample2vcf prefix or
940 hap-file,legend-file,sample-file
941 convert from hap/legend/sample format used by IMPUTE2 to VCF,
942 see also -h, --hapslegendsample below.
943
944 -h, --haplegendsample prefix or hap-file,legend-file,sample-file
945 convert from VCF to hap/legend/sample format used by IMPUTE2
946 and SHAPEIT. The columns of .legend file ID,POS,REF,ALT. In
947 order to prevent strand swaps, the program uses IDs of the form
948 "CHROM:POS_REF_ALT". The .sample file is quite basic at the
949 moment with columns for population, group and sex expected to
950 be edited by the user. For example:
951
952 .hap
953 -----
954 0 1 0 0 1 0
955 0 1 0 0 0 1
956
957 .legend
958 -------
959 id position a0 a1
960 1:111485207_G_A 111485207 G A
961 1:111494194_C_T 111494194 C T
962
963 .sample
964 -------
965 sample population group sex
966 sample1 sample1 sample1 2
967 sample2 sample2 sample2 2
968
969 --haploid2diploid
970 with -h option converts haploid genotypes to homozygous diploid
971 genotypes. For example, the program will print 0 0 instead of
972 the default 0 -. This is useful for programs which do not
973 handle haploid genotypes correctly.
974
975 --sex FILE
976 output sex column in the sample file. The FILE format is
977
978 MaleSample M
979 FemaleSample F
980
981 --vcf-ids
982 output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
983
984 TSV conversion:
985 --tsv2vcf file
986 convert from TSV (tab-separated values) format (such as
987 generated by 23andMe) to VCF. The input file fields can be tab-
988 or space- delimited
989
990 -c, --columns list
991 comma-separated list of fields in the input file. In the
992 current version, the fields CHROM, POS, ID, and AA are expected
993 and can appear in arbitrary order, columns which should be
994 ignored in the input file can be indicated by "-". The AA field
995 lists alleles on the forward reference strand, for example "CC"
996 or "CT" for diploid genotypes or "C" for haploid genotypes (sex
997 chromosomes). Insertions and deletions are not supported yet,
998 missing data can be indicated with "--".
999
1000 -f, --fasta-ref file
1001 reference sequence in fasta format. Must be indexed with
1002 samtools faidx
1003
1004 -s, --samples LIST
1005 list of sample names. See Common Options
1006
1007 -S, --samples-file FILE
1008 file of sample names. See Common Options
1009
1010 Example:
1011
1012 # Convert 23andme results into VCF
1013 bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
1014
1015 bcftools csq [OPTIONS] FILE
1016 Haplotype aware consequence predictor which correctly handles combined
1017 variants such as MNPs split over multiple VCF records, SNPs separated
1018 by an intron (but adjacent in the spliced transcript) or nearby
1019 frame-shifting indels which in combination in fact are not
1020 frame-shifting.
1021
1022 The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag
1023 (configurable with the -c option). The latter is a bitmask of indexes
1024 to INFO/BCSQ, with interleaved haplotypes. See the usage examples below
1025 for using the %TBCSQ converter in query for extracting a more human
1026 readable form from this bitmask. The contruction of the bitmask limits
1027 the number of consequences that can be referenced in the FORMAT/BCSQ
1028 tags. By default this is 16, but if more are required, see the --ncsq
1029 option.
1030
1031 The program requires on input a VCF/BCF file, the reference genome in
1032 fasta format (--fasta-ref) and genomic features in the GFF3 format
1033 downloadable from the Ensembl website (--gff-annot), and outputs an
1034 annotated VCF/BCF file. Currently, only Ensembl GFF3 files are
1035 supported.
1036
1037 By default, the input VCF should be phased. If phase is unknown, or
1038 only partially known, the --phase option can be used to indicate how to
1039 handle unphased data. Alternatively, haplotype aware calling can be
1040 turned off with the --local-csq option.
1041
1042 If conflicting (overlapping) variants within one haplotype are
1043 detected, a warning will be emitted and predictions will be based on
1044 only the first variant in the analysis.
1045
1046 Symbolic alleles are not supported. They will remain unannotated in the
1047 output VCF and are ignored for the prediction analysis.
1048
1049 -c, --custom-tag STRING
1050 use this custom tag to store consequences rather than the default
1051 BCSQ tag
1052
1053 -e, --exclude EXPRESSION
1054 exclude sites for which EXPRESSION is true. For valid expressions
1055 see EXPRESSIONS.
1056
1057 -f, --fasta-ref FILE
1058 reference sequence in fasta format (required)
1059
1060 --force
1061 run even if some sanity checks fail. Currently the option allows to
1062 skip transcripts in malformatted GFFs with incorrect phase
1063
1064 -g, --gff-annot FILE
1065 GFF3 annotation file (required), such as
1066 ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens. An example of
1067 a minimal working GFF file:
1068
1069 # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
1070 # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
1071 # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
1072 # (the most interesting is "protein_coding").
1073 #
1074 # Attributes required for
1075 # gene lines:
1076 # - ID=gene:<gene_id>
1077 # - biotype=<biotype>
1078 # - Name=<gene_name> [optional]
1079 #
1080 # transcript lines:
1081 # - ID=transcript:<transcript_id>
1082 # - Parent=gene:<gene_id>
1083 # - biotype=<biotype>
1084 #
1085 # other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
1086 # - Parent=transcript:<transcript_id>
1087 #
1088 # Supported biotypes:
1089 # - see the function gff_parse_biotype() in bcftools/csq.c
1090
1091 1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1092 1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1093 1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId
1094 1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId
1095 1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId
1096 1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
1097
1098 -i, --include EXPRESSION
1099 include only sites for which EXPRESSION is true. For valid
1100 expressions see EXPRESSIONS.
1101
1102 -l, --local-csq
1103 switch off haplotype-aware calling, run localized predictions
1104 considering only one VCF record at a time
1105
1106 -n, --ncsq INT
1107 maximum number of consequences to consider per site. The INFO/BCSQ
1108 column includes all consequences, but only the first INT will be
1109 referenced by the FORMAT/BCSQ fields. The default value is 16 which
1110 corresponds to one integer per diploid sample. Note that increasing
1111 the value leads to increased memory and is rarely necessary.
1112
1113 -o, --output FILE
1114 see Common Options
1115
1116 -O, --output-type b|t|u|z|v
1117 see Common Options. In addition, a custom tab-delimited plain text
1118 output can be printed (t).
1119
1120 -p, --phase a|m|r|R|s
1121 how to handle unphased heterozygous genotypes:
1122
1123 a
1124 take GTs as is, create haplotypes regardless of phase (0/1 →
1125 0|1)
1126
1127 m
1128 merge all GTs into a single haplotype (0/1 → 1, 1/2 → 1)
1129
1130 r
1131 require phased GTs, throw an error on unphased heterozygous GTs
1132
1133 R
1134 create non-reference haplotypes if possible (0/1 → 1|1, 1/2 →
1135 1|2)
1136
1137 s
1138 skip unphased heterozygous GTs
1139
1140 -q, --quiet
1141 suppress warning messages
1142
1143 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1144 see Common Options
1145
1146 -R, --regions-file FILE
1147 see Common Options
1148
1149 -s, --samples LIST
1150 samples to include or "-" to apply all variants and ignore samples
1151
1152 -S, --samples-file FILE
1153 see Common Options
1154
1155 -t, --targets LIST
1156 see Common Options
1157
1158 -T, --targets-file FILE
1159 see Common Options
1160
1161 Examples:
1162
1163 # Basic usage
1164 bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
1165
1166 # Extract the translated haplotype consequences. The following TBCSQ variations
1167 # are recognised:
1168 # %TBCSQ .. print consequences in all haplotypes in separate columns
1169 # %TBCSQ{0} .. print the first haplotype only
1170 # %TBCSQ{1} .. print the second haplotype only
1171 # %TBCSQ{*} .. print a list of unique consquences present in either haplotype
1172 bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
1173
1174 Examples of BCSQ annotation:
1175
1176 # Two separate VCF records at positions 2:122106101 and 2:122106102
1177 # change the same codon. This UV-induced C>T dinucleotide mutation
1178 # has been annotated fully at the position 2:122106101 with
1179 # - consequence type
1180 # - gene name
1181 # - ensembl transcript ID
1182 # - coding strand (+ fwd, - rev)
1183 # - amino acid position (in the coding strand orientation)
1184 # - list of corresponding VCF variants
1185 # The annotation at the second position gives the position of the full
1186 # annotation
1187 BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
1188 BCSQ=@122106101
1189
1190 # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
1191 BCSQ=@46115084
1192 BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
1193
1194 # Stop gained variant
1195 BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
1196
1197 # The consequence type of a variant downstream from a stop are prefixed with *
1198 BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
1199
1200 bcftools filter [OPTIONS] FILE
1201 Apply fixed-threshold filters.
1202
1203 -e, --exclude EXPRESSION
1204 exclude sites for which EXPRESSION is true. For valid expressions
1205 see EXPRESSIONS.
1206
1207 -g, --SnpGap INT
1208 filter SNPs within INT base pairs of an indel. The following
1209 example demonstrates the logic of --SnpGap 3 applied on a deletion
1210 and an insertion:
1211
1212 The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
1213 0123456789
1214 ref .G.GT..G..
1215 del .A.G-..A..
1216 Here the positions 1 and 6 are filtered, 0 and 7 are not:
1217 0123-456789
1218 ref .G.G-..G..
1219 ins .A.GT..A..
1220
1221 -G, --IndelGap INT
1222 filter clusters of indels separated by INT or fewer base pairs
1223 allowing only one to pass. The following example demonstrates the
1224 logic of --IndelGap 2 applied on a deletion and an insertion:
1225
1226 The second indel is filtered:
1227 012345678901
1228 ref .GT.GT..GT..
1229 del .G-.G-..G-..
1230 And similarly here, the second is filtered:
1231 01 23 456 78
1232 ref .A-.A-..A-..
1233 ins .AT.AT..AT..
1234
1235 -i, --include EXPRESSION
1236 include only sites for which EXPRESSION is true. For valid
1237 expressions see EXPRESSIONS.
1238
1239 -m, --mode [+x]
1240 define behaviour at sites with existing FILTER annotations. The
1241 default mode replaces existing filters of failed sites with a new
1242 FILTER string while leaving sites which pass untouched when
1243 non-empty and setting to "PASS" when the FILTER string is absent.
1244 The "+" mode appends new FILTER strings of failed sites instead of
1245 replacing them. The "x" mode resets filters of sites which pass to
1246 "PASS". Modes "+" and "x" can both be set.
1247
1248 --no-version
1249 see Common Options
1250
1251 -o, --output FILE
1252 see Common Options
1253
1254 -O, --output-type b|u|z|v
1255 see Common Options
1256
1257 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1258 see Common Options
1259
1260 -R, --regions-file file
1261 see Common Options
1262
1263 -s, --soft-filter STRING|+
1264 annotate FILTER column with STRING or, with +, a unique filter name
1265 generated by the program ("Filter%d").
1266
1267 -S, --set-GTs .|0
1268 set genotypes of failed samples to missing value (.) or reference
1269 allele (0)
1270
1271 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1272 see Common Options
1273
1274 -T, --targets-file file
1275 see Common Options
1276
1277 --threads INT
1278 see Common Options
1279
1280 bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz
1281 Checks sample identity. The program can operate in two modes. If the -g
1282 option is given, the identity of the -s sample from query.vcf.gz is
1283 checked against the samples in the -g file. Without the -g option,
1284 multi-sample cross-check of samples in query.vcf.gz is performed.
1285
1286 -a, --all-sites
1287 output for all sites
1288
1289 -c, --cluster FLOAT,FLOAT
1290 min inter- and max intra-sample error [0.23,-0.3]
1291
1292 The first "min" argument controls the typical error rate in multiplexed
1293 runs ("lanelets") from the same sample. Lanelets with error rate less
1294 than this will always be considered as coming from the same sample.
1295 The second "max" argument is the reverse: lanelets with error rate
1296 greater than the absolute value of this parameter will always be
1297 considered as different samples. When the value is negative, the cutoff
1298 may be heuristically lowered by the clustering engine. If positive, the
1299 value is interpreted as a fixed cutoff.
1300
1301 -g, --genotypes genotypes.vcf.gz
1302 reference genotypes to compare against
1303
1304 -G, --GTs-only INT
1305 use genotypes (GT) instead of genotype likelihoods (PL). When set
1306 to 1, reported discordance is the number of non-matching GTs,
1307 otherwise the number INT is interpreted as phred-scaled likelihood
1308 of unobserved genotypes.
1309
1310 -H, --homs-only
1311 consider only genotypes which are homozygous in both genotypes and
1312 query VCF. This may be useful with low coverage data.
1313
1314 -p, --plot PREFIX
1315 produce plots
1316
1317 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1318 see Common Options
1319
1320 -R, --regions-file file
1321 see Common Options
1322
1323 -s, --query-sample STRING
1324 query sample in query.vcf.gz. By default, the first sample is
1325 checked.
1326
1327 -S, --target-sample STRING
1328 target sample in the -g file, used only for plotting, not for
1329 analysis
1330
1331 -t, --targets file
1332 see Common Options
1333
1334 -T, --targets-file file
1335 see Common Options
1336
1337 Output files format:
1338 CN, Discordance
1339 Pairwise discordance for all sample pairs is calculated as
1340
1341 \sum_s { min_G { PL_a(G) + PL_b(G) } },
1342
1343 where the sum runs over all sites s and G is the the most
1344 likely genotype shared by both samples a and b. When PL field
1345 is not present, a constant value 99 is used for the unseen
1346 genotypes. With -G, the value 1 can be used instead; the
1347 discordance value then gives exactly the number of differing
1348 genotypes.
1349
1350 ERR, error rate
1351 Pairwise error rate calculated as number of differences divided
1352 by the total number of comparisons.
1353
1354 CLUSTER, TH, DOT
1355 In presence of multiple samples, related samples and outliers
1356 can be identified by clustering samples by error rate. A simple
1357 hierarchical clustering based on minimization of standard
1358 deviation is used. This is useful to detect sample swaps, for
1359 example in situations where one sample has been sequenced in
1360 multiple runs.
1361
1362 bcftools index [OPTIONS] in.bcf|in.vcf.gz
1363 Creates index for bgzip compressed VCF/BCF files for random access. CSI
1364 (coordinate-sorted index) is created by default. The CSI format
1365 supports indexing of chromosomes up to length 2^31. TBI (tabix index)
1366 index files, which support chromosome lengths up to 2^29, can be
1367 created by using the -t/--tbi option or using the tabix program
1368 packaged with htslib. When loading an index file, bcftools will try the
1369 CSI first and then the TBI.
1370
1371 Indexing options:
1372 -c, --csi
1373 generate CSI-format index for VCF/BCF files [default]
1374
1375 -f, --force
1376 overwrite index if it already exists
1377
1378 -m, --min-shift INT
1379 set minimal interval size for CSI indices to 2^INT; default: 14
1380
1381 -o, --output-file FILE
1382 output file name. If not set, then the index will be created
1383 using the input file name plus a .csi or .tbi extension
1384
1385 -t, --tbi
1386 generate TBI-format index for VCF files
1387
1388 --threads INT
1389 see Common Options
1390
1391 Stats options:
1392 -n, --nrecords
1393 print the number of records based on the CSI or TBI index files
1394
1395 -s, --stats
1396 Print per contig stats based on the CSI or TBI index files.
1397 Output format is three tab-delimited columns listing the contig
1398 name, contig length (. if unknown) and number of records for
1399 the contig. Contigs with zero records are not printed.
1400
1401 bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]
1402 Creates intersections, unions and complements of VCF files. Depending
1403 on the options, the program can output records from one (or more) files
1404 which have (or do not have) corresponding records with the same
1405 position in the other files.
1406
1407 -c, --collapse snps|indels|both|all|some|none
1408 see Common Options
1409
1410 -C, --complement
1411 output positions present only in the first file but missing in the
1412 others
1413
1414 -e, --exclude -|EXPRESSION
1415 exclude sites for which EXPRESSION is true. If -e (or -i) appears
1416 only once, the same filtering expression will be applied to all
1417 input files. Otherwise, -e or -i must be given for each input file.
1418 To indicate that no filtering should be performed on a file, use
1419 "-" in place of EXPRESSION, as shown in the example below. For
1420 valid expressions see EXPRESSIONS.
1421
1422 -f, --apply-filters LIST
1423 see Common Options
1424
1425 -i, --include EXPRESSION
1426 include only sites for which EXPRESSION is true. See discussion of
1427 -e, --exclude above.
1428
1429 -n, --nfiles [+-=]INT|~BITMAP
1430 output positions present in this many (=), this many or more (+),
1431 this many or fewer (-), or the exact same (~) files
1432
1433 -o, --output FILE
1434 see Common Options. When several files are being output, their
1435 names are controlled via -p instead.
1436
1437 -O, --output-type b|u|z|v
1438 see Common Options
1439
1440 -p, --prefix DIR
1441 if given, subset each of the input files accordingly. See also -w.
1442
1443 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1444 see Common Options
1445
1446 -R, --regions-file file
1447 see Common Options
1448
1449 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1450 see Common Options
1451
1452 -T, --targets-file file
1453 see Common Options
1454
1455 -w, --write LIST
1456 list of input files to output given as 1-based indices. With -p and
1457 no -w, all files are written.
1458
1459 Examples:
1460 Create intersection and complements of two sets saving the output
1461 in dir/*
1462
1463 bcftools isec -p dir A.vcf.gz B.vcf.gz
1464
1465 Filter sites in A (require INFO/MAF>=0.01) and B (require
1466 INFO/dbSNP) but not in C, and create an intersection, including
1467 only sites which appear in at least two of the files after filters
1468 have been applied
1469
1470 bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
1471
1472 Extract and write records from A shared by both A and B using exact
1473 allele match
1474
1475 bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
1476
1477 Extract records private to A or B comparing by position only
1478
1479 bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
1480
1481 Print a list of records which are present in A and B but not in C
1482 and D
1483
1484 bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
1485
1486 bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]
1487 Merge multiple VCF/BCF files from non-overlapping sample sets to create
1488 one multi-sample file. For example, when merging file A.vcf.gz
1489 containing samples S1, S2 and S3 and file B.vcf.gz containing samples
1490 S3 and S4, the output file will contain four samples named S1, S2, S3,
1491 2:S3 and S4.
1492
1493 Note that it is responsibility of the user to ensure that the sample
1494 names are unique across all files. If they are not, the program will
1495 exit with an error unless the option --force-samples is given. The
1496 sample names can be also given explicitly using the --print-header and
1497 --use-header options.
1498
1499 Note that only records from different files can be merged, never from
1500 the same file. For "vertical" merge take a look at bcftools concat or
1501 bcftools norm -m instead.
1502
1503 --force-samples
1504 if the merged files contain duplicate samples names, proceed
1505 anyway. Duplicate sample names will be resolved by prepending index
1506 of the file as it appeared on the command line to the conflicting
1507 sample name (see 2:S3 in the above example).
1508
1509 --print-header
1510 print only merged header and exit
1511
1512 --use-header FILE
1513 use the VCF header in the provided text FILE
1514
1515 -0 --missing-to-ref
1516 assume genotypes at missing sites are 0/0
1517
1518 -f, --apply-filters LIST
1519 see Common Options
1520
1521 -F, --filter-logic x|+
1522 Set the output record to PASS if any of the inputs is PASS (x), or
1523 apply all filters (+), which is the default.
1524
1525 -g, --gvcf -|FILE
1526 merge gVCF blocks, INFO/END tag is expected. If the reference fasta
1527 file FILE is not given and the dash (-) is given, unknown reference
1528 bases generated at gVCF block splits will be substituted with N’s.
1529 The --gvcf option uses the following default INFO rules: -i
1530 QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max.
1531
1532 -i, --info-rules -|TAG:METHOD[,...]
1533 Rules for merging INFO fields (scalars or vectors) or - to disable
1534 the default rules. METHOD is one of sum, avg, min, max, join.
1535 Default is DP:sum,DP4:sum if these fields exist in the input files.
1536 Fields with no specified rule will take the value from the first
1537 input file. The merged QUAL value is currently set to the maximum.
1538 This behaviour is not user controllable at the moment.
1539
1540 -l, --file-list FILE
1541 Read file names from FILE, one file name per line.
1542
1543 -m, --merge snps|indels|both|all|none|id
1544 The option controls what types of multiallelic records can be
1545 created:
1546
1547 -m none .. no new multiallelics, output multiple records instead
1548 -m snps .. allow multiallelic SNP records
1549 -m indels .. allow multiallelic indel records
1550 -m both .. both SNP and indel records can be multiallelic
1551 -m all .. SNP records can be merged with indel records
1552 -m id .. merge by ID
1553
1554 --no-version
1555 see Common Options
1556
1557 -o, --output FILE
1558 see Common Options
1559
1560 -O, --output-type b|u|z|v
1561 see Common Options
1562
1563 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1564 see Common Options
1565
1566 -R, --regions-file file
1567 see Common Options
1568
1569 --threads INT
1570 see Common Options
1571
1572 bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]
1573 Generate VCF or BCF containing genotype likelihoods for one or multiple
1574 alignment (BAM or CRAM) files. This is based on the original samtools
1575 mpileup command (with the -v or -g options) producing genotype
1576 likelihoods in VCF or BCF format, but not the textual pileup output.
1577 The mpileup command was transferred to bcftools in order to avoid
1578 errors resulting from use of incompatible versions of samtools and
1579 bcftools when using in the mpileup+bcftools call pipeline.
1580
1581 Individuals are identified from the SM tags in the @RG header lines.
1582 Multiple individuals can be pooled in one alignment file, also one
1583 individual can be separated into multiple files. If sample identifiers
1584 are absent, each input file is regarded as one sample.
1585
1586 Note that there are two orthogonal ways to specify locations in the
1587 input file; via -r region and -t positions. The former uses (and
1588 requires) an index to do random access while the latter streams through
1589 the file contents filtering out the specified regions, requiring no
1590 index. The two may be used in conjunction. For example a BED file
1591 containing locations of genes in chromosome 20 could be specified using
1592 -r 20 -t chr20.bed, meaning that the index is used to find chromosome
1593 20 and then it is filtered for the regions listed in the BED file. Also
1594 note that the -r option can be much slower than -t with many regions
1595 and can require more memory when multiple regions and many alignment
1596 files are processed.
1597
1598 Input options
1599 -6, --illumina1.3+
1600 Assume the quality is in the Illumina 1.3+ encoding.
1601
1602 -A, --count-orphans
1603 Do not skip anomalous read pairs in variant calling.
1604
1605 -b, --bam-list FILE
1606 List of input alignment files, one file per line [null]
1607
1608 -B, --no-BAQ
1609 Disable probabilistic realignment for the computation of base
1610 alignment quality (BAQ). BAQ is the Phred-scaled probability of
1611 a read base being misaligned. Applying this option greatly
1612 helps to reduce false SNPs caused by misalignments.
1613
1614 -C, --adjust-MQ INT
1615 Coefficient for downgrading mapping quality for reads
1616 containing excessive mismatches. Given a read with a
1617 phred-scaled probability q of being generated from the mapped
1618 posi- tion, the new mapping quality is about
1619 sqrt((INT-q)/INT)*INT. A zero value disables this
1620 functionality; if enabled, the recommended value for BWA is 50.
1621 [0]
1622
1623 -d, --max-depth INT
1624 At a position, read maximally INT reads per input file. Note
1625 that bcftools has a minimum value of 8000/n where n is the
1626 number of input files given to mpileup. This means the default
1627 is highly likely to be increased. Once above the cross-sample
1628 minimum of 8000 the -d parameter will have an effect. [250]
1629
1630 -E, --redo-BAQ
1631 Recalculate BAQ on the fly, ignore existing BQ tags
1632
1633 -f, --fasta-ref FILE
1634 The faidx-indexed reference file in the FASTA format. The file
1635 can be optionally compressed by bgzip. Reference is required by
1636 default unless the --no-reference option is set [null]
1637
1638 --no-reference
1639 Do not require the --fasta-ref option.
1640
1641 -G, --read-groups FILE
1642 list of read groups to include or exclude if prefixed with "^".
1643 One read group per line. This file can also be used to assign
1644 new sample names to read groups by giving the new sample name
1645 as a second white-space-separated field, like this:
1646 "read_group_id new_sample_name". If the read group name is not
1647 unique, also the bam file name can be included: "read_group_id
1648 file_name sample_name". If all reads from the alignment file
1649 should be treated as a single sample, the asterisk symbol can
1650 be used: "* file_name sample_name". Alignments without a read
1651 group ID can be matched with "?". NOTE: The meaning of
1652 bcftools mpileup -G is the opposite of samtools mpileup -G.
1653
1654 RG_ID_1
1655 RG_ID_2 SAMPLE_A
1656 RG_ID_3 SAMPLE_A
1657 RG_ID_4 SAMPLE_B
1658 RG_ID_5 FILE_1.bam SAMPLE_A
1659 RG_ID_6 FILE_2.bam SAMPLE_A
1660 * FILE_3.bam SAMPLE_C
1661 ? FILE_3.bam SAMPLE_D
1662
1663 -q, -min-MQ INT
1664 Minimum mapping quality for an alignment to be used [0]
1665
1666 -Q, --min-BQ INT
1667 Minimum base quality for a base to be considered [13]
1668
1669 -r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
1670 Only generate mpileup output in given regions. Requires the
1671 alignment files to be indexed. If used in conjunction with -l
1672 then considers the intersection; see Common Options
1673
1674 -R, --regions-file FILE
1675 As for -r, --regions, but regions read from FILE; see Common
1676 Options
1677
1678 --ignore-RG
1679 Ignore RG tags. Treat all reads in one alignment file as one
1680 sample.
1681
1682 --rf, --incl-flags STR|INT
1683 Required flags: skip reads with mask bits unset [null]
1684
1685 --ff, --excl-flags STR|INT
1686 Filter flags: skip reads with mask bits set
1687 [UNMAP,SECONDARY,QCFAIL,DUP]
1688
1689 -s, --samples LIST
1690 list of sample names. See Common Options
1691
1692 -S, --samples-file FILE
1693 file of sample names to include or exclude if prefixed with
1694 "^". One sample per line. This file can also be used to rename
1695 samples by giving the new sample name as a second
1696 white-space-separated column, like this: "old_name new_name".
1697 If a sample name contains spaces, the spaces can be escaped
1698 using the backslash character, for example "Not\ a\ good\
1699 sample\ name".
1700
1701 -t, --targets LIST
1702 see Common Options
1703
1704 -T, --targets-file FILE
1705 see Common Options
1706
1707 -x, --ignore-overlaps
1708 Disable read-pair overlap detection.
1709
1710 Output options
1711 -a, --annotate LIST
1712 Comma-separated list of FORMAT and INFO tags to output.
1713 (case-insensitive, the "FORMAT/" prefix is optional, and use
1714 "?" to list available annotations on the command line) [null]:
1715
1716 *FORMAT/AD* .. Allelic depth (Number=R,Type=Integer)
1717 *FORMAT/ADF* .. Allelic depths on the forward strand (Number=R,Type=Integer)
1718 *FORMAT/ADR* .. Allelic depths on the reverse strand (Number=R,Type=Integer)
1719 *FORMAT/DP* .. Number of high-quality bases (Number=1,Type=Integer)
1720 *FORMAT/SP* .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
1721
1722 *INFO/AD* .. Total allelic depth (Number=R,Type=Integer)
1723 *INFO/ADF* .. Total allelic depths on the forward strand (Number=R,Type=Integer)
1724 *INFO/ADR* .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
1725
1726 *FORMAT/DV* .. Deprecated in favor of FORMAT/AD;
1727 Number of high-quality non-reference bases, (Number=1,Type=Integer)
1728 *FORMAT/DP4* .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR;
1729 Number of high-quality ref-forward, ref-reverse,
1730 alt-forward and alt-reverse bases (Number=4,Type=Integer)
1731 *FORMAT/DPR* .. Deprecated in favor of FORMAT/AD;
1732 Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1733 *INFO/DPR* .. Deprecated in favor of INFO/AD;
1734 Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1735
1736 -g, --gvcf INT[,...]
1737 output gVCF blocks of homozygous REF calls, with depth (DP)
1738 ranges specified by the list of integers. For example, passing
1739 5,15 will group sites into two types of gVCF blocks, the first
1740 with minimum per-sample DP from the interval [5,15) and the
1741 latter with minimum depth 15 or more. In this example, sites
1742 with minimum per-sample depth less than 5 will be printed as
1743 separate records, outside of gVCF blocks.
1744
1745 --no-version
1746 see Common Options
1747
1748 -o, --output FILE
1749 Write output to FILE, rather than the default of standard
1750 output. (The same short option is used for both --open-prob and
1751 --output. If -o's argument contains any non-digit characters
1752 other than a leading + or - sign, it is interpreted as
1753 --output. Usually the filename extension will take care of
1754 this, but to write to an entirely numeric filename use -o ./123
1755 or --output 123.)
1756
1757 -O, --output-type b|u|z|v
1758 see Common Options
1759
1760 --threads INT
1761 see Common Options
1762
1763 Options for SNP/INDEL genotype likelihood computation
1764 -e, --ext-prob INT
1765 Phred-scaled gap extension sequencing error probability.
1766 Reducing INT leads to longer indels [20]
1767
1768 -F, --gap-frac FLOAT
1769 Minimum fraction of gapped reads [0.002]
1770
1771 -h, --tandem-qual INT
1772 Coefficient for modeling homopolymer errors. Given an l-long
1773 homopolymer run, the sequencing error of an indel of size s is
1774 modeled as INT*s/l [100]
1775
1776 -I, --skip-indels
1777 Do not perform INDEL calling
1778
1779 -L, --max-idepth INT
1780 Skip INDEL calling if the average per-sample depth is above INT
1781 [250]
1782
1783 -m, --min-ireads INT
1784 Minimum number gapped reads for indel candidates INT [1]
1785
1786 -o, --open-prob INT
1787 Phred-scaled gap open sequencing error probability. Reducing
1788 INT leads to more indel calls. (The same short option is used
1789 for both --open-prob and --output. When -o’s argument contains
1790 only an optional + or - sign followed by the digits 0 to 9, it
1791 is interpreted as --open-prob.) [40]
1792
1793 -p, --per-sample-mF
1794 Apply -m and -F thresholds per sample to increase sensitivity
1795 of calling. By default both options are applied to reads pooled
1796 from all samples.
1797
1798 -P, --platforms STR
1799 Comma-delimited list of platforms (determined by @RG-PL) from
1800 which indel candidates are obtained. It is recommended to
1801 collect indel candidates from sequencing technologies that have
1802 low indel error rate such as ILLUMINA [all]
1803
1804 Examples:
1805 Call SNPs and short INDELs, then mark low quality sites and sites
1806 with the read depth exceeding a limit. (The read depth should be
1807 adjusted to about twice the average read depth as higher read
1808 depths usually indicate problematic regions which are often
1809 enriched for artefacts.) One may consider to add -C50 to mpileup if
1810 mapping quality is overestimated for reads containing excessive
1811 mismatches. Applying this option usually helps for BWA-backtrack
1812 alignments, but may not other aligners.
1813
1814 bcftools mpileup -Ou -f ref.fa aln.bam | \
1815 bcftools call -Ou -mv | \
1816 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
1817
1818 bcftools norm [OPTIONS] file.vcf.gz
1819 Left-align and normalize indels, check if REF alleles match the
1820 reference, split multiallelic sites into multiple rows; recover
1821 multiallelics from multiple rows. Left-alignment and normalization will
1822 only be applied if the --fasta-ref option is supplied.
1823
1824 -c, --check-ref e|w|x|s
1825 what to do when incorrect or missing REF allele is encountered:
1826 exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w
1827 option can be combined with x and s. Note that s can swap alleles
1828 and will update genotypes (GT) and AC counts, but will not attempt
1829 to fix PL or other fields.
1830
1831 -d, --rm-dup snps|indels|both|all|none
1832 If a record is present multiple times, output only the first
1833 instance, see --collapse in Common Options.
1834
1835 -D, --remove-duplicates
1836 If a record is present in multiple files, output only the first
1837 instance. Alias for -d none, deprecated.
1838
1839 -f, --fasta-ref FILE
1840 reference sequence. Supplying this option will turn on
1841 left-alignment and normalization, however, see also the
1842 --do-not-normalize option below.
1843
1844 -m, --multiallelics -|+[snps|indels|both|any]
1845 split multiallelic sites into biallelic records (-) or join
1846 biallelic sites into multiallelic records (+). An optional type
1847 string can follow which controls variant types which should be
1848 split or merged together: If only SNP records should be split or
1849 merged, specify snps; if both SNPs and indels should be merged
1850 separately into two records, specify both; if SNPs and indels
1851 should be merged into a single record, specify any.
1852
1853 --no-version
1854 see Common Options
1855
1856 -N, --do-not-normalize
1857 the -c s option can be used to fix or set the REF allele from the
1858 reference -f. The -N option will not turn on indel normalisation as
1859 the -f option normally implies
1860
1861 -o, --output FILE
1862 see Common Options
1863
1864 -O, --output-type b|u|z|v
1865 see Common Options
1866
1867 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1868 see Common Options
1869
1870 -R, --regions-file file
1871 see Common Options
1872
1873 -s, --strict-filter
1874 when merging (-m+), merged site is PASS only if all sites being
1875 merged PASS
1876
1877 -t, --targets LIST
1878 see Common Options
1879
1880 -T, --targets-file FILE
1881 see Common Options
1882
1883 --threads INT
1884 see Common Options
1885
1886 -w, --site-win INT
1887 maximum distance between two records to consider when locally
1888 sorting variants which changed position during the realignment
1889
1890 bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]
1891 A common framework for various utilities. The plugins can be used the
1892 same way as normal commands only their name is prefixed with "+". Most
1893 plugins accept two types of parameters: general options shared by all
1894 plugins followed by a separator, and a list of plugin-specific options.
1895 There are some exceptions to this rule, some plugins do not accept the
1896 common options and implement their own parameters. Therefore please pay
1897 attention to the usage examples that each plugin comes with.
1898
1899 VCF input options:
1900 -e, --exclude EXPRESSION
1901 exclude sites for which EXPRESSION is true. For valid
1902 expressions see EXPRESSIONS.
1903
1904 -i, --include EXPRESSION
1905 include only sites for which EXPRESSION is true. For valid
1906 expressions see EXPRESSIONS.
1907
1908 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1909 see Common Options
1910
1911 -R, --regions-file file
1912 see Common Options
1913
1914 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1915 see Common Options
1916
1917 -T, --targets-file file
1918 see Common Options
1919
1920 VCF output options:
1921 --no-version
1922 see Common Options
1923
1924 -o, --output FILE
1925 see Common Options
1926
1927 -O, --output-type b|u|z|v
1928 see Common Options
1929
1930 --threads INT
1931 see Common Options
1932
1933 Plugin options:
1934 -h, --help
1935 list plugin’s options
1936
1937 -l, --list-plugins
1938 List all available plugins.
1939
1940 By default, appropriate system directories are searched for
1941 installed plugins. You can override this by setting the
1942 BCFTOOLS_PLUGINS environment variable to a colon-separated list
1943 of directories to search. If BCFTOOLS_PLUGINS begins with a
1944 colon, ends with a colon, or contains adjacent colons, the
1945 system directories are also searched at that position in the
1946 list of directories.
1947
1948 -v, --verbose
1949 print debugging information to debug plugin failure
1950
1951 -V, --version
1952 print version string and exit
1953
1954 List of plugins coming with the distribution:
1955 GTisec
1956 count genotype intersections across all possible sample subsets
1957 in a vcf file
1958
1959 GTsubset
1960 output only sites where the requested samples all exclusively
1961 share a genotype
1962
1963 ad-bias
1964 find positions with wildly varying ALT allele frequency (Fisher
1965 test on FMT/AD)
1966
1967 af-dist
1968 collect AF deviation stats and GT probability distribution
1969 given AF and assuming HWE
1970
1971 check-ploidy
1972 check if ploidy of samples is consistent for all sites
1973
1974 check-sparsity
1975 print samples without genotypes in a region or chromosome
1976
1977 color-chrs
1978 color shared chromosomal segments, requires trio VCF with
1979 phased GTs
1980
1981 counts
1982 a minimal plugin which counts number of SNPs, Indels, and total
1983 number of sites.
1984
1985 dosage
1986 print genotype dosage. By default the plugin searches for PL,
1987 GL and GT, in that order.
1988
1989 fill-AN-AC
1990 fill INFO fields AN and AC.
1991
1992 fill-from-fasta
1993 fill INFO or REF field based on values in a fasta file
1994
1995 fill-tags
1996 set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, HWE, MAF, NS
1997
1998 fix-ploidy
1999 sets correct ploidy
2000
2001 fixref
2002 determine and fix strand orientation
2003
2004 frameshifts
2005 annotate frameshift indels
2006
2007 guess-ploidy
2008 determine sample sex by checking genotype likelihoods (GL,PL)
2009 or genotypes (GT) in the non-PAR region of chrX.
2010
2011 impute-info
2012 add imputation information metrics to the INFO field based on
2013 selected FORMAT tags
2014
2015 isecGT
2016 compare two files and set non-identical genotypes to missing
2017
2018 mendelian
2019 count Mendelian consistent / inconsistent genotypes.
2020
2021 missing2ref
2022 sets missing genotypes ("./.") to ref allele ("0/0" or "0|0")
2023
2024 prune
2025 prune sites by missingness or linkage disequilibrium
2026
2027 setGT
2028 general tool to set genotypes according to rules requested by
2029 the user
2030
2031 tag2tag
2032 convert between similar tags, such as GL and GP
2033
2034 trio-switch-rate
2035 calculate phase switch rate in trio samples, children samples
2036 must have phased GTs.
2037
2038 Examples:
2039 # List options common to all plugins
2040 bcftools plugin
2041
2042 # List available plugins
2043 bcftools plugin -l
2044
2045 # Run a plugin
2046 bcftools plugin counts in.vcf
2047
2048 # Run a plugin using the abbreviated "+" notation
2049 bcftools +counts in.vcf
2050
2051 # The input VCF can be streamed just like in other commands
2052 cat in.vcf | bcftools +counts
2053
2054 # Print usage information of plugin "dosage"
2055 bcftools +dosage -h
2056
2057 # Replace missing genotypes with 0/0
2058 bcftools +missing2ref in.vcf
2059
2060 # Replace missing genotypes with 0|0
2061 bcftools +missing2ref in.vcf -- -p
2062
2063 Plugins troubleshooting:
2064 Things to check if your plugin does not show up in the bcftools
2065 plugin -l output:
2066
2067 · Run with the -v option for verbose output: bcftools plugin -lv
2068
2069 · Does the environment variable BCFTOOLS_PLUGINS include the
2070 correct path?
2071
2072 Plugins API:
2073 // Short description used by 'bcftools plugin -l'
2074 const char *about(void);
2075
2076 // Longer description used by 'bcftools +name -h'
2077 const char *usage(void);
2078
2079 // Called once at startup, allows initialization of local variables.
2080 // Return 1 to suppress normal VCF/BCF header output, -1 on critical
2081 // errors, 0 otherwise.
2082 int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr);
2083
2084 // Called for each VCF record, return NULL to suppress the output
2085 bcf1_t *process(bcf1_t *rec);
2086
2087 // Called after all lines have been processed to clean up
2088 void destroy(void);
2089
2090 bcftools polysomy [OPTIONS] file.vcf.gz
2091 Detect number of chromosomal copies in VCFs annotates with the
2092 Illumina’s B-allele frequency (BAF) values. Note that this command is
2093 not compiled in by default, see the section Optional Compilation with
2094 GSL in the INSTALL file for help.
2095
2096 General options:
2097 -o, --output-dir path
2098 output directory
2099
2100 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2101 see Common Options
2102
2103 -R, --regions-file file
2104 see Common Options
2105
2106 -s, --sample string
2107 sample name
2108
2109 -t, --targets LIST
2110 see Common Options
2111
2112 -T, --targets-file FILE
2113 see Common Options
2114
2115 -v, --verbose
2116 verbose debugging output which gives hints about the thresholds
2117 and decisions made by the program. Note that the exact output
2118 can change between versions.
2119
2120 Algorithm options:
2121 -b, --peak-size float
2122 the minimum peak size considered as a good match can be from
2123 the interval [0,1] where larger is stricter
2124
2125 -c, --cn-penalty float
2126 a penalty for increasing copy number state. How this works:
2127 multiple peaks are always a better fit than a single peak,
2128 therefore the program prefers a single peak (normal copy
2129 number) unless the absolute deviation of the multiple peaks fit
2130 is significantly smaller. Here the meaning of "significant" is
2131 given by the float from the interval [0,1] where larger is
2132 stricter.
2133
2134 -f, --fit-th float
2135 threshold for goodness of fit (normalized absolute deviation),
2136 smaller is stricter
2137
2138 -i, --include-aa
2139 include also the AA peak in CN2 and CN3 evaluation. This
2140 usually requires increasing -f.
2141
2142 -m, --min-fraction float
2143 minimum distinguishable fraction of aberrant cells. The
2144 experience shows that trustworthy are estimates of 20% and
2145 more.
2146
2147 -p, --peak-symmetry float
2148 a heuristics to filter failed fits where the expected peak
2149 symmetry is violated. The float is from the interval [0,1] and
2150 larger is stricter
2151
2152 bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]
2153 Extracts fields from VCF or BCF files and outputs them in user-defined
2154 format.
2155
2156 -e, --exclude EXPRESSION
2157 exclude sites for which EXPRESSION is true. For valid expressions
2158 see EXPRESSIONS.
2159
2160 -f, --format FORMAT
2161 learn by example, see below
2162
2163 -H, --print-header
2164 print header
2165
2166 -i, --include EXPRESSION
2167 include only sites for which EXPRESSION is true. For valid
2168 expressions see EXPRESSIONS.
2169
2170 -l, --list-samples
2171 list sample names and exit
2172
2173 -o, --output FILE
2174 see Common Options
2175
2176 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2177 see Common Options
2178
2179 -R, --regions-file file
2180 see Common Options
2181
2182 -s, --samples LIST
2183 see Common Options
2184
2185 -S, --samples-file FILE
2186 see Common Options
2187
2188 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2189 see Common Options
2190
2191 -T, --targets-file file
2192 see Common Options
2193
2194 -u, --allow-undef-tags
2195 do not throw an error if there are undefined tags in the format
2196 string, print "." instead
2197
2198 -v, --vcf-list FILE
2199 process multiple VCFs listed in the file
2200
2201 Format:
2202 %CHROM The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
2203 %INFO/TAG Any tag in the INFO column
2204 %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER)
2205 %MASK Indicates presence of the site in other files (with multiple files)
2206 %TAG{INT} Curly brackets to subscript vectors (0-based)
2207 %FIRST_ALT Alias for %ALT{0}
2208 [] Format fields must be enclosed in brackets to loop over all samples
2209 %GT Genotype (e.g. 0/1)
2210 %TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
2211 %TGT Translated genotype (e.g. C/A)
2212 %IUPACGT Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
2213 %LINE Prints the whole line
2214 %SAMPLE Sample name
2215 %POS0 POS in 0-based coordinates
2216 %END End position of the REF allele
2217 %END0 End position of the REF allele in 0-based cordinates
2218 \n new line
2219 \t tab character
2220
2221 Everything else is printed verbatim.
2222
2223 Examples:
2224 # Print chromosome, position, ref allele and the first alternate allele
2225 bcftools query -f '%CHROM %POS %REF %ALT{0}\n' file.vcf.gz
2226
2227 # Similar to above, but use tabs instead of spaces, add sample name and genotype
2228 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
2229
2230 # Print FORMAT/GT fields followed by FORMAT/GT fields
2231 bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
2232
2233 # Make a BED file: chr, pos (0-based), end pos (1-based), id
2234 bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
2235
2236 # Print only samples with alternate (non-reference) genotypes
2237 bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
2238
2239 # Print all samples at sites with at least one alternate genotype
2240 bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
2241
2242 bcftools reheader [OPTIONS] file.vcf.gz
2243 Modify header of VCF/BCF files, change sample names.
2244
2245 -h, --header FILE
2246 new VCF header
2247
2248 -o, --output FILE
2249 see Common Options
2250
2251 -s, --samples FILE
2252 new sample names, one name per line, in the same order as they
2253 appear in the VCF file. Alternatively, only samples which need to
2254 be renamed can be listed as "old_name new_name\n" pairs separated
2255 by whitespaces, each on a separate line. If a sample name contains
2256 spaces, the spaces can be escaped using the backslash character,
2257 for example "Not\ a\ good\ sample\ name".
2258
2259 bcftools roh [OPTIONS] file.vcf.gz
2260 A program for detecting runs of homo/autozygosity. Only bi-allelic
2261 sites are considered.
2262
2263 The HMM model:
2264 Notation:
2265 D = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
2266 f = non-ref allele frequency
2267
2268 Emission probabilities:
2269 oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
2270 oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA)
2271
2272 Transition probabilities:
2273 tAZ = P(AZ|HW) .. from HW to AZ, the -a parameter
2274 tHW = P(HW|AZ) .. from AZ to HW, the -H parameter
2275
2276 ci = P_i(C) .. probability of cross-over at site i, from genetic map
2277 AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
2278 HWi = P_i(HW)
2279
2280 P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
2281 P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
2282
2283 General Options:
2284 --AF-dflt FLOAT
2285 in case allele frequency is not known, use the FLOAT. By
2286 default, sites where allele frequency cannot be determined, or
2287 is 0, are skipped.
2288
2289 --AF-tag TAG
2290 use the specified INFO tag TAG as an allele frequency estimate
2291 instead of the default AC and AN tags. Sites which do not have
2292 TAG will be skipped.
2293
2294 --AF-file FILE
2295 Read allele frequencies from a tab-delimited file containing
2296 the columns: CHROM\tPOS\tREF,ALT\tAF. The file can be
2297 compressed with bgzip and indexed with tabix -s1 -b2 -e2. Sites
2298 which are not present in the FILE or have different reference
2299 or alternate allele will be skipped. Note that such a file can
2300 be easily created from a VCF using:
2301
2302 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
2303
2304 -b, --buffer-size INT[,INT]
2305 when the entire many-sample file cannot fit into memory, a
2306 sliding buffer approach can be used. The first value is the
2307 number of sites to keep in memory. If negative, it is
2308 interpreted as the maximum memory to use, in MB. The second,
2309 optional, value sets the number of overlapping sites. The
2310 default overlap is set to roughly 1% of the buffer size.
2311
2312 -e, --estimate-AF FILE
2313 estimate the allele frequency by recalculating INFO/AC and
2314 INFO/AN on the fly, using the specified TAG which can be either
2315 FORMAT/GT ("GT") or FORMAT/PL ("PL"). If TAG is not given, "GT"
2316 is assumed. Either all samples ("-") or samples listed in FILE
2317 will be included. For example, use "PL,-" to estimate AF from
2318 FORMAT/PL of all samples. If neither -e nor the other --AF-...
2319 options are given, the allele frequency is estimated from AC
2320 and AN counts which are already present in the INFO field.
2321
2322 --exclude EXPRESSION
2323 exclude sites for which EXPRESSION is true. For valid
2324 expressions see EXPRESSIONS.
2325
2326 -G, --GTs-only FLOAT
2327 use genotypes (FORMAT/GT fields) ignoring genotype likelihoods
2328 (FORMAT/PL), setting PL of unseen genotypes to FLOAT. Safe
2329 value to use is 30 to account for GT errors.
2330
2331 --include EXPRESSION
2332 include only sites for which EXPRESSION is true. For valid
2333 expressions see EXPRESSIONS.
2334
2335 -I, --skip-indels
2336 skip indels as their genotypes are usually enriched for errors
2337
2338 -m, --genetic-map FILE
2339 genetic map in the format required also by IMPUTE2. Only the
2340 first and third column are used (position and Genetic_Map(cM)).
2341 The FILE can chromosome name.
2342
2343 -M, --rec-rate FLOAT
2344 constant recombination rate per bp. In combination with
2345 --genetic-map, the --rec-rate parameter is interpreted
2346 differently, as FLOAT-fold increase of transition
2347 probabilities, which allows the model to become more sensitive
2348 yet still account for recombination hotspots. Note that also
2349 the range of the values is therefore different in both cases:
2350 normally the parameter will be in the range (1e-3,1e-9) but
2351 with --genetic-map it will be in the range (10,1000).
2352
2353 -o, --output FILE
2354 Write output to the FILE, by default the output is printed on
2355 stdout
2356
2357 -O, --output-type s|r[z]
2358 Generate per-site output (s) or per-region output (r). By
2359 default both types are printed and the output is uncompressed.
2360 Add z for a compressed output.
2361
2362 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2363 see Common Options
2364
2365 -R, --regions-file file
2366 see Common Options
2367
2368 -s, --samples LIST
2369 see Common Options
2370
2371 -S, --samples-file FILE
2372 see Common Options
2373
2374 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2375 see Common Options
2376
2377 -T, --targets-file file
2378 see Common Options
2379
2380 HMM Options:
2381 -a, --hw-to-az FLOAT
2382 P(AZ|HW) transition probability from AZ (autozygous) to HW
2383 (Hardy-Weinberg) state
2384
2385 -H, --az-to-hw FLOAT
2386 P(HW|AZ) transition probability from HW to AZ state
2387
2388 -V, --viterbi-training FLOAT
2389 estimate HMM parameters using Baum-Welch algorithm, using the
2390 convergence threshold FLOAT, e.g. 1e-10 (experimental)
2391
2392 bcftools sort [OPTIONS] file.bcf
2393 -m, --max-mem FLOAT[kMG]
2394 Maximum memory to use. Approximate, affects the number of temporary
2395 files written to the disk. Note that if the command fails at this
2396 step because of too many open files, your system limit on the
2397 number of open files ("ulimit") may need to be increased.
2398
2399 -o, --output FILE
2400 see Common Options
2401
2402 -O, --output-type b|u|z|v
2403 see Common Options
2404
2405 -T, --temp-dir DIR
2406 Use this directory to store temporary files
2407
2408 bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]
2409 Parses VCF or BCF and produces text file stats which is suitable for
2410 machine processing and can be plotted using plot-vcfstats. When two
2411 files are given, the program generates separate stats for intersection
2412 and the complements. By default only sites are compared, -s/-S must
2413 given to include also sample columns. When one VCF file is specified on
2414 the command line, then stats by non-reference allele frequency, depth
2415 distribution, stats by quality and per-sample counts, singleton stats,
2416 etc. are printed. When two VCF files are given, then stats such as
2417 concordance (Genotype concordance by non-reference allele frequency,
2418 Genotype concordance by sample, Non-Reference Discordance) and
2419 correlation are also printed. Per-site discordance (PSD) is also
2420 printed in --verbose mode.
2421
2422 --af-bins LIST|FILE
2423 comma separated list of allele frequency bins (e.g. 0.1,0.5,1) or a
2424 file listing the allele frequency bins one per line (e.g.
2425 0.1\n0.5\n1)
2426
2427 --af-tag TAG
2428 allele frequency INFO tag to use for binning. By default the allele
2429 frequency is estimated from AC/AN, if available, or directly from
2430 the genotypes (GT) if not.
2431
2432 -1, --1st-allele-only
2433 consider only the 1st alternate allele at multiallelic sites
2434
2435 -c, --collapse snps|indels|both|all|some|none
2436 see Common Options
2437
2438 -d, --depth INT,INT,INT
2439 ranges of depth distribution: min, max, and size of the bin
2440
2441 --debug
2442 produce verbose per-site and per-sample output
2443
2444 -e, --exclude EXPRESSION
2445 exclude sites for which EXPRESSION is true. For valid expressions
2446 see EXPRESSIONS.
2447
2448 -E, --exons file.gz
2449 tab-delimited file with exons for indel frameshifts statistics. The
2450 columns of the file are CHR, FROM, TO, with 1-based, inclusive,
2451 positions. The file is BGZF-compressed and indexed with tabix
2452
2453 tabix -s1 -b2 -e3 file.gz
2454
2455 -f, --apply-filters LIST
2456 see Common Options
2457
2458 -F, --fasta-ref ref.fa
2459 faidx indexed reference sequence file to determine INDEL context
2460
2461 -i, --include EXPRESSION
2462 include only sites for which EXPRESSION is true. For valid
2463 expressions see EXPRESSIONS.
2464
2465 -I, --split-by-ID
2466 collect stats separately for sites which have the ID column set
2467 ("known sites") or which do not have the ID column set ("novel
2468 sites").
2469
2470 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2471 see Common Options
2472
2473 -R, --regions-file file
2474 see Common Options
2475
2476 -s, --samples LIST
2477 see Common Options
2478
2479 -S, --samples-file FILE
2480 see Common Options
2481
2482 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2483 see Common Options
2484
2485 -T, --targets-file file
2486 see Common Options
2487
2488 -u, --user-tstv <TAG[:min:max:n]>
2489 collect Ts/Tv stats for any tag using the given binning [0:1:100]
2490
2491 -v, --verbose
2492 produce verbose per-site and per-sample output
2493
2494 bcftools view [OPTIONS] file.vcf.gz [REGION [...]]
2495 View, subset and filter VCF or BCF files by position and filtering
2496 expression. Convert between VCF and BCF. Former bcftools subset.
2497
2498 Output options
2499 -G, --drop-genotypes
2500 drop individual genotype information (after subsetting if -s
2501 option is set)
2502
2503 -h, --header-only
2504 output the VCF header only
2505
2506 -H, --no-header
2507 suppress the header in VCF output
2508
2509 -l, --compression-level [0-9]
2510 compression level. 0 stands for uncompressed, 1 for best speed
2511 and 9 for best compression.
2512
2513 --no-version
2514 see Common Options
2515
2516 -O, --output-type b|u|z|v
2517 see Common Options
2518
2519 -o, --output-file FILE: output file name. If not present, the
2520 default is to print to standard output (stdout).
2521
2522 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2523 see Common Options
2524
2525 -R, --regions-file file
2526 see Common Options
2527
2528 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2529 see Common Options
2530
2531 -T, --targets-file file
2532 see Common Options
2533
2534 --threads INT
2535 see Common Options
2536
2537 Subset options:
2538 -a, --trim-alt-alleles
2539 trim alternate alleles not seen in subset. Type A, G and R INFO
2540 and FORMAT fields will also be trimmed
2541
2542 --force-samples
2543 only warn about unknown subset samples
2544
2545 -I, --no-update
2546 do not (re)calculate INFO fields for the subset (currently
2547 INFO/AC and INFO/AN)
2548
2549 -s, --samples LIST
2550 see Common Options
2551
2552 -S, --samples-file FILE
2553 see Common Options
2554
2555 Filter options:
2556 Note that filter options below dealing with counting the number of
2557 alleles will, for speed, first check for the values of AC and AN in
2558 the INFO column to avoid parsing all the genotype (FORMAT/GT)
2559 fields in the VCF. This means that a filter like --min-af 0.1 will
2560 be based ‘AC/AN’ where AC and AN come from either INFO/AC and
2561 INFO/AN if available or FORMAT/GT if not. It will not filter on
2562 another field like INFO/AF. The --include and --exclude filter
2563 expressions should instead be used to explicitly filter based on
2564 fields in the INFO column, e.g. --exclude AF<0.1.
2565
2566 -c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
2567 minimum allele count (INFO/AC) of sites to be printed.
2568 Specifying the type of allele is optional and can be set to
2569 non-reference (nref, the default), 1st alternate (alt1), the
2570 least frequent (minor), the most frequent (major) or sum of all
2571 but the most frequent (nonmajor) alleles.
2572
2573 -C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
2574 maximum allele count (INFO/AC) of sites to be printed.
2575 Specifying the type of allele is optional and can be set to
2576 non-reference (nref, the default), 1st alternate (alt1), the
2577 least frequent (minor), the most frequent (major) or sum of all
2578 but the most frequent (nonmajor) alleles.
2579
2580 -e, --exclude EXPRESSION
2581 exclude sites for which EXPRESSION is true. For valid
2582 expressions see EXPRESSIONS.
2583
2584 -f, --apply-filters LIST
2585 see Common Options
2586
2587 -g, --genotype [^][hom|het|miss]
2588 include only sites with one or more homozygous (hom),
2589 heterozygous (het) or missing (miss) genotypes. When prefixed
2590 with ^, the logic is reversed; thus ^het excludes sites with
2591 heterozygous genotypes.
2592
2593 -i, --include EXPRESSION
2594 include sites for which EXPRESSION is true. For valid
2595 expressions see EXPRESSIONS.
2596
2597 -k, --known
2598 print known sites only (ID column is not ".")
2599
2600 -m, --min-alleles INT
2601 print sites with at least INT alleles listed in REF and ALT
2602 columns
2603
2604 -M, --max-alleles INT
2605 print sites with at most INT alleles listed in REF and ALT
2606 columns. Use -m2 -M2 -v snps to only view biallelic SNPs.
2607
2608 -n, --novel
2609 print novel sites only (ID column is ".")
2610
2611 -p, --phased
2612 print sites where all samples are phased. Haploid genotypes are
2613 considered phased. Missing genotypes considered unphased unless
2614 the phased bit is set.
2615
2616 -P, --exclude-phased
2617 exclude sites where all samples are phased
2618
2619 -q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
2620 minimum allele frequency (INFO/AC / INFO/AN) of sites to be
2621 printed. Specifying the type of allele is optional and can be
2622 set to non-reference (nref, the default), 1st alternate (alt1),
2623 the least frequent (minor), the most frequent (major) or sum of
2624 all but the most frequent (nonmajor) alleles.
2625
2626 -Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
2627 maximum allele frequency (INFO/AC / INFO/AN) of sites to be
2628 printed. Specifying the type of allele is optional and can be
2629 set to non-reference (nref, the default), 1st alternate (alt1),
2630 the least frequent (minor), the most frequent (major) or sum of
2631 all but the most frequent (nonmajor) alleles.
2632
2633 -u, --uncalled
2634 print sites without a called genotype
2635
2636 -U, --exclude-uncalled
2637 exclude sites without a called genotype
2638
2639 -v, --types snps|indels|mnps|other
2640 comma-separated list of variant types to select. Site is
2641 selected if any of the ALT alleles is of the type requested.
2642 Types are determined by comparing the REF and ALT alleles in
2643 the VCF record not INFO tags like INFO/INDEL or INFO/VT. Use
2644 --include to select based on INFO tags.
2645
2646 -V, --exclude-types snps|indels|mnps|ref|bnd|other
2647 comma-separated list of variant types to exclude. Site is
2648 excluded if any of the ALT alleles is of the type requested.
2649 Types are determined by comparing the REF and ALT alleles in
2650 the VCF record not INFO tags like INFO/INDEL or INFO/VT. Use
2651 --exclude to exclude based on INFO tags.
2652
2653 -x, --private
2654 print sites where only the subset samples carry an
2655 non-reference allele. Requires --samples or --samples-file.
2656
2657 -X, --exclude-private
2658 exclude sites where only the subset samples carry an
2659 non-reference allele
2660
2661 bcftools help [COMMAND] | bcftools --help [COMMAND]
2662 Display a brief usage message listing the bcftools commands available.
2663 If the name of a command is also given, e.g., bcftools help view, the
2664 detailed usage message for that particular command is displayed.
2665
2666 bcftools [--version|-v]
2667 Display the version numbers and copyright information for bcftools and
2668 the important libraries used by bcftools.
2669
2670 bcftools [--version-only]
2671 Display the full bcftools version number in a machine-readable format.
2672
2674 These filtering expressions are accepted by most of the commands.
2675
2676 Valid expressions may contain:
2677
2678 · numerical constants, string constants, file names
2679
2680 1, 1.0, 1e-4
2681 "String"
2682 @file_name
2683
2684 · arithmetic operators
2685
2686 +,*,-,/
2687
2688 · comparison operators
2689
2690 == (same as =), >, >=, <=, <, !=
2691
2692 · regex operators "~" and its negation "!~". The expressions are case
2693 sensitive unless "/i" is added.
2694
2695 INFO/HAYSTACK ~ "needle"
2696 INFO/HAYSTACK ~ "NEEDless/i"
2697
2698 · parentheses
2699
2700 (, )
2701
2702 · logical operators
2703
2704 && (same as &), ||, |
2705
2706 · INFO tags, FORMAT tags, column names
2707
2708 INFO/DP or DP
2709 FORMAT/DV, FMT/DV, or DV
2710 FILTER, QUAL, ID, POS, REF, ALT[0]
2711
2712 · 1 (or 0) to test the presence (or absence) of a flag
2713
2714 FlagA=1 && FlagB=0
2715
2716 · "." to test missing values
2717
2718 DP=".", DP!=".", ALT="."
2719
2720 · missing genotypes can be matched regardless of phase and ploidy
2721 (".|.", "./.", ".") using these expressions
2722
2723 GT~"\.", GT!~"\."
2724
2725 · missing genotypes can be matched including the phase and ploidy
2726 (".|.", "./.", ".") using these expressions
2727
2728 GT=".|.", GT="./.", GT="."
2729
2730 · sample genotype: reference (haploid or diploid), alternate (hom or
2731 het, haploid or diploid), missing genotype, homozygous,
2732 heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het,
2733 alt-alt het, haploid ref, haploid alt (case-insensitive)
2734
2735 GT="ref"
2736 GT="alt"
2737 GT="mis"
2738 GT="hom"
2739 GT="het"
2740 GT="hap"
2741 GT="RR"
2742 GT="AA"
2743 GT="RA" or GT="AR"
2744 GT="Aa" or GT="aA"
2745 GT="R"
2746 GT="A"
2747
2748 · TYPE for variant type in REF,ALT columns
2749 (indel,snp,mnp,ref,bnd,other). Use the regex operator "\~" to
2750 require at least one allele of the given type or the equal sign "="
2751 to require that all alleles are of the given type. Compare
2752
2753 TYPE="snp"
2754 TYPE~"snp"
2755 TYPE!="snp"
2756 TYPE!~"snp"
2757
2758 · array subscripts (0-based), "*" for any element, "-" to indicate a
2759 range. Note that for querying FORMAT vectors, the colon ":" can be
2760 used to select a sample and an element of the vector, as shown in
2761 the examples below
2762
2763 INFO/AF[0] > 0.3 .. first AF value bigger than 0.3
2764 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30
2765 FORMAT/AD[0:1] .. first sample, second AD value
2766 FORMAT/AD[1:0] .. second sample, first AD value
2767 DP4[*] == 0 .. any DP4 value
2768 FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30
2769 FORMAT/DP[1-3] > 10 .. samples 2-4
2770 FORMAT/DP[1-] < 7 .. all samples but the first
2771 FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5
2772 FORMAT/AD[0:1] .. first sample, second AD field
2773 FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
2774 FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field
2775 (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
2776 CSQ[*] ~ "missense_variant.*deleterious"
2777
2778 · with many samples it can be more practical to provide a file with
2779 sample names, one sample name per line
2780
2781 GT[@samples.txt]="het" & binom(AD)<0.01
2782
2783 · function on FORMAT tags (over samples) and INFO tags (over vector
2784 fields)
2785
2786 MAX, MIN, AVG, SUM, STRLEN, ABS, COUNT
2787
2788 · two-tailed binomial test. Note that for N=0 the test evaluates to a
2789 missing value and when FORMAT/GT is used to determine the vector
2790 indices, it evaluates to 1 for homozygous genotypes.
2791
2792 binom(FMT/AD) .. GT can be used to determine the correct index
2793 binom(AD[0],AD[1]) .. or the fields can be given explicitly
2794
2795 · variables calculated on the fly if not present: number of alternate
2796 alleles; number of samples; count of alternate alleles; minor
2797 allele count (similar to AC but is always smaller than 0.5);
2798 frequency of alternate alleles (AF=AC/AN); frequency of minor
2799 alleles (MAF=MAC/AN); number of alleles in called genotypes; number
2800 of samples with missing genotype; fraction of samples with missing
2801 genotype;
2802
2803 N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING
2804
2805 · the number (N_PASS) or fraction (F_PASS) of samples which pass the
2806 expression
2807
2808 N_PASS(GQ>90 & GT!="mis") > 90
2809 F_PASS(GQ>90 & GT!="mis") > 0.9
2810
2811 · custom perl filtering. Note that this command is not compiled in by
2812 default, see the section Optional Compilation with Perl in the
2813 INSTALL file for help and misc/demo-flt.pl for a working example.
2814 The demo defined the perl subroutine "severity" which can be
2815 invoked from the command line as follows:
2816
2817 perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
2818
2819 Notes:
2820
2821 · String comparisons and regular expressions are case-insensitive
2822
2823 · Variables and function names are case-insensitive, but not tag
2824 names. For example, "qual" can be used instead of "QUAL",
2825 "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
2826
2827 · When querying multiple values, all elements are tested and the OR
2828 logic is used on the result. For example, when querying
2829 "TAG=1,2,3,4", it will be evaluated as follows:
2830
2831 -i 'TAG[*]=1' .. true, the record will be printed
2832 -i 'TAG[*]!=1' .. true
2833 -e 'TAG[*]=1' .. false, the record will be discarded
2834 -e 'TAG[*]!=1' .. false
2835 -i 'TAG[0]=1' .. true
2836 -i 'TAG[0]!=1' .. false
2837 -e 'TAG[0]=1' .. false
2838 -e 'TAG[0]!=1' .. true
2839
2840 Examples:
2841
2842 MIN(DV)>5
2843
2844 MIN(DV/DP)>0.3
2845
2846 MIN(DP)>10 & MIN(DV)>3
2847
2848 FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample
2849
2850 FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
2851
2852 QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
2853
2854 QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
2855
2856 TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
2857
2858 COUNT(GT="hom")=0
2859
2860 MIN(DP)>35 && AVG(GQ)>50
2861
2862 ID=@file .. selects lines with ID present in the file
2863
2864 ID!=@~/file .. skip lines with ID present in the ~/file
2865
2866 MAF[0]<0.05 .. select rare variants at 5% cutoff
2867
2868 POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
2869
2870 Shell expansion:
2871
2872 Note that expressions must often be quoted because some characters have
2873 special meaning in the shell. An example of expression enclosed in
2874 single quotes which cause that the whole expression is passed to the
2875 program as intended:
2876
2877 bcftools view -i '%ID!="." & MAF[0]<0.01'
2878
2879 Please refer to the documentation of your shell for details.
2880
2882 plot-vcfstats [OPTIONS] file.vchk [...]
2883 Script for processing output of bcftools stats. It can merge results
2884 from multiple outputs (useful when running the stats for each
2885 chromosome separately), plots graphs and creates a PDF presentation.
2886
2887 -m, --merge
2888 Merge vcfstats files to STDOUT, skip plotting.
2889
2890 -p, --prefix DIR
2891 The output directory. This directory will be created if it does not
2892 exist.
2893
2894 -P, --no-PDF
2895 Skip the PDF creation step.
2896
2897 -r, --rasterize
2898 Rasterize PDF images for faster rendering.
2899
2900 -s, --sample-names
2901 Use sample names for xticks rather than numeric IDs.
2902
2903 -t, --title STRING
2904 Identify files by these titles in plots. The option can be given
2905 multiple times, for each ID in the bcftools stats output. If not
2906 present, the script will use abbreviated source file names for the
2907 titles.
2908
2909 -T, --main-title STRING
2910 Main title for the PDF.
2911
2913 HTSlib was designed with BCF format in mind. When parsing VCF files,
2914 all records are internally converted into BCF representation. Simple
2915 operations, like removing a single column from a VCF file, can be
2916 therefore done much faster with standard UNIX commands, such as awk or
2917 cut. Therefore it is recommended to use BCF as input/output format
2918 whenever possible to avoid large overhead of the VCF → BCF → VCF
2919 conversion.
2920
2922 Please report any bugs you encounter on the github website:
2923 http://github.com/samtools/bcftools
2924
2926 Heng Li from the Sanger Institute wrote the original C version of
2927 htslib, samtools and bcftools. Bob Handsaker from the Broad Institute
2928 implemented the BGZF library. Petr Danecek, Shane McCarthy and John
2929 Marshall are maintaining and further developing bcftools. Many other
2930 people contributed to the program and to the file format
2931 specifications, both directly and indirectly by providing patches,
2932 testing and reporting bugs. We thank them all.
2933
2935 BCFtools GitHub website: http://github.com/samtools/bcftools
2936
2937 Samtools GitHub website: http://github.com/samtools/samtools
2938
2939 HTSlib GitHub website: http://github.com/samtools/htslib
2940
2941 File format specifications: http://samtools.github.io/hts-specs
2942
2943 BCFtools documentation: http://samtools.github.io/bcftools
2944
2945 BCFtools wiki page: https://github.com/samtools/bcftools/wiki
2946
2948 The MIT/Expat License or GPL License, see the LICENSE document for
2949 details. Copyright (c) Genome Research Ltd.
2950
2951
2952
2953 2018-07-18 BCFTOOLS(1)