bcftools(1)

1BCFTOOLS(1)                                                        BCFTOOLS(1)
2
3
4

NAME

6       bcftools - utilities for variant calling and manipulating VCFs and
7       BCFs.
8

SYNOPSIS

10       bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
11

DESCRIPTION

13       BCFtools is a set of utilities that manipulate variant calls in the
14       Variant Call Format (VCF) and its binary counterpart BCF. All commands
15       work transparently with both VCFs and BCFs, both uncompressed and
16       BGZF-compressed.
17
18       Most commands accept VCF, bgzipped VCF and BCF with filetype detected
19       automatically even when streaming from a pipe. Indexed VCF and BCF will
20       work in all situations. Un-indexed VCF and BCF and streams will work in
21       most, but not all situations. In general, whenever multiple VCFs are
22       read simultaneously, they must be indexed and therefore also
23       compressed.
24
25       BCFtools is designed to work on a stream. It regards an input file "-"
26       as the standard input (stdin) and outputs to the standard output
27       (stdout). Several commands can thus be combined with Unix pipes.
28
29   VERSION
30       This manual page was last updated 2018-07-18 and refers to bcftools git
31       version 1.9.
32
33   BCF1
34       The BCF1 format output by versions of samtools <= 0.1.19 is not
35       compatible with this version of bcftools. To read BCF1 files one can
36       use the view command from old versions of bcftools packaged with
37       samtools versions <= 0.1.19 to convert to VCF, which can then be read
38       by this version of bcftools.
39
40               samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
41
42   VARIANT CALLING
43       See bcftools call for variant calling from the output of the samtools
44       mpileup command. In versions of samtools <= 0.1.19 calling was done
45       with bcftools view. Users are now required to choose between the old
46       samtools calling model (-c/--consensus-caller) and the new multiallelic
47       calling model (-m/--multiallelic-caller). The multiallelic calling
48       model is recommended for most tasks.
49

LIST OF COMMANDS

51       For a full list of available commands, run bcftools without arguments.
52       For a full list of available options, run bcftools COMMAND without
53       arguments.
54
55       ·    annotate .. edit VCF files, add or remove annotations
56
57       ·    call .. SNP/indel calling (former "view")
58
59       ·    cnv .. Copy Number Variation caller
60
61       ·    concat .. concatenate VCF/BCF files from the same set of samples
62
63       ·    consensus .. create consensus sequence by applying VCF variants
64
65       ·    convert .. convert VCF/BCF to other formats and back
66
67       ·    csq .. haplotype aware consequence caller
68
69       ·    filter .. filter VCF/BCF files using fixed thresholds
70
71       ·    gtcheck .. check sample concordance, detect sample swaps and
72           contamination
73
74       ·    index .. index VCF/BCF
75
76       ·    isec .. intersections of VCF/BCF files
77
78       ·    merge .. merge VCF/BCF files files from non-overlapping sample
79           sets
80
81       ·    mpileup .. multi-way pileup producing genotype likelihoods
82
83       ·    norm .. normalize indels
84
85       ·    plugin .. run user-defined plugin
86
87       ·    polysomy .. detect contaminations and whole-chromosome aberrations
88
89       ·    query .. transform VCF/BCF into user-defined formats
90
91       ·    reheader .. modify VCF/BCF header, change sample names
92
93       ·    roh .. identify runs of homo/auto-zygosity
94
95       ·    sort .. sort VCF/BCF files
96
97       ·    stats .. produce VCF/BCF stats (former vcfcheck)
98
99       ·    view .. subset, filter and convert VCF and BCF files
100

LIST OF SCRIPTS

102       Some helper scripts are bundled with the bcftools code.
103
104       ·    plot-vcfstats .. plots the output of stats
105

COMMANDS AND OPTIONS

107   Common Options
108       The following options are common to many bcftools commands. See usage
109       for specific commands to see if they apply.
110
111       FILE
112           Files can be both VCF or BCF, uncompressed or BGZF-compressed. The
113           file "-" is interpreted as standard input. Some tools may require
114           tabix- or CSI-indexed files.
115
116       -c, --collapse snps|indels|both|all|some|none|id
117           Controls how to treat records with duplicate positions and defines
118           compatible records across multiple input files. Here by
119           "compatible" we mean records which should be considered as
120           identical by the tools. For example, when performing line
121           intersections, the desire may be to consider as identical all sites
122           with matching positions (bcftools isec -c all), or only sites with
123           matching variant type (bcftools isec -c snps  -c indels), or only
124           sites with all alleles identical (bcftools isec -c none).
125
126           none
127               only records with identical REF and ALT alleles are compatible
128
129           some
130               only records where some subset of ALT alleles match are
131               compatible
132
133           all
134               all records are compatible, regardless of whether the ALT
135               alleles match or not. In the case of records with the same
136               position, only the first will be considered and appear on
137               output.
138
139           snps
140               any SNP records are compatible, regardless of whether the ALT
141               alleles match or not. For duplicate positions, only the first
142               SNP record will be considered and appear on output.
143
144           indels
145               all indel records are compatible, regardless of whether the REF
146               and ALT alleles match or not. For duplicate positions, only the
147               first indel record will be considered and appear on output.
148
149           both
150               abbreviation of "-c indels  -c snps"
151
152           id
153               only records with identical ID column are compatible. Supported
154               by bcftools merge only.
155
156       -f, --apply-filters LIST
157           Skip sites where FILTER column does not contain any of the strings
158           listed in LIST. For example, to include only sites which have no
159           filters set, use -f .,PASS.
160
161       --no-version
162           Do not append version and command line information to the output
163           VCF header.
164
165       -o, --output FILE
166           When output consists of a single stream, write it to FILE rather
167           than to standard output, where it is written by default.
168
169       -O, --output-type b|u|z|v
170           Output compressed BCF (b), uncompressed BCF (u), compressed VCF
171           (z), uncompressed VCF (v). Use the -Ou option when piping between
172           bcftools subcommands to speed up performance by removing
173           unnecessary compression/decompression and VCF←→BCF conversion.
174
175       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
176           Comma-separated list of regions, see also -R, --regions-file. Note
177           that -r cannot be used in combination with -R.
178
179       -R, --regions-file FILE
180           Regions can be specified either on command line or in a VCF, BED,
181           or tab-delimited file (the default). The columns of the
182           tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where
183           positions are 1-based and inclusive. The columns of the
184           tab-delimited BED file are also CHROM, POS and POS_TO (trailing
185           columns are ignored), but coordinates are 0-based, half-open. To
186           indicate that a file be treated as BED rather than the 1-based
187           tab-delimited file, the file must have the ".bed" or ".bed.gz"
188           suffix (case-insensitive). Uncompressed files are stored in memory,
189           while bgzip-compressed and tabix-indexed region files are streamed.
190           Note that sequence names must match exactly, "chr20" is not the
191           same as "20". Also note that chromosome ordering in FILE will be
192           respected, the VCF will be processed in the order in which
193           chromosomes first appear in FILE. However, within chromosomes, the
194           VCF will always be processed in ascending genomic coordinate order
195           no matter what order they appear in FILE. Note that overlapping
196           regions in FILE can result in duplicated out of order positions in
197           the output. This option requires indexed VCF/BCF files. Note that
198           -R cannot be used in combination with -r.
199
200       -s, --samples [^]LIST
201           Comma-separated list of samples to include or exclude if prefixed
202           with "^". The sample order is updated to reflect that given on the
203           command line. Note that in general tags such as INFO/AC, INFO/AN,
204           etc are not updated to correspond to the subset samples.  bcftools
205           view is the exception where some tags will be updated (unless the
206           -I, --no-update option is used; see bcftools view documentation).
207           To use updated tags for the subset in another command one can pipe
208           from view into that command. For example:
209
210               bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
211
212       -S, --samples-file FILE
213           File of sample names to include or exclude if prefixed with "^".
214           One sample per line. See also the note above for the -s, --samples
215           option. The sample order is updated to reflect that given in the
216           input file. The command bcftools call accepts an optional second
217           column indicating ploidy (0, 1 or 2) or sex (as defined by
218           --ploidy, for example "F" or "M"), and can parse also PED files. If
219           the second column is not present, the sex "F" is assumed. With
220           bcftools call -C trio, PED file is expected. File formats examples:
221
222               sample1    1
223               sample2    2
224               sample3    2
225
226             or
227
228               sample1    M
229               sample2    F
230               sample3    F
231
232             or a .ped file (here is shown a minimum working example, the first column is
233             ignored and the last indicates sex: 1=male, 2=female)
234
235               ignored daughterA fatherA motherA 2
236               ignored sonB fatherB motherB 1
237
238       -t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
239           Similar as -r, --regions, but the next position is accessed by
240           streaming the whole VCF/BCF rather than using the tbi/csi index.
241           Both -r and -t options can be applied simultaneously: -r uses the
242           index to jump to a region and -t discards positions which are not
243           in the targets. Unlike -r, targets can be prefixed with "^" to
244           request logical complement. For example, "^X,Y,MT" indicates that
245           sequences X, Y and MT should be skipped. Yet another difference
246           between the two is that -r checks both start and end positions of
247           indels, whereas -t checks start positions only. Note that -t cannot
248           be used in combination with -T.
249
250       -T, --targets-file [^]FILE
251           Same -t, --targets, but reads regions from a file. Note that -T
252           cannot be used in combination with -t.
253
254           With the call -C alleles command, third column of the targets file
255           must be comma-separated list of alleles, starting with the
256           reference allele. Note that the file must be compressed and index.
257           Such a file can be easily created from a VCF using:
258
259               bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
260
261       --threads INT
262           Number of output compression threads to use in addition to main
263           thread. Only used when --output-type is b or z. Default: 0.
264
265   bcftools annotate [OPTIONS] FILE
266       Add or remove annotations.
267
268       -a, --annotations file
269           Bgzip-compressed and tabix-indexed file with annotations. The file
270           can be VCF, BED, or a tab-delimited file with mandatory columns
271           CHROM, POS (or, alternatively, FROM and TO), optional columns REF
272           and ALT, and arbitrary number of annotation columns. BED files are
273           expected to have the ".bed" or ".bed.gz" suffix (case-insensitive),
274           otherwise a tab-delimited file is assumed. Note that in case of
275           tab-delimited file, the coordinates POS, FROM and TO are one-based
276           and inclusive. When REF and ALT are present, only matching VCF
277           records will be annotated. When multiple ALT alleles are present in
278           the annotation file (given as comma-separated list of alleles), at
279           least one must match one of the alleles in the corresponding VCF
280           record. Similarly, at least one alternate allele from a
281           multi-allelic VCF record must be present in the annotation file.
282           Note that flag types, such as "INFO/FLAG", can be annotated by
283           including a field with the value "1" to set the flag, "0" to remove
284           it, or "." to keep existing flags. See also -c, --columns and -h,
285           --header-lines.
286
287               # Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
288               1  752566  SomeString      5
289               1  798959  SomeOtherString 6
290               # etc.
291
292       --collapse snps|indels|both|all|some|none
293           Controls how to match records from the annotation file to the
294           target VCF. Effective only when -a is a VCF or BCF. See Common
295           Options for more.
296
297       -c, --columns list
298           Comma-separated list of columns or tags to carry over from the
299           annotation file (see also -a, --annotations). If the annotation
300           file is not a VCF/BCF, list describes the columns of the annotation
301           file and must include CHROM, POS (or, alternatively, FROM and TO),
302           and optionally REF and ALT. Unused columns which should be ignored
303           can be indicated by "-".
304
305           If the annotation file is a VCF/BCF, only the edited columns/tags
306           must be present and their order does not matter. The columns ID,
307           QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be
308           written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be
309           written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations
310           can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
311
312           To carry over all INFO annotations, use "INFO". To add all INFO
313           annotations except "TAG", use "^INFO/TAG". By default, existing
314           values are replaced.
315
316           To add annotations without overwriting existing values (that is, to
317           add missing tags or add values to existing tags with missing
318           values), use "+TAG" instead of "TAG". To append to existing values
319           (rather than replacing or leaving untouched), use "=TAG" (instead
320           of "TAG" or "+TAG"). To replace only existing values without
321           modifying missing annotations, use "-TAG".
322
323           If the annotation file is not a VCF/BCF, all new annotations must
324           be defined via -h, --header-lines.
325
326       -e, --exclude EXPRESSION
327           exclude sites for which EXPRESSION is true. For valid expressions
328           see EXPRESSIONS.
329
330       -h, --header-lines file
331           Lines to append to the VCF header, see also -c, --columns and -a,
332           --annotations. For example:
333
334               ##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
335               ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
336
337       -I, --set-id [+]FORMAT
338           assign ID on the fly. The format is the same as in the query
339           command (see below). By default all existing IDs are replaced. If
340           the format string is preceded by "+", only missing IDs will be set.
341           For example, one can use
342
343               bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
344
345       -i, --include EXPRESSION
346           include only sites for which EXPRESSION is true. For valid
347           expressions see EXPRESSIONS.
348
349       -k, --keep-sites
350           keep sites wich do not pass -i and -e expressions instead of
351           discarding them
352
353       -m, --mark-sites TAG
354           annotate sites which are present ("+") or absent ("-") in the -a
355           file with a new INFO/TAG flag
356
357       --no-version
358           see Common Options
359
360       -o, --output FILE
361           see Common Options
362
363       -O, --output-type b|u|z|v
364           see Common Options
365
366       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
367           see Common Options
368
369       -R, --regions-file file
370           see Common Options
371
372       --rename-chrs file
373           rename chromosomes according to the map in file, with "old_name
374           new_name\n" pairs separated by whitespaces, each on a separate
375           line.
376
377       -s, --samples [^]LIST
378           subset of samples to annotate, see also Common Options
379
380       -S, --samples-file FILE
381           subset of samples to annotate. If the samples are named differently
382           in the target VCF and the -a, --annotations VCF, the name mapping
383           can be given as "src_name dst_name\n", separated by whitespaces,
384           each pair on a separate line.
385
386       --threads INT
387           see Common Options
388
389       -x, --remove list
390           List of annotations to remove. Use "FILTER" to remove all filters
391           or "FILTER/SomeFilter" to remove a specific filter. Similarly,
392           "INFO" can be used to remove all INFO tags and "FORMAT" to remove
393           all FORMAT tags except GT. To remove all INFO tags except "FOO" and
394           "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and
395           FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
396
397       Examples:
398
399               # Remove three fields
400               bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
401
402               # Remove all INFO fields and all FORMAT fields except for GT and PL
403               bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
404
405               # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
406               bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
407
408               # Carry over all INFO and FORMAT annotations except FORMAT/GT
409               bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
410
411               # Annotate from a tab-delimited file with six columns (the fifth is ignored),
412               # first indexing with tabix. The coordinates are 1-based.
413               tabix -s1 -b2 -e2 annots.tab.gz
414               bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
415
416               # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
417               tabix -s1 -b2 -e3 annots.tab.gz
418               bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf
419
420               # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
421               bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
422
423   bcftools call [OPTIONS] FILE
424       This command replaces the former bcftools view caller. Some of the
425       original functionality has been temporarily lost in the process of
426       transition under htslib, but will be added back on popular demand. The
427       original calling model can be invoked with the -c option.
428
429       File format options:
430           --no-version
431               see Common Options
432
433           -o, --output FILE
434               see Common Options
435
436           -O, --output-type b|u|z|v
437               see Common Options
438
439           --ploidy ASSEMBLY[?]
440               predefined ploidy, use list (or any other unused word) to print
441               a list of all predefined assemblies. Append a question mark to
442               print the actual definition. See also --ploidy-file.
443
444           --ploidy-file FILE
445               ploidy definition given as a space/tab-delimited list of CHROM,
446               FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and
447               correspond to the ones used by --samples-file. The default
448               ploidy can be given using the starred records (see below),
449               unlisted regions have ploidy 2. The default ploidy definition
450               is
451
452                   X 1 60000 M 1
453                   X 2699521 154931043 M 1
454                   Y 1 59373566 M 1
455                   Y 1 59373566 F 0
456                   MT 1 16569 M 1
457                   MT 1 16569 F 1
458                   *  * *     M 2
459                   *  * *     F 2
460
461           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
462               see Common Options
463
464           -R, --regions-file file
465               see Common Options
466
467           -s, --samples LIST
468               see Common Options
469
470           -S, --samples-file FILE
471               see Common Options
472
473           -t, --targets LIST
474               see Common Options
475
476           -T, --targets-file FILE
477               see Common Options
478
479           --threads INT
480               see Common Options
481
482       Input/output options:
483           -A, --keep-alts
484               output all alternate alleles present in the alignments even if
485               they do not appear in any of the genotypes
486
487           -f, --format-fields list
488               comma-separated list of FORMAT fields to output for each
489               sample. Currently GQ and GP fields are supported. For
490               convenience, the fields can be given as lower case letters.
491
492           -F, --prior-freqs AN,AC
493               take advantage of prior knowledge of population allele
494               frequencies. The workflow looks like this:
495
496                   # Extract AN,AC values from an existing VCF, such 1000Genomes
497                   bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
498
499                   # If the tags AN,AC are not already present, use the +fill-AN-AC plugin
500                   bcftools +fill-AN-AC 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
501                   tabix -s1 -b2 -e2 AFs.tab.gz
502
503                   # Create a VCF header description, here we name the tags REF_AN,REF_AC
504                   cat AFs.hdr
505                   ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
506                   ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
507
508                   # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
509                   bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
510
511           -g, --gvcf INT
512               output also gVCF blocks of homozygous REF calls. The parameter
513               INT is the minimum per-sample depth required to include a site
514               in the non-variant block.
515
516           -i, --insert-missed INT
517               output also sites missed by mpileup but present in -T,
518               --targets-file.
519
520           -M, --keep-masked-ref
521               output sites where REF allele is N
522
523           -V, --skip-variants snps|indels
524               skip indel/SNP sites
525
526           -v, --variants-only
527               output variant sites only
528
529       Consensus/variant calling options:
530           -c, --consensus-caller
531               the original samtools/bcftools calling method (conflicts with
532               -m)
533
534           -C, --constrain alleles|trio
535
536               alleles
537                   call genotypes given alleles. See also -T, --targets-file.
538
539               trio
540                   call genotypes given the father-mother-child constraint.
541                   See also -s, --samples and -n, --novel-rate.
542
543           -m, --multiallelic-caller
544               alternative modelfor multiallelic and rare-variant calling
545               designed to overcome known limitations in -c calling model
546               (conflicts with -c)
547
548           -n, --novel-rate float[,...]
549               likelihood of novel mutation for constrained -C trio calling.
550               The trio genotype calling maximizes likelihood of a particular
551               combination of genotypes for father, mother and the child
552               P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) *
553               (1-Pn). By providing three values, the mutation rate Pn is set
554               explicitly for SNPs, deletions and insertions, respectively. If
555               two values are given, the first is interpreted as the mutation
556               rate of SNPs and the second is used to calculate the mutation
557               rate of indels according to their length as
558               Pn=float*exp(-a-b*len), where a=22.8689, b=0.2994 for
559               insertions and a=21.9313, b=0.2856 for deletions
560               [pubmed:23975140]. If only one value is given, the same
561               mutation rate Pn is used for SNPs and indels.
562
563           -p, --pval-threshold float
564               with -c, accept variant if P(ref|D) < float.
565
566           -P, --prior float
567               expected substitution rate, or 0 to disable the prior. Only
568               with -m.
569
570           -t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
571               see Common Options
572
573           -X, --chromosome-X
574               haploid output for male samples (requires PED file with -s)
575
576           -Y, --chromosome-Y
577               haploid output for males and skips females (requires PED file
578               with -s)
579
580   bcftools cnv [OPTIONS] FILE
581       Copy number variation caller, requires a VCF annotated with the
582       Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR)
583       values. The HMM considers the following copy number states: CN 2
584       (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy
585       gain).
586
587       General Options:
588           -c, --control-sample string
589               optional control sample name. If given, pairwise calling is
590               performed and the -P option can be used
591
592           -f, --AF-file file
593               read allele frequencies from a tab-delimited file with the
594               columns CHR,POS,REF,ALT,AF
595
596           -o, --output-dir path
597               output directory
598
599           -p, --plot-threshold float
600               call matplotlib to produce plots for chromosomes with quality
601               at least float, useful for visual inspection of the calls. With
602               -p 0, plots for all chromosomes will be generated. If not
603               given, a matplotlib script will be created but not called.
604
605           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
606               see Common Options
607
608           -R, --regions-file file
609               see Common Options
610
611           -s, --query-sample string
612               query samply name
613
614           -t, --targets LIST
615               see Common Options
616
617           -T, --targets-file FILE
618               see Common Options
619
620       HMM Options:
621           -a, --aberrant float[,float]
622               fraction of aberrant cells in query and control. The hallmark
623               of duplications and contaminations is the BAF value of
624               heterozygous markers which is dependent on the fraction of
625               aberrant cells. Sensitivity to smaller fractions of cells can
626               be increased by setting -a to a lower value. Note however, that
627               this comes at the cost of increased false discovery rate.
628
629           -b, --BAF-weight float
630               relative contribution from BAF
631
632           -d, --BAF-dev float[,float]
633               expected BAF deviation in query and control, i.e. the noise
634               observed in the data.
635
636           -e, --err-prob float
637               uniform error probability
638
639           -l, --LRR-weight float
640               relative contribution from LRR. With noisy data, this option
641               can have big effect on the number of calls produced. In truly
642               random noise (such as in simulated data), the value should be
643               set high (1.0), but in the presence of systematic noise when
644               LRR are not informative, lower values result in cleaner calls
645               (0.2).
646
647           -L, --LRR-smooth-win int
648               reduce LRR noise by applying moving average given this window
649               size
650
651           -O, --optimize float
652               iteratively estimate the fraction of aberrant cells, down to
653               the given fraction. Lowering this value from the default 1.0 to
654               say, 0.3, can help discover more events but also increases
655               noise
656
657           -P, --same-prob float
658               the prior probability of the query and the control sample being
659               the same. Setting to 0 calls both independently, setting to 1
660               forces the same copy number state in both.
661
662           -x, --xy-prob float
663               the HMM probability of transition to another copy number state.
664               Increasing this values leads to smaller and more frequent
665               calls.
666
667   bcftools concat [OPTIONS] FILE1 FILE2 [...]
668       Concatenate or combine VCF/BCF files. All source files must have the
669       same sample columns appearing in the same order. Can be used, for
670       example, to concatenate chromosome VCFs into one VCF, or combine a SNP
671       VCF and an indel VCF into one. The input files must be sorted by chr
672       and position. The files must be given in the correct order to produce
673       sorted VCF on output unless the -a, --allow-overlaps option is
674       specified. With the --naive option, the files are concatenated without
675       being recompressed, which is very fast but dangerous if the BCF headers
676       differ.
677
678       -a, --allow-overlaps
679           First coordinate of the next file can precede last record of the
680           current file.
681
682       -c, --compact-PS
683           Do not output PS tag at each site, only at the start of a new phase
684           set block.
685
686       -d, --rm-dups snps|indels|both|all|none
687           Output duplicate records of specified type present in multiple
688           files only once. Requires -a, --allow-overlaps.
689
690       -D, --remove-duplicates
691           Alias for -d none
692
693       -f, --file-list FILE
694           Read file names from FILE, one file name per line.
695
696       -l, --ligate
697           Ligate phased VCFs by matching phase at overlapping haplotypes
698
699       --no-version
700           see Common Options
701
702       -n, --naive
703           Concatenate VCF or BCF files without recompression. This is very
704           fast but requires that all files are of the same type (all VCF or
705           all BCF) and have the same headers. This is because all tags and
706           chromosome names in the BCF body rely on the implicit order of the
707           contig and tag definitions in the header. Currently no sanity
708           checks are in place. Dangerous, use with caution.
709
710       -o, --output FILE
711           see Common Options
712
713       -O, --output-type b|u|z|v
714           see Common Options
715
716       -q, --min-PQ INT
717           Break phase set if phasing quality is lower than INT
718
719       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
720           see Common Options. Requires -a, --allow-overlaps.
721
722       -R, --regions-file FILE
723           see Common Options. Requires -a, --allow-overlaps.
724
725       --threads INT
726           see Common Options
727
728   bcftools consensus [OPTIONS] FILE
729       Create consensus sequence by applying VCF variants to a reference fasta
730       file. By default, the program will apply all ALT variants to the
731       reference fasta to obtain the consensus sequence. Using the --sample
732       (and, optionally, --haplotype) option will apply genotype (haplotype)
733       calls from FORMAT/GT. Note that the program does not act as a primitive
734       variant caller and ignores allelic depth information, such as INFO/AD
735       or FORMAT/AD. For that, consider using the setGT plugin.
736
737       -c, --chain FILE
738           write a chain file for liftover
739
740       -e, --exclude EXPRESSION
741           exclude sites for which EXPRESSION is true. For valid expressions
742           see EXPRESSIONS.
743
744       -f, --fasta-ref FILE
745           reference sequence in fasta format
746
747       -H, --haplotype 1|2|R|A|LR|LA|SR|SA
748           choose which allele from the FORMAT/GT field to use (the codes are
749           case-insensitive):
750
751           1
752               the first allele
753
754           2
755               the second allele
756
757           R
758               the REF allele (in heterozygous genotypes)
759
760           A
761               the ALT allele (in heterozygous genotypes)
762
763           LR, LA
764               the longer allele. If both have the same length, use the REF
765               allele (LR), or the ALT allele (LA)
766
767           SR, SA
768               the shorter allele. If both have the same length, use the REF
769               allele (SR), or the ALT allele (SA)
770
771                   This option requires *-s*, unless exactly one sample is present in the VCF
772
773       -i, --include EXPRESSION
774           include only sites for which EXPRESSION is true. For valid
775           expressions see EXPRESSIONS.
776
777       -I, --iupac-codes
778           output variants in the form of IUPAC ambiguity codes
779
780       -m, --mask FILE
781           BED file or TAB file with regions to be replaced with N. See
782           discussion of --regions-file in Common Options for file format
783           details.
784
785       -M, --missing CHAR
786           instead of skipping the missing genotypes, output the character
787           CHAR (e.g. "?")
788
789       -o, --output FILE
790           write output to a file
791
792       -s, --sample NAME
793           apply variants of the given sample
794
795       Examples:
796
797               # Apply variants present in sample "NA001", output IUPAC codes for hets
798               bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
799
800               # Create consensus for one region. The fasta header lines are then expected
801               # in the form ">chr:from-to".
802               samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
803
804   bcftools convert [OPTIONS] FILE
805       VCF input options:
806           -e, --exclude EXPRESSION
807               exclude sites for which EXPRESSION is true. For valid
808               expressions see EXPRESSIONS.
809
810           -i, --include EXPRESSION
811               include only sites for which EXPRESSION is true. For valid
812               expressions see EXPRESSIONS.
813
814           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
815               see Common Options
816
817           -R, --regions-file FILE
818               see Common Options
819
820           -s, --samples LIST
821               see Common Options
822
823           -S, --samples-file FILE
824               see Common Options
825
826           -t, --targets LIST
827               see Common Options
828
829           -T, --targets-file FILE
830               see Common Options
831
832       VCF output options:
833           --no-version
834               see Common Options
835
836           -o, --output FILE
837               see Common Options
838
839           -O, --output-type b|u|z|v
840               see Common Options
841
842           --threads INT
843               see Common Options
844
845       GEN/SAMPLE conversion:
846           -G, --gensample2vcf prefix or gen-file,sample-file
847               convert IMPUTE2 output to VCF. The second column must be of the
848               form "CHROM:POS_REF_ALT" to detect possible strand swaps;
849               IMPUTE2 leaves the first one empty ("--") when sites from
850               reference panel are filled in. See also -g below.
851
852           -g, --gensample prefix or gen-file,sample-file
853               convert from VCF to gen/sample format used by IMPUTE2 and
854               SHAPEIT. The columns of .gen file format are ID1,ID2,POS,A,B
855               followed by three genotype probabilities P(AA), P(AB), P(BB)
856               for each sample. In order to prevent strand swaps, the program
857               uses IDs of the form "CHROM:POS_REF_ALT". For example:
858
859                 .gen
860                 ----
861                 1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
862                 1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
863
864                 .sample
865                 -------
866                 ID_1 ID_2 missing
867                 0 0 0
868                 sample1 sample1 0
869                 sample2 sample2 0
870
871           --tag STRING
872               tag to take values for .gen file: GT,PL,GL,GP
873
874           --chrom
875               output chromosome in the first column instead of
876               CHROM:POS_REF_ALT
877
878           --sex FILE
879               output sex column in the sample file. The FILE format is
880
881                   MaleSample    M
882                   FemaleSample  F
883
884           --vcf-ids
885               output VCF IDs in the second column instead of
886               CHROM:POS_REF_ALT
887
888       gVCF conversion:
889           --gvcf2vcf
890               convert gVCF to VCF, expanding REF blocks into sites. Note that
891               the -i and -e options work differently with this switch. In
892               this situation the filtering expressions define which sites
893               should be expanded and which sites should be left unmodified,
894               but all sites are printed on output. In order to drop sites,
895               stream first through bcftools view.
896
897           -f, --fasta-ref file
898               reference sequence in fasta format. Must be indexed with
899               samtools faidx
900
901       HAP/SAMPLE conversion:
902           --hapsample2vcf prefix or hap-file,sample-file
903               convert from hap/sample format to VCF. The columns of .hap file
904               are similar to .gen file above, but there are only two
905               haplotype columns per sample. Note that the first column of the
906               .hap file is expected to be in the form
907               "CHR:POS_REF_ALT(_END)?", with the _END being optional for
908               defining the INFO/END tag when ALT is a symbolic allele, for
909               example:
910
911                 .hap
912                 ----
913                 1:111485207_G_A rsID1 111485207 G A 0 1 0 0
914                 1:111494194_C_T rsID2 111494194 C T 0 1 0 0
915                 1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
916
917           --hapsample prefix or hap-file,sample-file
918               convert from VCF to hap/sample format used by IMPUTE2 and
919               SHAPEIT. The columns of .hap file begin with
920               ID,RSID,POS,REF,ALT. In order to prevent strand swaps, the
921               program uses IDs of the form "CHROM:POS_REF_ALT".
922
923           --haploid2diploid
924               with -h option converts haploid genotypes to homozygous diploid
925               genotypes. For example, the program will print 0 0 instead of
926               the default 0 -. This is useful for programs which do not
927               handle haploid genotypes correctly.
928
929           --sex FILE
930               output sex column in the sample file. The FILE format is
931
932                   MaleSample    M
933                   FemaleSample  F
934
935           --vcf-ids
936               output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
937
938       HAP/LEGEND/SAMPLE conversion:
939           -H, --haplegendsample2vcf prefix or
940           hap-file,legend-file,sample-file
941               convert from hap/legend/sample format used by IMPUTE2 to VCF,
942               see also -h, --hapslegendsample below.
943
944           -h, --haplegendsample prefix or hap-file,legend-file,sample-file
945               convert from VCF to hap/legend/sample format used by IMPUTE2
946               and SHAPEIT. The columns of .legend file ID,POS,REF,ALT. In
947               order to prevent strand swaps, the program uses IDs of the form
948               "CHROM:POS_REF_ALT". The .sample file is quite basic at the
949               moment with columns for population, group and sex expected to
950               be edited by the user. For example:
951
952                 .hap
953                 -----
954                 0 1 0 0 1 0
955                 0 1 0 0 0 1
956
957                 .legend
958                 -------
959                 id position a0 a1
960                 1:111485207_G_A 111485207 G A
961                 1:111494194_C_T 111494194 C T
962
963                 .sample
964                 -------
965                 sample population group sex
966                 sample1 sample1 sample1 2
967                 sample2 sample2 sample2 2
968
969           --haploid2diploid
970               with -h option converts haploid genotypes to homozygous diploid
971               genotypes. For example, the program will print 0 0 instead of
972               the default 0 -. This is useful for programs which do not
973               handle haploid genotypes correctly.
974
975           --sex FILE
976               output sex column in the sample file. The FILE format is
977
978                   MaleSample    M
979                   FemaleSample  F
980
981           --vcf-ids
982               output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
983
984       TSV conversion:
985           --tsv2vcf file
986               convert from TSV (tab-separated values) format (such as
987               generated by 23andMe) to VCF. The input file fields can be tab-
988               or space- delimited
989
990           -c, --columns list
991               comma-separated list of fields in the input file. In the
992               current version, the fields CHROM, POS, ID, and AA are expected
993               and can appear in arbitrary order, columns which should be
994               ignored in the input file can be indicated by "-". The AA field
995               lists alleles on the forward reference strand, for example "CC"
996               or "CT" for diploid genotypes or "C" for haploid genotypes (sex
997               chromosomes). Insertions and deletions are not supported yet,
998               missing data can be indicated with "--".
999
1000           -f, --fasta-ref file
1001               reference sequence in fasta format. Must be indexed with
1002               samtools faidx
1003
1004           -s, --samples LIST
1005               list of sample names. See Common Options
1006
1007           -S, --samples-file FILE
1008               file of sample names. See Common Options
1009
1010           Example:
1011
1012               # Convert 23andme results into VCF
1013               bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
1014
1015   bcftools csq [OPTIONS] FILE
1016       Haplotype aware consequence predictor which correctly handles combined
1017       variants such as MNPs split over multiple VCF records, SNPs separated
1018       by an intron (but adjacent in the spliced transcript) or nearby
1019       frame-shifting indels which in combination in fact are not
1020       frame-shifting.
1021
1022       The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag
1023       (configurable with the -c option). The latter is a bitmask of indexes
1024       to INFO/BCSQ, with interleaved haplotypes. See the usage examples below
1025       for using the %TBCSQ converter in query for extracting a more human
1026       readable form from this bitmask. The contruction of the bitmask limits
1027       the number of consequences that can be referenced in the FORMAT/BCSQ
1028       tags. By default this is 16, but if more are required, see the --ncsq
1029       option.
1030
1031       The program requires on input a VCF/BCF file, the reference genome in
1032       fasta format (--fasta-ref) and genomic features in the GFF3 format
1033       downloadable from the Ensembl website (--gff-annot), and outputs an
1034       annotated VCF/BCF file. Currently, only Ensembl GFF3 files are
1035       supported.
1036
1037       By default, the input VCF should be phased. If phase is unknown, or
1038       only partially known, the --phase option can be used to indicate how to
1039       handle unphased data. Alternatively, haplotype aware calling can be
1040       turned off with the --local-csq option.
1041
1042       If conflicting (overlapping) variants within one haplotype are
1043       detected, a warning will be emitted and predictions will be based on
1044       only the first variant in the analysis.
1045
1046       Symbolic alleles are not supported. They will remain unannotated in the
1047       output VCF and are ignored for the prediction analysis.
1048
1049       -c, --custom-tag STRING
1050           use this custom tag to store consequences rather than the default
1051           BCSQ tag
1052
1053       -e, --exclude EXPRESSION
1054           exclude sites for which EXPRESSION is true. For valid expressions
1055           see EXPRESSIONS.
1056
1057       -f, --fasta-ref FILE
1058           reference sequence in fasta format (required)
1059
1060       --force
1061           run even if some sanity checks fail. Currently the option allows to
1062           skip transcripts in malformatted GFFs with incorrect phase
1063
1064       -g, --gff-annot FILE
1065           GFF3 annotation file (required), such as
1066           ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens. An example of
1067           a minimal working GFF file:
1068
1069               # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
1070               # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
1071               # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
1072               # (the most interesting is "protein_coding").
1073               #
1074               # Attributes required for
1075               #   gene lines:
1076               #   - ID=gene:<gene_id>
1077               #   - biotype=<biotype>
1078               #   - Name=<gene_name>      [optional]
1079               #
1080               #   transcript lines:
1081               #   - ID=transcript:<transcript_id>
1082               #   - Parent=gene:<gene_id>
1083               #   - biotype=<biotype>
1084               #
1085               #   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
1086               #   - Parent=transcript:<transcript_id>
1087               #
1088               # Supported biotypes:
1089               #   - see the function gff_parse_biotype() in bcftools/csq.c
1090
1091               1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1092               1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1093               1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
1094               1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
1095               1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
1096               1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId
1097
1098       -i, --include EXPRESSION
1099           include only sites for which EXPRESSION is true. For valid
1100           expressions see EXPRESSIONS.
1101
1102       -l, --local-csq
1103           switch off haplotype-aware calling, run localized predictions
1104           considering only one VCF record at a time
1105
1106       -n, --ncsq INT
1107           maximum number of consequences to consider per site. The INFO/BCSQ
1108           column includes all consequences, but only the first INT will be
1109           referenced by the FORMAT/BCSQ fields. The default value is 16 which
1110           corresponds to one integer per diploid sample. Note that increasing
1111           the value leads to increased memory and is rarely necessary.
1112
1113       -o, --output FILE
1114           see Common Options
1115
1116       -O, --output-type b|t|u|z|v
1117           see Common Options. In addition, a custom tab-delimited plain text
1118           output can be printed (t).
1119
1120       -p, --phase a|m|r|R|s
1121           how to handle unphased heterozygous genotypes:
1122
1123           a
1124               take GTs as is, create haplotypes regardless of phase (0/1 →
1125               0|1)
1126
1127           m
1128               merge all GTs into a single haplotype (0/1 → 1, 1/2 → 1)
1129
1130           r
1131               require phased GTs, throw an error on unphased heterozygous GTs
1132
1133           R
1134               create non-reference haplotypes if possible (0/1 → 1|1, 1/2 →
1135               1|2)
1136
1137           s
1138               skip unphased heterozygous GTs
1139
1140       -q, --quiet
1141           suppress warning messages
1142
1143       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1144           see Common Options
1145
1146       -R, --regions-file FILE
1147           see Common Options
1148
1149       -s, --samples LIST
1150           samples to include or "-" to apply all variants and ignore samples
1151
1152       -S, --samples-file FILE
1153           see Common Options
1154
1155       -t, --targets LIST
1156           see Common Options
1157
1158       -T, --targets-file FILE
1159           see Common Options
1160
1161       Examples:
1162
1163               # Basic usage
1164               bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
1165
1166               # Extract the translated haplotype consequences. The following TBCSQ variations
1167               # are recognised:
1168               #   %TBCSQ    .. print consequences in all haplotypes in separate columns
1169               #   %TBCSQ{0} .. print the first haplotype only
1170               #   %TBCSQ{1} .. print the second haplotype only
1171               #   %TBCSQ{*} .. print a list of unique consquences present in either haplotype
1172               bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
1173
1174       Examples of BCSQ annotation:
1175
1176               # Two separate VCF records at positions 2:122106101 and 2:122106102
1177               # change the same codon. This UV-induced C>T dinucleotide mutation
1178               # has been annotated fully at the position 2:122106101 with
1179               #   - consequence type
1180               #   - gene name
1181               #   - ensembl transcript ID
1182               #   - coding strand (+ fwd, - rev)
1183               #   - amino acid position (in the coding strand orientation)
1184               #   - list of corresponding VCF variants
1185               # The annotation at the second position gives the position of the full
1186               # annotation
1187               BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
1188               BCSQ=@122106101
1189
1190               # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
1191               BCSQ=@46115084
1192               BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
1193
1194               # Stop gained variant
1195               BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
1196
1197               # The consequence type of a variant downstream from a stop are prefixed with *
1198               BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
1199
1200   bcftools filter [OPTIONS] FILE
1201       Apply fixed-threshold filters.
1202
1203       -e, --exclude EXPRESSION
1204           exclude sites for which EXPRESSION is true. For valid expressions
1205           see EXPRESSIONS.
1206
1207       -g, --SnpGap INT
1208           filter SNPs within INT base pairs of an indel. The following
1209           example demonstrates the logic of --SnpGap 3 applied on a deletion
1210           and an insertion:
1211
1212           The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
1213                    0123456789
1214               ref  .G.GT..G..
1215               del  .A.G-..A..
1216           Here the positions 1 and 6 are filtered, 0 and 7 are not:
1217                    0123-456789
1218               ref  .G.G-..G..
1219               ins  .A.GT..A..
1220
1221       -G, --IndelGap INT
1222           filter clusters of indels separated by INT or fewer base pairs
1223           allowing only one to pass. The following example demonstrates the
1224           logic of --IndelGap 2 applied on a deletion and an insertion:
1225
1226           The second indel is filtered:
1227                    012345678901
1228               ref  .GT.GT..GT..
1229               del  .G-.G-..G-..
1230           And similarly here, the second is filtered:
1231                    01 23 456 78
1232               ref  .A-.A-..A-..
1233               ins  .AT.AT..AT..
1234
1235       -i, --include EXPRESSION
1236           include only sites for which EXPRESSION is true. For valid
1237           expressions see EXPRESSIONS.
1238
1239       -m, --mode [+x]
1240           define behaviour at sites with existing FILTER annotations. The
1241           default mode replaces existing filters of failed sites with a new
1242           FILTER string while leaving sites which pass untouched when
1243           non-empty and setting to "PASS" when the FILTER string is absent.
1244           The "+" mode appends new FILTER strings of failed sites instead of
1245           replacing them. The "x" mode resets filters of sites which pass to
1246           "PASS". Modes "+" and "x" can both be set.
1247
1248       --no-version
1249           see Common Options
1250
1251       -o, --output FILE
1252           see Common Options
1253
1254       -O, --output-type b|u|z|v
1255           see Common Options
1256
1257       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1258           see Common Options
1259
1260       -R, --regions-file file
1261           see Common Options
1262
1263       -s, --soft-filter STRING|+
1264           annotate FILTER column with STRING or, with +, a unique filter name
1265           generated by the program ("Filter%d").
1266
1267       -S, --set-GTs .|0
1268           set genotypes of failed samples to missing value (.) or reference
1269           allele (0)
1270
1271       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1272           see Common Options
1273
1274       -T, --targets-file file
1275           see Common Options
1276
1277       --threads INT
1278           see Common Options
1279
1280   bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz
1281       Checks sample identity. The program can operate in two modes. If the -g
1282       option is given, the identity of the -s sample from query.vcf.gz is
1283       checked against the samples in the -g file. Without the -g option,
1284       multi-sample cross-check of samples in query.vcf.gz is performed.
1285
1286       -a, --all-sites
1287           output for all sites
1288
1289       -c, --cluster FLOAT,FLOAT
1290           min inter- and max intra-sample error [0.23,-0.3]
1291
1292               The first "min" argument controls the typical error rate in multiplexed
1293               runs ("lanelets") from the same sample. Lanelets with error rate less
1294               than this will always be considered as coming from the same sample.
1295               The second "max" argument is the reverse: lanelets with error rate
1296               greater than the absolute value of this parameter will always be
1297               considered as different samples. When the value is negative, the cutoff
1298               may be heuristically lowered by the clustering engine. If positive, the
1299               value is interpreted as a fixed cutoff.
1300
1301       -g, --genotypes genotypes.vcf.gz
1302           reference genotypes to compare against
1303
1304       -G, --GTs-only INT
1305           use genotypes (GT) instead of genotype likelihoods (PL). When set
1306           to 1, reported discordance is the number of non-matching GTs,
1307           otherwise the number INT is interpreted as phred-scaled likelihood
1308           of unobserved genotypes.
1309
1310       -H, --homs-only
1311           consider only genotypes which are homozygous in both genotypes and
1312           query VCF. This may be useful with low coverage data.
1313
1314       -p, --plot PREFIX
1315           produce plots
1316
1317       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1318           see Common Options
1319
1320       -R, --regions-file file
1321           see Common Options
1322
1323       -s, --query-sample STRING
1324           query sample in query.vcf.gz. By default, the first sample is
1325           checked.
1326
1327       -S, --target-sample STRING
1328           target sample in the -g file, used only for plotting, not for
1329           analysis
1330
1331       -t, --targets file
1332           see Common Options
1333
1334       -T, --targets-file file
1335           see Common Options
1336
1337       Output files format:
1338           CN, Discordance
1339               Pairwise discordance for all sample pairs is calculated as
1340
1341                       \sum_s { min_G { PL_a(G) + PL_b(G) } },
1342
1343               where the sum runs over all sites s and G is the the most
1344               likely genotype shared by both samples a and b. When PL field
1345               is not present, a constant value 99 is used for the unseen
1346               genotypes. With -G, the value 1 can be used instead; the
1347               discordance value then gives exactly the number of differing
1348               genotypes.
1349
1350           ERR, error rate
1351               Pairwise error rate calculated as number of differences divided
1352               by the total number of comparisons.
1353
1354           CLUSTER, TH, DOT
1355               In presence of multiple samples, related samples and outliers
1356               can be identified by clustering samples by error rate. A simple
1357               hierarchical clustering based on minimization of standard
1358               deviation is used. This is useful to detect sample swaps, for
1359               example in situations where one sample has been sequenced in
1360               multiple runs.
1361
1362   bcftools index [OPTIONS] in.bcf|in.vcf.gz
1363       Creates index for bgzip compressed VCF/BCF files for random access. CSI
1364       (coordinate-sorted index) is created by default. The CSI format
1365       supports indexing of chromosomes up to length 2^31. TBI (tabix index)
1366       index files, which support chromosome lengths up to 2^29, can be
1367       created by using the -t/--tbi option or using the tabix program
1368       packaged with htslib. When loading an index file, bcftools will try the
1369       CSI first and then the TBI.
1370
1371       Indexing options:
1372           -c, --csi
1373               generate CSI-format index for VCF/BCF files [default]
1374
1375           -f, --force
1376               overwrite index if it already exists
1377
1378           -m, --min-shift INT
1379               set minimal interval size for CSI indices to 2^INT; default: 14
1380
1381           -o, --output-file FILE
1382               output file name. If not set, then the index will be created
1383               using the input file name plus a .csi or .tbi extension
1384
1385           -t, --tbi
1386               generate TBI-format index for VCF files
1387
1388           --threads INT
1389               see Common Options
1390
1391       Stats options:
1392           -n, --nrecords
1393               print the number of records based on the CSI or TBI index files
1394
1395           -s, --stats
1396               Print per contig stats based on the CSI or TBI index files.
1397               Output format is three tab-delimited columns listing the contig
1398               name, contig length (.  if unknown) and number of records for
1399               the contig. Contigs with zero records are not printed.
1400
1401   bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]
1402       Creates intersections, unions and complements of VCF files. Depending
1403       on the options, the program can output records from one (or more) files
1404       which have (or do not have) corresponding records with the same
1405       position in the other files.
1406
1407       -c, --collapse snps|indels|both|all|some|none
1408           see Common Options
1409
1410       -C, --complement
1411           output positions present only in the first file but missing in the
1412           others
1413
1414       -e, --exclude -|EXPRESSION
1415           exclude sites for which EXPRESSION is true. If -e (or -i) appears
1416           only once, the same filtering expression will be applied to all
1417           input files. Otherwise, -e or -i must be given for each input file.
1418           To indicate that no filtering should be performed on a file, use
1419           "-" in place of EXPRESSION, as shown in the example below. For
1420           valid expressions see EXPRESSIONS.
1421
1422       -f, --apply-filters LIST
1423           see Common Options
1424
1425       -i, --include EXPRESSION
1426           include only sites for which EXPRESSION is true. See discussion of
1427           -e, --exclude above.
1428
1429       -n, --nfiles [+-=]INT|~BITMAP
1430           output positions present in this many (=), this many or more (+),
1431           this many or fewer (-), or the exact same (~) files
1432
1433       -o, --output FILE
1434           see Common Options. When several files are being output, their
1435           names are controlled via -p instead.
1436
1437       -O, --output-type b|u|z|v
1438           see Common Options
1439
1440       -p, --prefix DIR
1441           if given, subset each of the input files accordingly. See also -w.
1442
1443       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1444           see Common Options
1445
1446       -R, --regions-file file
1447           see Common Options
1448
1449       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1450           see Common Options
1451
1452       -T, --targets-file file
1453           see Common Options
1454
1455       -w, --write LIST
1456           list of input files to output given as 1-based indices. With -p and
1457           no -w, all files are written.
1458
1459       Examples:
1460           Create intersection and complements of two sets saving the output
1461           in dir/*
1462
1463                   bcftools isec -p dir A.vcf.gz B.vcf.gz
1464
1465           Filter sites in A (require INFO/MAF>=0.01) and B (require
1466           INFO/dbSNP) but not in C, and create an intersection, including
1467           only sites which appear in at least two of the files after filters
1468           have been applied
1469
1470                   bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
1471
1472           Extract and write records from A shared by both A and B using exact
1473           allele match
1474
1475                   bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
1476
1477           Extract records private to A or B comparing by position only
1478
1479                   bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
1480
1481           Print a list of records which are present in A and B but not in C
1482           and D
1483
1484                   bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
1485
1486   bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]
1487       Merge multiple VCF/BCF files from non-overlapping sample sets to create
1488       one multi-sample file. For example, when merging file A.vcf.gz
1489       containing samples S1, S2 and S3 and file B.vcf.gz containing samples
1490       S3 and S4, the output file will contain four samples named S1, S2, S3,
1491       2:S3 and S4.
1492
1493       Note that it is responsibility of the user to ensure that the sample
1494       names are unique across all files. If they are not, the program will
1495       exit with an error unless the option --force-samples is given. The
1496       sample names can be also given explicitly using the --print-header and
1497       --use-header options.
1498
1499       Note that only records from different files can be merged, never from
1500       the same file. For "vertical" merge take a look at bcftools concat or
1501       bcftools norm -m instead.
1502
1503       --force-samples
1504           if the merged files contain duplicate samples names, proceed
1505           anyway. Duplicate sample names will be resolved by prepending index
1506           of the file as it appeared on the command line to the conflicting
1507           sample name (see 2:S3 in the above example).
1508
1509       --print-header
1510           print only merged header and exit
1511
1512       --use-header FILE
1513           use the VCF header in the provided text FILE
1514
1515       -0 --missing-to-ref
1516           assume genotypes at missing sites are 0/0
1517
1518       -f, --apply-filters LIST
1519           see Common Options
1520
1521       -F, --filter-logic x|+
1522           Set the output record to PASS if any of the inputs is PASS (x), or
1523           apply all filters (+), which is the default.
1524
1525       -g, --gvcf -|FILE
1526           merge gVCF blocks, INFO/END tag is expected. If the reference fasta
1527           file FILE is not given and the dash (-) is given, unknown reference
1528           bases generated at gVCF block splits will be substituted with N’s.
1529           The --gvcf option uses the following default INFO rules: -i
1530           QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max.
1531
1532       -i, --info-rules -|TAG:METHOD[,...]
1533           Rules for merging INFO fields (scalars or vectors) or - to disable
1534           the default rules.  METHOD is one of sum, avg, min, max, join.
1535           Default is DP:sum,DP4:sum if these fields exist in the input files.
1536           Fields with no specified rule will take the value from the first
1537           input file. The merged QUAL value is currently set to the maximum.
1538           This behaviour is not user controllable at the moment.
1539
1540       -l, --file-list FILE
1541           Read file names from FILE, one file name per line.
1542
1543       -m, --merge snps|indels|both|all|none|id
1544           The option controls what types of multiallelic records can be
1545           created:
1546
1547           -m none   ..  no new multiallelics, output multiple records instead
1548           -m snps   ..  allow multiallelic SNP records
1549           -m indels ..  allow multiallelic indel records
1550           -m both   ..  both SNP and indel records can be multiallelic
1551           -m all    ..  SNP records can be merged with indel records
1552           -m id     ..  merge by ID
1553
1554       --no-version
1555           see Common Options
1556
1557       -o, --output FILE
1558           see Common Options
1559
1560       -O, --output-type b|u|z|v
1561           see Common Options
1562
1563       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1564           see Common Options
1565
1566       -R, --regions-file file
1567           see Common Options
1568
1569       --threads INT
1570           see Common Options
1571
1572   bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]
1573       Generate VCF or BCF containing genotype likelihoods for one or multiple
1574       alignment (BAM or CRAM) files. This is based on the original samtools
1575       mpileup command (with the -v or -g options) producing genotype
1576       likelihoods in VCF or BCF format, but not the textual pileup output.
1577       The mpileup command was transferred to bcftools in order to avoid
1578       errors resulting from use of incompatible versions of samtools and
1579       bcftools when using in the mpileup+bcftools call pipeline.
1580
1581       Individuals are identified from the SM tags in the @RG header lines.
1582       Multiple individuals can be pooled in one alignment file, also one
1583       individual can be separated into multiple files. If sample identifiers
1584       are absent, each input file is regarded as one sample.
1585
1586       Note that there are two orthogonal ways to specify locations in the
1587       input file; via -r region and -t positions. The former uses (and
1588       requires) an index to do random access while the latter streams through
1589       the file contents filtering out the specified regions, requiring no
1590       index. The two may be used in conjunction. For example a BED file
1591       containing locations of genes in chromosome 20 could be specified using
1592       -r 20 -t chr20.bed, meaning that the index is used to find chromosome
1593       20 and then it is filtered for the regions listed in the BED file. Also
1594       note that the -r option can be much slower than -t with many regions
1595       and can require more memory when multiple regions and many alignment
1596       files are processed.
1597
1598       Input options
1599           -6, --illumina1.3+
1600               Assume the quality is in the Illumina 1.3+ encoding.
1601
1602           -A, --count-orphans
1603               Do not skip anomalous read pairs in variant calling.
1604
1605           -b, --bam-list FILE
1606               List of input alignment files, one file per line [null]
1607
1608           -B, --no-BAQ
1609               Disable probabilistic realignment for the computation of base
1610               alignment quality (BAQ). BAQ is the Phred-scaled probability of
1611               a read base being misaligned. Applying this option greatly
1612               helps to reduce false SNPs caused by misalignments.
1613
1614           -C, --adjust-MQ INT
1615               Coefficient for downgrading mapping quality for reads
1616               containing excessive mismatches. Given a read with a
1617               phred-scaled probability q of being generated from the mapped
1618               posi- tion, the new mapping quality is about
1619               sqrt((INT-q)/INT)*INT. A zero value disables this
1620               functionality; if enabled, the recommended value for BWA is 50.
1621               [0]
1622
1623           -d, --max-depth INT
1624               At a position, read maximally INT reads per input file. Note
1625               that bcftools has a minimum value of 8000/n where n is the
1626               number of input files given to mpileup. This means the default
1627               is highly likely to be increased. Once above the cross-sample
1628               minimum of 8000 the -d parameter will have an effect. [250]
1629
1630           -E, --redo-BAQ
1631               Recalculate BAQ on the fly, ignore existing BQ tags
1632
1633           -f, --fasta-ref FILE
1634               The faidx-indexed reference file in the FASTA format. The file
1635               can be optionally compressed by bgzip. Reference is required by
1636               default unless the --no-reference option is set [null]
1637
1638           --no-reference
1639               Do not require the --fasta-ref option.
1640
1641           -G, --read-groups FILE
1642               list of read groups to include or exclude if prefixed with "^".
1643               One read group per line. This file can also be used to assign
1644               new sample names to read groups by giving the new sample name
1645               as a second white-space-separated field, like this:
1646               "read_group_id new_sample_name". If the read group name is not
1647               unique, also the bam file name can be included: "read_group_id
1648               file_name sample_name". If all reads from the alignment file
1649               should be treated as a single sample, the asterisk symbol can
1650               be used: "* file_name sample_name". Alignments without a read
1651               group ID can be matched with "?".  NOTE: The meaning of
1652               bcftools mpileup -G is the opposite of samtools mpileup -G.
1653
1654                   RG_ID_1
1655                   RG_ID_2  SAMPLE_A
1656                   RG_ID_3  SAMPLE_A
1657                   RG_ID_4  SAMPLE_B
1658                   RG_ID_5  FILE_1.bam  SAMPLE_A
1659                   RG_ID_6  FILE_2.bam  SAMPLE_A
1660                   *        FILE_3.bam  SAMPLE_C
1661                   ?        FILE_3.bam  SAMPLE_D
1662
1663           -q, -min-MQ INT
1664               Minimum mapping quality for an alignment to be used [0]
1665
1666           -Q, --min-BQ INT
1667               Minimum base quality for a base to be considered [13]
1668
1669           -r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
1670               Only generate mpileup output in given regions. Requires the
1671               alignment files to be indexed. If used in conjunction with -l
1672               then considers the intersection; see Common Options
1673
1674           -R, --regions-file FILE
1675               As for -r, --regions, but regions read from FILE; see Common
1676               Options
1677
1678           --ignore-RG
1679               Ignore RG tags. Treat all reads in one alignment file as one
1680               sample.
1681
1682           --rf, --incl-flags STR|INT
1683               Required flags: skip reads with mask bits unset [null]
1684
1685           --ff, --excl-flags STR|INT
1686               Filter flags: skip reads with mask bits set
1687               [UNMAP,SECONDARY,QCFAIL,DUP]
1688
1689           -s, --samples LIST
1690               list of sample names. See Common Options
1691
1692           -S, --samples-file FILE
1693               file of sample names to include or exclude if prefixed with
1694               "^". One sample per line. This file can also be used to rename
1695               samples by giving the new sample name as a second
1696               white-space-separated column, like this: "old_name new_name".
1697               If a sample name contains spaces, the spaces can be escaped
1698               using the backslash character, for example "Not\ a\ good\
1699               sample\ name".
1700
1701           -t, --targets LIST
1702               see Common Options
1703
1704           -T, --targets-file FILE
1705               see Common Options
1706
1707           -x, --ignore-overlaps
1708               Disable read-pair overlap detection.
1709
1710       Output options
1711           -a, --annotate LIST
1712               Comma-separated list of FORMAT and INFO tags to output.
1713               (case-insensitive, the "FORMAT/" prefix is optional, and use
1714               "?" to list available annotations on the command line) [null]:
1715
1716                   *FORMAT/AD* .. Allelic depth (Number=R,Type=Integer)
1717                   *FORMAT/ADF* .. Allelic depths on the forward strand (Number=R,Type=Integer)
1718                   *FORMAT/ADR* .. Allelic depths on the reverse strand (Number=R,Type=Integer)
1719                   *FORMAT/DP* .. Number of high-quality bases (Number=1,Type=Integer)
1720                   *FORMAT/SP* .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
1721
1722                   *INFO/AD* .. Total allelic depth (Number=R,Type=Integer)
1723                   *INFO/ADF* .. Total allelic depths on the forward strand (Number=R,Type=Integer)
1724                   *INFO/ADR* .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
1725
1726                   *FORMAT/DV* .. Deprecated in favor of FORMAT/AD;
1727                           Number of high-quality non-reference bases, (Number=1,Type=Integer)
1728                   *FORMAT/DP4* .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR;
1729                           Number of high-quality ref-forward, ref-reverse,
1730                           alt-forward and alt-reverse bases (Number=4,Type=Integer)
1731                   *FORMAT/DPR* .. Deprecated in favor of FORMAT/AD;
1732                           Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1733                   *INFO/DPR* .. Deprecated in favor of INFO/AD;
1734                           Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1735
1736           -g, --gvcf INT[,...]
1737               output gVCF blocks of homozygous REF calls, with depth (DP)
1738               ranges specified by the list of integers. For example, passing
1739               5,15 will group sites into two types of gVCF blocks, the first
1740               with minimum per-sample DP from the interval [5,15) and the
1741               latter with minimum depth 15 or more. In this example, sites
1742               with minimum per-sample depth less than 5 will be printed as
1743               separate records, outside of gVCF blocks.
1744
1745           --no-version
1746               see Common Options
1747
1748           -o, --output FILE
1749               Write output to FILE, rather than the default of standard
1750               output. (The same short option is used for both --open-prob and
1751               --output. If -o's argument contains any non-digit characters
1752               other than a leading + or - sign, it is interpreted as
1753               --output. Usually the filename extension will take care of
1754               this, but to write to an entirely numeric filename use -o ./123
1755               or --output 123.)
1756
1757           -O, --output-type b|u|z|v
1758               see Common Options
1759
1760           --threads INT
1761               see Common Options
1762
1763       Options for SNP/INDEL genotype likelihood computation
1764           -e, --ext-prob INT
1765               Phred-scaled gap extension sequencing error probability.
1766               Reducing INT leads to longer indels [20]
1767
1768           -F, --gap-frac FLOAT
1769               Minimum fraction of gapped reads [0.002]
1770
1771           -h, --tandem-qual INT
1772               Coefficient for modeling homopolymer errors. Given an l-long
1773               homopolymer run, the sequencing error of an indel of size s is
1774               modeled as INT*s/l [100]
1775
1776           -I, --skip-indels
1777               Do not perform INDEL calling
1778
1779           -L, --max-idepth INT
1780               Skip INDEL calling if the average per-sample depth is above INT
1781               [250]
1782
1783           -m, --min-ireads INT
1784               Minimum number gapped reads for indel candidates INT [1]
1785
1786           -o, --open-prob INT
1787               Phred-scaled gap open sequencing error probability. Reducing
1788               INT leads to more indel calls. (The same short option is used
1789               for both --open-prob and --output. When -o’s argument contains
1790               only an optional + or - sign followed by the digits 0 to 9, it
1791               is interpreted as --open-prob.) [40]
1792
1793           -p, --per-sample-mF
1794               Apply -m and -F thresholds per sample to increase sensitivity
1795               of calling. By default both options are applied to reads pooled
1796               from all samples.
1797
1798           -P, --platforms STR
1799               Comma-delimited list of platforms (determined by @RG-PL) from
1800               which indel candidates are obtained. It is recommended to
1801               collect indel candidates from sequencing technologies that have
1802               low indel error rate such as ILLUMINA [all]
1803
1804       Examples:
1805           Call SNPs and short INDELs, then mark low quality sites and sites
1806           with the read depth exceeding a limit. (The read depth should be
1807           adjusted to about twice the average read depth as higher read
1808           depths usually indicate problematic regions which are often
1809           enriched for artefacts.) One may consider to add -C50 to mpileup if
1810           mapping quality is overestimated for reads containing excessive
1811           mismatches. Applying this option usually helps for BWA-backtrack
1812           alignments, but may not other aligners.
1813
1814                   bcftools mpileup -Ou -f ref.fa aln.bam | \
1815                   bcftools call -Ou -mv | \
1816                   bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
1817
1818   bcftools norm [OPTIONS] file.vcf.gz
1819       Left-align and normalize indels, check if REF alleles match the
1820       reference, split multiallelic sites into multiple rows; recover
1821       multiallelics from multiple rows. Left-alignment and normalization will
1822       only be applied if the --fasta-ref option is supplied.
1823
1824       -c, --check-ref e|w|x|s
1825           what to do when incorrect or missing REF allele is encountered:
1826           exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w
1827           option can be combined with x and s. Note that s can swap alleles
1828           and will update genotypes (GT) and AC counts, but will not attempt
1829           to fix PL or other fields.
1830
1831       -d, --rm-dup snps|indels|both|all|none
1832           If a record is present multiple times, output only the first
1833           instance, see --collapse in Common Options.
1834
1835       -D, --remove-duplicates
1836           If a record is present in multiple files, output only the first
1837           instance. Alias for -d none, deprecated.
1838
1839       -f, --fasta-ref FILE
1840           reference sequence. Supplying this option will turn on
1841           left-alignment and normalization, however, see also the
1842           --do-not-normalize option below.
1843
1844       -m, --multiallelics -|+[snps|indels|both|any]
1845           split multiallelic sites into biallelic records (-) or join
1846           biallelic sites into multiallelic records (+). An optional type
1847           string can follow which controls variant types which should be
1848           split or merged together: If only SNP records should be split or
1849           merged, specify snps; if both SNPs and indels should be merged
1850           separately into two records, specify both; if SNPs and indels
1851           should be merged into a single record, specify any.
1852
1853       --no-version
1854           see Common Options
1855
1856       -N, --do-not-normalize
1857           the -c s option can be used to fix or set the REF allele from the
1858           reference -f. The -N option will not turn on indel normalisation as
1859           the -f option normally implies
1860
1861       -o, --output FILE
1862           see Common Options
1863
1864       -O, --output-type b|u|z|v
1865           see Common Options
1866
1867       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1868           see Common Options
1869
1870       -R, --regions-file file
1871           see Common Options
1872
1873       -s, --strict-filter
1874           when merging (-m+), merged site is PASS only if all sites being
1875           merged PASS
1876
1877       -t, --targets LIST
1878           see Common Options
1879
1880       -T, --targets-file FILE
1881           see Common Options
1882
1883       --threads INT
1884           see Common Options
1885
1886       -w, --site-win INT
1887           maximum distance between two records to consider when locally
1888           sorting variants which changed position during the realignment
1889
1890   bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]
1891       A common framework for various utilities. The plugins can be used the
1892       same way as normal commands only their name is prefixed with "+". Most
1893       plugins accept two types of parameters: general options shared by all
1894       plugins followed by a separator, and a list of plugin-specific options.
1895       There are some exceptions to this rule, some plugins do not accept the
1896       common options and implement their own parameters. Therefore please pay
1897       attention to the usage examples that each plugin comes with.
1898
1899       VCF input options:
1900           -e, --exclude EXPRESSION
1901               exclude sites for which EXPRESSION is true. For valid
1902               expressions see EXPRESSIONS.
1903
1904           -i, --include EXPRESSION
1905               include only sites for which EXPRESSION is true. For valid
1906               expressions see EXPRESSIONS.
1907
1908           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1909               see Common Options
1910
1911           -R, --regions-file file
1912               see Common Options
1913
1914           -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1915               see Common Options
1916
1917           -T, --targets-file file
1918               see Common Options
1919
1920       VCF output options:
1921           --no-version
1922               see Common Options
1923
1924           -o, --output FILE
1925               see Common Options
1926
1927           -O, --output-type b|u|z|v
1928               see Common Options
1929
1930           --threads INT
1931               see Common Options
1932
1933       Plugin options:
1934           -h, --help
1935               list plugin’s options
1936
1937           -l, --list-plugins
1938               List all available plugins.
1939
1940               By default, appropriate system directories are searched for
1941               installed plugins. You can override this by setting the
1942               BCFTOOLS_PLUGINS environment variable to a colon-separated list
1943               of directories to search. If BCFTOOLS_PLUGINS begins with a
1944               colon, ends with a colon, or contains adjacent colons, the
1945               system directories are also searched at that position in the
1946               list of directories.
1947
1948           -v, --verbose
1949               print debugging information to debug plugin failure
1950
1951           -V, --version
1952               print version string and exit
1953
1954       List of plugins coming with the distribution:
1955           GTisec
1956               count genotype intersections across all possible sample subsets
1957               in a vcf file
1958
1959           GTsubset
1960               output only sites where the requested samples all exclusively
1961               share a genotype
1962
1963           ad-bias
1964               find positions with wildly varying ALT allele frequency (Fisher
1965               test on FMT/AD)
1966
1967           af-dist
1968               collect AF deviation stats and GT probability distribution
1969               given AF and assuming HWE
1970
1971           check-ploidy
1972               check if ploidy of samples is consistent for all sites
1973
1974           check-sparsity
1975               print samples without genotypes in a region or chromosome
1976
1977           color-chrs
1978               color shared chromosomal segments, requires trio VCF with
1979               phased GTs
1980
1981           counts
1982               a minimal plugin which counts number of SNPs, Indels, and total
1983               number of sites.
1984
1985           dosage
1986               print genotype dosage. By default the plugin searches for PL,
1987               GL and GT, in that order.
1988
1989           fill-AN-AC
1990               fill INFO fields AN and AC.
1991
1992           fill-from-fasta
1993               fill INFO or REF field based on values in a fasta file
1994
1995           fill-tags
1996               set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, HWE, MAF, NS
1997
1998           fix-ploidy
1999               sets correct ploidy
2000
2001           fixref
2002               determine and fix strand orientation
2003
2004           frameshifts
2005               annotate frameshift indels
2006
2007           guess-ploidy
2008               determine sample sex by checking genotype likelihoods (GL,PL)
2009               or genotypes (GT) in the non-PAR region of chrX.
2010
2011           impute-info
2012               add imputation information metrics to the INFO field based on
2013               selected FORMAT tags
2014
2015           isecGT
2016               compare two files and set non-identical genotypes to missing
2017
2018           mendelian
2019               count Mendelian consistent / inconsistent genotypes.
2020
2021           missing2ref
2022               sets missing genotypes ("./.") to ref allele ("0/0" or "0|0")
2023
2024           prune
2025               prune sites by missingness or linkage disequilibrium
2026
2027           setGT
2028               general tool to set genotypes according to rules requested by
2029               the user
2030
2031           tag2tag
2032               convert between similar tags, such as GL and GP
2033
2034           trio-switch-rate
2035               calculate phase switch rate in trio samples, children samples
2036               must have phased GTs.
2037
2038       Examples:
2039               # List options common to all plugins
2040               bcftools plugin
2041
2042               # List available plugins
2043               bcftools plugin -l
2044
2045               # Run a plugin
2046               bcftools plugin counts in.vcf
2047
2048               # Run a plugin using the abbreviated "+" notation
2049               bcftools +counts in.vcf
2050
2051               # The input VCF can be streamed just like in other commands
2052               cat in.vcf | bcftools +counts
2053
2054               # Print usage information of plugin "dosage"
2055               bcftools +dosage -h
2056
2057               # Replace missing genotypes with 0/0
2058               bcftools +missing2ref in.vcf
2059
2060               # Replace missing genotypes with 0|0
2061               bcftools +missing2ref in.vcf -- -p
2062
2063       Plugins troubleshooting:
2064           Things to check if your plugin does not show up in the bcftools
2065           plugin -l output:
2066
2067           ·   Run with the -v option for verbose output: bcftools plugin -lv
2068
2069           ·   Does the environment variable BCFTOOLS_PLUGINS include the
2070               correct path?
2071
2072       Plugins API:
2073               // Short description used by 'bcftools plugin -l'
2074               const char *about(void);
2075
2076               // Longer description used by 'bcftools +name -h'
2077               const char *usage(void);
2078
2079               // Called once at startup, allows initialization of local variables.
2080               // Return 1 to suppress normal VCF/BCF header output, -1 on critical
2081               // errors, 0 otherwise.
2082               int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr);
2083
2084               // Called for each VCF record, return NULL to suppress the output
2085               bcf1_t *process(bcf1_t *rec);
2086
2087               // Called after all lines have been processed to clean up
2088               void destroy(void);
2089
2090   bcftools polysomy [OPTIONS] file.vcf.gz
2091       Detect number of chromosomal copies in VCFs annotates with the
2092       Illumina’s B-allele frequency (BAF) values. Note that this command is
2093       not compiled in by default, see the section Optional Compilation with
2094       GSL in the INSTALL file for help.
2095
2096       General options:
2097           -o, --output-dir path
2098               output directory
2099
2100           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2101               see Common Options
2102
2103           -R, --regions-file file
2104               see Common Options
2105
2106           -s, --sample string
2107               sample name
2108
2109           -t, --targets LIST
2110               see Common Options
2111
2112           -T, --targets-file FILE
2113               see Common Options
2114
2115           -v, --verbose
2116               verbose debugging output which gives hints about the thresholds
2117               and decisions made by the program. Note that the exact output
2118               can change between versions.
2119
2120       Algorithm options:
2121           -b, --peak-size float
2122               the minimum peak size considered as a good match can be from
2123               the interval [0,1] where larger is stricter
2124
2125           -c, --cn-penalty float
2126               a penalty for increasing copy number state. How this works:
2127               multiple peaks are always a better fit than a single peak,
2128               therefore the program prefers a single peak (normal copy
2129               number) unless the absolute deviation of the multiple peaks fit
2130               is significantly smaller. Here the meaning of "significant" is
2131               given by the float from the interval [0,1] where larger is
2132               stricter.
2133
2134           -f, --fit-th float
2135               threshold for goodness of fit (normalized absolute deviation),
2136               smaller is stricter
2137
2138           -i, --include-aa
2139               include also the AA peak in CN2 and CN3 evaluation. This
2140               usually requires increasing -f.
2141
2142           -m, --min-fraction float
2143               minimum distinguishable fraction of aberrant cells. The
2144               experience shows that trustworthy are estimates of 20% and
2145               more.
2146
2147           -p, --peak-symmetry float
2148               a heuristics to filter failed fits where the expected peak
2149               symmetry is violated. The float is from the interval [0,1] and
2150               larger is stricter
2151
2152   bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]
2153       Extracts fields from VCF or BCF files and outputs them in user-defined
2154       format.
2155
2156       -e, --exclude EXPRESSION
2157           exclude sites for which EXPRESSION is true. For valid expressions
2158           see EXPRESSIONS.
2159
2160       -f, --format FORMAT
2161           learn by example, see below
2162
2163       -H, --print-header
2164           print header
2165
2166       -i, --include EXPRESSION
2167           include only sites for which EXPRESSION is true. For valid
2168           expressions see EXPRESSIONS.
2169
2170       -l, --list-samples
2171           list sample names and exit
2172
2173       -o, --output FILE
2174           see Common Options
2175
2176       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2177           see Common Options
2178
2179       -R, --regions-file file
2180           see Common Options
2181
2182       -s, --samples LIST
2183           see Common Options
2184
2185       -S, --samples-file FILE
2186           see Common Options
2187
2188       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2189           see Common Options
2190
2191       -T, --targets-file file
2192           see Common Options
2193
2194       -u, --allow-undef-tags
2195           do not throw an error if there are undefined tags in the format
2196           string, print "." instead
2197
2198       -v, --vcf-list FILE
2199           process multiple VCFs listed in the file
2200
2201       Format:
2202               %CHROM          The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
2203               %INFO/TAG       Any tag in the INFO column
2204               %TYPE           Variant type (REF, SNP, MNP, INDEL, BND, OTHER)
2205               %MASK           Indicates presence of the site in other files (with multiple files)
2206               %TAG{INT}       Curly brackets to subscript vectors (0-based)
2207               %FIRST_ALT      Alias for %ALT{0}
2208               []              Format fields must be enclosed in brackets to loop over all samples
2209               %GT             Genotype (e.g. 0/1)
2210               %TBCSQ          Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
2211               %TGT            Translated genotype (e.g. C/A)
2212               %IUPACGT        Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
2213               %LINE           Prints the whole line
2214               %SAMPLE         Sample name
2215               %POS0           POS in 0-based coordinates
2216               %END            End position of the REF allele
2217               %END0           End position of the REF allele in 0-based cordinates
2218               \n              new line
2219               \t              tab character
2220
2221               Everything else is printed verbatim.
2222
2223       Examples:
2224               # Print chromosome, position, ref allele and the first alternate allele
2225               bcftools query -f '%CHROM  %POS  %REF  %ALT{0}\n' file.vcf.gz
2226
2227               # Similar to above, but use tabs instead of spaces, add sample name and genotype
2228               bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
2229
2230               # Print FORMAT/GT fields followed by FORMAT/GT fields
2231               bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
2232
2233               # Make a BED file: chr, pos (0-based), end pos (1-based), id
2234               bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
2235
2236               # Print only samples with alternate (non-reference) genotypes
2237               bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
2238
2239               # Print all samples at sites with at least one alternate genotype
2240               bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
2241
2242   bcftools reheader [OPTIONS] file.vcf.gz
2243       Modify header of VCF/BCF files, change sample names.
2244
2245       -h, --header FILE
2246           new VCF header
2247
2248       -o, --output FILE
2249           see Common Options
2250
2251       -s, --samples FILE
2252           new sample names, one name per line, in the same order as they
2253           appear in the VCF file. Alternatively, only samples which need to
2254           be renamed can be listed as "old_name new_name\n" pairs separated
2255           by whitespaces, each on a separate line. If a sample name contains
2256           spaces, the spaces can be escaped using the backslash character,
2257           for example "Not\ a\ good\ sample\ name".
2258
2259   bcftools roh [OPTIONS] file.vcf.gz
2260       A program for detecting runs of homo/autozygosity. Only bi-allelic
2261       sites are considered.
2262
2263       The HMM model:
2264               Notation:
2265                 D  = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
2266                 f  = non-ref allele frequency
2267
2268               Emission probabilities:
2269                 oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
2270                 oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA)
2271
2272               Transition probabilities:
2273                 tAZ = P(AZ|HW)  .. from HW to AZ, the -a parameter
2274                 tHW = P(HW|AZ)  .. from AZ to HW, the -H parameter
2275
2276                 ci  = P_i(C)  .. probability of cross-over at site i, from genetic map
2277                 AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
2278                 HWi = P_i(HW)
2279
2280                 P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
2281                 P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
2282
2283       General Options:
2284           --AF-dflt FLOAT
2285               in case allele frequency is not known, use the FLOAT. By
2286               default, sites where allele frequency cannot be determined, or
2287               is 0, are skipped.
2288
2289           --AF-tag TAG
2290               use the specified INFO tag TAG as an allele frequency estimate
2291               instead of the default AC and AN tags. Sites which do not have
2292               TAG will be skipped.
2293
2294           --AF-file FILE
2295               Read allele frequencies from a tab-delimited file containing
2296               the columns: CHROM\tPOS\tREF,ALT\tAF. The file can be
2297               compressed with bgzip and indexed with tabix -s1 -b2 -e2. Sites
2298               which are not present in the FILE or have different reference
2299               or alternate allele will be skipped. Note that such a file can
2300               be easily created from a VCF using:
2301
2302                   bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
2303
2304           -b, --buffer-size INT[,INT]
2305               when the entire many-sample file cannot fit into memory, a
2306               sliding buffer approach can be used. The first value is the
2307               number of sites to keep in memory. If negative, it is
2308               interpreted as the maximum memory to use, in MB. The second,
2309               optional, value sets the number of overlapping sites. The
2310               default overlap is set to roughly 1% of the buffer size.
2311
2312           -e, --estimate-AF FILE
2313               estimate the allele frequency by recalculating INFO/AC and
2314               INFO/AN on the fly, using the specified TAG which can be either
2315               FORMAT/GT ("GT") or FORMAT/PL ("PL"). If TAG is not given, "GT"
2316               is assumed. Either all samples ("-") or samples listed in FILE
2317               will be included. For example, use "PL,-" to estimate AF from
2318               FORMAT/PL of all samples. If neither -e nor the other --AF-...
2319               options are given, the allele frequency is estimated from AC
2320               and AN counts which are already present in the INFO field.
2321
2322           --exclude EXPRESSION
2323               exclude sites for which EXPRESSION is true. For valid
2324               expressions see EXPRESSIONS.
2325
2326           -G, --GTs-only FLOAT
2327               use genotypes (FORMAT/GT fields) ignoring genotype likelihoods
2328               (FORMAT/PL), setting PL of unseen genotypes to FLOAT. Safe
2329               value to use is 30 to account for GT errors.
2330
2331           --include EXPRESSION
2332               include only sites for which EXPRESSION is true. For valid
2333               expressions see EXPRESSIONS.
2334
2335           -I, --skip-indels
2336               skip indels as their genotypes are usually enriched for errors
2337
2338           -m, --genetic-map FILE
2339               genetic map in the format required also by IMPUTE2. Only the
2340               first and third column are used (position and Genetic_Map(cM)).
2341               The FILE can chromosome name.
2342
2343           -M, --rec-rate FLOAT
2344               constant recombination rate per bp. In combination with
2345               --genetic-map, the --rec-rate parameter is interpreted
2346               differently, as FLOAT-fold increase of transition
2347               probabilities, which allows the model to become more sensitive
2348               yet still account for recombination hotspots. Note that also
2349               the range of the values is therefore different in both cases:
2350               normally the parameter will be in the range (1e-3,1e-9) but
2351               with --genetic-map it will be in the range (10,1000).
2352
2353           -o, --output FILE
2354               Write output to the FILE, by default the output is printed on
2355               stdout
2356
2357           -O, --output-type s|r[z]
2358               Generate per-site output (s) or per-region output (r). By
2359               default both types are printed and the output is uncompressed.
2360               Add z for a compressed output.
2361
2362           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2363               see Common Options
2364
2365           -R, --regions-file file
2366               see Common Options
2367
2368           -s, --samples LIST
2369               see Common Options
2370
2371           -S, --samples-file FILE
2372               see Common Options
2373
2374           -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2375               see Common Options
2376
2377           -T, --targets-file file
2378               see Common Options
2379
2380       HMM Options:
2381           -a, --hw-to-az FLOAT
2382               P(AZ|HW) transition probability from AZ (autozygous) to HW
2383               (Hardy-Weinberg) state
2384
2385           -H, --az-to-hw FLOAT
2386               P(HW|AZ) transition probability from HW to AZ state
2387
2388           -V, --viterbi-training FLOAT
2389               estimate HMM parameters using Baum-Welch algorithm, using the
2390               convergence threshold FLOAT, e.g. 1e-10 (experimental)
2391
2392   bcftools sort [OPTIONS] file.bcf
2393       -m, --max-mem FLOAT[kMG]
2394           Maximum memory to use. Approximate, affects the number of temporary
2395           files written to the disk. Note that if the command fails at this
2396           step because of too many open files, your system limit on the
2397           number of open files ("ulimit") may need to be increased.
2398
2399       -o, --output FILE
2400           see Common Options
2401
2402       -O, --output-type b|u|z|v
2403           see Common Options
2404
2405       -T, --temp-dir DIR
2406           Use this directory to store temporary files
2407
2408   bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]
2409       Parses VCF or BCF and produces text file stats which is suitable for
2410       machine processing and can be plotted using plot-vcfstats. When two
2411       files are given, the program generates separate stats for intersection
2412       and the complements. By default only sites are compared, -s/-S must
2413       given to include also sample columns. When one VCF file is specified on
2414       the command line, then stats by non-reference allele frequency, depth
2415       distribution, stats by quality and per-sample counts, singleton stats,
2416       etc. are printed. When two VCF files are given, then stats such as
2417       concordance (Genotype concordance by non-reference allele frequency,
2418       Genotype concordance by sample, Non-Reference Discordance) and
2419       correlation are also printed. Per-site discordance (PSD) is also
2420       printed in --verbose mode.
2421
2422       --af-bins LIST|FILE
2423           comma separated list of allele frequency bins (e.g. 0.1,0.5,1) or a
2424           file listing the allele frequency bins one per line (e.g.
2425           0.1\n0.5\n1)
2426
2427       --af-tag TAG
2428           allele frequency INFO tag to use for binning. By default the allele
2429           frequency is estimated from AC/AN, if available, or directly from
2430           the genotypes (GT) if not.
2431
2432       -1, --1st-allele-only
2433           consider only the 1st alternate allele at multiallelic sites
2434
2435       -c, --collapse snps|indels|both|all|some|none
2436           see Common Options
2437
2438       -d, --depth INT,INT,INT
2439           ranges of depth distribution: min, max, and size of the bin
2440
2441       --debug
2442           produce verbose per-site and per-sample output
2443
2444       -e, --exclude EXPRESSION
2445           exclude sites for which EXPRESSION is true. For valid expressions
2446           see EXPRESSIONS.
2447
2448       -E, --exons file.gz
2449           tab-delimited file with exons for indel frameshifts statistics. The
2450           columns of the file are CHR, FROM, TO, with 1-based, inclusive,
2451           positions. The file is BGZF-compressed and indexed with tabix
2452
2453               tabix -s1 -b2 -e3 file.gz
2454
2455       -f, --apply-filters LIST
2456           see Common Options
2457
2458       -F, --fasta-ref ref.fa
2459           faidx indexed reference sequence file to determine INDEL context
2460
2461       -i, --include EXPRESSION
2462           include only sites for which EXPRESSION is true. For valid
2463           expressions see EXPRESSIONS.
2464
2465       -I, --split-by-ID
2466           collect stats separately for sites which have the ID column set
2467           ("known sites") or which do not have the ID column set ("novel
2468           sites").
2469
2470       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2471           see Common Options
2472
2473       -R, --regions-file file
2474           see Common Options
2475
2476       -s, --samples LIST
2477           see Common Options
2478
2479       -S, --samples-file FILE
2480           see Common Options
2481
2482       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2483           see Common Options
2484
2485       -T, --targets-file file
2486           see Common Options
2487
2488       -u, --user-tstv <TAG[:min:max:n]>
2489           collect Ts/Tv stats for any tag using the given binning [0:1:100]
2490
2491       -v, --verbose
2492           produce verbose per-site and per-sample output
2493
2494   bcftools view [OPTIONS] file.vcf.gz [REGION [...]]
2495       View, subset and filter VCF or BCF files by position and filtering
2496       expression. Convert between VCF and BCF. Former bcftools subset.
2497
2498       Output options
2499           -G, --drop-genotypes
2500               drop individual genotype information (after subsetting if -s
2501               option is set)
2502
2503           -h, --header-only
2504               output the VCF header only
2505
2506           -H, --no-header
2507               suppress the header in VCF output
2508
2509           -l, --compression-level [0-9]
2510               compression level. 0 stands for uncompressed, 1 for best speed
2511               and 9 for best compression.
2512
2513           --no-version
2514               see Common Options
2515
2516           -O, --output-type b|u|z|v
2517               see Common Options
2518
2519           -o, --output-file FILE: output file name. If not present, the
2520           default is to print to standard output (stdout).
2521
2522           -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2523               see Common Options
2524
2525           -R, --regions-file file
2526               see Common Options
2527
2528           -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2529               see Common Options
2530
2531           -T, --targets-file file
2532               see Common Options
2533
2534           --threads INT
2535               see Common Options
2536
2537       Subset options:
2538           -a, --trim-alt-alleles
2539               trim alternate alleles not seen in subset. Type A, G and R INFO
2540               and FORMAT fields will also be trimmed
2541
2542           --force-samples
2543               only warn about unknown subset samples
2544
2545           -I, --no-update
2546               do not (re)calculate INFO fields for the subset (currently
2547               INFO/AC and INFO/AN)
2548
2549           -s, --samples LIST
2550               see Common Options
2551
2552           -S, --samples-file FILE
2553               see Common Options
2554
2555       Filter options:
2556           Note that filter options below dealing with counting the number of
2557           alleles will, for speed, first check for the values of AC and AN in
2558           the INFO column to avoid parsing all the genotype (FORMAT/GT)
2559           fields in the VCF. This means that a filter like --min-af 0.1 will
2560           be based ‘AC/AN’ where AC and AN come from either INFO/AC and
2561           INFO/AN if available or FORMAT/GT if not. It will not filter on
2562           another field like INFO/AF. The --include and --exclude filter
2563           expressions should instead be used to explicitly filter based on
2564           fields in the INFO column, e.g. --exclude AF<0.1.
2565
2566           -c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
2567               minimum allele count (INFO/AC) of sites to be printed.
2568               Specifying the type of allele is optional and can be set to
2569               non-reference (nref, the default), 1st alternate (alt1), the
2570               least frequent (minor), the most frequent (major) or sum of all
2571               but the most frequent (nonmajor) alleles.
2572
2573           -C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
2574               maximum allele count (INFO/AC) of sites to be printed.
2575               Specifying the type of allele is optional and can be set to
2576               non-reference (nref, the default), 1st alternate (alt1), the
2577               least frequent (minor), the most frequent (major) or sum of all
2578               but the most frequent (nonmajor) alleles.
2579
2580           -e, --exclude EXPRESSION
2581               exclude sites for which EXPRESSION is true. For valid
2582               expressions see EXPRESSIONS.
2583
2584           -f, --apply-filters LIST
2585               see Common Options
2586
2587           -g, --genotype [^][hom|het|miss]
2588               include only sites with one or more homozygous (hom),
2589               heterozygous (het) or missing (miss) genotypes. When prefixed
2590               with ^, the logic is reversed; thus ^het excludes sites with
2591               heterozygous genotypes.
2592
2593           -i, --include EXPRESSION
2594               include sites for which EXPRESSION is true. For valid
2595               expressions see EXPRESSIONS.
2596
2597           -k, --known
2598               print known sites only (ID column is not ".")
2599
2600           -m, --min-alleles INT
2601               print sites with at least INT alleles listed in REF and ALT
2602               columns
2603
2604           -M, --max-alleles INT
2605               print sites with at most INT alleles listed in REF and ALT
2606               columns. Use -m2 -M2 -v snps to only view biallelic SNPs.
2607
2608           -n, --novel
2609               print novel sites only (ID column is ".")
2610
2611           -p, --phased
2612               print sites where all samples are phased. Haploid genotypes are
2613               considered phased. Missing genotypes considered unphased unless
2614               the phased bit is set.
2615
2616           -P, --exclude-phased
2617               exclude sites where all samples are phased
2618
2619           -q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
2620               minimum allele frequency (INFO/AC / INFO/AN) of sites to be
2621               printed. Specifying the type of allele is optional and can be
2622               set to non-reference (nref, the default), 1st alternate (alt1),
2623               the least frequent (minor), the most frequent (major) or sum of
2624               all but the most frequent (nonmajor) alleles.
2625
2626           -Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
2627               maximum allele frequency (INFO/AC / INFO/AN) of sites to be
2628               printed. Specifying the type of allele is optional and can be
2629               set to non-reference (nref, the default), 1st alternate (alt1),
2630               the least frequent (minor), the most frequent (major) or sum of
2631               all but the most frequent (nonmajor) alleles.
2632
2633           -u, --uncalled
2634               print sites without a called genotype
2635
2636           -U, --exclude-uncalled
2637               exclude sites without a called genotype
2638
2639           -v, --types snps|indels|mnps|other
2640               comma-separated list of variant types to select. Site is
2641               selected if any of the ALT alleles is of the type requested.
2642               Types are determined by comparing the REF and ALT alleles in
2643               the VCF record not INFO tags like INFO/INDEL or INFO/VT. Use
2644               --include to select based on INFO tags.
2645
2646           -V, --exclude-types snps|indels|mnps|ref|bnd|other
2647               comma-separated list of variant types to exclude. Site is
2648               excluded if any of the ALT alleles is of the type requested.
2649               Types are determined by comparing the REF and ALT alleles in
2650               the VCF record not INFO tags like INFO/INDEL or INFO/VT. Use
2651               --exclude to exclude based on INFO tags.
2652
2653           -x, --private
2654               print sites where only the subset samples carry an
2655               non-reference allele. Requires --samples or --samples-file.
2656
2657           -X, --exclude-private
2658               exclude sites where only the subset samples carry an
2659               non-reference allele
2660
2661   bcftools help [COMMAND] | bcftools --help [COMMAND]
2662       Display a brief usage message listing the bcftools commands available.
2663       If the name of a command is also given, e.g., bcftools help view, the
2664       detailed usage message for that particular command is displayed.
2665
2666   bcftools [--version|-v]
2667       Display the version numbers and copyright information for bcftools and
2668       the important libraries used by bcftools.
2669
2670   bcftools [--version-only]
2671       Display the full bcftools version number in a machine-readable format.
2672

EXPRESSIONS

2674       These filtering expressions are accepted by most of the commands.
2675
2676       Valid expressions may contain:
2677
2678       ·   numerical constants, string constants, file names
2679
2680               1, 1.0, 1e-4
2681               "String"
2682               @file_name
2683
2684       ·   arithmetic operators
2685
2686               +,*,-,/
2687
2688       ·   comparison operators
2689
2690               == (same as =), >, >=, <=, <, !=
2691
2692       ·   regex operators "~" and its negation "!~". The expressions are case
2693           sensitive unless "/i" is added.
2694
2695               INFO/HAYSTACK ~ "needle"
2696               INFO/HAYSTACK ~ "NEEDless/i"
2697
2698       ·   parentheses
2699
2700               (, )
2701
2702       ·   logical operators
2703
2704               && (same as &), ||,  |
2705
2706       ·   INFO tags, FORMAT tags, column names
2707
2708               INFO/DP or DP
2709               FORMAT/DV, FMT/DV, or DV
2710               FILTER, QUAL, ID, POS, REF, ALT[0]
2711
2712       ·   1 (or 0) to test the presence (or absence) of a flag
2713
2714               FlagA=1 && FlagB=0
2715
2716       ·   "." to test missing values
2717
2718               DP=".", DP!=".", ALT="."
2719
2720       ·   missing genotypes can be matched regardless of phase and ploidy
2721           (".|.", "./.", ".") using these expressions
2722
2723               GT~"\.", GT!~"\."
2724
2725       ·   missing genotypes can be matched including the phase and ploidy
2726           (".|.", "./.", ".") using these expressions
2727
2728               GT=".|.", GT="./.", GT="."
2729
2730       ·   sample genotype: reference (haploid or diploid), alternate (hom or
2731           het, haploid or diploid), missing genotype, homozygous,
2732           heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het,
2733           alt-alt het, haploid ref, haploid alt (case-insensitive)
2734
2735               GT="ref"
2736               GT="alt"
2737               GT="mis"
2738               GT="hom"
2739               GT="het"
2740               GT="hap"
2741               GT="RR"
2742               GT="AA"
2743               GT="RA" or GT="AR"
2744               GT="Aa" or GT="aA"
2745               GT="R"
2746               GT="A"
2747
2748       ·   TYPE for variant type in REF,ALT columns
2749           (indel,snp,mnp,ref,bnd,other). Use the regex operator "\~" to
2750           require at least one allele of the given type or the equal sign "="
2751           to require that all alleles are of the given type. Compare
2752
2753               TYPE="snp"
2754               TYPE~"snp"
2755               TYPE!="snp"
2756               TYPE!~"snp"
2757
2758       ·   array subscripts (0-based), "*" for any element, "-" to indicate a
2759           range. Note that for querying FORMAT vectors, the colon ":" can be
2760           used to select a sample and an element of the vector, as shown in
2761           the examples below
2762
2763               INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
2764               FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
2765               FORMAT/AD[0:1]               .. first sample, second AD value
2766               FORMAT/AD[1:0]               .. second sample, first AD value
2767               DP4[*] == 0                  .. any DP4 value
2768               FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
2769               FORMAT/DP[1-3] > 10          .. samples 2-4
2770               FORMAT/DP[1-]  < 7           .. all samples but the first
2771               FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
2772               FORMAT/AD[0:1]               .. first sample, second AD field
2773               FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
2774               FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
2775               (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
2776               CSQ[*] ~ "missense_variant.*deleterious"
2777
2778       ·   with many samples it can be more practical to provide a file with
2779           sample names, one sample name per line
2780
2781               GT[@samples.txt]="het" & binom(AD)<0.01
2782
2783       ·   function on FORMAT tags (over samples) and INFO tags (over vector
2784           fields)
2785
2786               MAX, MIN, AVG, SUM, STRLEN, ABS, COUNT
2787
2788       ·   two-tailed binomial test. Note that for N=0 the test evaluates to a
2789           missing value and when FORMAT/GT is used to determine the vector
2790           indices, it evaluates to 1 for homozygous genotypes.
2791
2792               binom(FMT/AD)                .. GT can be used to determine the correct index
2793               binom(AD[0],AD[1])           .. or the fields can be given explicitly
2794
2795       ·   variables calculated on the fly if not present: number of alternate
2796           alleles; number of samples; count of alternate alleles; minor
2797           allele count (similar to AC but is always smaller than 0.5);
2798           frequency of alternate alleles (AF=AC/AN); frequency of minor
2799           alleles (MAF=MAC/AN); number of alleles in called genotypes; number
2800           of samples with missing genotype; fraction of samples with missing
2801           genotype;
2802
2803               N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING
2804
2805       ·   the number (N_PASS) or fraction (F_PASS) of samples which pass the
2806           expression
2807
2808               N_PASS(GQ>90 & GT!="mis") > 90
2809               F_PASS(GQ>90 & GT!="mis") > 0.9
2810
2811       ·   custom perl filtering. Note that this command is not compiled in by
2812           default, see the section Optional Compilation with Perl in the
2813           INSTALL file for help and misc/demo-flt.pl for a working example.
2814           The demo defined the perl subroutine "severity" which can be
2815           invoked from the command line as follows:
2816
2817               perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
2818
2819       Notes:
2820
2821       ·   String comparisons and regular expressions are case-insensitive
2822
2823       ·   Variables and function names are case-insensitive, but not tag
2824           names. For example, "qual" can be used instead of "QUAL",
2825           "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
2826
2827       ·   When querying multiple values, all elements are tested and the OR
2828           logic is used on the result. For example, when querying
2829           "TAG=1,2,3,4", it will be evaluated as follows:
2830
2831               -i 'TAG[*]=1'   .. true, the record will be printed
2832               -i 'TAG[*]!=1'  .. true
2833               -e 'TAG[*]=1'   .. false, the record will be discarded
2834               -e 'TAG[*]!=1'  .. false
2835               -i 'TAG[0]=1'   .. true
2836               -i 'TAG[0]!=1'  .. false
2837               -e 'TAG[0]=1'   .. false
2838               -e 'TAG[0]!=1'  .. true
2839
2840       Examples:
2841
2842           MIN(DV)>5
2843
2844           MIN(DV/DP)>0.3
2845
2846           MIN(DP)>10 & MIN(DV)>3
2847
2848           FMT/DP>10  & FMT/GQ>10 .. both conditions must be satisfied within one sample
2849
2850           FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
2851
2852           QUAL>10 |  FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
2853
2854           QUAL>10 || FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
2855
2856           TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
2857
2858           COUNT(GT="hom")=0
2859
2860           MIN(DP)>35 && AVG(GQ)>50
2861
2862           ID=@file       .. selects lines with ID present in the file
2863
2864           ID!=@~/file    .. skip lines with ID present in the ~/file
2865
2866           MAF[0]<0.05    .. select rare variants at 5% cutoff
2867
2868           POS>=100   .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
2869
2870       Shell expansion:
2871
2872       Note that expressions must often be quoted because some characters have
2873       special meaning in the shell. An example of expression enclosed in
2874       single quotes which cause that the whole expression is passed to the
2875       program as intended:
2876
2877           bcftools view -i '%ID!="." & MAF[0]<0.01'
2878
2879       Please refer to the documentation of your shell for details.
2880

SCRIPTS AND OPTIONS

2882   plot-vcfstats [OPTIONS] file.vchk [...]
2883       Script for processing output of bcftools stats. It can merge results
2884       from multiple outputs (useful when running the stats for each
2885       chromosome separately), plots graphs and creates a PDF presentation.
2886
2887       -m, --merge
2888           Merge vcfstats files to STDOUT, skip plotting.
2889
2890       -p, --prefix DIR
2891           The output directory. This directory will be created if it does not
2892           exist.
2893
2894       -P, --no-PDF
2895           Skip the PDF creation step.
2896
2897       -r, --rasterize
2898           Rasterize PDF images for faster rendering.
2899
2900       -s, --sample-names
2901           Use sample names for xticks rather than numeric IDs.
2902
2903       -t, --title STRING
2904           Identify files by these titles in plots. The option can be given
2905           multiple times, for each ID in the bcftools stats output. If not
2906           present, the script will use abbreviated source file names for the
2907           titles.
2908
2909       -T, --main-title STRING
2910           Main title for the PDF.
2911

PERFORMANCE

2913       HTSlib was designed with BCF format in mind. When parsing VCF files,
2914       all records are internally converted into BCF representation. Simple
2915       operations, like removing a single column from a VCF file, can be
2916       therefore done much faster with standard UNIX commands, such as awk or
2917       cut. Therefore it is recommended to use BCF as input/output format
2918       whenever possible to avoid large overhead of the VCF → BCF → VCF
2919       conversion.
2920

BUGS

2922       Please report any bugs you encounter on the github website:
2923       http://github.com/samtools/bcftools
2924

AUTHORS

2926       Heng Li from the Sanger Institute wrote the original C version of
2927       htslib, samtools and bcftools. Bob Handsaker from the Broad Institute
2928       implemented the BGZF library. Petr Danecek, Shane McCarthy and John
2929       Marshall are maintaining and further developing bcftools. Many other
2930       people contributed to the program and to the file format
2931       specifications, both directly and indirectly by providing patches,
2932       testing and reporting bugs. We thank them all.
2933

RESOURCES

2935       BCFtools GitHub website: http://github.com/samtools/bcftools
2936
2937       Samtools GitHub website: http://github.com/samtools/samtools
2938
2939       HTSlib GitHub website: http://github.com/samtools/htslib
2940
2941       File format specifications: http://samtools.github.io/hts-specs
2942
2943       BCFtools documentation: http://samtools.github.io/bcftools
2944
2945       BCFtools wiki page: https://github.com/samtools/bcftools/wiki
2946

COPYING

2948       The MIT/Expat License or GPL License, see the LICENSE document for
2949       details. Copyright (c) Genome Research Ltd.
2950
2951
2952
2953                                  2018-07-18                       BCFTOOLS(1)