bcftools(1)

1BCFTOOLS(1)                                                        BCFTOOLS(1)
2
3
4

NAME

6       bcftools - utilities for variant calling and manipulating VCFs and
7       BCFs.
8

SYNOPSIS

10       bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
11

DESCRIPTION

13       BCFtools  is  a set of utilities that manipulate variant calls in the
14       Variant Call Format (VCF) and its binary counterpart BCF. All commands
15       work transparently with both VCFs and BCFs, both uncompressed and
16       BGZF-compressed.
17
18       Most commands accept VCF, bgzipped VCF and BCF with filetype detected
19       automatically even when streaming from a pipe. Indexed VCF and BCF will
20       work in all situations. Un-indexed VCF and BCF and streams will work in
21       most, but not all situations. In general, whenever multiple VCFs are
22       read simultaneously, they must be indexed and therefore also
23       compressed. (Note that files with non-standard index names can be
24       accessed as e.g. "bcftools view -r X:2928329
25       file.vcf.gz##idx##non-standard-index-name".)
26
27       BCFtools is designed to work on a stream. It regards an input file "-"
28       as the standard input (stdin) and outputs to the standard output
29       (stdout). Several commands can thus be  combined  with  Unix pipes.
30
31   VERSION
32       This manual page was last updated 2022-04-07 and refers to bcftools git
33       version 1.15.1.
34
35   BCF1
36       The BCF1 format output by versions of samtools <= 0.1.19 is not
37       compatible with this version of bcftools. To read BCF1 files one can
38       use the view command from old versions of bcftools packaged with
39       samtools versions <= 0.1.19 to convert to VCF, which can then be read
40       by this version of bcftools.
41
42               samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
43
44   VARIANT CALLING
45       See bcftools call for variant calling from the output of the samtools
46       mpileup command. In versions of samtools <= 0.1.19 calling was done
47       with bcftools view. Users are now required to choose between the old
48       samtools calling model (-c/--consensus-caller) and the new multiallelic
49       calling model (-m/--multiallelic-caller). The multiallelic calling
50       model is recommended for most tasks.
51

LIST OF COMMANDS

53       For a full list of available commands, run bcftools without arguments.
54       For a full list of available options, run bcftools COMMAND without
55       arguments.
56
57       •   annotate  .. edit VCF files, add or remove annotations
58
59       •   call      .. SNP/indel calling (former "view")
60
61       •   cnv       .. Copy Number Variation caller
62
63       •   concat    .. concatenate VCF/BCF files from the same set of samples
64
65       •   consensus .. create consensus sequence by applying VCF variants
66
67       •   convert   .. convert VCF/BCF to other formats and back
68
69       •   csq       .. haplotype aware consequence caller
70
71       •   filter    .. filter VCF/BCF files using fixed thresholds
72
73       •   gtcheck   .. check sample concordance, detect sample swaps and
74           contamination
75
76       •   head      .. view VCF/BCF file headers
77
78       •   index     .. index VCF/BCF
79
80       •   isec      .. intersections of VCF/BCF files
81
82       •   merge     .. merge VCF/BCF files files from non-overlapping sample
83           sets
84
85       •   mpileup   .. multi-way pileup producing genotype likelihoods
86
87       •   norm      .. normalize indels
88
89       •   plugin    .. run user-defined plugin
90
91       •   polysomy  .. detect contaminations and whole-chromosome aberrations
92
93       •   query     .. transform VCF/BCF into user-defined formats
94
95       •   reheader  .. modify VCF/BCF header, change sample names
96
97       •   roh       .. identify runs of homo/auto-zygosity
98
99       •   sort      .. sort VCF/BCF files
100
101       •   stats     .. produce VCF/BCF stats (former vcfcheck)
102
103       •   view      .. subset, filter and convert VCF and BCF files
104

LIST OF SCRIPTS

106       Some helper scripts are bundled with the bcftools code.
107
108       •   plot-vcfstats  .. plots the output of stats
109

COMMANDS AND OPTIONS

111   Common Options
112       The following options are common to many bcftools commands. See usage
113       for specific commands to see if they apply.
114
115       FILE
116           Files can be both VCF or BCF, uncompressed or BGZF-compressed. The
117           file "-" is interpreted as standard input. Some tools may require
118           tabix- or CSI-indexed files.
119
120       -c, --collapse snps|indels|both|all|some|none|id
121           Controls  how to treat records with duplicate positions and defines
122           compatible records across multiple input files. Here by
123           "compatible" we mean records which should be considered as
124           identical by the tools. For example, when performing line
125           intersections, the desire may be to consider as identical all sites
126           with matching positions (bcftools isec -c all), or only sites with
127           matching variant type (bcftools isec -c snps  -c indels), or only
128           sites with all alleles identical (bcftools isec -c none).
129
130           none
131               only records with identical REF and ALT alleles are compatible
132
133           some
134               only records where some subset of ALT alleles match are
135               compatible
136
137           all
138               all records are compatible, regardless of whether the ALT
139               alleles match or not. In the case of records with the same
140               position, only the first will be considered and appear on
141               output.
142
143           snps
144               any SNP records are compatible, regardless of whether the ALT
145               alleles match or not. For duplicate positions, only the first
146               SNP record will be considered and appear on output.
147
148           indels
149               all  indel records are compatible, regardless of whether the
150               REF and ALT alleles match or not. For duplicate positions, only
151               the first indel record will be considered and appear on output.
152
153           both
154               abbreviation of "-c indels  -c snps"
155
156           id
157               only records with identical ID column are compatible. Supported
158               by bcftools merge only.
159
160       -f, --apply-filters LIST
161           Skip sites where FILTER column does not contain any of the strings
162           listed in LIST. For example, to include only sites which have no
163           filters set, use -f .,PASS.
164
165       --no-version
166           Do not append version and command line information to the output
167           VCF header.
168
169       -o, --output FILE
170           When output consists of a single stream, write it to FILE rather
171           than to standard output, where it is written by default. The file
172           type is determined automatically from the file name suffix and in
173           case a conflicting -O option is given, the file name suffix takes
174           precedence.
175
176       -O, --output-type b|u|z|v[0-9]
177           Output compressed BCF (b), uncompressed BCF (u), compressed VCF
178           (z), uncompressed VCF (v). Use the -Ou option when piping between
179           bcftools subcommands to speed up performance by removing
180           unnecessary compression/decompression and VCF←→BCF conversion. The
181           compression level of the compressed formats (b and z) can be set by
182           by appending a number between 0-9.
183
184       -r, --regions chr|chr:pos|chr:beg-end|chr:beg-[,...]
185           Comma-separated list of regions, see also -R, --regions-file.
186           Overlapping records are matched even when the starting coordinate
187           is outside of the region, unlike the -t/-T options where only the
188           POS coordinate is checked. Note that -r cannot be used in
189           combination with -R.
190
191       -R, --regions-file FILE
192           Regions can be specified either on command line or in a VCF, BED,
193           or tab-delimited file (the default). The columns of the
194           tab-delimited file can contain either positions (two-column format)
195           or intervals (three-column format): CHROM, POS, and, optionally,
196           END,  where positions are 1-based and inclusive. The columns of the
197           tab-delimited BED file are also CHROM, POS and END (trailing
198           columns are ignored), but coordinates are 0-based, half-open. To
199           indicate that a file be treated as BED rather than the 1-based
200           tab-delimited file, the file must have the ".bed" or ".bed.gz"
201           suffix (case-insensitive). Uncompressed files are stored in memory,
202           while bgzip-compressed and tabix-indexed region files are streamed.
203           Note that sequence names must match exactly, "chr20" is not the
204           same as "20". Also note that chromosome ordering in FILE will be
205           respected, the VCF will be processed in the order in which
206           chromosomes first appear in FILE. However, within chromosomes, the
207           VCF will always be processed in ascending genomic coordinate order
208           no matter what order they appear in FILE. Note that overlapping
209           regions in FILE can result in duplicated out of order positions in
210           the output. This option requires indexed VCF/BCF files. Note that
211           -R cannot be used in combination with -r.
212
213       --regions-overlap pos|record|variant|0|1|2
214           This option controls how overlapping records are determined: set to
215           pos or 0 if the VCF record has to have POS inside a region (this
216           corresponds to the default behavior of -t/-T); set to record or 1
217           if also overlapping records with POS outside a region should be
218           included (this is the default behavior of -r/-R, and includes
219           indels with POS at the end of a region, which are technically
220           outside the region); or set to variant or 2 to include only true
221           overlapping variation (compare the full VCF representation "TA>T-"
222           vs the true sequence variation "A>-").
223
224       -s, --samples [^]LIST
225           Comma-separated list of samples to include or exclude if prefixed
226           with "^." (Note that when multiple samples are to be excluded, the
227           "^" prefix is still present only once, e.g. "^SAMPLE1,SAMPLE2".)
228           The sample order is updated to reflect that given on the command
229           line. Note that in general tags such as INFO/AC, INFO/AN, etc are
230           not updated to correspond to the subset samples. bcftools view is
231           the exception where some tags will be updated (unless the -I,
232           --no-update option is used; see bcftools view documentation). To
233           use updated tags for the subset in another command one can pipe
234           from view into that command. For example:
235
236               bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
237
238       -S, --samples-file [^]FILE
239           File of sample names to include or exclude if prefixed with "^".
240           One sample per line. See also the note above for the -s, --samples
241           option. The sample order is updated to reflect that given in the
242           input file. The command bcftools call accepts an optional second
243           column indicating ploidy (0, 1 or 2) or sex (as defined by
244           --ploidy, for example "F" or "M"), for example:
245
246               sample1    1
247               sample2    2
248               sample3    2
249
250       or
251
252               sample1    M
253               sample2    F
254               sample3    F
255
256       If the second column is not present, the sex "F" is assumed. With
257       bcftools call -C trio, PED file is expected. The program ignores the
258       first column and the last indicates sex (1=male, 2=female), for
259       example:
260
261               ignored_column  daughterA fatherA  motherA  2
262               ignored_column  sonB      fatherB  motherB  1
263
264       -t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
265           Similar as -r, --regions, but the next position is accessed by
266           streaming the whole VCF/BCF rather than using the tbi/csi index.
267           Both -r and -t options can be applied simultaneously: -r  uses  the
268           index  to  jump  to  a  region and -t discards positions which are
269           not in the targets. Unlike -r, targets can be prefixed with "^" to
270           request logical complement. For example, "^X,Y,MT" indicates that
271           sequences X, Y and MT should be skipped. Yet another difference
272           between the -t/-T and -r/-R is that -r/-R checks for proper
273           overlaps and considers both POS and the end position of an indel,
274           while -t/-T considers the POS coordinate only (by default; see also
275           --regions-overlap and --targets-overlap). Note that -t cannot be
276           used in combination with -T.
277
278       -T, --targets-file [^]FILE
279           Same -t, --targets, but reads regions from a file. Note that -T
280           cannot be used in combination with -t.
281
282           With the call -C alleles command, third column of the targets file
283           must be comma-separated list of alleles, starting with the
284           reference allele. Note that the file must be compressed and
285           indexed. Such a file can be easily created from a VCF using:
286
287               bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
288
289       --targets-overlap pos|record|variant|0|1|2
290           Same as --regions-overlap but for -t/-T.
291
292       --threads INT
293           Use multithreading with INT worker threads. The option is currently
294           used only for the compression of the output stream, only when
295           --output-type is b or z. Default: 0.
296
297   bcftools annotate [OPTIONS] FILE
298       Add or remove annotations.
299
300       -a, --annotations file
301           Bgzip-compressed and tabix-indexed file with annotations. The file
302           can be VCF, BED, or a tab-delimited file with mandatory columns
303           CHROM, POS (or, alternatively, FROM and TO), optional columns REF
304           and ALT, and arbitrary number of annotation columns. BED files are
305           expected to have the ".bed" or ".bed.gz" suffix (case-insensitive),
306           otherwise a tab-delimited file is assumed. Note that in case of
307           tab-delimited file, the coordinates POS, FROM and TO are one-based
308           and inclusive. When REF and ALT are present, only matching VCF
309           records will be annotated. If the END coordinate is present in the
310           annotation file and given on command line as "-c ~INFO/END", then
311           VCF records will be matched also by the INFO/END coordinate. If ID
312           is present in the annotation file and given as "-c ~ID", then VCF
313           records will be matched also by the ID column.
314
315           When multiple ALT alleles are present in the annotation file (given
316           as comma-separated list of alleles), at least one must match one of
317           the alleles in the corresponding VCF record. Similarly, at least
318           one alternate allele from a multi-allelic VCF record must be
319           present in the annotation file.
320
321           Missing values can be added by providing "." in place of actual
322           value and using the missing value modifier with -c, such as ".TAG".
323
324           Note that flag types, such as "INFO/FLAG", can be annotated by
325           including a field with the value "1" to set the flag, "0" to remove
326           it, or "." to keep existing flags. See also -c, --columns and -h,
327           --header-lines.
328
329               # Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
330               1  752566  SomeString      5
331               1  798959  SomeOtherString 6
332
333       -c, --columns list
334           Comma-separated list of columns or tags to carry over from the
335           annotation file (see also -a, --annotations). If the annotation
336           file is not a VCF/BCF, list describes the columns of the annotation
337           file and must include CHROM, POS (or, alternatively, FROM and TO),
338           and optionally REF and ALT. Unused columns which should be ignored
339           can be indicated by "-".
340
341           If the annotation file is a VCF/BCF, only the edited columns/tags
342           must be present and their order does not matter. The columns ID,
343           QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be
344           written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be
345           written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations
346           can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
347
348           To carry over all INFO annotations, use "INFO". To add all INFO
349           annotations except "TAG", use "^INFO/TAG". By default, existing
350           values are replaced.
351
352           By default, existing tags are overwritten unless the source value
353           is a missing value (i.e. "."). If also missing values should be
354           carried over (and overwrite existing tags), use ".TAG" instead of
355           "TAG". To add annotations without overwriting existing values (that
356           is, to add tags that are absent or to add values to existing tags
357           with missing values), use "+TAG" instead of "TAG". These can be
358           combined, for example ".+TAG" can be used to add TAG even if the
359           source value is missing but only if TAG does not exist in the
360           target file; existing tags will not be overwritten. To append to
361           existing values (rather than replacing or leaving untouched), use
362           "=TAG" (instead of "TAG" or "+TAG"). To replace only existing
363           values without modifying missing annotations, use "-TAG". To match
364           the record also by ID or INFO/END, in addition to REF and ALT, use
365           "~ID" or "~INFO/END". If position needs to be replaced, mark the
366           column with the new position as "~POS".
367
368           If the annotation file is not a VCF/BCF, all new annotations must
369           be defined via -h, --header-lines.
370
371           See also the -l, --merge-logic option.
372
373       -C, --columns-file file
374           Read the list of columns from a file (normally given via the -c,
375           --columns option). "-" to skip a column of the annotation file. One
376           column name per row, an additional space- or tab-separated field
377           can be present to indicate the merge logic (normally given via the
378           -l, --merge-logic option). This is useful when many annotations are
379           added at once.
380
381       -e, --exclude EXPRESSION
382           exclude sites for which EXPRESSION is true. For valid expressions
383           see EXPRESSIONS.
384
385       --force
386           continue even when parsing errors, such as undefined tags, are
387           encountered. Note this can be an unsafe operation and can result in
388           corrupted BCF files. If this option is used, make sure to sanity
389           check the result thoroughly.
390
391       -h, --header-lines file
392           Lines to append to the VCF header, see also -c, --columns and -a,
393           --annotations. For example:
394
395               ##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
396               ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
397
398       -I, --set-id [+]FORMAT
399           assign ID on the fly. The format is the same as in the query
400           command (see below). By default all existing IDs are replaced. If
401           the format string is preceded by "+", only missing IDs will be set.
402           For example, one can use
403
404               bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
405
406       -i, --include EXPRESSION
407           include only sites for which EXPRESSION is true. For valid
408           expressions see EXPRESSIONS.
409
410       -k, --keep-sites
411           keep sites which do not pass -i and -e expressions instead of
412           discarding them
413
414       -l, --merge-logic
415       tag:first|append|append-missing|unique|sum|avg|min|max[,...]
416           When multiple regions overlap a single record, this option defines
417           how to treat multiple annotation values when setting tag in the
418           destination file: use the first encountered value ignoring the rest
419           (first); append allowing duplicates (append); append even if the
420           appended value is missing, i.e. is a dot (append-missing); append
421           discarding duplicate values (unique); sum the values (sum, numeric
422           fields only); average the values (avg); use the minimum value (min)
423           or the maximum (max). + Note that this option is intended for use
424           with BED or TAB-delimited annotation files only. Moreover, it is
425           effective only when either REF and ALT or BEG and END --columns are
426           present . + Multiple rules can be given either as a comma-separated
427           list or giving the option multiple times. This is an experimental
428           feature.
429
430       -m, --mark-sites TAG
431           annotate sites which are present ("+") or absent ("-") in the -a
432           file with a new INFO/TAG flag
433
434       --min-overlap ANN:'VCF'
435           minimum overlap required as a fraction of the variant in the
436           annotation -a file (ANN), in the target VCF file (:VCF), or both
437           for reciprocal overlap (ANN:VCF). By default overlaps of arbitrary
438           length are sufficient. The option can be used only with the
439           tab-delimited annotation -a file and with BEG and END columns
440           present.
441
442       --no-version
443           see Common Options
444
445       -o, --output FILE
446           see Common Options
447
448       -O, --output-type b|u|z|v[0-9]
449           see Common Options
450
451       --pair-logic snps|indels|both|all|some|exact
452           Controls how to match records from the annotation file to the
453           target VCF. Effective only when -a is a VCF or BCF. The option
454           replaces the former uninuitive --collapse. See Common Options for
455           more.
456
457       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
458           see Common Options
459
460       -R, --regions-file file
461           see Common Options
462
463       --regions-overlap 0|1|2
464           see Common Options
465
466       --rename-annots file
467           rename annotations according to the map in file, with "old_name
468           new_name\n" pairs separated by whitespaces, each on a separate
469           line. The old name must be prefixed with the annotation type: INFO,
470           FORMAT, or FILTER.
471
472       --rename-chrs file
473           rename chromosomes according to the map in file, with "old_name
474           new_name\n" pairs separated by whitespaces, each on a separate
475           line.
476
477       -s, --samples [^]LIST
478           subset of samples to annotate, see also Common Options
479
480       -S, --samples-file FILE
481           subset of samples to annotate. If the samples are named differently
482           in the target VCF and the -a, --annotations VCF, the name mapping
483           can be given as "src_name dst_name\n", separated by whitespaces,
484           each pair on a separate line.
485
486       --single-overlaps
487           use this option to keep memory requirements low with very large
488           annotation files. Note, however, that this comes at a cost, only
489           single overlapping intervals are considered in this mode. This was
490           the default mode until the commit af6f0c9 (Feb 24 2019).
491
492       --threads INT
493           see Common Options
494
495       -x, --remove list
496           List of annotations to remove. Use "FILTER" to remove all filters
497           or "FILTER/SomeFilter" to remove a specific filter. Similarly,
498           "INFO" can be used to remove all INFO tags and "FORMAT" to remove
499           all FORMAT tags except GT. To remove all INFO tags except "FOO" and
500           "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and
501           FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
502
503       Examples:
504
505               # Remove three fields
506               bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
507
508               # Remove all INFO fields and all FORMAT fields except for GT and PL
509               bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
510
511               # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
512               bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
513
514               # Carry over all INFO and FORMAT annotations except FORMAT/GT
515               bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
516
517               # Annotate from a tab-delimited file with six columns (the fifth is ignored),
518               # first indexing with tabix. The coordinates are 1-based.
519               tabix -s1 -b2 -e2 annots.tab.gz
520               bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
521
522               # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
523               tabix -s1 -b2 -e3 annots.tab.gz
524               bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
525
526               # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
527               bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
528
529               # Transfer the INFO/END tag, matching by POS,REF,ALT and ID. This example assumes
530               # that INFO/END is already present in the VCF header.
531               bcftools annotate -a annots.tab.gz  -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf
532
533               # For more examples see http://samtools.github.io/bcftools/howtos/annotate.html
534
535   bcftools call [OPTIONS] FILE
536       This command replaces the former bcftools view caller. Some of the
537       original functionality has been temporarily lost in the process of
538       transition under htslib <http://github.com/samtools/htslib>, but will
539       be added back on popular demand. The original calling model can be
540       invoked with the -c option.
541
542   File format options:
543       --no-version
544           see Common Options
545
546       -o, --output FILE
547           see Common Options
548
549       -O, --output-type b|u|z|v[0-9]
550           see Common Options
551
552       --ploidy ASSEMBLY[?]
553           predefined ploidy, use list (or any other unused word) to print a
554           list of all predefined assemblies. Append a question mark to print
555           the actual definition. See also --ploidy-file.
556
557       --ploidy-file FILE
558           ploidy definition given as a space/tab-delimited list of CHROM,
559           FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and correspond
560           to the ones used by --samples-file. The default ploidy can be given
561           using the starred records (see below), unlisted regions have ploidy
562           2. The default ploidy definition is
563
564               X 1 60000 M 1
565               X 2699521 154931043 M 1
566               Y 1 59373566 M 1
567               Y 1 59373566 F 0
568               MT 1 16569 M 1
569               MT 1 16569 F 1
570               *  * *     M 2
571               *  * *     F 2
572
573       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
574           see Common Options
575
576       -R, --regions-file file
577           see Common Options
578
579       --regions-overlap 0|1|2
580           see Common Options
581
582       -s, --samples LIST
583           see Common Options
584
585       -S, --samples-file FILE
586           see Common Options
587
588       -t, --targets LIST
589           see Common Options
590
591       -T, --targets-file FILE
592           see Common Options
593
594       --targets-overlap 0|1|2
595           see Common Options
596
597       --threads INT
598           see Common Options
599
600   Input/output options:
601       -A, --keep-alts
602           output all alternate alleles present in the alignments even if they
603           do not appear in any of the genotypes
604
605       -f, --format-fields list
606           comma-separated list of FORMAT fields to output for each sample.
607           Currently GQ and GP fields are supported. For convenience, the
608           fields can be given as lower case letters. Prefixed with "^"
609           indicates a request for tag removal of auxiliary tags useful only
610           for calling.
611
612       -F, --prior-freqs AN,AC
613           take advantage of prior knowledge of population allele frequencies.
614           The workflow looks like this:
615
616               # Extract AN,AC values from an existing VCF, such 1000Genomes
617               bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
618
619               # If the tags AN,AC are not already present, use the +fill-tags plugin
620               bcftools +fill-tags 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
621               tabix -s1 -b2 -e2 AFs.tab.gz
622
623               # Create a VCF header description, here we name the tags REF_AN,REF_AC
624               cat AFs.hdr
625               ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
626               ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
627
628               # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
629               bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
630
631       -G, --group-samples FILE|-
632           by default, all samples are assumed to come from a single
633           population. This option allows to group samples into populations
634           and apply the HWE assumption within but not across the populations.
635           FILE is a tab-delimited text file with sample names in the first
636           column and group names in the second column. If - is given instead,
637           no HWE assumption is made at all and single-sample calling is
638           performed. (Note that in low coverage data this inflates the rate
639           of false positives.) The -G option requires the presence of
640           per-sample FORMAT/QS or FORMAT/AD tag generated with bcftools
641           mpileup -a QS (or -a AD).
642
643       -g, --gvcf INT
644           output also gVCF blocks of homozygous REF calls. The parameter INT
645           is the minimum per-sample depth required to include a site in the
646           non-variant block.
647
648       -i, --insert-missed INT
649           output also sites missed by mpileup but present in -T,
650           --targets-file.
651
652       -M, --keep-masked-ref
653           output sites where REF allele is N
654
655       -V, --skip-variants snps|indels
656           skip indel/SNP sites
657
658       -v, --variants-only
659           output variant sites only
660
661   Consensus/variant calling options:
662       -c, --consensus-caller
663           the original samtools/bcftools calling method (conflicts with -m)
664
665       -C, --constrain alleles|trio
666
667           alleles
668               call genotypes given alleles. See also -T, --targets-file.
669
670           trio
671               call genotypes given the father-mother-child constraint. See
672               also -s, --samples and -n, --novel-rate.
673
674       -m, --multiallelic-caller
675           alternative model for multiallelic and rare-variant calling
676           designed to overcome known limitations in -c calling model
677           (conflicts with -c)
678
679       -n, --novel-rate float[,...]
680           likelihood of novel mutation for constrained -C trio calling. The
681           trio genotype calling maximizes likelihood of a particular
682           combination of genotypes for father, mother and the child
683           P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) * (1-Pn).
684           By providing three values, the mutation rate Pn is set explicitly
685           for SNPs, deletions and insertions, respectively. If two values are
686           given, the first is interpreted as the mutation rate of SNPs and
687           the second is used to calculate the mutation rate of indels
688           according to their length as Pn=float*exp(-a-b*len), where
689           a=22.8689, b=0.2994 for insertions and a=21.9313, b=0.2856 for
690           deletions [pubmed:23975140]. If only one value is given, the same
691           mutation rate Pn is used for SNPs and indels.
692
693       -p, --pval-threshold float
694           with -c, accept variant if P(ref|D) < float.
695
696       -P, --prior float
697           expected substitution rate, or 0 to disable the prior. Only with
698           -m.
699
700       -t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
701           see Common Options
702
703       -X, --chromosome-X
704           haploid output for male samples (requires PED file with -s)
705
706       -Y, --chromosome-Y
707           haploid output for males and skips females (requires PED file with
708           -s)
709
710   bcftools cnv [OPTIONS] FILE
711       Copy number variation caller, requires a VCF annotated with the
712       Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR)
713       values. The HMM considers the following copy number states: CN 2
714       (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy
715       gain).
716
717   General Options:
718       -c, --control-sample string
719           optional control sample name. If given, pairwise calling is
720           performed and the -P  option can be used
721
722       -f, --AF-file file
723           read allele frequencies from  a tab-delimited file with the columns
724           CHR,POS,REF,ALT,AF
725
726       -o, --output-dir path
727           output directory
728
729       -p, --plot-threshold float
730           call matplotlib to produce plots for chromosomes with quality at
731           least float, useful for visual inspection of the calls. With -p 0,
732           plots for all chromosomes will be generated. If not given, a
733           matplotlib script will be created but not called.
734
735       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
736           see Common Options
737
738       -R, --regions-file file
739           see Common Options
740
741       --regions-overlap 0|1|2
742           see Common Options
743
744       -s, --query-sample string
745           query sample name
746
747       -t, --targets LIST
748           see Common Options
749
750       -T, --targets-file FILE
751           see Common Options
752
753       --targets-overlap 0|1|2
754           see Common Options
755
756   HMM Options:
757       -a, --aberrant float[,float]
758           fraction of aberrant cells in query and control. The hallmark of
759           duplications and contaminations is the BAF value of heterozygous
760           markers which is dependent on the fraction of aberrant cells.
761           Sensitivity to smaller fractions of cells can be increased by
762           setting -a to a lower value. Note however, that this comes at the
763           cost of increased false discovery rate.
764
765       -b, --BAF-weight float
766           relative contribution from BAF
767
768       -d, --BAF-dev float[,float]
769           expected BAF deviation in query and control, i.e. the noise
770           observed in the data.
771
772       -e, --err-prob float
773           uniform error probability
774
775       -l, --LRR-weight float
776           relative contribution from LRR. With noisy data, this option can
777           have big effect on the number of calls produced. In truly random
778           noise (such as in simulated data), the value should be set high
779           (1.0), but in the presence of systematic noise when LRR are not
780           informative, lower values result in cleaner calls (0.2).
781
782       -L, --LRR-smooth-win int
783           reduce LRR noise by applying moving average given this window size
784
785       -O, --optimize float
786           iteratively estimate the fraction of aberrant cells, down to the
787           given fraction. Lowering this value from the default 1.0 to say,
788           0.3, can help discover more events but also increases noise
789
790       -P, --same-prob float
791           the prior probability of the query and the control sample being the
792           same. Setting to 0 calls both independently, setting to 1 forces
793           the same copy number state in both.
794
795       -x, --xy-prob float
796           the HMM probability of transition to another copy number state.
797           Increasing this values leads to smaller and more frequent calls.
798
799   bcftools concat [OPTIONS] FILE1 FILE2 [...]
800       Concatenate or combine VCF/BCF files. All source files must have the
801       same sample columns appearing in the same order. Can be used, for
802       example, to concatenate chromosome VCFs into one VCF, or combine a SNP
803       VCF and an indel VCF into one. The input files must be sorted by chr
804       and position. The files must be given in the correct order to produce
805       sorted VCF on output unless the -a, --allow-overlaps option is
806       specified. With the --naive option, the files are concatenated without
807       being recompressed, which is very fast..
808
809       -a, --allow-overlaps
810           First coordinate of the next file can precede last record of the
811           current file.
812
813       -c, --compact-PS
814           Do not output PS tag at each site, only at the start of a new phase
815           set block.
816
817       -d, --rm-dups snps|indels|both|all|exact
818           Output duplicate records of specified type present in multiple
819           files only once. Note that records duplicate within one file are
820           not removed with this option, for that use bcftools norm -d
821           instead.
822           In other words, the default behavior of the program is similar to
823           unix "cat" in that when two files contain a record with the same
824           position, that position will appear twice on output. With -d, every
825           line that finds a matching record in another file will be printed
826           only once.
827           Requires -a, --allow-overlaps.
828
829       -D, --remove-duplicates
830           Alias for -d exact
831
832       -f, --file-list FILE
833           Read file names from FILE, one file name per line.
834
835       -l, --ligate
836           Ligate phased VCFs by matching phase at overlapping haplotypes.
837           Note that the option is intended for VCFs with perfect overlap,
838           sites in overlapping regions present in one but missing in the
839           other are dropped.
840
841       --ligate-force
842           Keep all sites and ligate even non-overlapping chunks and chunks
843           with imperfect overlap
844
845       --ligate-warn
846           Drop sites in imperfect overlaps
847
848       --no-version
849           see Common Options
850
851       -n, --naive
852           Concatenate VCF or BCF files without recompression. This is very
853           fast but requires that all files are of the same type (all VCF or
854           all BCF) and have the same headers. This is because all tags and
855           chromosome names in the BCF body rely on the order of the contig
856           and tag definitions in the header. A header check compatibility is
857           performed and the program throws an error if it is not safe to use
858           the option.
859
860       --naive-force
861           Same as --naive, but header compatibility is not checked.
862           Dangerous, use with caution.
863
864       -o, --output FILE
865           see Common Options
866
867       -O, --output-type b|u|z|v[0-9]
868           see Common Options
869
870       -q, --min-PQ INT
871           Break phase set if phasing quality is lower than INT
872
873       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
874           see Common Options. Requires -a, --allow-overlaps.
875
876       -R, --regions-file FILE
877           see Common Options. Requires -a, --allow-overlaps.
878
879       --regions-overlap 0|1|2
880           see Common Options
881
882       --threads INT
883           see Common Options
884
885   bcftools consensus [OPTIONS] FILE
886       Create consensus sequence by applying VCF variants to a reference fasta
887       file. By default, the program will apply all ALT variants to the
888       reference fasta to obtain the consensus sequence. Using the --sample
889       (and, optionally, --haplotype) option will apply genotype (haplotype)
890       calls from FORMAT/GT. Note that the program does not act as a primitive
891       variant caller and ignores allelic depth information, such as INFO/AD
892       or FORMAT/AD. For that, consider using the setGT plugin.
893
894       -a, --absent CHAR
895           replace positions absent from VCF with CHAR
896
897       -c, --chain FILE
898           write a chain file for liftover
899
900       -e, --exclude EXPRESSION
901           exclude sites for which EXPRESSION is true. For valid expressions
902           see EXPRESSIONS.
903
904       -f, --fasta-ref FILE
905           reference sequence in fasta format
906
907       -H, --haplotype 1|2|R|A|I|LR|LA|SR|SA|1pIu|2pIu
908           choose which allele from the FORMAT/GT field to use (the codes are
909           case-insensitive):
910
911           1
912               the first allele, regardless of phasing
913
914           2
915               the second allele, regardless of phasing
916
917           R
918               the REF allele (in heterozygous genotypes)
919
920           A
921               the ALT allele (in heterozygous genotypes)
922
923           I
924               IUPAC code for all genotypes
925
926           LR, LA
927               the longer allele. If both have the same length, use the REF
928               allele (LR), or the ALT allele  (LA)
929
930           SR, SA
931               the shorter allele. If both have the same length, use the REF
932               allele (SR), or the ALT allele  (SA)
933
934           1pIu, 2pIu
935               first/second allele for phased genotypes and IUPAC code for
936               unphased genotypes
937
938                   This option requires *-s*, unless exactly one sample is present in the VCF
939
940       -i, --include EXPRESSION
941           include only sites for which EXPRESSION is true. For valid
942           expressions see EXPRESSIONS.
943
944       -I, --iupac-codes
945           output variants in the form of IUPAC ambiguity codes
946
947       --mark-del CHAR
948           instead of removing sequence, insert CHAR for deletions
949
950       --mark-ins uc|lc
951           highlight inserted sequence in uppercase (uc) or lowercase (lc),
952           leaving the rest of the sequence as is
953
954       --mark-snv uc|lc
955           highlight substitutions in uppercase (uc) or lowercase (lc),
956           leaving the rest of the sequence as is
957
958       -m, --mask FILE
959           BED file or TAB file with regions to be replaced with N (the
960           default) or as specified by the next --mask-with option. See
961           discussion of --regions-file in Common Options for file format
962           details.
963
964       --mask-with CHAR|lc|uc
965           replace sequence from --mask with CHAR, skipping overlapping
966           variants, or change to lowercase (lc) or uppercase (uc)
967
968       -M, --missing CHAR
969           instead of skipping the missing genotypes, output the character
970           CHAR (e.g. "?")
971
972       -o, --output FILE
973           write output to a file
974
975       -s, --sample NAME
976           apply variants of the given sample
977
978       Examples:
979
980               # Apply variants present in sample "NA001", output IUPAC codes for hets
981               bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
982
983               # Create consensus for one region. The fasta header lines are then expected
984               # in the form ">chr:from-to".
985               samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
986
987   bcftools convert [OPTIONS] FILE
988   VCF input options:
989       -e, --exclude EXPRESSION
990           exclude sites for which EXPRESSION is true. For valid expressions
991           see EXPRESSIONS.
992
993       -i, --include EXPRESSION
994           include only sites for which EXPRESSION is true. For valid
995           expressions see EXPRESSIONS.
996
997       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
998           see Common Options
999
1000       -R, --regions-file FILE
1001           see Common Options
1002
1003       --regions-overlap 0|1|2
1004           see Common Options
1005
1006       -s, --samples LIST
1007           see Common Options
1008
1009       -S, --samples-file FILE
1010           see Common Options
1011
1012       -t, --targets LIST
1013           see Common Options
1014
1015       -T, --targets-file FILE
1016           see Common Options
1017
1018       --targets-overlap 0|1|2
1019           see Common Options
1020
1021   VCF output options:
1022       --no-version
1023           see Common Options
1024
1025       -o, --output FILE
1026           see Common Options
1027
1028       -O, --output-type b|u|z|v[0-9]
1029           see Common Options
1030
1031       --threads INT
1032           see Common Options
1033
1034   GEN/SAMPLE conversion:
1035       -G, --gensample2vcf prefix or gen-file,sample-file
1036           convert IMPUTE2 output to VCF. One of the ID columns ("SNP ID" or
1037           "rsID" in https://www.cog-genomics.org/plink/2.0/formats#gen) must
1038           be of the form "CHROM:POS_REF_ALT" to detect possible strand swaps.
1039           When the --vcf-ids option is given, the other column (autodetected)
1040           is used to fill the ID column of the VCF.
1041           See also -g and --3N6 options.
1042
1043       -g, --gensample prefix or gen-file,sample-file
1044           convert from VCF to gen/sample format used by IMPUTE2 and SHAPEIT.
1045           The columns of .gen file format are ID1,ID2,POS,A,B followed by
1046           three genotype probabilities P(AA), P(AB), P(BB) for each sample.
1047           In order to prevent strand swaps, the program uses IDs of the form
1048           "CHROM:POS_REF_ALT". When the --vcf-ids option is given, the second
1049           column is set to match the ID column of the VCF.
1050           See also -G and --3N6 options.
1051           The file .gen and .sample file format are:
1052
1053             .gen (with --3N6 --vcf-ids)
1054             ---------------------------
1055             chr1 1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
1056             chr1 1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
1057
1058             .gen (with --vcf-ids)
1059             ---------------------------
1060             1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
1061             1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
1062
1063             .gen (the default)
1064             ------------------------------
1065             1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
1066             1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
1067
1068             .sample
1069             -------
1070             ID_1 ID_2 missing
1071             0 0 0
1072             sample1 sample1 0
1073             sample2 sample2 0
1074
1075       --3N6
1076           Expect/Create files in the 3*N+6 column format. This is the new
1077           .gen file format with the first column containing the chromosome
1078           name, see https://www.cog-genomics.org/plink/2.0/formats#gen
1079
1080       --tag STRING
1081           tag to take values for .gen file: GT,PL,GL,GP
1082
1083       --sex FILE
1084           output sex column in the sample file. The FILE format is
1085
1086               MaleSample    M
1087               FemaleSample  F
1088
1089       --vcf-ids
1090           output VCF IDs in the second column instead of CHROM:POS_REF_ALT
1091
1092   gVCF conversion:
1093       --gvcf2vcf
1094           convert gVCF to VCF, expanding REF blocks into sites. Note that the
1095           -i and -e options work differently with this switch. In this
1096           situation the filtering expressions define which sites should be
1097           expanded and which sites should be left unmodified, but all sites
1098           are printed on output. In order to drop sites, stream first through
1099           bcftools view.
1100
1101       -f, --fasta-ref file
1102           reference sequence in fasta format. Must be indexed with samtools
1103           faidx
1104
1105   HAP/SAMPLE conversion:
1106       --hapsample2vcf prefix or hap-file,sample-file
1107           convert from hap/sample format to VCF. The columns of .hap file are
1108           similar to .gen file above, but there are only two haplotype
1109           columns per sample. Note that the first or the second column of the
1110           .hap file is expected to be in the form "CHR:POS_REF_ALT[_END]",
1111           with the _END being optional for defining the INFO/END tag when ALT
1112           is a symbolic allele. For example:
1113
1114             .hap (with --vcf-ids)
1115             ---------------------
1116             1:111485207_G_A rsID1 111485207 G A 0 1 0 0
1117             1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
1118
1119             .hap (the default)
1120             ------------------
1121             1 1:111485207_G_A 111485207 G A 0 1 0 0
1122             1 1:111495231_A_<DEL>_111495784 111495231 A <DEL> 0 0 1 0
1123
1124       --hapsample prefix or hap-file,sample-file
1125           convert from VCF to hap/sample format used by IMPUTE2 and SHAPEIT.
1126           The columns of .hap file begin with ID,RSID,POS,REF,ALT. In order
1127           to prevent strand swaps, the program uses IDs of the form
1128           "CHROM:POS_REF_ALT".
1129
1130       --haploid2diploid
1131           with -h option converts haploid genotypes to homozygous diploid
1132           genotypes. For example, the program will print 0 0 instead of the
1133           default 0 -. This is useful for programs which do not handle
1134           haploid genotypes correctly.
1135
1136       --sex FILE
1137           output sex column in the sample file. The FILE format is
1138
1139               MaleSample    M
1140               FemaleSample  F
1141
1142       --vcf-ids
1143           the second column of the .hap file holds the VCF ids, the first
1144           column is of the form "CHR:POS_REF_ALT[_END]". Without the option,
1145           the format follows
1146           https://www.cog-genomics.org/plink/2.0/formats#haps with ids (the
1147           second column) of the form "CHR:POS_REF_ALT[_END]"
1148
1149   HAP/LEGEND/SAMPLE conversion:
1150       -H, --haplegendsample2vcf prefix or hap-file,legend-file,sample-file
1151           convert from hap/legend/sample format used by IMPUTE2 to VCF. See
1152           also -h, --hapslegendsample below.
1153
1154       -h, --haplegendsample prefix or hap-file,legend-file,sample-file
1155           convert from VCF to hap/legend/sample format used by IMPUTE2 and
1156           SHAPEIT. The columns of .legend file ID,POS,REF,ALT. In order to
1157           prevent strand swaps, the program uses IDs of the form
1158           "CHROM:POS_REF_ALT". The .sample file is quite basic at the moment
1159           with columns for population, group and sex expected to be edited by
1160           the user. For example:
1161
1162             .hap
1163             -----
1164             0 1 0 0 1 0
1165             0 1 0 0 0 1
1166
1167             .legend
1168             -------
1169             id position a0 a1
1170             1:111485207_G_A 111485207 G A
1171             1:111494194_C_T 111494194 C T
1172
1173             .sample
1174             -------
1175             sample population group sex
1176             sample1 sample1 sample1 2
1177             sample2 sample2 sample2 2
1178
1179       --haploid2diploid
1180           with -h option converts haploid genotypes to homozygous diploid
1181           genotypes. For example, the program will print 0 0 instead of the
1182           default 0 -. This is useful for programs which do not handle
1183           haploid genotypes correctly.
1184
1185       --sex FILE
1186           output sex column in the sample file. The FILE format is
1187
1188               MaleSample    M
1189               FemaleSample  F
1190
1191       --vcf-ids
1192           output VCF IDs instead of "CHROM:POS_REF_ALT". Note that this
1193           option can be used with --haplegendsample but not with
1194           --haplegendsample2vcf.
1195
1196   TSV conversion:
1197       --tsv2vcf file
1198           convert from TSV (tab-separated values) format (such as generated
1199           by 23andMe) to VCF. The input file fields can be tab- or space-
1200           delimited
1201
1202       -c, --columns list
1203           comma-separated list of fields in the input file. In the current
1204           version, the fields CHROM, POS, ID, and AA are expected and can
1205           appear in arbitrary order, columns which should be ignored in the
1206           input file can be indicated by "-". The AA field lists alleles on
1207           the forward reference strand, for example "CC" or "CT" for diploid
1208           genotypes or "C" for haploid genotypes (sex chromosomes).
1209           Insertions and deletions are not supported yet, missing data can be
1210           indicated with "--".
1211
1212       -f, --fasta-ref file
1213           reference sequence in fasta format. Must be indexed with samtools
1214           faidx
1215
1216       -s, --samples LIST
1217           list of sample names. See Common Options
1218
1219       -S, --samples-file FILE
1220           file of sample names. See Common Options
1221
1222       Example:
1223
1224           # Convert 23andme results into VCF
1225           bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
1226
1227   bcftools csq [OPTIONS] FILE
1228       Haplotype aware consequence predictor which correctly handles combined
1229       variants such as MNPs split over multiple VCF records, SNPs separated
1230       by an intron (but adjacent in the spliced transcript) or nearby
1231       frame-shifting indels which in combination in fact are not
1232       frame-shifting.
1233
1234       The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag
1235       (configurable with the -c option). The latter is a bitmask of indexes
1236       to INFO/BCSQ, with interleaved haplotypes. See the usage examples below
1237       for using the %TBCSQ converter in query for extracting a more human
1238       readable form from this bitmask. The construction of the bitmask limits
1239       the number of consequences that can be referenced per sample in the
1240       FORMAT/BCSQ tags. By default this is 15, but if more are required, see
1241       the --ncsq option.
1242
1243       The program requires on input a VCF/BCF file, the reference genome in
1244       fasta format (--fasta-ref) and genomic features in the GFF3 format
1245       downloadable from the Ensembl website (--gff-annot), and outputs an
1246       annotated VCF/BCF file. Currently, only Ensembl GFF3 files are
1247       supported.
1248
1249       By default, the input VCF should be phased. If phase is unknown, or
1250       only partially known, the --phase option can be used to indicate how to
1251       handle unphased data. Alternatively, haplotype aware calling can be
1252       turned off with the --local-csq option.
1253
1254       If conflicting (overlapping) variants within one haplotype are
1255       detected, a warning will be emitted and predictions will be based on
1256       only the first variant in the analysis.
1257
1258       Symbolic alleles are not supported. They will remain unannotated in the
1259       output VCF and are ignored for the prediction analysis.
1260
1261       -c, --custom-tag STRING
1262           use this custom tag to store consequences rather than the default
1263           BCSQ tag
1264
1265       -B, --trim-protein-seq INT
1266           abbreviate protein-changing predictions to maximum of INT
1267           aminoacids. For example, instead of writing the whole modified
1268           protein sequence with potentially hundreds of aminoacids, with -B 1
1269           only an abbreviated version such as 25E..329>25G..94 will be
1270           written.
1271
1272       -e, --exclude EXPRESSION
1273           exclude sites for which EXPRESSION is true. For valid expressions
1274           see EXPRESSIONS.
1275
1276       -f, --fasta-ref FILE
1277           reference sequence in fasta format (required)
1278
1279       --force
1280           run even if some sanity checks fail. Currently the option allows to
1281           skip transcripts in malformatted GFFs with incorrect phase
1282
1283       -g, --gff-annot FILE
1284           GFF3 annotation file (required), such as
1285           ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens. An example of
1286           a minimal working GFF file:
1287
1288               # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
1289               # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
1290               # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
1291               # (the most interesting is "protein_coding").
1292               #
1293               # Attributes required for
1294               #   gene lines:
1295               #   - ID=gene:<gene_id>
1296               #   - biotype=<biotype>
1297               #   - Name=<gene_name>      [optional]
1298               #
1299               #   transcript lines:
1300               #   - ID=transcript:<transcript_id>
1301               #   - Parent=gene:<gene_id>
1302               #   - biotype=<biotype>
1303               #
1304               #   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
1305               #   - Parent=transcript:<transcript_id>
1306               #
1307               # Supported biotypes:
1308               #   - see the function gff_parse_biotype() in bcftools/csq.c
1309
1310               1   ignored_field  gene            21  2148  . -   . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1311               1   ignored_field  transcript      21  2148  . -   . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1312               1   ignored_field  three_prime_UTR 21  2054  . -   . Parent=transcript:TranscriptId
1313               1   ignored_field  exon            21  2148  . -   . Parent=transcript:TranscriptId
1314               1   ignored_field  CDS             21  2148  . -   1   Parent=transcript:TranscriptId
1315               1   ignored_field  five_prime_UTR  210 2148  . -   . Parent=transcript:TranscriptId
1316
1317       -i, --include EXPRESSION
1318           include only sites for which EXPRESSION is true. For valid
1319           expressions see EXPRESSIONS.
1320
1321       -l, --local-csq
1322           switch off haplotype-aware calling, run localized predictions
1323           considering only one VCF record at a time
1324
1325       -n, --ncsq INT
1326           maximum number of per-haplotype consequences to consider for each
1327           site. The INFO/BCSQ column includes all consequences, but only the
1328           first INT will be referenced by the FORMAT/BCSQ fields. The default
1329           value is 15 which corresponds to one 32-bit integer per diploid
1330           sample, after accounting for values reserved by the BCF
1331           specification. Note that increasing the value leads to increased
1332           size of the output BCF.
1333
1334       --no-version
1335           see Common Options
1336
1337       -o, --output FILE
1338           see Common Options
1339
1340       -O, --output-type t|b|u|z|v[0-9]
1341           see Common Options. In addition, a custom tab-delimited plain text
1342           output can be printed (t).
1343
1344       -p, --phase a|m|r|R|s
1345           how to handle unphased heterozygous genotypes:
1346
1347           a
1348               take GTs as is, create haplotypes regardless of phase (0/1 →
1349               0|1)
1350
1351           m
1352               merge all GTs into a single haplotype (0/1 → 1, 1/2 → 1)
1353
1354           r
1355               require phased GTs, throw an error on unphased heterozygous GTs
1356
1357           R
1358               create non-reference haplotypes if possible (0/1 → 1|1, 1/2 →
1359               1|2)
1360
1361           s
1362               skip unphased heterozygous GTs
1363
1364       -q, --quiet
1365           suppress warning messages
1366
1367       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1368           see Common Options
1369
1370       -R, --regions-file FILE
1371           see Common Options
1372
1373       --regions-overlap 0|1|2
1374           see Common Options
1375
1376       -s, --samples LIST
1377           samples to include or "-" to apply all variants and ignore samples
1378
1379       -S, --samples-file FILE
1380           see Common Options
1381
1382       -t, --targets LIST
1383           see Common Options
1384
1385       -T, --targets-file FILE
1386           see Common Options
1387
1388       --targets-overlap 0|1|2
1389           see Common Options
1390
1391       Examples:
1392
1393               # Basic usage
1394               bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
1395
1396               # Extract the translated haplotype consequences. The following TBCSQ variations
1397               # are recognised:
1398               #   %TBCSQ    .. print consequences in all haplotypes in separate columns
1399               #   %TBCSQ{0} .. print the first haplotype only
1400               #   %TBCSQ{1} .. print the second haplotype only
1401               #   %TBCSQ{*} .. print a list of unique consequences present in either haplotype
1402               bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
1403
1404       Examples of BCSQ annotation:
1405
1406               # Two separate VCF records at positions 2:122106101 and 2:122106102
1407               # change the same codon. This UV-induced C>T dinucleotide mutation
1408               # has been annotated fully at the position 2:122106101 with
1409               #   - consequence type
1410               #   - gene name
1411               #   - ensembl transcript ID
1412               #   - coding strand (+ fwd, - rev)
1413               #   - amino acid position (in the coding strand orientation)
1414               #   - list of corresponding VCF variants
1415               # The annotation at the second position gives the position of the full
1416               # annotation
1417               BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
1418               BCSQ=@122106101
1419
1420               # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
1421               BCSQ=@46115084
1422               BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
1423
1424               # Stop gained variant
1425               BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
1426
1427               # The consequence type of a variant downstream from a stop are prefixed with *
1428               BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
1429
1430       Supported consequence types
1431
1432           3_prime_utr
1433           5_prime_utr
1434           coding_sequence
1435           feature_elongation
1436           frameshift
1437           inframe_altering
1438           inframe_deletion
1439           inframe_insertion
1440           intergenic
1441           intron
1442           missense
1443           non_coding
1444           splice_acceptor
1445           splice_donor
1446           splice_region
1447           start_lost
1448           start_retained
1449           stop_gained
1450           stop_lost
1451           stop_retained
1452           synonymous
1453
1454       See also
1455       https://ensembl.org/info/genome/variation/prediction/predicted_data.html
1456
1457   bcftools filter [OPTIONS] FILE
1458       Apply fixed-threshold filters.
1459
1460       -e, --exclude EXPRESSION
1461           exclude sites for which EXPRESSION is true. For valid expressions
1462           see EXPRESSIONS.
1463
1464       -g, --SnpGap INT[:'indel',mnp,bnd,other,overlap]
1465           filter SNPs within INT base pairs of an indel or other other
1466           variant type. The following example demonstrates the logic of
1467           --SnpGap 3 applied on a deletion and an insertion:
1468
1469           The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
1470                    0123456789
1471               ref  .G.GT..G..
1472               del  .A.G-..A..
1473           Here the positions 1 and 6 are filtered, 0 and 7 are not:
1474                    0123-456789
1475               ref  .G.G-..G..
1476               ins  .A.GT..A..
1477
1478       -G, --IndelGap INT
1479           filter clusters of indels separated by INT or fewer base pairs
1480           allowing only one to pass. The following example demonstrates the
1481           logic of --IndelGap 2 applied on a deletion and an insertion:
1482
1483           The second indel is filtered:
1484                    012345678901
1485               ref  .GT.GT..GT..
1486               del  .G-.G-..G-..
1487           And similarly here, the second is filtered:
1488                    01 23 456 78
1489               ref  .A-.A-..A-..
1490               ins  .AT.AT..AT..
1491
1492       -i, --include EXPRESSION
1493           include only sites for which EXPRESSION is true. For valid
1494           expressions see EXPRESSIONS.
1495
1496       --mask [^]REGION
1497           Soft filter regions, prepepend "^" to negate. Requires -s,
1498           --soft-filter.
1499
1500       -M, --mask-file [^]FILE
1501           Soft filter regions listed in a file, "^" to negate. Requires -s,
1502           --soft-filter.
1503
1504       --mask-overlap 0|1|2
1505           Same as --regions-overlap but for --mask/--mask-file. See Common
1506           Options. [1]
1507
1508       -m, --mode [+x]
1509           define behaviour at sites with existing FILTER annotations. The
1510           default mode replaces existing filters of failed sites with a new
1511           FILTER string while leaving sites which pass untouched when
1512           non-empty and setting to "PASS" when the FILTER string is absent.
1513           The "+" mode appends new FILTER strings of failed sites instead of
1514           replacing them. The "x" mode resets filters of sites which pass to
1515           "PASS". Modes "+" and "x" can both be set.
1516
1517       --no-version
1518           see Common Options
1519
1520       -o, --output FILE
1521           see Common Options
1522
1523       -O, --output-type b|u|z|v[0-9]
1524           see Common Options
1525
1526       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1527           see Common Options
1528
1529       -R, --regions-file file
1530           see Common Options
1531
1532       --regions-overlap 0|1|2
1533           see Common Options
1534
1535       -s, --soft-filter STRING|+
1536           annotate FILTER column with STRING or, with +, a unique filter name
1537           generated by the program ("Filter%d").
1538
1539       -S, --set-GTs .|0
1540           set genotypes of failed samples to missing value (.) or reference
1541           allele (0)
1542
1543       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1544           see Common Options
1545
1546       -T, --targets-file file
1547           see Common Options
1548
1549       --targets-overlap 0|1|2
1550           see Common Options
1551
1552       --threads INT
1553           see Common Options
1554
1555   bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz
1556       Checks sample identity. The program can operate in two modes. If the -g
1557       option is given, the identity of samples from query.vcf.gz is checked
1558       against the samples in the -g file. Without the -g option, multi-sample
1559       cross-check of samples in query.vcf.gz is performed.
1560
1561       --distinctive-sites NUM[,MEM[,DIR]]
1562           Find sites that can distinguish between at least NUM sample pairs.
1563           If the number is smaller or equal to 1, it is interpreted as the
1564           fraction of pairs. The optional MEM string sets the maximum memory
1565           used for in-memory sorting and DIR is the temporary directory for
1566           external sorting. This option requires also --pairs to be given.
1567
1568       --dry-run
1569           Stop after first record to estimate required time.
1570
1571       -e, --error-probability INT
1572           Interpret genotypes and genotype likelihoods probabilistically. The
1573           value of INT represents genotype quality when GT tag is used (e.g.
1574           Q=30 represents one error in 1,000 genotypes and Q=40 one error in
1575           10,000 genotypes) and is ignored when PL tag is used (in that case
1576           an arbitrary non-zero integer can be provided). See also the -u,
1577           --use option below. If set to 0, the discordance equals to the
1578           number of mismatching genotypes when GT vs GT is compared. Note
1579           that the values with and without -e are not comparable, only values
1580           generated with -e 0 correspond to mismatching genotypes. If
1581           performance is an issue, set to 0 for faster run but less accurate
1582           results.
1583
1584       -g, --genotypes FILE
1585           VCF/BCF file with reference genotypes to compare against
1586
1587       -H, --homs-only
1588           Homozygous genotypes only, useful with low coverage data (requires
1589           -g, --genotypes)
1590
1591       --n-matches INT
1592           Print only top INT matches for each sample, 0 for unlimited. Use
1593           negative value to sort by HWE probability rather than the number of
1594           discordant sites. Note that average score is used to determine the
1595           top matches, not absolute values.
1596
1597       --no-HWE-prob
1598           Disable calculation of HWE probability to reduce memory
1599           requirements with comparisons between very large number of sample
1600           pairs.
1601
1602       -p, --pairs LIST
1603           A comma-separated list of sample pairs to compare. When the -g
1604           option is given, the first sample must be from the query file, the
1605           second from the -g file, third from the query file etc
1606           (qry,gt[,qry,gt..]). Without the -g option, the pairs are created
1607           the same way but both samples are from the query file
1608           (qry,qry[,qry,qry..])
1609
1610       -P, --pairs-file FILE
1611           A file with tab-delimited sample pairs to compare. The first sample
1612           in the pair must come from the query file, the second from the
1613           genotypes file when -g is given
1614
1615       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1616           Restrict to comma-separated list of regions, see Common Options
1617
1618       *-R, --regions-file' FILE
1619           Restrict to regions listed in a file, see Common Options
1620
1621       --regions-overlap 0|1|2
1622           see Common Options
1623
1624       -s, --samples [qry|gt]:'LIST': List of query samples or -g samples. If
1625       neither -s nor -S are given, all possible sample pair combinations are
1626       compared
1627
1628       -S, --samples-file [qry|gt]:'FILE' File with the query or -g samples to
1629       compare. If neither -s nor -S are given, all possible sample pair
1630       combinations are compared
1631
1632       -t, --targets file
1633           see Common Options
1634
1635       -T, --targets-file file
1636           see Common Options
1637
1638       --targets-overlap 0|1|2
1639           see Common Options
1640
1641       -u, --use TAG1[,TAG2]
1642           specifies which tag to use in the query file (TAG1) and the -g
1643           (TAG2) file. By default, the PL tag is used in the query file and
1644           GT in the -g file when available.
1645
1646       Examples:
1647
1648              # Check discordance of all samples from B against all sample in A
1649              bcftools gtcheck -g A.bcf B.bcf
1650
1651              # Limit comparisons to the fiven list of samples
1652              bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
1653
1654              # Compare only two pairs a1,b1 and a1,b2
1655              bcftools gtcheck -p a1,b1,a1,b2 -g A.bcf B.bcf
1656
1657   bcftools head [OPTIONS] [FILE]
1658       By default, prints all headers from the specified input file to
1659       standard output in VCF format. The input file may be in VCF or BCF
1660       format; if no FILE is specified, standard input will be read. With
1661       appropriate options, only some of the headers and/or additionally some
1662       of the variant records will be printed.
1663
1664       The bcftools head command outputs VCF headers almost exactly as they
1665       appear in the input file: it may add a ##FILTER=<ID=PASS> header if not
1666       already present, but it never adds version or command line information
1667       itself.
1668
1669   Options:
1670       -h, --header INT
1671           Display only the first INT header lines. By default, all header
1672           lines are displayed.
1673
1674       -n, --records INT
1675           Also display the first INT variant records. By default, no variant
1676           records are displayed.
1677
1678   bcftools index [OPTIONS] in.bcf|in.vcf.gz
1679       Creates index for bgzip compressed VCF/BCF files for random access. CSI
1680       (coordinate-sorted index) is created by default. The CSI format
1681       supports indexing of chromosomes up to length 2^31. TBI (tabix index)
1682       index files, which support chromosome lengths up to 2^29, can be
1683       created by using the -t/--tbi option or using the tabix program
1684       packaged with htslib. When loading an index file, bcftools will try the
1685       CSI first and then the TBI.
1686
1687   Indexing options:
1688       -c, --csi
1689           generate CSI-format index for VCF/BCF files [default]
1690
1691       -f, --force
1692           overwrite index if it already exists
1693
1694       -m, --min-shift INT
1695           set minimal interval size for CSI indices to 2^INT; default: 14
1696
1697       -o, --output FILE
1698           output file name. If not set, then the index will be created using
1699           the input file name plus a .csi or .tbi extension
1700
1701       -t, --tbi
1702           generate TBI-format index for VCF files
1703
1704       --threads INT
1705           see Common Options
1706
1707   Stats options:
1708       -n, --nrecords
1709           print the number of records based on the CSI or TBI index files
1710
1711       -s, --stats
1712           Print per contig stats based on the CSI or TBI index files. Output
1713           format is three tab-delimited columns listing the contig name,
1714           contig length (. if unknown) and number of records for the contig.
1715           Contigs with zero records are not printed.
1716
1717   bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]
1718       Creates intersections, unions and complements of VCF files. Depending
1719       on the options, the program can output records from one (or more) files
1720       which have (or do not have) corresponding records with the same
1721       position in the other files.
1722
1723       -c, --collapse snps|indels|both|all|some|none
1724           see Common Options
1725
1726       -C, --complement
1727           output positions present only in the first file but missing in the
1728           others
1729
1730       -e, --exclude -|EXPRESSION
1731           exclude sites for which EXPRESSION is true. If -e (or -i) appears
1732           only once, the same filtering expression will be applied to all
1733           input files. Otherwise, -e or -i must be given for each input file.
1734           To indicate that no filtering should be performed on a file, use
1735           "-" in place of EXPRESSION, as shown in the example below. For
1736           valid expressions see EXPRESSIONS.
1737
1738       -f, --apply-filters LIST
1739           see Common Options
1740
1741       -i, --include EXPRESSION
1742           include only sites for which EXPRESSION is true. See discussion of
1743           -e, --exclude above.
1744
1745       -n, --nfiles [+-=]INT|~BITMAP
1746           output positions present in this many (=), this many or more (+),
1747           this many or fewer (-), or the exact same (~) files
1748
1749       -o, --output FILE
1750           see Common Options. When several files are being output, their
1751           names are controlled via -p instead.
1752
1753       -O, --output-type b|u|z|v[0-9]
1754           see Common Options
1755
1756       -p, --prefix DIR
1757           if given, subset each of the input files accordingly. See also -w.
1758
1759       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1760           see Common Options
1761
1762       -R, --regions-file file
1763           see Common Options
1764
1765       --regions-overlap 0|1|2
1766           see Common Options
1767
1768       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1769           see Common Options
1770
1771       -T, --targets-file file
1772           see Common Options
1773
1774       --targets-overlap 0|1|2
1775           see Common Options
1776
1777       -w, --write LIST
1778           list of input files to output given as 1-based indices. With -p and
1779           no -w, all files are written.
1780
1781   Examples:
1782       Create intersection and complements of two sets saving the output in
1783       dir/*
1784
1785               bcftools isec -p dir A.vcf.gz B.vcf.gz
1786
1787       Filter sites in A (require INFO/MAF>=0.01) and B (require INFO/dbSNP)
1788       but not in C, and create an intersection, including only sites which
1789       appear in at least two of the files after filters have been applied
1790
1791               bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
1792
1793       Extract and write records from A shared by both A and B using exact
1794       allele match
1795
1796               bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
1797
1798       Extract records private to A or B comparing by position only
1799
1800               bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
1801
1802       Print a list of records which are present in A and B but not in C and D
1803
1804               bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
1805
1806   bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]
1807       Merge multiple VCF/BCF files from non-overlapping sample sets to create
1808       one multi-sample file. For example, when merging file A.vcf.gz
1809       containing samples S1, S2 and S3 and file B.vcf.gz containing samples
1810       S3 and S4, the output file will contain five samples named S1, S2, S3,
1811       2:S3 and S4.
1812
1813       Note that it is responsibility of the user to ensure that the sample
1814       names are unique across all files. If they are not, the program will
1815       exit with an error unless the option --force-samples is given. The
1816       sample names can be also given explicitly using the --print-header and
1817       --use-header options.
1818
1819       Note that only records from different files can be merged, never from
1820       the same file. For "vertical" merge take a look at bcftools concat or
1821       bcftools norm -m instead.
1822
1823       --force-samples
1824           if the merged files contain duplicate samples names, proceed
1825           anyway. Duplicate sample names will be resolved by prepending the
1826           index of the file as it appeared on the command line to the
1827           conflicting sample name (see 2:S3 in the above example).
1828
1829       --print-header
1830           print only merged header and exit
1831
1832       --use-header FILE
1833           use the VCF header in the provided text FILE
1834
1835       -0  --missing-to-ref
1836           assume genotypes at missing sites are 0/0
1837
1838       -f, --apply-filters LIST
1839           see Common Options
1840
1841       -F, --filter-logic x|+
1842           Set the output record to PASS if any of the inputs is PASS (x), or
1843           apply all filters (+), which is the default.
1844
1845       -g, --gvcf -|FILE
1846           merge gVCF blocks, INFO/END tag is expected. If the reference fasta
1847           file FILE is not given and the dash (-) is given, unknown reference
1848           bases generated at gVCF block splits will be substituted with N’s.
1849           The --gvcf option uses the following default INFO rules: -i
1850           QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max.
1851
1852       -i, --info-rules -|TAG:METHOD[,...]
1853           Rules for merging INFO fields (scalars or vectors) or - to disable
1854           the default rules. METHOD is one of sum, avg, min, max, join.
1855           Default is DP:sum,DP4:sum if these fields exist in the input files.
1856           Fields with no specified rule will take the value from the first
1857           input file. The merged QUAL value is currently set to the maximum.
1858           This behaviour is not user controllable at the moment.
1859
1860       -l, --file-list FILE
1861           Read file names from FILE, one file name per line.
1862
1863       -L, --local-alleles INT
1864           Sites with many alternate alleles can require extremely large
1865           storage space which can exceed the 2GB size limit representable by
1866           BCF. This is caused by Number=G tags (such as FORMAT/PL) which
1867           store a value for each combination of reference and alternate
1868           alleles. The -L, --local-alleles option allows to replace such tags
1869           with a localized tag (FORMAT/LPL) which only includes a subset of
1870           alternate alleles relevant for that sample. A new FORMAT/LAA tag is
1871           added which lists 1-based indices of the alternate alleles relevant
1872           (local) for the current sample. The number INT gives the maximum
1873           number of alternate alleles that can be included in the PL tag. The
1874           default value is 0 which disables the feature and outputs values
1875           for all alternate alleles.
1876
1877       -m, --merge snps|indels|both|all|none|id
1878           The option controls what types of multiallelic records can be
1879           created:
1880
1881           -m none   .. no new multiallelics, output multiple records instead
1882           -m snps   .. allow multiallelic SNP records
1883           -m indels .. allow multiallelic indel records
1884           -m both   .. both SNP and indel records can be multiallelic
1885           -m all    .. SNP records can be merged with indel records
1886           -m id     .. merge by ID
1887
1888       --no-index
1889           the option allows to merge files without indexing them first. In
1890           order for this option to work, the user must ensure that the input
1891           files have chromosomes in the same order and consistent with the
1892           order of sequences in the VCF header.
1893
1894       --no-version
1895           see Common Options
1896
1897       -o, --output FILE
1898           see Common Options
1899
1900       -O, --output-type b|u|z|v[0-9]
1901           see Common Options
1902
1903       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1904           see Common Options
1905
1906       -R, --regions-file file
1907           see Common Options
1908
1909       --regions-overlap 0|1|2
1910           see Common Options
1911
1912       --threads INT
1913           see Common Options
1914
1915   bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]
1916       Generate VCF or BCF containing genotype likelihoods for one or multiple
1917       alignment (BAM or CRAM) files. This is based on the original samtools
1918       mpileup command (with the -v or -g options) producing genotype
1919       likelihoods in VCF or BCF format, but not the textual pileup output.
1920       The mpileup command was transferred to bcftools in order to avoid
1921       errors resulting from use of incompatible versions of samtools and
1922       bcftools when using in the mpileup+bcftools call pipeline.
1923
1924       Individuals are identified from the SM tags in the @RG header lines.
1925       Multiple individuals can be pooled in one alignment file, also one
1926       individual can be separated into multiple files. If sample identifiers
1927       are absent, each input file is regarded as one sample.
1928
1929       Note that there are two orthogonal ways to specify locations in the
1930       input file; via -r region and -t positions. The former uses (and
1931       requires) an index to do random access while the latter streams through
1932       the file contents filtering out the specified regions, requiring no
1933       index. The two may be used in conjunction. For example a BED file
1934       containing locations of genes in chromosome 20 could be specified using
1935       -r 20 -t chr20.bed, meaning that the index is used to find chromosome
1936       20 and then it is filtered for the regions listed in the BED file. Also
1937       note that the -r option can be much slower than -t with many regions
1938       and can require more memory when multiple regions and many alignment
1939       files are processed.
1940
1941   Input options
1942       -6, --illumina1.3+
1943           Assume the quality is in the Illumina 1.3+ encoding.
1944
1945       -A, --count-orphans
1946           Do not skip anomalous read pairs in variant calling.
1947
1948       -b, --bam-list FILE
1949           List of input alignment files, one file per line [null]
1950
1951       -B, --no-BAQ
1952           Disable probabilistic realignment for the computation of base
1953           alignment quality (BAQ). BAQ is the Phred-scaled probability of a
1954           read base being misaligned. Applying this option greatly helps to
1955           reduce false SNPs caused by misalignments.
1956
1957       -C, --adjust-MQ INT
1958           Coefficient  for  downgrading mapping quality for reads containing
1959           excessive mismatches. Given a read with a phred-scaled probability
1960           q of being generated from the mapped position, the new mapping
1961           quality is about sqrt((INT-q)/INT)*INT. A zero value (the default)
1962           disables this functionality.
1963
1964       -D, --full-BAQ
1965           Run the BAQ algorithm on all reads, not just those in problematic
1966           regions. This matches the behaviour for Bcftools 1.12 and earlier.
1967
1968           By default mpileup uses heuristics to decide when to apply the BAQ
1969           algorithm. Most sequences will not be BAQ adjusted, giving a CPU
1970           time closer to --no-BAQ, but it will still be applied in regions
1971           with suspected problematic alignments. This has been tested to work
1972           well on single sample data with even allele frequency, but the
1973           reliability is unknown for multi-sample calling and for low allele
1974           frequency variants so full BAQ is still recommended in those
1975           scenarios.
1976
1977       -d, --max-depth INT
1978           At a position, read maximally INT reads per input file. Note that
1979           the original samtools mpileup command had a minimum value of 8000/n
1980           where n was the number of input files given to mpileup. This means
1981           that in samtools mpileup the default was highly likely to be
1982           increased and the -d parameter would have an effect only once above
1983           the cross-sample minimum of 8000. This behavior was problematic
1984           when working with a combination of single- and multi-sample bams,
1985           therefore in bcftools mpileup the user is given the full control
1986           (and responsibility), and an informative message is printed instead
1987           [250]
1988
1989       -E, --redo-BAQ
1990           Recalculate BAQ on the fly, ignore existing BQ tags
1991
1992       -f, --fasta-ref FILE
1993           The faidx-indexed reference file in the FASTA format. The file can
1994           be optionally compressed by bgzip. Reference is required by default
1995           unless the --no-reference option is set [null]
1996
1997       --no-reference
1998           Do not require the --fasta-ref option.
1999
2000       -G, --read-groups FILE
2001           list of read groups to include or exclude if prefixed with "^". One
2002           read group per line. This file can also be used to assign new
2003           sample names to read groups by giving the new sample name as a
2004           second white-space-separated field, like this: "read_group_id
2005           new_sample_name". If the read group name is not unique, also the
2006           bam file name can be included: "read_group_id file_name
2007           sample_name". If all reads from the alignment file should be
2008           treated as a single sample, the asterisk symbol can be used: "*
2009           file_name sample_name". Alignments without a read group ID can be
2010           matched with "?". NOTE: The meaning of bcftools mpileup -G is the
2011           opposite of samtools mpileup -G.
2012
2013               RG_ID_1
2014               RG_ID_2  SAMPLE_A
2015               RG_ID_3  SAMPLE_A
2016               RG_ID_4  SAMPLE_B
2017               RG_ID_5  FILE_1.bam  SAMPLE_A
2018               RG_ID_6  FILE_2.bam  SAMPLE_A
2019               *        FILE_3.bam  SAMPLE_C
2020               ? FILE_3.bam  SAMPLE_D
2021
2022       -q, -min-MQ INT
2023           Minimum mapping quality for an alignment to be used [0]
2024
2025       -Q, --min-BQ INT
2026           Minimum base quality for a base to be considered [13]
2027
2028       --max-BQ INT
2029           Caps the base quality to a maximum value [60]. This can be
2030           particularly useful on technologies that produce overly optimistic
2031           high qualities, leading to too many false positives or incorrect
2032           genotype assignments.
2033
2034       -r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
2035           Only generate mpileup output in given regions. Requires the
2036           alignment files to be indexed. If used in conjunction with -l then
2037           considers the intersection; see Common Options
2038
2039       -R, --regions-file FILE
2040           As for -r, --regions, but regions read from FILE; see Common
2041           Options
2042
2043       --regions-overlap 0|1|2
2044           see Common Options
2045
2046       --ignore-RG
2047           Ignore RG tags. Treat all reads in one alignment file as one
2048           sample.
2049
2050       --ls, --skip-all-set
2051           Skip reads with all of the FLAG bits set [null]
2052
2053       --ns, --skip-any-set
2054           Skip reads with any of the FLAG bits set. This option replaces and
2055           is synonymous to the deprecated --ff, --excl-flags
2056           [UNMAP,SECONDARY,QCFAIL,DUP]
2057
2058       --lu, --skip-all-unset
2059           Skip reads with all of the FLAG bits unset. This option replaces
2060           and is synonymous to the deprecated --rf, --incl-flags [null]
2061
2062       --nu, --skip-any-unset
2063           Skip reads with any of the FLAG bits unset [null]
2064
2065       -s, --samples LIST
2066           list of sample names. See Common Options
2067
2068       -S, --samples-file FILE
2069           file of sample names to include or exclude if prefixed with "^".
2070           One sample per line. This file can also be used to rename samples
2071           by giving the new sample name as a second white-space-separated
2072           column, like this: "old_name new_name". If a sample name contains
2073           spaces, the spaces can be escaped using the backslash character,
2074           for example "Not\ a\ good\ sample\ name".
2075
2076       -t, --targets LIST
2077           see Common Options
2078
2079       -T, --targets-file FILE
2080           see Common Options
2081
2082       --targets-overlap 0|1|2
2083           see Common Options
2084
2085       -x, --ignore-overlaps
2086           Disable read-pair overlap detection.
2087
2088       --seed INT
2089           Set the random number seed used when sub-sampling deep regions [0].
2090
2091   Output options
2092       -a, --annotate LIST
2093           Comma-separated list of FORMAT and INFO tags to output.
2094           (case-insensitive, the "FORMAT/" prefix is optional, and use "?" to
2095           list available annotations on the command line) [null]:
2096
2097           FORMAT/AD   .. Allelic depth (Number=R,Type=Integer)
2098           FORMAT/ADF  .. Allelic depths on the forward strand (Number=R,Type=Integer)
2099           FORMAT/ADR  .. Allelic depths on the reverse strand (Number=R,Type=Integer)
2100           FORMAT/DP   .. Number of high-quality bases (Number=1,Type=Integer)
2101           FORMAT/SP   .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
2102           FORMAT/SCR  .. Number of soft-clipped reads (Number=1,Type=Integer)
2103
2104           INFO/AD     .. Total allelic depth (Number=R,Type=Integer)
2105           INFO/ADF    .. Total allelic depths on the forward strand (Number=R,Type=Integer)
2106           INFO/ADR    .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
2107           INFO/SCR    .. Number of soft-clipped reads (Number=1,Type=Integer)
2108
2109           FORMAT/DV   .. Deprecated in favor of FORMAT/AD; Number of high-quality non-reference bases, (Number=1,Type=Integer)
2110           FORMAT/DP4  .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR; Number of high-quality ref-forward, ref-reverse,
2111                          alt-forward and alt-reverse bases (Number=4,Type=Integer)
2112           FORMAT/DPR  .. Deprecated in favor of FORMAT/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
2113           INFO/DPR    .. Deprecated in favor of INFO/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
2114
2115       -g, --gvcf INT[,...]
2116           output gVCF blocks of homozygous REF calls, with depth (DP) ranges
2117           specified by the list of integers. For example, passing 5,15 will
2118           group sites into two types of gVCF blocks, the first with minimum
2119           per-sample DP from the interval [5,15) and the latter with minimum
2120           depth 15 or more. In this example, sites with minimum per-sample
2121           depth less than 5 will be printed as separate records, outside of
2122           gVCF blocks.
2123
2124       --no-version
2125           see Common Options
2126
2127       -o, --output FILE
2128           Write output to FILE, rather than the default of standard output.
2129           (The same short option is used for both --open-prob and --output.
2130           If -o's argument contains any non-digit characters other than a
2131           leading + or - sign,  it  is  interpreted  as --output. Usually the
2132           filename extension will take care of this, but to write to an
2133           entirely numeric filename use -o ./123 or --output 123.)
2134
2135       -O, --output-type b|u|z|v[0-9]
2136           see Common Options
2137
2138       --threads INT
2139           see Common Options
2140
2141       -U, --mwu-u
2142           The the previous Mann-Whitney U test score from version 1.12 and
2143           earlier. This is a probability score, but importantly it folds
2144           probabilities above or below the desired score into the same P. The
2145           new Mann-Whitney U test score is a "Z score", expressing the score
2146           as the number of standard deviations away from the mean (with zero
2147           being matching the mean). It keeps both positive and negative
2148           values. This can be important for some tests where errors are
2149           asymmetric.
2150
2151               This option changes the INFO field names produced back to the ones
2152               used by the earlier Bcftools releases. For excample BQBZ becomes
2153               BQB.
2154
2155   Options for SNP/INDEL genotype likelihood computation
2156       -X, --config STR
2157           Specify a platform specific configuration profile. The profile
2158           should be one of 1.12, illumina, ont or pacbio-ccs. Settings
2159           applied are as follows:
2160
2161               1.12           -Q13 -h100 -m1
2162               illumina       [ default values ]
2163               ont                   -B -Q5 --max-BQ 30 -I
2164               pacbio-ccs     -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
2165
2166       --ar, --ambig-reads drop|incAD|incAD0
2167           What to do with ambiguous indel reads that do not span an entire
2168           short tandem repeat region: discard ambiguous reads from calling
2169           and do not increment high-quality AD depth counters (drop), exclude
2170           from calling but increment AD counters proportionally (incAD),
2171           exclude from calling and increment the first value of the AD
2172           counter (incAD0) [drop]
2173
2174       -e, --ext-prob INT
2175           Phred-scaled gap extension sequencing error probability. Reducing
2176           INT leads to longer indels [20]
2177
2178       -F, --gap-frac FLOAT
2179           Minimum fraction of gapped reads [0.002]
2180
2181       -h, --tandem-qual INT
2182           Coefficient for modeling homopolymer errors. Given an l-long
2183           homopolymer run, the sequencing error of an indel of size s is
2184           modeled as INT*s/l [500] Increasing this informs the caller that
2185           indels in long homopolymers are more likely genuine and less likely
2186           to be sequencing artifacts. Hence increasing tandem-qual will have
2187           higher recall and lower precision. Bcftools 1.12 and earlier had a
2188           default of 100, which was tuned around more error prone
2189           instruments. Note changing this may have a minor impact on SNP
2190           calling too. For maximum SNP calling accuracy, it may be preferable
2191           to adjust this lower again, although this will adversely affect
2192           indels.
2193
2194       --indel-bias FLOAT
2195           Skews the indel scores up or down, trading recall (low
2196           false-negative) vs precision (low false-positive) [1.0]. In
2197           Bcftools 1.12 and earlier this parameter didn’t exist, but had an
2198           implied value of 1.0. If you are planning to do heavy filtering of
2199           variants, selecting the best quality ones only (favouring precision
2200           over recall), it is advisable to set this lower (such as 0.75)
2201           while higher depth samples or where you favour recall rates over
2202           precision may work better with a higher value such as 2.0.
2203
2204       --indel-size INT
2205           Indel window size to use when assessing the quality of candidate
2206           indels. Note that although the window size approximately
2207           corresponds to the maximum indel size considered, it is not an
2208           exact threshold [110]
2209
2210       -I, --skip-indels
2211           Do not perform INDEL calling
2212
2213       -L, --max-idepth INT
2214           Skip INDEL calling if the average per-sample depth is above INT
2215           [250]
2216
2217       -m, --min-ireads INT
2218           Minimum number gapped reads for indel candidates INT [1]
2219
2220       -M, --max-read-len INT
2221           The maximum read length permitted by the BAQ algorithm [500].
2222           Variants are still called on longer reads, but they will not be
2223           passed through the BAQ method. This limit exists to prevent
2224           excessively long BAQ times and high memory usage. Note if partial
2225           BAQ is enabled with -D then raising this parameter will likely not
2226           have a significant a CPU cost.
2227
2228       -o, --open-prob INT
2229           Phred-scaled gap open sequencing error probability. Reducing INT
2230           leads to more indel calls. (The same short option is used for both
2231           --open-prob and --output. When -o’s argument contains only an
2232           optional + or - sign followed by the digits 0 to 9, it is
2233           interpreted  as --open-prob.) [40]
2234
2235       -p, --per-sample-mF
2236           Apply -m and -F thresholds per sample to increase sensitivity of
2237           calling. By default both options are applied to reads pooled from
2238           all samples.
2239
2240       -P, --platforms STR
2241           Comma-delimited  list  of  platforms (determined by @RG-PL) from
2242           which indel candidates are obtained. It is recommended to collect
2243           indel candidates from sequencing technologies that have low indel
2244           error rate such as ILLUMINA [all]
2245
2246   Examples:
2247       Call SNPs and short INDELs, then mark low quality sites and sites with
2248       the read depth exceeding a limit. (The read depth should be adjusted to
2249       about twice the average read depth as higher read depths usually
2250       indicate problematic regions which are often enriched for artefacts.)
2251       One may consider to add -C50 to mpileup if mapping quality is
2252       overestimated  for reads containing  excessive mismatches. Applying
2253       this option usually helps for BWA-backtrack alignments, but may not
2254       other aligners.
2255
2256               bcftools mpileup -Ou -f ref.fa aln.bam | \
2257               bcftools call -Ou -mv | \
2258               bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
2259
2260   bcftools norm [OPTIONS] file.vcf.gz
2261       Left-align and normalize indels, check if REF alleles match the
2262       reference, split multiallelic sites into multiple rows; recover
2263       multiallelics from multiple rows. Left-alignment and normalization will
2264       only be applied if the --fasta-ref option is supplied.
2265
2266       -a, --atomize
2267           Decompose complex variants, e.g. split MNVs into consecutive SNVs.
2268           See also --atom-overlaps and --old-rec-tag.
2269
2270       --atom-overlaps .|*
2271           Alleles missing because of an overlapping variant can be set either
2272           to missing (.) or to the star alele (*), as recommended by the VCF
2273           specification. IMPORTANT: Note that asterisk is expaneded by shell
2274           and must be put in quotes or escaped by a backslash:
2275
2276               # Before atomization:
2277               100  CC  C,GG   1/2
2278
2279               # After:
2280               #   bcftools norm -a .
2281               100         C         G      ./1
2282               100         CC         C      1/.
2283               101         C         G      ./1
2284
2285               # After:
2286               #   bcftools norm -a '*'
2287               #   bcftools norm -a \*
2288               100         C         G,*    2/1
2289               100         CC         C,*    1/2
2290               101         C         G,*    2/1
2291
2292       -c, --check-ref e|w|x|s
2293           what to do when incorrect or missing REF allele is encountered:
2294           exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w
2295           option can be combined with x and s. Note that s can swap alleles
2296           and will update genotypes (GT) and AC counts, but will not attempt
2297           to fix PL or other fields. Also note, and this cannot be stressed
2298           enough, that s will NOT fix strand issues in your VCF, do NOT use
2299           it for that purpose!!! (Instead see
2300           http://samtools.github.io/bcftools/howtos/plugin.af-dist.html and
2301           http://samtools.github.io/bcftools/howtos/plugin.fixref.html.)
2302
2303       -d, --rm-dup snps|indels|both|all|exact
2304           If a record is present multiple times, output only the first
2305           instance. See also --collapse in Common Options.
2306
2307       -D, --remove-duplicates
2308           If a record is present in multiple files, output only the first
2309           instance. Alias for -d none, deprecated.
2310
2311       -f, --fasta-ref FILE
2312           reference sequence. Supplying this option will turn on
2313           left-alignment and normalization, however, see also the
2314           --do-not-normalize option below.
2315
2316       --force
2317           try to proceed with -m- even if malformed tags with incorrect
2318           number of fields are encountered, discarding such tags.
2319           (Experimental, use at your own risk.)
2320
2321       --keep-sum TAG[,...]
2322           keep vector sum constant when splitting multiallelic sites. Only AD
2323           tag is currently supported. See also
2324           https://github.com/samtools/bcftools/issues/360
2325
2326       -m, --multiallelics -|+[snps|indels|both|any]
2327           split multiallelic sites into biallelic records (-) or join
2328           biallelic sites into multiallelic records (+). An optional type
2329           string can follow which controls variant types which should be
2330           split or merged together: If only SNP records should be split or
2331           merged, specify snps; if both SNPs and indels should be merged
2332           separately into two records, specify both; if SNPs and indels
2333           should be merged into a single record, specify any.
2334
2335       --no-version
2336           see Common Options
2337
2338       -N, --do-not-normalize
2339           the -c s option can be used to fix or set the REF allele from the
2340           reference -f. The -N option will not turn on indel normalisation as
2341           the -f option normally implies
2342
2343       --old-rec-tag STR
2344           Add INFO/STR annotation with the original record. The format of the
2345           annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
2346
2347       -o, --output FILE
2348           see Common Options
2349
2350       -O, --output-type b|u|z|v[0-9]
2351           see Common Options
2352
2353       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2354           see Common Options
2355
2356       -R, --regions-file file
2357           see Common Options
2358
2359       --regions-overlap 0|1|2
2360           see Common Options
2361
2362       -s, --strict-filter
2363           when merging (-m+), merged site is PASS only if all sites being
2364           merged PASS
2365
2366       -t, --targets LIST
2367           see Common Options
2368
2369       -T, --targets-file FILE
2370           see Common Options
2371
2372       --targets-overlap 0|1|2
2373           see Common Options
2374
2375       --threads INT
2376           see Common Options
2377
2378       -w, --site-win INT
2379           maximum distance between two records to consider when locally
2380           sorting variants which changed position during the realignment
2381
2382   bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]
2383       A common framework for various utilities. The plugins can be used the
2384       same way as normal commands only their name is prefixed with "+". Most
2385       plugins accept two types of parameters: general options shared by all
2386       plugins followed by a separator, and a list of plugin-specific options.
2387       There are some exceptions to this rule, some plugins do not accept the
2388       common options and implement their own parameters. Therefore please pay
2389       attention to the usage examples that each plugin comes with.
2390
2391   VCF input options:
2392       -e, --exclude EXPRESSION
2393           exclude sites for which EXPRESSION is true. For valid expressions
2394           see EXPRESSIONS.
2395
2396       -i, --include EXPRESSION
2397           include only sites for which EXPRESSION is true. For valid
2398           expressions see EXPRESSIONS.
2399
2400       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2401           see Common Options
2402
2403       -R, --regions-file file
2404           see Common Options
2405
2406       --regions-overlap 0|1|2
2407           see Common Options
2408
2409       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2410           see Common Options
2411
2412       -T, --targets-file file
2413           see Common Options
2414
2415       --targets-overlap 0|1|2
2416           see Common Options
2417
2418   VCF output options:
2419       --no-version
2420           see Common Options
2421
2422       -o, --output FILE
2423           see Common Options
2424
2425       -O, --output-type b|u|z|v[0-9]
2426           see Common Options
2427
2428       --threads INT
2429           see Common Options
2430
2431   Plugin options:
2432       -h, --help
2433           list plugin’s options
2434
2435       -l, --list-plugins
2436           List all available plugins.
2437
2438           By default, appropriate system directories are searched for
2439           installed plugins. You can override this by setting the
2440           BCFTOOLS_PLUGINS environment variable to a colon-separated list of
2441           directories to search. If BCFTOOLS_PLUGINS begins with a colon,
2442           ends with a colon, or contains adjacent colons, the system
2443           directories are also searched at that position in the list of
2444           directories.
2445
2446       -v, --verbose
2447           print debugging information to debug plugin failure
2448
2449       -V, --version
2450           print version string and exit
2451
2452   List of plugins coming with the distribution:
2453       ad-bias
2454           find positions with wildly varying ALT allele frequency (Fisher
2455           test on FMT/AD)
2456
2457       add-variantkey
2458           add VariantKey INFO fields VKX and RSX
2459
2460       af-dist
2461           collect AF deviation stats and GT probability distribution given AF
2462           and assuming HWE
2463
2464       allele-length
2465           count the frequency of the length of REF, ALT and REF+ALT
2466
2467       check-ploidy
2468           check if ploidy of samples is consistent for all sites
2469
2470       check-sparsity
2471           print samples without genotypes in a region or chromosome
2472
2473       color-chrs
2474           color shared chromosomal segments, requires trio VCF with phased
2475           GTs
2476
2477       contrast
2478           runs a basic association test, per-site or in a region, and checks
2479           for novel alleles and genotypes in two groups of samples. Adds the
2480           following INFO annotations:
2481
2482           •   PASSOC  .. Fisher’s exact test probability of genotypic
2483               association (REF vs non-REF allele)
2484
2485           •   FASSOC  .. proportion of non-REF allele in controls and cases
2486
2487           •   NASSOC  .. number of control-ref, control-alt, case-ref and
2488               case-alt alleles
2489
2490           •   NOVELAL .. lists samples with a novel allele not observed in
2491               the control group
2492
2493           •   NOVELGT .. lists samples with a novel genotype not observed in
2494               the control group
2495
2496       counts
2497           a minimal plugin which counts number of SNPs, Indels, and total
2498           number of sites.
2499
2500       dosage
2501           print genotype dosage. By default the plugin searches for PL, GL
2502           and GT, in that order.
2503
2504       fill-from-fasta
2505           fill INFO or REF field based on values in a fasta file
2506
2507       fill-tags
2508           set various INFO tags. The list of tags supported in this version:
2509
2510           •   INFO/AC         Number:A  Type:Integer  .. Allele count in
2511               genotypes
2512
2513           •   INFO/AC_Hom     Number:A  Type:Integer  .. Allele counts in
2514               homozygous genotypes
2515
2516           •   INFO/AC_Het     Number:A  Type:Integer  .. Allele counts in
2517               heterozygous genotypes
2518
2519           •   INFO/AC_Hemi    Number:A  Type:Integer  .. Allele counts in
2520               hemizygous genotypes
2521
2522           •   INFO/AF         Number:A  Type:Float    .. Allele frequency
2523
2524           •   INFO/AN         Number:1  Type:Integer  .. Total number of
2525               alleles in called genotypes
2526
2527           •   INFO/ExcHet     Number:A  Type:Float    .. Test excess
2528               heterozygosity; 1=good, 0=bad
2529
2530           •   INFO/END        Number:1  Type:Integer  .. End position of the
2531               variant
2532
2533           •   INFO/F_MISSING  Number:1  Type:Float    .. Fraction of missing
2534               genotypes
2535
2536           •   INFO/HWE        Number:A  Type:Float    .. HWE test
2537               (PMID:15789306); 1=good, 0=bad
2538
2539           •   INFO/MAF        Number:A  Type:Float    .. Minor Allele
2540               frequency
2541
2542           •   INFO/NS         Number:1  Type:Integer  .. Number of samples
2543               with data
2544
2545           •   INFO/TYPE       Number:. Type:String   .. The record type
2546               (REF,SNP,MNP,INDEL,etc)
2547
2548           •   FORMAT/VAF      Number:A  Type:Float    .. The fraction of
2549               reads with the alternate allele, requires FORMAT/AD or ADF+ADR
2550
2551           •   FORMAT/VAF1     Number:1  Type:Float    .. The same as
2552               FORMAT/VAF but for all alternate alleles cumulatively
2553
2554           •   TAG=func(TAG)   Number:1  Type:Integer  .. Experimental support
2555               for user-defined expressions such as "DP=sum(DP)"
2556
2557       fix-ploidy
2558           sets correct ploidy
2559
2560       fixref
2561           determine and fix strand orientation
2562
2563       frameshifts
2564           annotate frameshift indels
2565
2566       GTisec
2567           count genotype intersections across all possible sample subsets in
2568           a vcf file
2569
2570       GTsubset
2571           output only sites where the requested samples all exclusively share
2572           a genotype
2573
2574       guess-ploidy
2575           determine sample sex by checking genotype likelihoods (GL,PL) or
2576           genotypes (GT) in the non-PAR region of chrX.
2577
2578       gvcfz
2579           compress gVCF file by resizing non-variant blocks according to
2580           specified criteria
2581
2582       impute-info
2583           add imputation information metrics to the INFO field based on
2584           selected FORMAT tags
2585
2586       indel-stats
2587           calculates per-sample or de novo indels stats. The usage and format
2588           is similar to smpl-stats and trio-stats
2589
2590       isecGT
2591           compare two files and set non-identical genotypes to missing
2592
2593       mendelian
2594           count Mendelian consistent / inconsistent genotypes.
2595
2596       missing2ref
2597           sets missing genotypes ("./.") to ref allele ("0/0" or "0|0")
2598
2599       parental-origin
2600           determine parental origin of a CNV region
2601
2602       prune
2603           prune sites by missingness, allele frequency or linkage
2604           disequilibrium. Alternatively, annotate sites with r2, Lewontin’s
2605           D' (PMID:19433632), Ragsdale’s D (PMID:31697386).
2606
2607       remove-overlaps
2608           remove overlapping variants and duplicate sites
2609
2610       scatter
2611           intended as an inverse to bcftools concat, scatter VCF by chunks or
2612           regions, creating multiple VCFs.
2613
2614       setGT
2615           general tool to set genotypes according to rules requested by the
2616           user
2617
2618       smpl-stats
2619           calculates basic per-sample stats. The usage and format is similar
2620           to indel-stats and trio-stats.
2621
2622       split
2623           split VCF by sample, creating single- or multi-sample VCFs
2624
2625       split-vep
2626           extract fields from structured annotations such as INFO/CSQ created
2627           by bcftools/csq or VEP. These can be added as a new INFO field to
2628           the VCF or in a custom text format. See
2629           http://samtools.github.io/bcftools/howtos/plugin.split-vep.html for
2630           more.
2631
2632       tag2tag
2633           Convert between similar tags, such as GL,PL,GP or QR,QA,QS.
2634
2635       trio-dnm2
2636           screen variants for possible de-novo mutations in trios
2637
2638       trio-stats
2639           calculate transmission rate in trio children. The usage and format
2640           is similar to indel-stats and smpl-stats.
2641
2642       trio-switch-rate
2643           calculate phase switch rate in trio samples, children samples must
2644           have phased GTs
2645
2646       variantkey-hex
2647           generate unsorted VariantKey-RSid index files in hexadecimal format
2648
2649   Examples:
2650           # List options common to all plugins
2651           bcftools plugin
2652
2653           # List available plugins
2654           bcftools plugin -l
2655
2656           # Run a plugin
2657           bcftools plugin counts in.vcf
2658
2659           # Run a plugin using the abbreviated "+" notation
2660           bcftools +counts in.vcf
2661
2662           # Run a plugin from an explicit location
2663           bcftools +/path/to/counts.so in.vcf
2664
2665           # The input VCF can be streamed just like in other commands
2666           cat in.vcf | bcftools +counts
2667
2668           # Print usage information of plugin "dosage"
2669           bcftools +dosage -h
2670
2671           # Replace missing genotypes with 0/0
2672           bcftools +missing2ref in.vcf
2673
2674           # Replace missing genotypes with 0|0
2675           bcftools +missing2ref in.vcf -- -p
2676
2677   Plugins troubleshooting:
2678       Things to check if your plugin does not show up in the bcftools plugin
2679       -l output:
2680
2681       •   Run with the -v option for verbose output: bcftools plugin -lv
2682
2683       •   Does the environment variable BCFTOOLS_PLUGINS include the correct
2684           path?
2685
2686   Plugins API:
2687           // Short description used by 'bcftools plugin -l'
2688           const char *about(void);
2689
2690           // Longer description used by 'bcftools +name -h'
2691           const char *usage(void);
2692
2693           // Called once at startup, allows initialization of local variables.
2694           // Return 1 to suppress normal VCF/BCF header output, -1 on critical
2695           // errors, 0 otherwise.
2696           int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr);
2697
2698           // Called for each VCF record, return NULL to suppress the output
2699           bcf1_t *process(bcf1_t *rec);
2700
2701           // Called after all lines have been processed to clean up
2702           void destroy(void);
2703
2704   bcftools polysomy [OPTIONS] file.vcf.gz
2705       Detect number of chromosomal copies in VCFs annotates with the
2706       Illumina’s B-allele frequency (BAF) values. Note that this command is
2707       not compiled in by default, see the section Optional Compilation with
2708       GSL in the INSTALL file for help.
2709
2710   General options:
2711       -o, --output-dir path
2712           output directory
2713
2714       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2715           see Common Options
2716
2717       -R, --regions-file file
2718           see Common Options
2719
2720       --regions-overlap 0|1|2
2721           see Common Options
2722
2723       -s, --sample string
2724           sample name
2725
2726       -t, --targets LIST
2727           see Common Options
2728
2729       -T, --targets-file FILE
2730           see Common Options
2731
2732       --targets-overlap 0|1|2
2733           see Common Options
2734
2735       -v, --verbose
2736           verbose debugging output which gives hints about the thresholds and
2737           decisions made by the program. Note that the exact output can
2738           change between versions.
2739
2740   Algorithm options:
2741       -b, --peak-size float
2742           the minimum peak size considered as a good match can be from the
2743           interval [0,1] where larger is stricter
2744
2745       -c, --cn-penalty float
2746           a penalty for increasing copy number state. How this works:
2747           multiple peaks are always a better fit than a single peak,
2748           therefore the program prefers a single peak (normal copy number)
2749           unless the absolute deviation of the multiple peaks fit is
2750           significantly smaller. Here the meaning of "significant" is given
2751           by the float from the interval [0,1] where larger is stricter.
2752
2753       -f, --fit-th float
2754           threshold for goodness of fit (normalized absolute deviation),
2755           smaller is stricter
2756
2757       -i, --include-aa
2758           include also the AA peak in CN2 and CN3 evaluation. This usually
2759           requires increasing -f.
2760
2761       -m, --min-fraction float
2762           minimum distinguishable fraction of aberrant cells. The experience
2763           shows that trustworthy are estimates of 20% and more.
2764
2765       -p, --peak-symmetry float
2766           a heuristics to filter failed fits where the expected peak symmetry
2767           is violated. The float is from the interval [0,1] and larger is
2768           stricter
2769
2770   bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]
2771       Extracts fields from VCF or BCF files and outputs them in user-defined
2772       format.
2773
2774       -e, --exclude EXPRESSION
2775           exclude sites for which EXPRESSION is true. For valid expressions
2776           see EXPRESSIONS.
2777
2778       --force-samples
2779           continue even when some samples requested via -s/-S do not exist
2780
2781       -f, --format FORMAT
2782           learn by example, see below
2783
2784       -H, --print-header
2785           print header
2786
2787       -i, --include EXPRESSION
2788           include only sites for which EXPRESSION is true. For valid
2789           expressions see EXPRESSIONS.
2790
2791       -l, --list-samples
2792           list sample names and exit
2793
2794       -o, --output FILE
2795           see Common Options
2796
2797       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2798           see Common Options
2799
2800       -R, --regions-file file
2801           see Common Options
2802
2803       --regions-overlap 0|1|2
2804           see Common Options
2805
2806       -s, --samples LIST
2807           see Common Options
2808
2809       -S, --samples-file FILE
2810           see Common Options
2811
2812       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2813           see Common Options
2814
2815       -T, --targets-file file
2816           see Common Options
2817
2818       --targets-overlap 0|1|2
2819           see Common Options
2820
2821       -u, --allow-undef-tags
2822           do not throw an error if there are undefined tags in the format
2823           string, print "." instead
2824
2825       -v, --vcf-list FILE
2826           process multiple VCFs listed in the file
2827
2828   Format:
2829           %CHROM          The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
2830           %END            End position of the REF allele
2831           %END0           End position of the REF allele in 0-based coordinates
2832           %FIRST_ALT      Alias for %ALT{0}
2833           %FORMAT         Prints all FORMAT fields or a subset of samples with -s or -S
2834           %GT             Genotype (e.g. 0/1)
2835           %INFO           Prints the whole INFO column
2836           %INFO/TAG       Any tag in the INFO column
2837           %IUPACGT        Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
2838           %LINE           Prints the whole line
2839           %MASK           Indicates presence of the site in other files (with multiple files)
2840           %N_PASS(expr)   Number of samples that pass the filtering expression (see *<<expressions,EXPRESSIONS>>*)
2841           %POS0           POS in 0-based coordinates
2842           %PBINOM(TAG)    Calculate phred-scaled binomial probability, the allele index is determined from GT
2843           %SAMPLE         Sample name
2844           %TAG{INT}       Curly brackets to print a subfield (e.g. INFO/TAG{1}, the indexes are 0-based)
2845           %TBCSQ          Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
2846           %TGT            Translated genotype (e.g. C/A)
2847           %TYPE           Variant type (REF, SNP, MNP, INDEL, BND, OTHER)
2848           []              Format fields must be enclosed in brackets to loop over all samples
2849           \n              new line
2850           \t              tab character
2851
2852           Everything else is printed verbatim.
2853
2854   Examples:
2855           # Print chromosome, position, ref allele and the first alternate allele
2856           bcftools query -f '%CHROM  %POS  %REF  %ALT{0}\n' file.vcf.gz
2857
2858           # Similar to above, but use tabs instead of spaces, add sample name and genotype
2859           bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
2860
2861           # Print FORMAT/GT fields followed by FORMAT/GT fields
2862           bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
2863
2864           # Make a BED file: chr, pos (0-based), end pos (1-based), id
2865           bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
2866
2867           # Print only samples with alternate (non-reference) genotypes
2868           bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
2869
2870           # Print all samples at sites with at least one alternate genotype
2871           bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
2872
2873           # Print phred-scaled binomial probability from FORMAT/AD tag for all heterozygous genotypes
2874           bcftools query -i'GT="het"' -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' file.vcf
2875
2876           # Print the second value of AC field if bigger than 10. Note the (unfortunate) difference in
2877           # index subscript notation: formatting expressions (-f) uses "{}" while filtering expressions
2878           # (-i) use "[]". This is for historic reasons and backward-compatibility.
2879           bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz
2880
2881   bcftools reheader [OPTIONS] file.vcf.gz
2882       Modify header of VCF/BCF files, change sample names.
2883
2884       -f, --fai FILE
2885           add to the header contig names and their lengths from the provided
2886           fasta index file (.fai). Lengths of existing contig lines will be
2887           updated and contig lines not present in the fai file will be
2888           removed
2889
2890       -h, --header FILE
2891           new VCF header
2892
2893       -o, --output FILE
2894           see Common Options
2895
2896       -s, --samples FILE
2897           new sample names, one name per line, in the same order as they
2898           appear in the VCF file. Alternatively, only samples which need to
2899           be renamed can be listed as "old_name new_name\n" pairs separated
2900           by whitespaces, each on a separate line. If a sample name contains
2901           spaces, the spaces can be escaped using the backslash character,
2902           for example "Not\ a\ good\ sample\ name".
2903
2904       -T, --temp-prefix PATH
2905           template for temporary file names, used with -f
2906
2907       --threads INT
2908           see Common Options
2909
2910   bcftools roh [OPTIONS] file.vcf.gz
2911       A program for detecting runs of homo/autozygosity. Only bi-allelic
2912       sites are considered.
2913
2914   The HMM model:
2915           Notation:
2916             D  = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
2917             f  = non-ref allele frequency
2918
2919           Emission probabilities:
2920             oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
2921             oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA)
2922
2923           Transition probabilities:
2924             tAZ = P(AZ|HW)  .. from HW to AZ, the -a parameter
2925             tHW = P(HW|AZ)  .. from AZ to HW, the -H parameter
2926
2927             ci  = P_i(C)  .. probability of cross-over at site i, from genetic map
2928             AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
2929             HWi = P_i(HW)
2930
2931             P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
2932             P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
2933
2934   General Options:
2935       --AF-dflt FLOAT
2936           in case allele frequency is not known, use the FLOAT. By default,
2937           sites where allele frequency cannot be determined, or is 0, are
2938           skipped.
2939
2940       --AF-tag TAG
2941           use the specified INFO tag TAG as an allele frequency estimate
2942           instead of the default AC and AN tags. Sites which do not have TAG
2943           will be skipped.
2944
2945       --AF-file FILE
2946           Read allele frequencies from a tab-delimited file containing the
2947           columns: CHROM\tPOS\tREF,ALT\tAF. The file can be compressed with
2948           bgzip and indexed with tabix -s1 -b2 -e2. Sites which are not
2949           present in the FILE or have different reference or alternate allele
2950           will be skipped. Note that such a file can be easily created from a
2951           VCF using:
2952
2953               bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
2954
2955       -b, --buffer-size INT[,INT]
2956           when the entire many-sample file cannot fit into memory, a sliding
2957           buffer approach can be used. The first value is the number of sites
2958           to keep in memory. If negative, it is interpreted as the maximum
2959           memory to use, in MB. The second, optional, value sets the number
2960           of overlapping sites. The default overlap is set to roughly 1% of
2961           the buffer size.
2962
2963       -e, --estimate-AF FILE
2964           estimate the allele frequency by recalculating INFO/AC and INFO/AN
2965           on the fly, using the specified TAG which can be either FORMAT/GT
2966           ("GT") or FORMAT/PL ("PL"). If TAG is not given, "GT" is assumed.
2967           Either all samples ("-") or samples listed in FILE will be
2968           included. For example, use "PL,-" to estimate AF from FORMAT/PL of
2969           all samples. If neither -e nor the other --AF-... options are
2970           given, the allele frequency is estimated from AC and AN counts
2971           which are already present in the INFO field.
2972
2973       --exclude EXPRESSION
2974           exclude sites for which EXPRESSION is true. For valid expressions
2975           see EXPRESSIONS.
2976
2977       -G, --GTs-only FLOAT
2978           use genotypes (FORMAT/GT fields) ignoring genotype likelihoods
2979           (FORMAT/PL), setting PL of unseen genotypes to FLOAT. Safe value to
2980           use is 30 to account for GT errors.
2981
2982       --include EXPRESSION
2983           include only sites for which EXPRESSION is true. For valid
2984           expressions see EXPRESSIONS.
2985
2986       -I, --skip-indels
2987           skip indels as their genotypes are usually enriched for errors
2988
2989       -m, --genetic-map FILE
2990           genetic map in the format required also by IMPUTE2. Only the first
2991           and third column are used (position and Genetic_Map(cM)). The FILE
2992           can be a single file or a file mask, where string "{CHROM}" is
2993           replaced with chromosome name.
2994
2995       -M, --rec-rate FLOAT
2996           constant recombination rate per bp. In combination with
2997           --genetic-map, the --rec-rate parameter is interpreted differently,
2998           as FLOAT-fold increase of transition probabilities, which allows
2999           the model to become more sensitive yet still account for
3000           recombination hotspots. Note that also the range of the values is
3001           therefore different in both cases: normally the parameter will be
3002           in the range (1e-3,1e-9) but with --genetic-map it will be in the
3003           range (10,1000).
3004
3005       -o, --output FILE
3006           Write output to the FILE, by default the output is printed on
3007           stdout
3008
3009       -O, --output-type s|r[z]
3010           Generate per-site output (s) or per-region output (r). By default
3011           both types are printed and the output is uncompressed. Add z for a
3012           compressed output.
3013
3014       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3015           see Common Options
3016
3017       -R, --regions-file file
3018           see Common Options
3019
3020       --regions-overlap 0|1|2
3021           see Common Options
3022
3023       -s, --samples LIST
3024           see Common Options
3025
3026       -S, --samples-file FILE
3027           see Common Options
3028
3029       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3030           see Common Options
3031
3032       -T, --targets-file file
3033           see Common Options
3034
3035       --targets-overlap 0|1|2
3036           see Common Options
3037
3038   HMM Options:
3039       -a, --hw-to-az FLOAT
3040           P(AZ|HW) transition probability from AZ (autozygous) to HW
3041           (Hardy-Weinberg) state
3042
3043       -H, --az-to-hw FLOAT
3044           P(HW|AZ) transition probability from HW to AZ state
3045
3046       -V, --viterbi-training FLOAT
3047           estimate HMM parameters using Baum-Welch algorithm, using the
3048           convergence threshold FLOAT, e.g. 1e-10 (experimental)
3049
3050   bcftools sort [OPTIONS] file.bcf
3051       -m, --max-mem FLOAT[kMG]
3052           Maximum memory to use. Approximate, affects the number of temporary
3053           files written to the disk. Note that if the command fails at this
3054           step because of too many open files, your system limit on the
3055           number of open files ("ulimit") may need to be increased.
3056
3057       -o, --output FILE
3058           see Common Options
3059
3060       -O, --output-type b|u|z|v[0-9]
3061           see Common Options
3062
3063       -T, --temp-dir DIR
3064           Use this directory to store temporary files
3065
3066   bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]
3067       Parses VCF or BCF and produces text file stats which is suitable for
3068       machine processing and can be plotted using plot-vcfstats. When two
3069       files are given, the program generates separate stats for intersection
3070       and the complements. By default only sites are compared, -s/-S must
3071       given to include also sample columns. When one VCF file is specified on
3072       the command line, then stats by non-reference allele frequency, depth
3073       distribution, stats by quality and per-sample counts, singleton stats,
3074       etc. are printed. When two VCF files are given, then stats such as
3075       concordance (Genotype concordance by non-reference allele frequency,
3076       Genotype concordance by sample, Non-Reference Discordance) and
3077       correlation are also printed. Per-site discordance (PSD) is also
3078       printed in --verbose mode.
3079
3080       --af-bins LIST|FILE
3081           comma separated list of allele frequency bins (e.g. 0.1,0.5,1) or a
3082           file listing the allele frequency bins one per line (e.g.
3083           0.1\n0.5\n1)
3084
3085       --af-tag TAG
3086           allele frequency INFO tag to use for binning. By default the allele
3087           frequency is estimated from AC/AN, if available, or directly from
3088           the genotypes (GT) if not.
3089
3090       -1, --1st-allele-only
3091           consider only the 1st alternate allele at multiallelic sites
3092
3093       -c, --collapse snps|indels|both|all|some|none
3094           see Common Options
3095
3096       -d, --depth INT,INT,INT
3097           ranges of depth distribution: min, max, and size of the bin
3098
3099       --debug
3100           produce verbose per-site and per-sample output
3101
3102       -e, --exclude EXPRESSION
3103           exclude sites for which EXPRESSION is true. For valid expressions
3104           see EXPRESSIONS.
3105
3106       -E, --exons file.gz
3107           tab-delimited file with exons for indel frameshifts statistics. The
3108           columns of the file are CHR, FROM, TO, with 1-based, inclusive,
3109           positions. The file is BGZF-compressed and indexed with tabix
3110
3111               tabix -s1 -b2 -e3 file.gz
3112
3113       -f, --apply-filters LIST
3114           see Common Options
3115
3116       -F, --fasta-ref ref.fa
3117           faidx indexed reference sequence file to determine INDEL context
3118
3119       -i, --include EXPRESSION
3120           include only sites for which EXPRESSION is true. For valid
3121           expressions see EXPRESSIONS.
3122
3123       -I, --split-by-ID
3124           collect stats separately for sites which have the ID column set
3125           ("known sites") or which do not have the ID column set ("novel
3126           sites").
3127
3128       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3129           see Common Options
3130
3131       -R, --regions-file file
3132           see Common Options
3133
3134       --regions-overlap 0|1|2
3135           see Common Options
3136
3137       -s, --samples LIST
3138           see Common Options
3139
3140       -S, --samples-file FILE
3141           see Common Options
3142
3143       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3144           see Common Options
3145
3146       -T, --targets-file file
3147           see Common Options
3148
3149       --targets-overlap 0|1|2
3150           see Common Options
3151
3152       -u, --user-tstv <TAG[:min:max:n]>
3153           collect Ts/Tv stats for any tag using the given binning [0:1:100]
3154
3155       -v, --verbose
3156           produce verbose per-site and per-sample output
3157
3158   bcftools view [OPTIONS] file.vcf.gz [REGION [...]]
3159       View, subset and filter VCF or BCF files by position and filtering
3160       expression. Convert between VCF and BCF. Former bcftools subset.
3161
3162   Output options
3163       -G, --drop-genotypes
3164           drop individual genotype information (after subsetting if -s option
3165           is set)
3166
3167       -h, --header-only
3168           output the VCF header only (see also bcftools head)
3169
3170       -H, --no-header
3171           suppress the header in VCF output
3172
3173       --with-header
3174           output both VCF header and records (this is the default, but the
3175           option is useful for explicitness or to reset the effects of -h or
3176           -H)
3177
3178       -l, --compression-level [0-9]
3179           compression level. 0 stands for uncompressed, 1 for best speed and
3180           9 for best compression.
3181
3182       --no-version
3183           see Common Options
3184
3185       -O, --output-type b|u|z|v[0-9]
3186           see Common Options
3187
3188       -o, --output FILE: output file name. If not present, the default is to
3189       print to standard output (stdout).
3190
3191       -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3192           see Common Options
3193
3194       -R, --regions-file file
3195           see Common Options
3196
3197       --regions-overlap 0|1|2
3198           see Common Options
3199
3200       -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3201           see Common Options
3202
3203       -T, --targets-file file
3204           see Common Options
3205
3206       --targets-overlap 0|1|2
3207           see Common Options
3208
3209       --threads INT
3210           see Common Options
3211
3212   Subset options:
3213       -a, --trim-alt-alleles
3214           remove alleles not seen in the genotype fields from the ALT column.
3215           Note that if no alternate allele remains after trimming, the record
3216           itself is not removed but ALT is set to ".". If the option -s or -S
3217           is given, removes alleles not seen in the subset. INFO and FORMAT
3218           tags declared as Type=A, G or R will be trimmed as well.
3219
3220       --force-samples
3221           only warn about unknown subset samples
3222
3223       -I, --no-update
3224           do not (re)calculate INFO fields for the subset (currently INFO/AC
3225           and INFO/AN)
3226
3227       -s, --samples LIST
3228           see Common Options. Note that it is possible to create multiple
3229           subsets simultaneously using the split plugin.
3230
3231       -S, --samples-file FILE
3232           see Common Options. Note that it is possible to create multiple
3233           subsets simultaneously using the split plugin.
3234
3235   Filter options:
3236       Note that filter options below dealing with counting the number of
3237       alleles will, for speed, first check for the values of AC and AN in the
3238       INFO column to avoid parsing all the genotype (FORMAT/GT) fields in the
3239       VCF. This means that a filter like --min-af 0.1 will be calculated from
3240       INFO/AC and INFO/AN when available or FORMAT/GT otherwise. However, it
3241       will not attempt to use any other existing field, like INFO/AF for
3242       example. For that, use --exclude AF<0.1 instead.
3243
3244       Also note that one must be careful when sample subsetting and filtering
3245       is performed in a single command because the order of internal
3246       operations can influence the result. For example, the -i/-e filtering
3247       is performed before sample removal, but the -P filtering is performed
3248       after, and some are inherently ambiguous, for example allele counts can
3249       be taken from the INFO column when present but calculated on the fly
3250       when absent. Therefore it is strongly recommended to spell out the
3251       required order explicitly by separating such commands into two steps.
3252       (Make sure to use the -O u option when piping!)
3253
3254       -c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
3255           minimum allele count (INFO/AC) of sites to be printed. Specifying
3256           the type of allele is optional and can be set to non-reference
3257           (nref, the default), 1st alternate  (alt1), the least frequent
3258           (minor), the most frequent (major) or sum of all but the most
3259           frequent (nonmajor) alleles.
3260
3261       -C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
3262           maximum allele count (INFO/AC) of sites to be printed. Specifying
3263           the type of allele is optional and can be set to non-reference
3264           (nref, the default), 1st alternate  (alt1), the least frequent
3265           (minor), the most frequent (major) or sum of all but the most
3266           frequent (nonmajor) alleles.
3267
3268       -e, --exclude EXPRESSION
3269           exclude sites for which EXPRESSION is true. For valid expressions
3270           see EXPRESSIONS.
3271
3272       -f, --apply-filters LIST
3273           see Common Options
3274
3275       -g, --genotype [^][hom|het|miss]
3276           include only sites with one or more homozygous (hom), heterozygous
3277           (het) or missing (miss) genotypes. When prefixed with ^, the logic
3278           is reversed; thus ^het excludes sites with heterozygous genotypes.
3279
3280       -i, --include EXPRESSION
3281           include sites for which EXPRESSION is true. For valid expressions
3282           see EXPRESSIONS.
3283
3284       -k, --known
3285           print known sites only (ID column is not ".")
3286
3287       -m, --min-alleles INT
3288           print sites with at least INT alleles listed in REF and ALT columns
3289
3290       -M, --max-alleles INT
3291           print sites with at most INT alleles listed in REF and ALT columns.
3292           Use -m2 -M2 -v snps to only view biallelic SNPs.
3293
3294       -n, --novel
3295           print novel sites only (ID column is ".")
3296
3297       -p, --phased
3298           print sites where all samples are phased. Haploid genotypes are
3299           considered phased. Missing genotypes considered unphased unless the
3300           phased bit is set.
3301
3302       -P, --exclude-phased
3303           exclude sites where all samples are phased
3304
3305       -q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3306           minimum allele frequency (INFO/AC / INFO/AN) of sites to be
3307           printed. Specifying the type of allele is optional and can be set
3308           to non-reference (nref, the default), 1st alternate  (alt1), the
3309           least frequent (minor), the most frequent (major) or sum of all but
3310           the most frequent (nonmajor) alleles.
3311
3312       -Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3313           maximum allele frequency (INFO/AC / INFO/AN) of sites to be
3314           printed. Specifying the type of allele is optional and can be set
3315           to non-reference (nref, the default), 1st alternate  (alt1), the
3316           least frequent (minor), the most frequent (major) or sum of all but
3317           the most frequent (nonmajor) alleles.
3318
3319       -u, --uncalled
3320           print sites without a called genotype
3321
3322       -U, --exclude-uncalled
3323           exclude sites without a called genotype
3324
3325       -v, --types snps|indels|mnps|other
3326           comma-separated list of variant types to select. Site is selected
3327           if any of the ALT alleles is of the type requested. Types are
3328           determined by comparing the REF and ALT alleles in the VCF record
3329           not INFO tags like INFO/INDEL or INFO/VT. Use --include to select
3330           based on INFO tags.
3331
3332       -V, --exclude-types snps|indels|mnps|ref|bnd|other
3333           comma-separated list of variant types to exclude. Site is excluded
3334           if any of the ALT alleles is of the type requested. Types are
3335           determined by comparing the REF and ALT alleles in the VCF record
3336           not INFO tags like INFO/INDEL or INFO/VT. Use --exclude to exclude
3337           based on INFO tags.
3338
3339       -x, --private
3340           print sites where only the subset samples carry an non-reference
3341           allele. Requires --samples or --samples-file.
3342
3343       -X, --exclude-private
3344           exclude sites where only the subset samples carry an non-reference
3345           allele
3346
3347   bcftools help [COMMAND] | bcftools --help [COMMAND]
3348       Display  a  brief usage message listing the bcftools commands
3349       available. If the name of a command is also given, e.g., bcftools help
3350       view, the detailed usage message for that particular command is
3351       displayed.
3352
3353   bcftools [--version|-v]
3354       Display the version numbers and copyright information for bcftools and
3355       the important libraries used by bcftools.
3356
3357   bcftools [--version-only]
3358       Display the full bcftools version number in a machine-readable format.
3359

EXPRESSIONS

3361       These filtering expressions are accepted by most of the commands.
3362
3363       Valid expressions may contain:
3364
3365       •   numerical constants, string constants, file names (this is
3366           currently supported only to filter by the ID column)
3367
3368               1, 1.0, 1e-4
3369               "String"
3370               @file_name
3371
3372       •   arithmetic operators
3373
3374               +,*,-,/
3375
3376       •   comparison operators
3377
3378               == (same as =), >, >=, <=, <, !=
3379
3380       •   regex operators "\~" and its negation "!~". The expressions are
3381           case sensitive unless "/i" is added.
3382
3383               INFO/HAYSTACK ~ "needle"
3384               INFO/HAYSTACK ~ "NEEDless/i"
3385
3386       •   parentheses
3387
3388               (, )
3389
3390       •   logical operators. See also the examples below and the filtering
3391           tutorial <http://samtools.github.io/bcftools/howtos/filtering.html>
3392           about the distinction between "&&" vs "&" and "||" vs "|".
3393
3394               &&,  &, ||,  |
3395
3396       •   INFO tags, FORMAT tags, column names
3397
3398               INFO/DP or DP
3399               FORMAT/DV, FMT/DV, or DV
3400               FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
3401
3402       •   starting with 1.11, the FILTER column can be queried as follows:
3403
3404               FILTER="PASS"
3405               FILTER="A"          .. exact match, for example "A;B" does not pass
3406               FILTER!="A"         .. exact match, for example "A;B" does pass
3407               FILTER~"A"          .. both "A" and "A;B" pass
3408               FILTER!~"A"         .. neither "A" nor "A;B" pass
3409
3410       •   1 (or 0) to test the presence (or absence) of a flag
3411
3412               FlagA=1 && FlagB=0
3413
3414       •   "." to test missing values
3415
3416               DP=".", DP!=".", ALT="."
3417
3418       •   missing genotypes can be matched regardless of phase and ploidy
3419           (".|.", "./.", ".", "0|.") using these expressions
3420
3421               GT="mis", GT~"\.", GT!~"\."
3422
3423       •   missing genotypes can be matched including the phase and ploidy
3424           (".|.", "./.", ".") using these expressions
3425
3426               GT=".|.", GT="./.", GT="."
3427
3428       •   sample genotype: reference (haploid or diploid), alternate (hom or
3429           het, haploid or diploid), missing genotype, homozygous,
3430           heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het,
3431           alt-alt het, haploid ref, haploid alt (case-insensitive)
3432
3433               GT="ref"
3434               GT="alt"
3435               GT="mis"
3436               GT="hom"
3437               GT="het"
3438               GT="hap"
3439               GT="RR"
3440               GT="AA"
3441               GT="RA" or GT="AR"
3442               GT="Aa" or GT="aA"
3443               GT="R"
3444               GT="A"
3445
3446       •   TYPE for variant type in REF,ALT columns
3447           (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~"
3448           to require at least one allele of the given type or the equal sign
3449           "=" to require that all alleles are of the given type. Compare
3450
3451               TYPE="snp"
3452               TYPE~"snp"
3453               TYPE!="snp"
3454               TYPE!~"snp"
3455
3456       •   array subscripts (0-based), "*" for any element, "-" to indicate a
3457           range. Note that for querying FORMAT vectors, the colon ":" can be
3458           used to select a sample and an element of the vector, as shown in
3459           the examples below
3460
3461               INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
3462               FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
3463               FORMAT/AD[0:1]               .. first sample, second AD value
3464               FORMAT/AD[1:0]               .. second sample, first AD value
3465               DP4[*] == 0                  .. any DP4 value
3466               FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
3467               FORMAT/DP[1-3] > 10          .. samples 2-4
3468               FORMAT/DP[1-]  < 7           .. all samples but the first
3469               FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
3470               FORMAT/AD[0:1]               .. first sample, second AD field
3471               FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
3472               FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
3473               (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
3474               CSQ[*] ~ "missense_variant.*deleterious"
3475
3476       •   with many samples it can be more practical to provide a file with
3477           sample names, one sample name per line
3478
3479               GT[@samples.txt]="het" & binom(AD)<0.01
3480
3481       •   function on FORMAT tags (over samples) and INFO tags (over vector
3482           fields): maximum; minimum; arithmetic mean (AVG is synonymous with
3483           MEAN); median; standard deviation from mean; sum; string length;
3484           absolute value; number of elements:
3485
3486               MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
3487
3488           Note that functions above evaluate to a single value across all
3489           samples and are intended to select sites, not samples, even when
3490           applied on FORMAT tags. However, when prefixed with SMPL_ (or "s"
3491           for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector
3492           of per-sample values when applied on FORMAT tags:
3493
3494               SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM,
3495               sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
3496
3497       •   two-tailed binomial test. Note that for N=0 the test evaluates to a
3498           missing value and when FORMAT/GT is used to determine the vector
3499           indices, it evaluates to 1 for homozygous genotypes.
3500
3501               binom(FMT/AD)                .. GT can be used to determine the correct index
3502               binom(AD[0],AD[1])           .. or the fields can be given explicitly
3503               phred(binom())               .. the same as binom but phred-scaled
3504
3505       •   variables calculated on the fly if not present: number of alternate
3506           alleles; number of samples; count of alternate alleles; minor
3507           allele count (similar to AC but is always smaller than 0.5);
3508           frequency of alternate alleles (AF=AC/AN); frequency of minor
3509           alleles (MAF=MAC/AN); number of alleles in called genotypes; number
3510           of samples with missing genotype; fraction of samples with missing
3511           genotype; indel length (deletions negative, insertions positive)
3512
3513               N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
3514
3515       •   the number (N_PASS) or fraction (F_PASS) of samples which pass the
3516           expression
3517
3518               N_PASS(GQ>90 & GT!="mis") > 90
3519               F_PASS(GQ>90 & GT!="mis") > 0.9
3520
3521       •   custom perl filtering. Note that this command is not compiled in by
3522           default, see the section Optional Compilation with Perl in the
3523           INSTALL file for help and misc/demo-flt.pl for a working example.
3524           The demo defined the perl subroutine "severity" which can be
3525           invoked from the command line as follows:
3526
3527               perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
3528
3529       Notes:
3530
3531       •   String comparisons and regular expressions are case-insensitive
3532
3533       •   Comma in strings is interpreted as a separator and when multiple
3534           values are compared, the OR logic is used. Consequently, the
3535           following two expressions are equivalent but not the third:
3536
3537               -i 'TAG="hello,world"'
3538               -i 'TAG="hello" || TAG="world"'
3539               -i 'TAG="hello" && TAG="world"'
3540
3541       •   Variables and function names are case-insensitive, but not tag
3542           names. For example, "qual" can be used instead of "QUAL",
3543           "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
3544
3545       •   When querying multiple values, all elements are tested and the OR
3546           logic is used on the result. For example, when querying
3547           "TAG=1,2,3,4", it will be evaluated as follows:
3548
3549               -i 'TAG[*]=1'   .. true, the record will be printed
3550               -i 'TAG[*]!=1'  .. true
3551               -e 'TAG[*]=1'   .. false, the record will be discarded
3552               -e 'TAG[*]!=1'  .. false
3553               -i 'TAG[0]=1'   .. true
3554               -i 'TAG[0]!=1'  .. false
3555               -e 'TAG[0]=1'   .. false
3556               -e 'TAG[0]!=1'  .. true
3557
3558       Examples:
3559
3560           MIN(DV)>5       .. selects the whole site, evaluates min across all values and samples
3561
3562           SMPL_MIN(DV)>5  .. selects matching samples, evaluates within samples
3563
3564           MIN(DV/DP)>0.3
3565
3566           MIN(DP)>10 & MIN(DV)>3
3567
3568           FMT/DP>10  & FMT/GQ>10 .. both conditions must be satisfied within one sample
3569
3570           FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
3571
3572           QUAL>10 |  FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
3573
3574           QUAL>10 || FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
3575
3576           TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
3577
3578           COUNT(GT="hom")=0      .. no homozygous genotypes at the site
3579
3580           AVG(GQ)>50             .. average (arithmetic mean) of genotype qualities bigger than 50
3581
3582           ID=@file       .. selects lines with ID present in the file
3583
3584           ID!=@~/file    .. skip lines with ID present in the ~/file
3585
3586           MAF[0]<0.05    .. select rare variants at 5% cutoff
3587
3588           POS>=100   .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
3589
3590       Shell expansion:
3591
3592       Note that expressions must often be quoted because some characters have
3593       special meaning in the shell. An example of expression enclosed in
3594       single quotes which cause that the whole expression is passed to the
3595       program as intended:
3596
3597           bcftools view -i '%ID!="." & MAF[0]<0.01'
3598
3599       Please refer to the documentation of your shell for details.
3600

SCRIPTS AND OPTIONS

3602   plot-vcfstats [OPTIONS] file.vchk [...]
3603       Script for processing output of bcftools stats. It can merge results
3604       from multiple outputs (useful when running the stats for each
3605       chromosome separately), plots graphs and creates a PDF presentation.
3606
3607       -m, --merge
3608           Merge vcfstats files to STDOUT, skip plotting.
3609
3610       -p, --prefix DIR
3611           The output directory. This directory will be created if it does not
3612           exist.
3613
3614       -P, --no-PDF
3615           Skip the PDF creation step.
3616
3617       -r, --rasterize
3618           Rasterize PDF images for faster rendering. This is the default and
3619           the opposite of -v, --vectors.
3620
3621       -s, --sample-names
3622           Use sample names for xticks rather than numeric IDs.
3623
3624       -t, --title STRING
3625           Identify files by these titles in plots. The option can be given
3626           multiple times, for each ID in the bcftools stats output. If not
3627           present, the script will use abbreviated source file names for the
3628           titles.
3629
3630       -v, --vectors
3631           Generate vector graphics for PDF images, the opposite of -r,
3632           --rasterize.
3633
3634       -T, --main-title STRING
3635           Main title for the PDF.
3636
3637       Example:
3638
3639           # Generate the stats
3640           bcftools stats -s - > file.vchk
3641
3642           # Plot the stats
3643           plot-vcfstats -p outdir file.vchk
3644
3645           # The final looks can be customized by editing the generated
3646           # 'outdir/plot.py' script and re-running manually
3647           cd outdir && python plot.py && pdflatex summary.tex
3648

PERFORMANCE

3650       HTSlib was designed with BCF format in mind. When parsing VCF files,
3651       all records are internally converted into BCF representation. Simple
3652       operations, like removing a single column from a VCF file, can be
3653       therefore done much faster with standard UNIX commands, such as awk or
3654       cut. Therefore it is recommended to use BCF as input/output format
3655       whenever possible to avoid large overhead of the VCF → BCF → VCF
3656       conversion.
3657

BUGS

3659       Please report any bugs you encounter on the github website:
3660       http://github.com/samtools/bcftools
3661

AUTHORS

3663       Heng Li from the Sanger Institute wrote the original C version of
3664       htslib, samtools and bcftools. Bob Handsaker from the Broad Institute
3665       implemented the BGZF library. Petr Danecek, Shane McCarthy and John
3666       Marshall are  maintaining and further developing bcftools. Many other
3667       people contributed to the program and to the file format
3668       specifications, both directly and indirectly by providing patches,
3669       testing and reporting bugs. We thank them all.
3670

RESOURCES

3672       BCFtools GitHub website: http://github.com/samtools/bcftools
3673
3674       Samtools GitHub website: http://github.com/samtools/samtools
3675
3676       HTSlib GitHub website: http://github.com/samtools/htslib
3677
3678       File format specifications: http://samtools.github.io/hts-specs
3679
3680       BCFtools documentation: http://samtools.github.io/bcftools
3681
3682       BCFtools wiki page: https://github.com/samtools/bcftools/wiki
3683

COPYING

3685       The MIT/Expat License or GPL License, see the LICENSE document for
3686       details. Copyright (c) Genome Research Ltd.
3687
3688
3689
3690                                  2022-04-07                       BCFTOOLS(1)