1BCFTOOLS(1) BCFTOOLS(1)
2
3
4
6 bcftools - utilities for variant calling and manipulating VCFs and
7 BCFs.
8
10 bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
11
13 BCFtools is a set of utilities that manipulate variant calls in the
14 Variant Call Format (VCF) and its binary counterpart BCF. All commands
15 work transparently with both VCFs and BCFs, both uncompressed and
16 BGZF-compressed.
17
18 Most commands accept VCF, bgzipped VCF and BCF with filetype detected
19 automatically even when streaming from a pipe. Indexed VCF and BCF will
20 work in all situations. Un-indexed VCF and BCF and streams will work in
21 most, but not all situations. In general, whenever multiple VCFs are
22 read simultaneously, they must be indexed and therefore also
23 compressed. (Note that files with non-standard index names can be
24 accessed as e.g. "bcftools view -r X:2928329
25 file.vcf.gz##idx##non-standard-index-name".)
26
27 BCFtools is designed to work on a stream. It regards an input file "-"
28 as the standard input (stdin) and outputs to the standard output
29 (stdout). Several commands can thus be combined with Unix pipes.
30
31 VERSION
32 This manual page was last updated 2021-07-07 and refers to bcftools git
33 version 1.13.
34
35 BCF1
36 The BCF1 format output by versions of samtools <= 0.1.19 is not
37 compatible with this version of bcftools. To read BCF1 files one can
38 use the view command from old versions of bcftools packaged with
39 samtools versions <= 0.1.19 to convert to VCF, which can then be read
40 by this version of bcftools.
41
42 samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
43
44 VARIANT CALLING
45 See bcftools call for variant calling from the output of the samtools
46 mpileup command. In versions of samtools <= 0.1.19 calling was done
47 with bcftools view. Users are now required to choose between the old
48 samtools calling model (-c/--consensus-caller) and the new multiallelic
49 calling model (-m/--multiallelic-caller). The multiallelic calling
50 model is recommended for most tasks.
51
53 For a full list of available commands, run bcftools without arguments.
54 For a full list of available options, run bcftools COMMAND without
55 arguments.
56
57 • annotate .. edit VCF files, add or remove annotations
58
59 • call .. SNP/indel calling (former "view")
60
61 • cnv .. Copy Number Variation caller
62
63 • concat .. concatenate VCF/BCF files from the same set of samples
64
65 • consensus .. create consensus sequence by applying VCF variants
66
67 • convert .. convert VCF/BCF to other formats and back
68
69 • csq .. haplotype aware consequence caller
70
71 • filter .. filter VCF/BCF files using fixed thresholds
72
73 • gtcheck .. check sample concordance, detect sample swaps and
74 contamination
75
76 • index .. index VCF/BCF
77
78 • isec .. intersections of VCF/BCF files
79
80 • merge .. merge VCF/BCF files files from non-overlapping sample
81 sets
82
83 • mpileup .. multi-way pileup producing genotype likelihoods
84
85 • norm .. normalize indels
86
87 • plugin .. run user-defined plugin
88
89 • polysomy .. detect contaminations and whole-chromosome
90 aberrations
91
92 • query .. transform VCF/BCF into user-defined formats
93
94 • reheader .. modify VCF/BCF header, change sample names
95
96 • roh .. identify runs of homo/auto-zygosity
97
98 • sort .. sort VCF/BCF files
99
100 • stats .. produce VCF/BCF stats (former vcfcheck)
101
102 • view .. subset, filter and convert VCF and BCF files
103
105 Some helper scripts are bundled with the bcftools code.
106
107 • plot-vcfstats .. plots the output of stats
108
110 Common Options
111 The following options are common to many bcftools commands. See usage
112 for specific commands to see if they apply.
113
114 FILE
115 Files can be both VCF or BCF, uncompressed or BGZF-compressed. The
116 file "-" is interpreted as standard input. Some tools may require
117 tabix- or CSI-indexed files.
118
119 -c, --collapse snps|indels|both|all|some|none|id
120 Controls how to treat records with duplicate positions and defines
121 compatible records across multiple input files. Here by
122 "compatible" we mean records which should be considered as
123 identical by the tools. For example, when performing line
124 intersections, the desire may be to consider as identical all sites
125 with matching positions (bcftools isec -c all), or only sites with
126 matching variant type (bcftools isec -c snps -c indels), or only
127 sites with all alleles identical (bcftools isec -c none).
128
129 none
130 only records with identical REF and ALT alleles are compatible
131
132 some
133 only records where some subset of ALT alleles match are
134 compatible
135
136 all
137 all records are compatible, regardless of whether the ALT
138 alleles match or not. In the case of records with the same
139 position, only the first will be considered and appear on
140 output.
141
142 snps
143 any SNP records are compatible, regardless of whether the ALT
144 alleles match or not. For duplicate positions, only the first
145 SNP record will be considered and appear on output.
146
147 indels
148 all indel records are compatible, regardless of whether the
149 REF and ALT alleles match or not. For duplicate positions, only
150 the first indel record will be considered and appear on output.
151
152 both
153 abbreviation of "-c indels -c snps"
154
155 id
156 only records with identical ID column are compatible. Supported
157 by bcftools merge only.
158
159 -f, --apply-filters LIST
160 Skip sites where FILTER column does not contain any of the strings
161 listed in LIST. For example, to include only sites which have no
162 filters set, use -f .,PASS.
163
164 --no-version
165 Do not append version and command line information to the output
166 VCF header.
167
168 -o, --output FILE
169 When output consists of a single stream, write it to FILE rather
170 than to standard output, where it is written by default.
171
172 -O, --output-type b|u|z|v
173 Output compressed BCF (b), uncompressed BCF (u), compressed VCF
174 (z), uncompressed VCF (v). Use the -Ou option when piping between
175 bcftools subcommands to speed up performance by removing
176 unnecessary compression/decompression and VCF←→BCF conversion.
177
178 -r, --regions chr|chr:pos|chr:beg-end|chr:beg-[,...]
179 Comma-separated list of regions, see also -R, --regions-file.
180 Overlapping records are matched even when the starting coordinate
181 is outside of the region, unlike the -t/-T options where only the
182 POS coordinate is checked. Note that -r cannot be used in
183 combination with -R.
184
185 -R, --regions-file FILE
186 Regions can be specified either on command line or in a VCF, BED,
187 or tab-delimited file (the default). The columns of the
188 tab-delimited file can contain either positions (two-column format)
189 or intervals (three-column format): CHROM, POS, and, optionally,
190 END, where positions are 1-based and inclusive. The columns of the
191 tab-delimited BED file are also CHROM, POS and END (trailing
192 columns are ignored), but coordinates are 0-based, half-open. To
193 indicate that a file be treated as BED rather than the 1-based
194 tab-delimited file, the file must have the ".bed" or ".bed.gz"
195 suffix (case-insensitive). Uncompressed files are stored in memory,
196 while bgzip-compressed and tabix-indexed region files are streamed.
197 Note that sequence names must match exactly, "chr20" is not the
198 same as "20". Also note that chromosome ordering in FILE will be
199 respected, the VCF will be processed in the order in which
200 chromosomes first appear in FILE. However, within chromosomes, the
201 VCF will always be processed in ascending genomic coordinate order
202 no matter what order they appear in FILE. Note that overlapping
203 regions in FILE can result in duplicated out of order positions in
204 the output. This option requires indexed VCF/BCF files. Note that
205 -R cannot be used in combination with -r.
206
207 -s, --samples [^]LIST
208 Comma-separated list of samples to include or exclude if prefixed
209 with "^". The sample order is updated to reflect that given on
210 the command line. Note that in general tags such as INFO/AC,
211 INFO/AN, etc are not updated to correspond to the subset samples.
212 bcftools view is the exception where some tags will be updated
213 (unless the -I, --no-update option is used; see bcftools view
214 documentation). To use updated tags for the subset in another
215 command one can pipe from view into that command. For example:
216
217 bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
218
219 -S, --samples-file FILE
220 File of sample names to include or exclude if prefixed with
221 "^". One sample per line. See also the note above for the -s,
222 --samples option. The sample order is updated to reflect that given
223 in the input file. The command bcftools call accepts an optional
224 second column indicating ploidy (0, 1 or 2) or sex (as defined by
225 --ploidy, for example "F" or "M"), for example:
226
227 sample1 1
228 sample2 2
229 sample3 2
230
231 or
232
233 sample1 M
234 sample2 F
235 sample3 F
236
237 If the second column is not present, the sex "F" is assumed. With
238 bcftools call -C trio, PED file is expected. The program ignores the
239 first column and the last indicates sex (1=male, 2=female), for
240 example:
241
242 ignored_column daughterA fatherA motherA 2
243 ignored_column sonB fatherB motherB 1
244
245 -t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
246 Similar as -r, --regions, but the next position is accessed by
247 streaming the whole VCF/BCF rather than using the tbi/csi index.
248 Both -r and -t options can be applied simultaneously: -r uses the
249 index to jump to a region and -t discards positions which are
250 not in the targets. Unlike -r, targets can be prefixed with "^"
251 to request logical complement. For example, "^X,Y,MT" indicates
252 that sequences X, Y and MT should be skipped. Yet another
253 difference between the -t/-T and -r/-R is that -r/-R checks for
254 proper overlaps and considers both POS and the end position of an
255 indel, while -t/-T considers the POS coordinate only. Note that -t
256 cannot be used in combination with -T.
257
258 -T, --targets-file [^]FILE
259 Same -t, --targets, but reads regions from a file. Note that -T
260 cannot be used in combination with -t.
261
262 With the call -C alleles command, third column of the targets file
263 must be comma-separated list of alleles, starting with the
264 reference allele. Note that the file must be compressed and
265 indexed. Such a file can be easily created from a VCF using:
266
267 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
268
269 --threads INT
270 Use multithreading with INT worker threads. The option is currently
271 used only for the compression of the output stream, only when
272 --output-type is b or z. Default: 0.
273
274 bcftools annotate [OPTIONS] FILE
275 Add or remove annotations.
276
277 -a, --annotations file
278 Bgzip-compressed and tabix-indexed file with annotations. The file
279 can be VCF, BED, or a tab-delimited file with mandatory columns
280 CHROM, POS (or, alternatively, FROM and TO), optional columns REF
281 and ALT, and arbitrary number of annotation columns. BED files are
282 expected to have the ".bed" or ".bed.gz" suffix (case-insensitive),
283 otherwise a tab-delimited file is assumed. Note that in case of
284 tab-delimited file, the coordinates POS, FROM and TO are one-based
285 and inclusive. When REF and ALT are present, only matching VCF
286 records will be annotated. When multiple ALT alleles are present in
287 the annotation file (given as comma-separated list of alleles), at
288 least one must match one of the alleles in the corresponding VCF
289 record. Similarly, at least one alternate allele from a
290 multi-allelic VCF record must be present in the annotation file.
291 Missing values can be added by providing "." in place of actual
292 value. Note that flag types, such as "INFO/FLAG", can be annotated
293 by including a field with the value "1" to set the flag, "0" to
294 remove it, or "." to keep existing flags. See also -c, --columns
295 and -h, --header-lines.
296
297 # Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
298 1 752566 SomeString 5
299 1 798959 SomeOtherString 6
300
301 --collapse snps|indels|both|all|some|none
302 Controls how to match records from the annotation file to the
303 target VCF. Effective only when -a is a VCF or BCF. See Common
304 Options for more.
305
306 -c, --columns list
307 Comma-separated list of columns or tags to carry over from the
308 annotation file (see also -a, --annotations). If the annotation
309 file is not a VCF/BCF, list describes the columns of the annotation
310 file and must include CHROM, POS (or, alternatively, FROM and TO),
311 and optionally REF and ALT. Unused columns which should be ignored
312 can be indicated by "-". + If the annotation file is a VCF/BCF,
313 only the edited columns/tags must be present and their order does
314 not matter. The columns ID, QUAL, FILTER, INFO and FORMAT can be
315 edited, where INFO tags can be written both as "INFO/TAG" or simply
316 "TAG", and FORMAT tags can be written as "FORMAT/TAG" or "FMT/TAG".
317 The imported VCF annotations can be renamed as "DST_TAG:=SRC_TAG"
318 or "FMT/DST_TAG:=FMT/SRC_TAG". + To carry over all INFO
319 annotations, use "INFO". To add all INFO annotations except "TAG",
320 use "^INFO/TAG". By default, existing values are replaced. + To
321 add annotations without overwriting existing values (that is, to
322 add missing tags or add values to existing tags with missing
323 values), use "+TAG" instead of "TAG". To append to existing values
324 (rather than replacing or leaving untouched), use "=TAG" (instead
325 of "TAG" or "+TAG"). To replace only existing values without
326 modifying missing annotations, use "-TAG". To match the record also
327 by ID, in addition to REF and ALT, use "~ID". + If the annotation
328 file is not a VCF/BCF, all new annotations must be defined via -h,
329 --header-lines. + See also the -l, --merge-logic option.
330
331 -C, --columns-file file
332 Read the list of columns from a file (normally given via the -c,
333 --columns option). "-" to skip a column of the annotation file. One
334 column name per row, an additional space- or tab-separated field
335 can be present to indicate the merge logic (normally given via the
336 -l, --merge-logic option). This is useful when many annotations are
337 added at once.
338
339 -e, --exclude EXPRESSION
340 exclude sites for which EXPRESSION is true. For valid expressions
341 see EXPRESSIONS.
342
343 --force
344 continue even when parsing errors, such as undefined tags, are
345 encountered. Note this can be an unsafe operation and can result in
346 corrupted BCF files. If this option is used, make sure to sanity
347 check the result thoroughly.
348
349 -h, --header-lines file
350 Lines to append to the VCF header, see also -c, --columns and -a,
351 --annotations. For example:
352
353 ##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
354 ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
355
356 -I, --set-id [+]FORMAT
357 assign ID on the fly. The format is the same as in the query
358 command (see below). By default all existing IDs are replaced. If
359 the format string is preceded by "+", only missing IDs will be set.
360 For example, one can use
361
362 bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
363
364 -i, --include EXPRESSION
365 include only sites for which EXPRESSION is true. For valid
366 expressions see EXPRESSIONS.
367
368 -k, --keep-sites
369 keep sites which do not pass -i and -e expressions instead of
370 discarding them
371
372 -l, --merge-logic
373 tag:first|append|append-missing|unique|sum|avg|min|max[,...]
374 When multiple regions overlap a single record, this option defines
375 how to treat multiple annotation values when setting tag in the
376 destination file: use the first encountered value ignoring the rest
377 (first); append allowing duplicates (append); append even if the
378 appended value is missing, i.e. is a dot (append-missing); append
379 discarding duplicate values (unique); sum the values (sum, numeric
380 fields only); average the values (avg); use the minimum value (min)
381 or the maximum (max). + Note that this option is intended for use
382 with BED or TAB-delimited annotation files only. Moreover, it is
383 effective only when either REF and ALT or BEG and END --columns are
384 present . + Multiple rules can be given either as a comma-separated
385 list or giving the option multiple times. This is an experimental
386 feature.
387
388 -m, --mark-sites TAG
389 annotate sites which are present ("+") or absent ("-") in the -a
390 file with a new INFO/TAG flag
391
392 --no-version
393 see Common Options
394
395 -o, --output FILE
396 see Common Options
397
398 -O, --output-type b|u|z|v
399 see Common Options
400
401 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
402 see Common Options
403
404 -R, --regions-file file
405 see Common Options
406
407 --rename-annots file
408 rename annotations according to the map in file, with "old_name
409 new_name\n" pairs separated by whitespaces, each on a separate
410 line. The old name must be prefixed with the annotation type: INFO,
411 FORMAT, or FILTER.
412
413 --rename-chrs file
414 rename chromosomes according to the map in file, with "old_name
415 new_name\n" pairs separated by whitespaces, each on a separate
416 line.
417
418 -s, --samples [^]LIST
419 subset of samples to annotate, see also Common Options
420
421 -S, --samples-file FILE
422 subset of samples to annotate. If the samples are named differently
423 in the target VCF and the -a, --annotations VCF, the name mapping
424 can be given as "src_name dst_name\n", separated by whitespaces,
425 each pair on a separate line.
426
427 --single-overlaps
428 use this option to keep memory requirements low with very large
429 annotation files. Note, however, that this comes at a cost, only
430 single overlapping intervals are considered in this mode. This was
431 the default mode until the commit af6f0c9 (Feb 24 2019).
432
433 --threads INT
434 see Common Options
435
436 -x, --remove list
437 List of annotations to remove. Use "FILTER" to remove all filters
438 or "FILTER/SomeFilter" to remove a specific filter. Similarly,
439 "INFO" can be used to remove all INFO tags and "FORMAT" to remove
440 all FORMAT tags except GT. To remove all INFO tags except "FOO" and
441 "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and
442 FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
443
444 Examples:
445
446 # Remove three fields
447 bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
448
449 # Remove all INFO fields and all FORMAT fields except for GT and PL
450 bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
451
452 # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
453 bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
454
455 # Carry over all INFO and FORMAT annotations except FORMAT/GT
456 bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
457
458 # Annotate from a tab-delimited file with six columns (the fifth is ignored),
459 # first indexing with tabix. The coordinates are 1-based.
460 tabix -s1 -b2 -e2 annots.tab.gz
461 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
462
463 # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
464 tabix -s1 -b2 -e3 annots.tab.gz
465 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
466
467 # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
468 bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
469
470 # Transfer the INFO/END tag, matching by POS,REF,ALT and ID. This example assumes
471 # that INFO/END is already present in the VCF header.
472 bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf
473
474 # For more examples see http://samtools.github.io/bcftools/howtos/annotate.html
475
476 bcftools call [OPTIONS] FILE
477 This command replaces the former bcftools view caller. Some of the
478 original functionality has been temporarily lost in the process of
479 transition under htslib <http://github.com/samtools/htslib>, but will
480 be added back on popular demand. The original calling model can be
481 invoked with the -c option.
482
483 File format options:
484 --no-version
485 see Common Options
486
487 -o, --output FILE
488 see Common Options
489
490 -O, --output-type b|u|z|v
491 see Common Options
492
493 --ploidy ASSEMBLY[?]
494 predefined ploidy, use list (or any other unused word) to print a
495 list of all predefined assemblies. Append a question mark to print
496 the actual definition. See also --ploidy-file.
497
498 --ploidy-file FILE
499 ploidy definition given as a space/tab-delimited list of CHROM,
500 FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and correspond
501 to the ones used by --samples-file. The default ploidy can be given
502 using the starred records (see below), unlisted regions have ploidy
503 2. The default ploidy definition is
504
505 X 1 60000 M 1
506 X 2699521 154931043 M 1
507 Y 1 59373566 M 1
508 Y 1 59373566 F 0
509 MT 1 16569 M 1
510 MT 1 16569 F 1
511 * * * M 2
512 * * * F 2
513
514 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
515 see Common Options
516
517 -R, --regions-file file
518 see Common Options
519
520 -s, --samples LIST
521 see Common Options
522
523 -S, --samples-file FILE
524 see Common Options
525
526 -t, --targets LIST
527 see Common Options
528
529 -T, --targets-file FILE
530 see Common Options
531
532 --threads INT
533 see Common Options
534
535 Input/output options:
536 -A, --keep-alts
537 output all alternate alleles present in the alignments even if they
538 do not appear in any of the genotypes
539
540 -f, --format-fields list
541 comma-separated list of FORMAT fields to output for each sample.
542 Currently GQ and GP fields are supported. For convenience, the
543 fields can be given as lower case letters. Prefixed with "^"
544 indicates a request for tag removal of auxiliary tags useful only
545 for calling.
546
547 -F, --prior-freqs AN,AC
548 take advantage of prior knowledge of population allele frequencies.
549 The workflow looks like this:
550
551 # Extract AN,AC values from an existing VCF, such 1000Genomes
552 bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
553
554 # If the tags AN,AC are not already present, use the +fill-tags plugin
555 bcftools +fill-tags 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
556 tabix -s1 -b2 -e2 AFs.tab.gz
557
558 # Create a VCF header description, here we name the tags REF_AN,REF_AC
559 cat AFs.hdr
560 ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
561 ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
562
563 # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
564 bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
565
566 -G, --group-samples FILE|-
567 by default, all samples are assumed to come from a single
568 population. This option allows to group samples into populations
569 and apply the HWE assumption within but not across the populations.
570 FILE is a tab-delimited text file with sample names in the first
571 column and group names in the second column. If - is given instead,
572 no HWE assumption is made at all and single-sample calling is
573 performed. (Note that in low coverage data this inflates the rate
574 of false positives.) The -G option requires the presence of
575 per-sample FORMAT/QS or FORMAT/AD tag generated with bcftools
576 mpileup -a QS (or -a AD).
577
578 -g, --gvcf INT
579 output also gVCF blocks of homozygous REF calls. The parameter INT
580 is the minimum per-sample depth required to include a site in the
581 non-variant block.
582
583 -i, --insert-missed INT
584 output also sites missed by mpileup but present in -T,
585 --targets-file.
586
587 -M, --keep-masked-ref
588 output sites where REF allele is N
589
590 -V, --skip-variants snps|indels
591 skip indel/SNP sites
592
593 -v, --variants-only
594 output variant sites only
595
596 Consensus/variant calling options:
597 -c, --consensus-caller
598 the original samtools/bcftools calling method (conflicts with -m)
599
600 -C, --constrain alleles|trio
601
602 alleles
603 call genotypes given alleles. See also -T, --targets-file.
604
605 trio
606 call genotypes given the father-mother-child constraint. See
607 also -s, --samples and -n, --novel-rate.
608
609 -m, --multiallelic-caller
610 alternative model for multiallelic and rare-variant calling
611 designed to overcome known limitations in -c calling model
612 (conflicts with -c)
613
614 -n, --novel-rate float[,...]
615 likelihood of novel mutation for constrained -C trio calling. The
616 trio genotype calling maximizes likelihood of a particular
617 combination of genotypes for father, mother and the child
618 P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) * (1-Pn).
619 By providing three values, the mutation rate Pn is set explicitly
620 for SNPs, deletions and insertions, respectively. If two values are
621 given, the first is interpreted as the mutation rate of SNPs and
622 the second is used to calculate the mutation rate of indels
623 according to their length as Pn=float*exp(-a-b*len), where
624 a=22.8689, b=0.2994 for insertions and a=21.9313, b=0.2856 for
625 deletions [pubmed:23975140]. If only one value is given, the same
626 mutation rate Pn is used for SNPs and indels.
627
628 -p, --pval-threshold float
629 with -c, accept variant if P(ref|D) < float.
630
631 -P, --prior float
632 expected substitution rate, or 0 to disable the prior. Only with
633 -m.
634
635 -t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
636 see Common Options
637
638 -X, --chromosome-X
639 haploid output for male samples (requires PED file with -s)
640
641 -Y, --chromosome-Y
642 haploid output for males and skips females (requires PED file with
643 -s)
644
645 bcftools cnv [OPTIONS] FILE
646 Copy number variation caller, requires a VCF annotated with the
647 Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR)
648 values. The HMM considers the following copy number states: CN 2
649 (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy
650 gain).
651
652 General Options:
653 -c, --control-sample string
654 optional control sample name. If given, pairwise calling is
655 performed and the -P option can be used
656
657 -f, --AF-file file
658 read allele frequencies from a tab-delimited file with the columns
659 CHR,POS,REF,ALT,AF
660
661 -o, --output-dir path
662 output directory
663
664 -p, --plot-threshold float
665 call matplotlib to produce plots for chromosomes with quality at
666 least float, useful for visual inspection of the calls. With -p 0,
667 plots for all chromosomes will be generated. If not given, a
668 matplotlib script will be created but not called.
669
670 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
671 see Common Options
672
673 -R, --regions-file file
674 see Common Options
675
676 -s, --query-sample string
677 query sample name
678
679 -t, --targets LIST
680 see Common Options
681
682 -T, --targets-file FILE
683 see Common Options
684
685 HMM Options:
686 -a, --aberrant float[,float]
687 fraction of aberrant cells in query and control. The hallmark of
688 duplications and contaminations is the BAF value of heterozygous
689 markers which is dependent on the fraction of aberrant cells.
690 Sensitivity to smaller fractions of cells can be increased by
691 setting -a to a lower value. Note however, that this comes at the
692 cost of increased false discovery rate.
693
694 -b, --BAF-weight float
695 relative contribution from BAF
696
697 -d, --BAF-dev float[,float]
698 expected BAF deviation in query and control, i.e. the noise
699 observed in the data.
700
701 -e, --err-prob float
702 uniform error probability
703
704 -l, --LRR-weight float
705 relative contribution from LRR. With noisy data, this option can
706 have big effect on the number of calls produced. In truly random
707 noise (such as in simulated data), the value should be set high
708 (1.0), but in the presence of systematic noise when LRR are not
709 informative, lower values result in cleaner calls (0.2).
710
711 -L, --LRR-smooth-win int
712 reduce LRR noise by applying moving average given this window size
713
714 -O, --optimize float
715 iteratively estimate the fraction of aberrant cells, down to the
716 given fraction. Lowering this value from the default 1.0 to say,
717 0.3, can help discover more events but also increases noise
718
719 -P, --same-prob float
720 the prior probability of the query and the control sample being the
721 same. Setting to 0 calls both independently, setting to 1 forces
722 the same copy number state in both.
723
724 -x, --xy-prob float
725 the HMM probability of transition to another copy number state.
726 Increasing this values leads to smaller and more frequent calls.
727
728 bcftools concat [OPTIONS] FILE1 FILE2 [...]
729 Concatenate or combine VCF/BCF files. All source files must have the
730 same sample columns appearing in the same order. Can be used, for
731 example, to concatenate chromosome VCFs into one VCF, or combine a SNP
732 VCF and an indel VCF into one. The input files must be sorted by chr
733 and position. The files must be given in the correct order to produce
734 sorted VCF on output unless the -a, --allow-overlaps option is
735 specified. With the --naive option, the files are concatenated without
736 being recompressed, which is very fast..
737
738 -a, --allow-overlaps
739 First coordinate of the next file can precede last record of the
740 current file.
741
742 -c, --compact-PS
743 Do not output PS tag at each site, only at the start of a new phase
744 set block.
745
746 -d, --rm-dups snps|indels|both|all|exact
747 Output duplicate records of specified type present in multiple
748 files only once. Requires -a, --allow-overlaps.
749
750 -D, --remove-duplicates
751 Alias for -d exact
752
753 -f, --file-list FILE
754 Read file names from FILE, one file name per line.
755
756 -l, --ligate
757 Ligate phased VCFs by matching phase at overlapping haplotypes.
758 Note that the option is intended for VCFs with perfect overlap,
759 sites in overlapping regions present in one but missing in other
760 are dropped.
761
762 --no-version
763 see Common Options
764
765 -n, --naive
766 Concatenate VCF or BCF files without recompression. This is very
767 fast but requires that all files are of the same type (all VCF or
768 all BCF) and have the same headers. This is because all tags and
769 chromosome names in the BCF body rely on the order of the contig
770 and tag definitions in the header. A header check compatibility is
771 performed and the program throws an error if it is not safe to use
772 the option.
773
774 --naive-force
775 Same as --naive, but header compatibility is not checked.
776 Dangerous, use with caution.
777
778 -o, --output FILE
779 see Common Options
780
781 -O, --output-type b|u|z|v
782 see Common Options
783
784 -q, --min-PQ INT
785 Break phase set if phasing quality is lower than INT
786
787 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
788 see Common Options. Requires -a, --allow-overlaps.
789
790 -R, --regions-file FILE
791 see Common Options. Requires -a, --allow-overlaps.
792
793 --threads INT
794 see Common Options
795
796 bcftools consensus [OPTIONS] FILE
797 Create consensus sequence by applying VCF variants to a reference fasta
798 file. By default, the program will apply all ALT variants to the
799 reference fasta to obtain the consensus sequence. Using the --sample
800 (and, optionally, --haplotype) option will apply genotype (haplotype)
801 calls from FORMAT/GT. Note that the program does not act as a primitive
802 variant caller and ignores allelic depth information, such as INFO/AD
803 or FORMAT/AD. For that, consider using the setGT plugin.
804
805 -c, --chain FILE
806 write a chain file for liftover
807
808 -e, --exclude EXPRESSION
809 exclude sites for which EXPRESSION is true. For valid expressions
810 see EXPRESSIONS.
811
812 -f, --fasta-ref FILE
813 reference sequence in fasta format
814
815 -H, --haplotype 1|2|R|A|I|LR|LA|SR|SA|1pIu|2pIu
816 choose which allele from the FORMAT/GT field to use (the codes are
817 case-insensitive):
818
819 1
820 the first allele, regardless of phasing
821
822 2
823 the second allele, regardless of phasing
824
825 R
826 the REF allele (in heterozygous genotypes)
827
828 A
829 the ALT allele (in heterozygous genotypes)
830
831 I
832 IUPAC code for all genotypes
833
834 LR, LA
835 the longer allele. If both have the same length, use the REF
836 allele (LR), or the ALT allele (LA)
837
838 SR, SA
839 the shorter allele. If both have the same length, use the REF
840 allele (SR), or the ALT allele (SA)
841
842 1pIu, 2pIu
843 first/second allele for phased genotypes and IUPAC code for
844 unphased genotypes
845
846 This option requires *-s*, unless exactly one sample is present in the VCF
847
848 -i, --include EXPRESSION
849 include only sites for which EXPRESSION is true. For valid
850 expressions see EXPRESSIONS.
851
852 -I, --iupac-codes
853 output variants in the form of IUPAC ambiguity codes
854
855 --mark-del CHAR
856 instead of removing sequence, insert CHAR for deletions
857
858 --mark-ins uc|lc
859 highlight inserted sequence in uppercase (uc) or lowercase (lc),
860 leaving the rest of the sequence as is
861
862 --mark-snv uc|lc
863 highlight substitutions in uppercase (uc) or lowercase (lc),
864 leaving the rest of the sequence as is
865
866 -m, --mask FILE
867 BED file or TAB file with regions to be replaced with N (the
868 default) or as specified by the next --mask-with option. See
869 discussion of --regions-file in Common Options for file format
870 details.
871
872 --mask-with CHAR|lc|uc
873 replace sequence from --mask with CHAR, skipping overlapping
874 variants, or change to lowercase (lc) or uppercase (uc)
875
876 -M, --missing CHAR
877 instead of skipping the missing genotypes, output the character
878 CHAR (e.g. "?")
879
880 -o, --output FILE
881 write output to a file
882
883 -s, --sample NAME
884 apply variants of the given sample
885
886 Examples:
887
888 # Apply variants present in sample "NA001", output IUPAC codes for hets
889 bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
890
891 # Create consensus for one region. The fasta header lines are then expected
892 # in the form ">chr:from-to".
893 samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
894
895 bcftools convert [OPTIONS] FILE
896 VCF input options:
897 -e, --exclude EXPRESSION
898 exclude sites for which EXPRESSION is true. For valid expressions
899 see EXPRESSIONS.
900
901 -i, --include EXPRESSION
902 include only sites for which EXPRESSION is true. For valid
903 expressions see EXPRESSIONS.
904
905 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
906 see Common Options
907
908 -R, --regions-file FILE
909 see Common Options
910
911 -s, --samples LIST
912 see Common Options
913
914 -S, --samples-file FILE
915 see Common Options
916
917 -t, --targets LIST
918 see Common Options
919
920 -T, --targets-file FILE
921 see Common Options
922
923 VCF output options:
924 --no-version
925 see Common Options
926
927 -o, --output FILE
928 see Common Options
929
930 -O, --output-type b|u|z|v
931 see Common Options
932
933 --threads INT
934 see Common Options
935
936 GEN/SAMPLE conversion:
937 -G, --gensample2vcf prefix or gen-file,sample-file
938 convert IMPUTE2 output to VCF. The second column must be of the
939 form "CHROM:POS_REF_ALT" to detect possible strand swaps; IMPUTE2
940 leaves the first one empty ("--") when sites from reference panel
941 are filled in. See also -g below.
942
943 -g, --gensample prefix or gen-file,sample-file
944 convert from VCF to gen/sample format used by IMPUTE2 and SHAPEIT.
945 The columns of .gen file format are ID1,ID2,POS,A,B followed by
946 three genotype probabilities P(AA), P(AB), P(BB) for each sample.
947 In order to prevent strand swaps, the program uses IDs of the form
948 "CHROM:POS_REF_ALT". For example:
949
950 .gen
951 ----
952 1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
953 1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
954
955 .sample
956 -------
957 ID_1 ID_2 missing
958 0 0 0
959 sample1 sample1 0
960 sample2 sample2 0
961
962 --tag STRING
963 tag to take values for .gen file: GT,PL,GL,GP
964
965 --chrom
966 output chromosome in the first column instead of CHROM:POS_REF_ALT
967
968 --sex FILE
969 output sex column in the sample file. The FILE format is
970
971 MaleSample M
972 FemaleSample F
973
974 --vcf-ids
975 output VCF IDs in the second column instead of CHROM:POS_REF_ALT
976
977 gVCF conversion:
978 --gvcf2vcf
979 convert gVCF to VCF, expanding REF blocks into sites. Note that the
980 -i and -e options work differently with this switch. In this
981 situation the filtering expressions define which sites should be
982 expanded and which sites should be left unmodified, but all sites
983 are printed on output. In order to drop sites, stream first through
984 bcftools view.
985
986 -f, --fasta-ref file
987 reference sequence in fasta format. Must be indexed with samtools
988 faidx
989
990 HAP/SAMPLE conversion:
991 --hapsample2vcf prefix or hap-file,sample-file
992 convert from hap/sample format to VCF. The columns of .hap file are
993 similar to .gen file above, but there are only two haplotype
994 columns per sample. Note that the first column of the .hap file is
995 expected to be in the form "CHR:POS_REF_ALT(_END)?", with the _END
996 being optional for defining the INFO/END tag when ALT is a symbolic
997 allele, for example:
998
999 .hap
1000 ----
1001 1:111485207_G_A rsID1 111485207 G A 0 1 0 0
1002 1:111494194_C_T rsID2 111494194 C T 0 1 0 0
1003 1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
1004
1005 --hapsample prefix or hap-file,sample-file
1006 convert from VCF to hap/sample format used by IMPUTE2 and SHAPEIT.
1007 The columns of .hap file begin with ID,RSID,POS,REF,ALT. In order
1008 to prevent strand swaps, the program uses IDs of the form
1009 "CHROM:POS_REF_ALT".
1010
1011 --haploid2diploid
1012 with -h option converts haploid genotypes to homozygous diploid
1013 genotypes. For example, the program will print 0 0 instead of the
1014 default 0 -. This is useful for programs which do not handle
1015 haploid genotypes correctly.
1016
1017 --sex FILE
1018 output sex column in the sample file. The FILE format is
1019
1020 MaleSample M
1021 FemaleSample F
1022
1023 --vcf-ids
1024 output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
1025
1026 HAP/LEGEND/SAMPLE conversion:
1027 -H, --haplegendsample2vcf prefix or hap-file,legend-file,sample-file
1028 convert from hap/legend/sample format used by IMPUTE2 to VCF, see
1029 also -h, --hapslegendsample below.
1030
1031 -h, --haplegendsample prefix or hap-file,legend-file,sample-file
1032 convert from VCF to hap/legend/sample format used by IMPUTE2 and
1033 SHAPEIT. The columns of .legend file ID,POS,REF,ALT. In order to
1034 prevent strand swaps, the program uses IDs of the form
1035 "CHROM:POS_REF_ALT". The .sample file is quite basic at the moment
1036 with columns for population, group and sex expected to be edited by
1037 the user. For example:
1038
1039 .hap
1040 -----
1041 0 1 0 0 1 0
1042 0 1 0 0 0 1
1043
1044 .legend
1045 -------
1046 id position a0 a1
1047 1:111485207_G_A 111485207 G A
1048 1:111494194_C_T 111494194 C T
1049
1050 .sample
1051 -------
1052 sample population group sex
1053 sample1 sample1 sample1 2
1054 sample2 sample2 sample2 2
1055
1056 --haploid2diploid
1057 with -h option converts haploid genotypes to homozygous diploid
1058 genotypes. For example, the program will print 0 0 instead of the
1059 default 0 -. This is useful for programs which do not handle
1060 haploid genotypes correctly.
1061
1062 --sex FILE
1063 output sex column in the sample file. The FILE format is
1064
1065 MaleSample M
1066 FemaleSample F
1067
1068 --vcf-ids
1069 output VCF IDs instead of "CHROM:POS_REF_ALT" IDs
1070
1071 TSV conversion:
1072 --tsv2vcf file
1073 convert from TSV (tab-separated values) format (such as generated
1074 by 23andMe) to VCF. The input file fields can be tab- or space-
1075 delimited
1076
1077 -c, --columns list
1078 comma-separated list of fields in the input file. In the current
1079 version, the fields CHROM, POS, ID, and AA are expected and can
1080 appear in arbitrary order, columns which should be ignored in the
1081 input file can be indicated by "-". The AA field lists alleles on
1082 the forward reference strand, for example "CC" or "CT" for diploid
1083 genotypes or "C" for haploid genotypes (sex chromosomes).
1084 Insertions and deletions are not supported yet, missing data can be
1085 indicated with "--".
1086
1087 -f, --fasta-ref file
1088 reference sequence in fasta format. Must be indexed with samtools
1089 faidx
1090
1091 -s, --samples LIST
1092 list of sample names. See Common Options
1093
1094 -S, --samples-file FILE
1095 file of sample names. See Common Options
1096
1097 Example:
1098
1099 # Convert 23andme results into VCF
1100 bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
1101
1102 bcftools csq [OPTIONS] FILE
1103 Haplotype aware consequence predictor which correctly handles combined
1104 variants such as MNPs split over multiple VCF records, SNPs separated
1105 by an intron (but adjacent in the spliced transcript) or nearby
1106 frame-shifting indels which in combination in fact are not
1107 frame-shifting.
1108
1109 The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag
1110 (configurable with the -c option). The latter is a bitmask of indexes
1111 to INFO/BCSQ, with interleaved haplotypes. See the usage examples below
1112 for using the %TBCSQ converter in query for extracting a more human
1113 readable form from this bitmask. The construction of the bitmask limits
1114 the number of consequences that can be referenced per sample in the
1115 FORMAT/BCSQ tags. By default this is 15, but if more are required, see
1116 the --ncsq option.
1117
1118 The program requires on input a VCF/BCF file, the reference genome in
1119 fasta format (--fasta-ref) and genomic features in the GFF3 format
1120 downloadable from the Ensembl website (--gff-annot), and outputs an
1121 annotated VCF/BCF file. Currently, only Ensembl GFF3 files are
1122 supported.
1123
1124 By default, the input VCF should be phased. If phase is unknown, or
1125 only partially known, the --phase option can be used to indicate how to
1126 handle unphased data. Alternatively, haplotype aware calling can be
1127 turned off with the --local-csq option.
1128
1129 If conflicting (overlapping) variants within one haplotype are
1130 detected, a warning will be emitted and predictions will be based on
1131 only the first variant in the analysis.
1132
1133 Symbolic alleles are not supported. They will remain unannotated in the
1134 output VCF and are ignored for the prediction analysis.
1135
1136 -c, --custom-tag STRING
1137 use this custom tag to store consequences rather than the default
1138 BCSQ tag
1139
1140 -B, --trim-protein-seq INT
1141 abbreviate protein-changing predictions to maximum of INT
1142 aminoacids. For example, instead of writing the whole modified
1143 protein sequence with potentially hundreds of aminoacids, with -B 1
1144 only an abbreviated version such as 25E..329>25G..94 will be
1145 written.
1146
1147 -e, --exclude EXPRESSION
1148 exclude sites for which EXPRESSION is true. For valid expressions
1149 see EXPRESSIONS.
1150
1151 -f, --fasta-ref FILE
1152 reference sequence in fasta format (required)
1153
1154 --force
1155 run even if some sanity checks fail. Currently the option allows to
1156 skip transcripts in malformatted GFFs with incorrect phase
1157
1158 -g, --gff-annot FILE
1159 GFF3 annotation file (required), such as <ftp://ftp.ensembl.org/
1160 pub/current_gff3/homo_sapiens>. An example of a minimal working GFF
1161 file:
1162
1163 # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
1164 # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
1165 # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
1166 # (the most interesting is "protein_coding").
1167 #
1168 # Attributes required for
1169 # gene lines:
1170 # - ID=gene:<gene_id>
1171 # - biotype=<biotype>
1172 # - Name=<gene_name> [optional]
1173 #
1174 # transcript lines:
1175 # - ID=transcript:<transcript_id>
1176 # - Parent=gene:<gene_id>
1177 # - biotype=<biotype>
1178 #
1179 # other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
1180 # - Parent=transcript:<transcript_id>
1181 #
1182 # Supported biotypes:
1183 # - see the function gff_parse_biotype() in bcftools/csq.c
1184
1185 1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1186 1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1187 1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId
1188 1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId
1189 1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId
1190 1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
1191
1192 -i, --include EXPRESSION
1193 include only sites for which EXPRESSION is true. For valid
1194 expressions see EXPRESSIONS.
1195
1196 -l, --local-csq
1197 switch off haplotype-aware calling, run localized predictions
1198 considering only one VCF record at a time
1199
1200 -n, --ncsq INT
1201 maximum number of per-haplotype consequences to consider for each
1202 site. The INFO/BCSQ column includes all consequences, but only the
1203 first INT will be referenced by the FORMAT/BCSQ fields. The default
1204 value is 15 which corresponds to one 32-bit integer per diploid
1205 sample, after accounting for values reserved by the BCF
1206 specification. Note that increasing the value leads to increased
1207 size of the output BCF.
1208
1209 --no-version
1210 see Common Options
1211
1212 -o, --output FILE
1213 see Common Options
1214
1215 -O, --output-type b|t|u|z|v
1216 see Common Options. In addition, a custom tab-delimited plain text
1217 output can be printed (t).
1218
1219 -p, --phase a|m|r|R|s
1220 how to handle unphased heterozygous genotypes:
1221
1222 a
1223 take GTs as is, create haplotypes regardless of phase (0/1 →
1224 0|1)
1225
1226 m
1227 merge all GTs into a single haplotype (0/1 → 1, 1/2 → 1)
1228
1229 r
1230 require phased GTs, throw an error on unphased heterozygous GTs
1231
1232 R
1233 create non-reference haplotypes if possible (0/1 → 1|1, 1/2 →
1234 1|2)
1235
1236 s
1237 skip unphased heterozygous GTs
1238
1239 -q, --quiet
1240 suppress warning messages
1241
1242 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1243 see Common Options
1244
1245 -R, --regions-file FILE
1246 see Common Options
1247
1248 -s, --samples LIST
1249 samples to include or "-" to apply all variants and ignore samples
1250
1251 -S, --samples-file FILE
1252 see Common Options
1253
1254 -t, --targets LIST
1255 see Common Options
1256
1257 -T, --targets-file FILE
1258 see Common Options
1259
1260 Examples:
1261
1262 # Basic usage
1263 bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
1264
1265 # Extract the translated haplotype consequences. The following TBCSQ variations
1266 # are recognised:
1267 # %TBCSQ .. print consequences in all haplotypes in separate columns
1268 # %TBCSQ{0} .. print the first haplotype only
1269 # %TBCSQ{1} .. print the second haplotype only
1270 # %TBCSQ{*} .. print a list of unique consequences present in either haplotype
1271 bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
1272
1273 Examples of BCSQ annotation:
1274
1275 # Two separate VCF records at positions 2:122106101 and 2:122106102
1276 # change the same codon. This UV-induced C>T dinucleotide mutation
1277 # has been annotated fully at the position 2:122106101 with
1278 # - consequence type
1279 # - gene name
1280 # - ensembl transcript ID
1281 # - coding strand (+ fwd, - rev)
1282 # - amino acid position (in the coding strand orientation)
1283 # - list of corresponding VCF variants
1284 # The annotation at the second position gives the position of the full
1285 # annotation
1286 BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
1287 BCSQ=@122106101
1288
1289 # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
1290 BCSQ=@46115084
1291 BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
1292
1293 # Stop gained variant
1294 BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
1295
1296 # The consequence type of a variant downstream from a stop are prefixed with *
1297 BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
1298
1299 bcftools filter [OPTIONS] FILE
1300 Apply fixed-threshold filters.
1301
1302 -e, --exclude EXPRESSION
1303 exclude sites for which EXPRESSION is true. For valid expressions
1304 see EXPRESSIONS.
1305
1306 -g, --SnpGap INT[:'indel',mnp,bnd,other,overlap]
1307 filter SNPs within INT base pairs of an indel or other other
1308 variant type. The following example demonstrates the logic of
1309 --SnpGap 3 applied on a deletion and an insertion:
1310
1311 The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
1312 0123456789
1313 ref .G.GT..G..
1314 del .A.G-..A..
1315 Here the positions 1 and 6 are filtered, 0 and 7 are not:
1316 0123-456789
1317 ref .G.G-..G..
1318 ins .A.GT..A..
1319
1320 -G, --IndelGap INT
1321 filter clusters of indels separated by INT or fewer base pairs
1322 allowing only one to pass. The following example demonstrates the
1323 logic of --IndelGap 2 applied on a deletion and an insertion:
1324
1325 The second indel is filtered:
1326 012345678901
1327 ref .GT.GT..GT..
1328 del .G-.G-..G-..
1329 And similarly here, the second is filtered:
1330 01 23 456 78
1331 ref .A-.A-..A-..
1332 ins .AT.AT..AT..
1333
1334 -i, --include EXPRESSION
1335 include only sites for which EXPRESSION is true. For valid
1336 expressions see EXPRESSIONS.
1337
1338 -m, --mode [+x]
1339 define behaviour at sites with existing FILTER annotations. The
1340 default mode replaces existing filters of failed sites with a new
1341 FILTER string while leaving sites which pass untouched when
1342 non-empty and setting to "PASS" when the FILTER string is absent.
1343 The "+" mode appends new FILTER strings of failed sites instead of
1344 replacing them. The "x" mode resets filters of sites which pass to
1345 "PASS". Modes "+" and "x" can both be set.
1346
1347 --no-version
1348 see Common Options
1349
1350 -o, --output FILE
1351 see Common Options
1352
1353 -O, --output-type b|u|z|v
1354 see Common Options
1355
1356 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1357 see Common Options
1358
1359 -R, --regions-file file
1360 see Common Options
1361
1362 -s, --soft-filter STRING|+
1363 annotate FILTER column with STRING or, with +, a unique filter name
1364 generated by the program ("Filter%d").
1365
1366 -S, --set-GTs .|0
1367 set genotypes of failed samples to missing value (.) or reference
1368 allele (0)
1369
1370 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1371 see Common Options
1372
1373 -T, --targets-file file
1374 see Common Options
1375
1376 --threads INT
1377 see Common Options
1378
1379 bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz
1380 Checks sample identity. The program can operate in two modes. If the -g
1381 option is given, the identity of samples from query.vcf.gz is checked
1382 against the samples in the -g file. Without the -g option, multi-sample
1383 cross-check of samples in query.vcf.gz is performed.
1384
1385 --distinctive-sites NUM[,MEM[,DIR]]
1386 Find sites that can distinguish between at least NUM sample pairs.
1387 If the number is smaller or equal to 1, it is interpreted as the
1388 fraction of pairs. The optional MEM string sets the maximum memory
1389 used for in-memory sorting and DIR is the temporary directory for
1390 external sorting. This option requires also --pairs to be given.
1391
1392 --dry-run
1393 Stop after first record to estimate required time.
1394
1395 -e, --error-probability INT
1396 Interpret genotypes and genotype likelihoods probabilistically. The
1397 value of INT represents genotype quality when GT tag is used (e.g.
1398 Q=30 represents one error in 1,000 genotypes and Q=40 one error in
1399 10,000 genotypes) and is ignored when PL tag is used (in that case
1400 an arbitrary non-zero integer can be provided). See also the -u,
1401 --use option below. If set to 0, the discordance equals to the
1402 number of mismatching genotypes when GT vs GT is compared. If
1403 performance is an issue, set to 0 for faster run but less accurate
1404 results.
1405
1406 -g, --genotypes FILE
1407 VCF/BCF file with reference genotypes to compare against
1408
1409 -H, --homs-only
1410 Homozygous genotypes only, useful with low coverage data (requires
1411 -g, --genotypes)
1412
1413 --n-matches INT
1414 Print only top INT matches for each sample, 0 for unlimited. Use
1415 negative value to sort by HWE probability rather than the number of
1416 discordant sites. Note that average score is used to determine the
1417 top matches, not absolute values.
1418
1419 --no-HWE-prob
1420 Disable calculation of HWE probability to reduce memory
1421 requirements with comparisons between very large number of sample
1422 pairs.
1423
1424 -p, --pairs LIST
1425 A comma-separated list of sample pairs to compare. When the -g
1426 option is given, the first sample must be from the query file, the
1427 second from the -g file, third from the query file etc
1428 (qry,gt[,qry,gt..]). Without the -g option, the pairs are created
1429 the same way but both samples are from the query file
1430 (qry,qry[,qry,qry..])
1431
1432 -P, --pairs-file FILE
1433 A file with tab-delimited sample pairs to compare. The first sample
1434 in the pair must come from the query file, the second from the
1435 genotypes file when -g is given
1436
1437 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1438 Restrict to comma-separated list of regions, see Common Options
1439
1440 *-R, --regions-file' FILE
1441 Restrict to regions listed in a file, see Common Options
1442
1443 -s, --samples [qry|gt]:'LIST':
1444 List of query samples or -g samples. If neither -s nor -S are
1445 given, all possible sample
1446 pair combinations are compared
1447
1448 -S, --samples-file [qry|gt]:'FILE'
1449 File with the query or -g samples to compare. If neither -s nor -S
1450 are given, all possible sample
1451 pair combinations are compared
1452
1453 -t, --targets file
1454 see Common Options
1455
1456 -T, --targets-file file
1457 see Common Options
1458
1459 -u, --use TAG1[,TAG2]
1460 specifies which tag to use in the query file (TAG1) and the -g
1461 (TAG2) file. By default, the PL tag is used in the query file and
1462 GT in the -g file when available.
1463
1464 Examples:
1465
1466 # Check discordance of all samples from B against all sample in A
1467 bcftools gtcheck -g A.bcf B.bcf
1468
1469 # Limit comparisons to the fiven list of samples
1470 bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
1471
1472 # Compare only two pairs a1,b1 and a1,b2
1473 bcftools gtcheck -p a1,b1,a1,b2 -g A.bcf B.bcf
1474
1475 bcftools index [OPTIONS] in.bcf|in.vcf.gz
1476 Creates index for bgzip compressed VCF/BCF files for random access. CSI
1477 (coordinate-sorted index) is created by default. The CSI format
1478 supports indexing of chromosomes up to length 2^31. TBI (tabix
1479 index) index files, which support chromosome lengths up to 2^29,
1480 can be created by using the -t/--tbi option or using the tabix program
1481 packaged with htslib. When loading an index file, bcftools will try the
1482 CSI first and then the TBI.
1483
1484 Indexing options:
1485 -c, --csi
1486 generate CSI-format index for VCF/BCF files [default]
1487
1488 -f, --force
1489 overwrite index if it already exists
1490
1491 -m, --min-shift INT
1492 set minimal interval size for CSI indices to 2^INT; default: 14
1493
1494 -o, --output FILE
1495 output file name. If not set, then the index will be created using
1496 the input file name plus a .csi or .tbi extension
1497
1498 -t, --tbi
1499 generate TBI-format index for VCF files
1500
1501 --threads INT
1502 see Common Options
1503
1504 Stats options:
1505 -n, --nrecords
1506 print the number of records based on the CSI or TBI index files
1507
1508 -s, --stats
1509 Print per contig stats based on the CSI or TBI index files. Output
1510 format is three tab-delimited columns listing the contig name,
1511 contig length (. if unknown) and number of records for the contig.
1512 Contigs with zero records are not printed.
1513
1514 bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]
1515 Creates intersections, unions and complements of VCF files. Depending
1516 on the options, the program can output records from one (or more) files
1517 which have (or do not have) corresponding records with the same
1518 position in the other files.
1519
1520 -c, --collapse snps|indels|both|all|some|none
1521 see Common Options
1522
1523 -C, --complement
1524 output positions present only in the first file but missing in the
1525 others
1526
1527 -e, --exclude -|EXPRESSION
1528 exclude sites for which EXPRESSION is true. If -e (or -i) appears
1529 only once, the same filtering expression will be applied to all
1530 input files. Otherwise, -e or -i must be given for each input file.
1531 To indicate that no filtering should be performed on a file, use
1532 "-" in place of EXPRESSION, as shown in the example below. For
1533 valid expressions see EXPRESSIONS.
1534
1535 -f, --apply-filters LIST
1536 see Common Options
1537
1538 -i, --include EXPRESSION
1539 include only sites for which EXPRESSION is true. See discussion of
1540 -e, --exclude above.
1541
1542 -n, --nfiles [+-=]INT|~BITMAP
1543 output positions present in this many (=), this many or more (+),
1544 this many or fewer (-), or the exact same (~) files
1545
1546 -o, --output FILE
1547 see Common Options. When several files are being output, their
1548 names are controlled via -p instead.
1549
1550 -O, --output-type b|u|z|v
1551 see Common Options
1552
1553 -p, --prefix DIR
1554 if given, subset each of the input files accordingly. See also -w.
1555
1556 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1557 see Common Options
1558
1559 -R, --regions-file file
1560 see Common Options
1561
1562 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1563 see Common Options
1564
1565 -T, --targets-file file
1566 see Common Options
1567
1568 -w, --write LIST
1569 list of input files to output given as 1-based indices. With -p and
1570 no -w, all files are written.
1571
1572 Examples:
1573 Create intersection and complements of two sets saving the output in
1574 dir/*
1575
1576 bcftools isec -p dir A.vcf.gz B.vcf.gz
1577
1578 Filter sites in A (require INFO/MAF>=0.01) and B (require INFO/dbSNP)
1579 but not in C, and create an intersection, including only sites which
1580 appear in at least two of the files after filters have been applied
1581
1582 bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
1583
1584 Extract and write records from A shared by both A and B using exact
1585 allele match
1586
1587 bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
1588
1589 Extract records private to A or B comparing by position only
1590
1591 bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
1592
1593 Print a list of records which are present in A and B but not in C and D
1594
1595 bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
1596
1597 bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]
1598 Merge multiple VCF/BCF files from non-overlapping sample sets to create
1599 one multi-sample file. For example, when merging file A.vcf.gz
1600 containing samples S1, S2 and S3 and file B.vcf.gz containing samples
1601 S3 and S4, the output file will contain five samples named S1, S2, S3,
1602 2:S3 and S4.
1603
1604 Note that it is responsibility of the user to ensure that the sample
1605 names are unique across all files. If they are not, the program will
1606 exit with an error unless the option --force-samples is given. The
1607 sample names can be also given explicitly using the --print-header and
1608 --use-header options.
1609
1610 Note that only records from different files can be merged, never from
1611 the same file. For "vertical" merge take a look at bcftools concat or
1612 bcftools norm -m instead.
1613
1614 --force-samples
1615 if the merged files contain duplicate samples names, proceed
1616 anyway. Duplicate sample names will be resolved by prepending the
1617 index of the file as it appeared on the command line to the
1618 conflicting sample name (see 2:S3 in the above example).
1619
1620 --print-header
1621 print only merged header and exit
1622
1623 --use-header FILE
1624 use the VCF header in the provided text FILE
1625
1626 -0 --missing-to-ref
1627 assume genotypes at missing sites are 0/0
1628
1629 -f, --apply-filters LIST
1630 see Common Options
1631
1632 -F, --filter-logic x|+
1633 Set the output record to PASS if any of the inputs is PASS (x), or
1634 apply all filters (+), which is the default.
1635
1636 -g, --gvcf -|FILE
1637 merge gVCF blocks, INFO/END tag is expected. If the reference fasta
1638 file FILE is not given and the dash (-) is given, unknown reference
1639 bases generated at gVCF block splits will be substituted with N’s.
1640 The --gvcf option uses the following default INFO rules: -i
1641 QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max.
1642
1643 -i, --info-rules -|TAG:METHOD[,...]
1644 Rules for merging INFO fields (scalars or vectors) or - to disable
1645 the default rules. METHOD is one of sum, avg, min, max, join.
1646 Default is DP:sum,DP4:sum if these fields exist in the input files.
1647 Fields with no specified rule will take the value from the first
1648 input file. The merged QUAL value is currently set to the maximum.
1649 This behaviour is not user controllable at the moment.
1650
1651 -l, --file-list FILE
1652 Read file names from FILE, one file name per line.
1653
1654 -L, --local-alleles INT
1655 Sites with many alternate alleles can require extremely large
1656 storage space which can exceed the 2GB size limit representable by
1657 BCF. This is caused by Number=G tags (such as FORMAT/PL) which
1658 store a value for each combination of reference and alternate
1659 alleles. The -L, --local-alleles option allows to replace such tags
1660 with a localized tag (FORMAT/LPL) which only includes a subset of
1661 alternate alleles relevant for that sample. A new FORMAT/LAA tag is
1662 added which lists 1-based indices of the alternate alleles relevant
1663 (local) for the current sample. The number INT gives the maximum
1664 number of alternate alleles that can be included in the PL tag. The
1665 default value is 0 which disables the feature and outputs values
1666 for all alternate alleles.
1667
1668 -m, --merge snps|indels|both|all|none|id
1669 The option controls what types of multiallelic records can be
1670 created:
1671
1672 -m none .. no new multiallelics, output multiple records instead
1673 -m snps .. allow multiallelic SNP records
1674 -m indels .. allow multiallelic indel records
1675 -m both .. both SNP and indel records can be multiallelic
1676 -m all .. SNP records can be merged with indel records
1677 -m id .. merge by ID
1678
1679 --no-index
1680 the option allows to merge files without indexing them first. In
1681 order for this option to work, the user must ensure that the input
1682 files have chromosomes in the same order and consistent with the
1683 order of sequences in the VCF header.
1684
1685 --no-version
1686 see Common Options
1687
1688 -o, --output FILE
1689 see Common Options
1690
1691 -O, --output-type b|u|z|v
1692 see Common Options
1693
1694 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1695 see Common Options
1696
1697 -R, --regions-file file
1698 see Common Options
1699
1700 --threads INT
1701 see Common Options
1702
1703 bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]
1704 Generate VCF or BCF containing genotype likelihoods for one or multiple
1705 alignment (BAM or CRAM) files. This is based on the original samtools
1706 mpileup command (with the -v or -g options) producing genotype
1707 likelihoods in VCF or BCF format, but not the textual pileup output.
1708 The mpileup command was transferred to bcftools in order to avoid
1709 errors resulting from use of incompatible versions of samtools and
1710 bcftools when using in the mpileup+bcftools call pipeline.
1711
1712 Individuals are identified from the SM tags in the @RG header lines.
1713 Multiple individuals can be pooled in one alignment file, also one
1714 individual can be separated into multiple files. If sample identifiers
1715 are absent, each input file is regarded as one sample.
1716
1717 Note that there are two orthogonal ways to specify locations in the
1718 input file; via -r region and -t positions. The former uses (and
1719 requires) an index to do random access while the latter streams through
1720 the file contents filtering out the specified regions, requiring no
1721 index. The two may be used in conjunction. For example a BED file
1722 containing locations of genes in chromosome 20 could be specified using
1723 -r 20 -t chr20.bed, meaning that the index is used to find chromosome
1724 20 and then it is filtered for the regions listed in the BED file. Also
1725 note that the -r option can be much slower than -t with many regions
1726 and can require more memory when multiple regions and many alignment
1727 files are processed.
1728
1729 Input options
1730 -6, --illumina1.3+
1731 Assume the quality is in the Illumina 1.3+ encoding.
1732
1733 -A, --count-orphans
1734 Do not skip anomalous read pairs in variant calling.
1735
1736 -b, --bam-list FILE
1737 List of input alignment files, one file per line [null]
1738
1739 -B, --no-BAQ
1740 Disable probabilistic realignment for the computation of base
1741 alignment quality (BAQ). BAQ is the Phred-scaled probability of a
1742 read base being misaligned. Applying this option greatly helps to
1743 reduce false SNPs caused by misalignments.
1744
1745 -C, --adjust-MQ INT
1746 Coefficient for downgrading mapping quality for reads containing
1747 excessive mismatches. Given a read with a phred-scaled probability
1748 q of being generated from the mapped position, the new mapping
1749 quality is about sqrt((INT-q)/INT)*INT. A zero value (the default)
1750 disables this functionality.
1751
1752 -D, --full-BAQ
1753 Run the BAQ algorithm on all reads, not just those in problematic
1754 regions. This matches the behaviour for Bcftools 1.12 and earlier.
1755
1756 By default mpileup uses heuristics to decide when to apply the BAQ
1757 algorithm. Most sequences will not be BAQ adjusted, giving a CPU
1758 time closer to --no-BAQ, but it will still be applied in regions
1759 with suspected problematic alignments. This has been tested to work
1760 well on single sample data with even allele frequency, but the
1761 reliability is unknown for multi-sample calling and for low allele
1762 frequency variants so full BAQ is still recommended in those
1763 scenarios.
1764
1765 -d, --max-depth INT
1766 At a position, read maximally INT reads per input file. Note that
1767 the original samtools mpileup command had a minimum value of 8000/n
1768 where n was the number of input files given to mpileup. This means
1769 that in samtools mpileup the default was highly likely to be
1770 increased and the -d parameter would have an effect only once above
1771 the cross-sample minimum of 8000. This behavior was problematic
1772 when working with a combination of single- and multi-sample bams,
1773 therefore in bcftools mpileup the user is given the full control
1774 (and responsibility), and an informative message is printed instead
1775 [250]
1776
1777 -E, --redo-BAQ
1778 Recalculate BAQ on the fly, ignore existing BQ tags
1779
1780 -f, --fasta-ref FILE
1781 The faidx-indexed reference file in the FASTA format. The file can
1782 be optionally compressed by bgzip. Reference is required by default
1783 unless the --no-reference option is set [null]
1784
1785 --no-reference
1786 Do not require the --fasta-ref option.
1787
1788 -G, --read-groups FILE
1789 list of read groups to include or exclude if prefixed with "^".
1790 One read group per line. This file can also be used to assign new
1791 sample names to read groups by giving the new sample name as a
1792 second white-space-separated field, like this: "read_group_id
1793 new_sample_name". If the read group name is not unique, also the
1794 bam file name can be included: "read_group_id file_name
1795 sample_name". If all reads from the alignment file should be
1796 treated as a single sample, the asterisk symbol can be used: "*
1797 file_name sample_name". Alignments without a read group ID can be
1798 matched with "?". NOTE: The meaning of bcftools mpileup -G is the
1799 opposite of samtools mpileup -G.
1800
1801 RG_ID_1
1802 RG_ID_2 SAMPLE_A
1803 RG_ID_3 SAMPLE_A
1804 RG_ID_4 SAMPLE_B
1805 RG_ID_5 FILE_1.bam SAMPLE_A
1806 RG_ID_6 FILE_2.bam SAMPLE_A
1807 * FILE_3.bam SAMPLE_C
1808 ? FILE_3.bam SAMPLE_D
1809
1810 -q, -min-MQ INT
1811 Minimum mapping quality for an alignment to be used [0]
1812
1813 -Q, --min-BQ INT
1814 Minimum base quality for a base to be considered [13]
1815
1816 * --max-BQ* INT
1817 Caps the base quality to a maximum value [60]. This can be
1818 particularly useful on technologies that produce overly optimistic
1819 high qualities, leading to too many false positives or incorrect
1820 genotype assignments.
1821
1822 -r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
1823 Only generate mpileup output in given regions. Requires the
1824 alignment files to be indexed. If used in conjunction with -l then
1825 considers the intersection; see Common Options
1826
1827 -R, --regions-file FILE
1828 As for -r, --regions, but regions read from FILE; see Common
1829 Options
1830
1831 --ignore-RG
1832 Ignore RG tags. Treat all reads in one alignment file as one
1833 sample.
1834
1835 --rf, --incl-flags STR|INT
1836 Required flags: skip reads with mask bits unset [null]
1837
1838 --ff, --excl-flags STR|INT
1839 Filter flags: skip reads with mask bits set
1840 [UNMAP,SECONDARY,QCFAIL,DUP]
1841
1842 -s, --samples LIST
1843 list of sample names. See Common Options
1844
1845 -S, --samples-file FILE
1846 file of sample names to include or exclude if prefixed with
1847 "^". One sample per line. This file can also be used to rename
1848 samples by giving the new sample name as a second
1849 white-space-separated column, like this: "old_name new_name". If a
1850 sample name contains spaces, the spaces can be escaped using the
1851 backslash character, for example "Not\ a\ good\ sample\ name".
1852
1853 -t, --targets LIST
1854 see Common Options
1855
1856 -T, --targets-file FILE
1857 see Common Options
1858
1859 -x, --ignore-overlaps
1860 Disable read-pair overlap detection.
1861
1862 --seed INT
1863 Set the random number seed used when sub-sampling deep regions [0].
1864
1865 Output options
1866 -a, --annotate LIST
1867 Comma-separated list of FORMAT and INFO tags to output.
1868 (case-insensitive, the "FORMAT/" prefix is optional, and use "?" to
1869 list available annotations on the command line) [null]:
1870
1871 FORMAT/AD .. Allelic depth (Number=R,Type=Integer)
1872 FORMAT/ADF .. Allelic depths on the forward strand (Number=R,Type=Integer)
1873 FORMAT/ADR .. Allelic depths on the reverse strand (Number=R,Type=Integer)
1874 FORMAT/DP .. Number of high-quality bases (Number=1,Type=Integer)
1875 FORMAT/SP .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
1876 FORMAT/SCR .. Number of soft-clipped reads (Number=1,Type=Integer)
1877
1878 INFO/AD .. Total allelic depth (Number=R,Type=Integer)
1879 INFO/ADF .. Total allelic depths on the forward strand (Number=R,Type=Integer)
1880 INFO/ADR .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
1881 INFO/SCR .. Number of soft-clipped reads (Number=1,Type=Integer)
1882
1883 FORMAT/DV .. Deprecated in favor of FORMAT/AD; Number of high-quality non-reference bases, (Number=1,Type=Integer)
1884 FORMAT/DP4 .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR; Number of high-quality ref-forward, ref-reverse,
1885 alt-forward and alt-reverse bases (Number=4,Type=Integer)
1886 FORMAT/DPR .. Deprecated in favor of FORMAT/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1887 INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
1888
1889 -g, --gvcf INT[,...]
1890 output gVCF blocks of homozygous REF calls, with depth (DP) ranges
1891 specified by the list of integers. For example, passing 5,15 will
1892 group sites into two types of gVCF blocks, the first with minimum
1893 per-sample DP from the interval [5,15) and the latter with minimum
1894 depth 15 or more. In this example, sites with minimum per-sample
1895 depth less than 5 will be printed as separate records, outside of
1896 gVCF blocks.
1897
1898 --no-version
1899 see Common Options
1900
1901 -o, --output FILE
1902 Write output to FILE, rather than the default of standard output.
1903 (The same short option is used for both --open-prob and --output.
1904 If -o’s argument contains any non-digit characters other than a
1905 leading + or - sign, it is interpreted as --output. Usually the
1906 filename extension will take care of this, but to write to an
1907 entirely numeric filename use -o ./123 or --output 123.)
1908
1909 -O, --output-type b|u|z|v
1910 see Common Options
1911
1912 --threads INT
1913 see Common Options
1914
1915 -U, --mwu-u
1916 The the previous Mann-Whitney U test score from version 1.12 and
1917 earlier. This is a probability score, but importantly it folds
1918 probabilities above or below the desired score into the same P. The
1919 new Mann-Whitney U test score is a "Z score", expressing the score
1920 as the number of standard deviations away from the mean (with zero
1921 being matching the mean). It keeps both positive and negative
1922 values. This can be important for some tests where errors are
1923 asymmetric.
1924
1925 This option changes the INFO field names produced back to the ones
1926 used by the earlier Bcftools releases. For excample BQBZ becomes
1927 BQB.
1928
1929 Options for SNP/INDEL genotype likelihood computation
1930 -X, --config STR
1931 Specify a platform specific configuration profile. The profile
1932 should be one of 1.12, illumina, ont or pacbio-ccs. Settings
1933 applied are as follows:
1934
1935 1.12 -Q13 -h100 -m1
1936 illumina [ default values ]
1937 ont -B -Q5 --max-BQ 30 -I
1938 pacbio-ccs -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
1939
1940 --ar, --ambig-reads drop|incAD|incAD0
1941 What to do with ambiguous indel reads that do not span an entire
1942 short tandem repeat region: discard ambiguous reads from calling
1943 and do not increment high-quality AD depth counters (drop), exclude
1944 from calling but increment AD counters proportionally (incAD),
1945 exclude from calling and increment the first value of the AD
1946 counter (incAD0) [drop]
1947
1948 -e, --ext-prob INT
1949 Phred-scaled gap extension sequencing error probability. Reducing
1950 INT leads to longer indels [20]
1951
1952 -F, --gap-frac FLOAT
1953 Minimum fraction of gapped reads [0.002]
1954
1955 -h, --tandem-qual INT
1956 Coefficient for modeling homopolymer errors. Given an l-long
1957 homopolymer run, the sequencing error of an indel of size s is
1958 modeled as INT*s/l [500] Increasing this informs the caller that
1959 indels in long homopolymers are more likely genuine and less likely
1960 to be sequencing artifacts. Hence increasing tandem-qual will have
1961 higher recall and lower precision. Bcftools 1.12 and earlier had a
1962 default of 100, which was tuned around more error prone
1963 instruments. Note changing this may have a minor impact on SNP
1964 calling too. For maximum SNP calling accuracy, it may be preferable
1965 to adjust this lower again, although this will adversely affect
1966 indels.
1967
1968 --indel-bias FLOAT
1969 Skews the indel scores up or down, trading recall (low
1970 false-negative) vs precision (low false-positive) [1.0]. In
1971 Bcftools 1.12 and earlier this parameter didn’t exist, but had an
1972 implied value of 1.0. If you are planning to do heavy filtering of
1973 variants, selecting the best quality ones only (favouring precision
1974 over recall), it is advisable to set this lower (such as 0.75)
1975 while higher depth samples or where you favour recall rates over
1976 precision may work better with a higher value such as 2.0.
1977
1978 -I, --skip-indels
1979 Do not perform INDEL calling
1980
1981 -L, --max-idepth INT
1982 Skip INDEL calling if the average per-sample depth is above INT
1983 [250]
1984
1985 -m, --min-ireads INT
1986 Minimum number gapped reads for indel candidates INT [1]
1987
1988 -M, --max-read-len INT
1989 The maximum read length permitted by the BAQ algorithm [500].
1990 Variants are still called on longer reads, but they will not be
1991 passed through the BAQ method. This limit exists to prevent
1992 excessively long BAQ times and high memory usage. Note if partial
1993 BAQ is enabled with -D then raising this parameter will likely not
1994 have a significant a CPU cost.
1995
1996 -o, --open-prob INT
1997 Phred-scaled gap open sequencing error probability. Reducing INT
1998 leads to more indel calls. (The same short option is used for both
1999 --open-prob and --output. When -o’s argument contains only an
2000 optional + or - sign followed by the digits 0 to 9, it is
2001 interpreted as --open-prob.) [40]
2002
2003 -p, --per-sample-mF
2004 Apply -m and -F thresholds per sample to increase sensitivity of
2005 calling. By default both options are applied to reads pooled from
2006 all samples.
2007
2008 -P, --platforms STR
2009 Comma-delimited list of platforms (determined by @RG-PL) from
2010 which indel candidates are obtained. It is recommended to collect
2011 indel candidates from sequencing technologies that have low indel
2012 error rate such as ILLUMINA [all]
2013
2014 Examples:
2015 Call SNPs and short INDELs, then mark low quality sites and sites with
2016 the read depth exceeding a limit. (The read depth should be adjusted to
2017 about twice the average read depth as higher read depths usually
2018 indicate problematic regions which are often enriched for artefacts.)
2019 One may consider to add -C50 to mpileup if mapping quality is
2020 overestimated for reads containing excessive mismatches. Applying
2021 this option usually helps for BWA-backtrack alignments, but may not
2022 other aligners.
2023
2024 bcftools mpileup -Ou -f ref.fa aln.bam | \
2025 bcftools call -Ou -mv | \
2026 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
2027
2028 bcftools norm [OPTIONS] file.vcf.gz
2029 Left-align and normalize indels, check if REF alleles match the
2030 reference, split multiallelic sites into multiple rows; recover
2031 multiallelics from multiple rows. Left-alignment and normalization will
2032 only be applied if the --fasta-ref option is supplied.
2033
2034 -a, --atomize
2035 Decompose complex variants, e.g. split MNVs into consecutive SNVs.
2036 See also --atom-overlaps and --old-rec-tag.
2037
2038 --atom-overlaps .|*
2039 Alleles missing because of an overlapping variant can be set either
2040 to missing (.) or to the star alele (*), as recommended by the VCF
2041 specification. IMPORTANT: Note that asterisk is expaneded by shell
2042 and must be put in quotes or escaped by a backslash:
2043
2044 # Before atomization:
2045 100 CC C,GG 1/2
2046
2047 # After:
2048 # bcftools norm -a .
2049 100 C G ./1
2050 100 CC C 1/.
2051 101 C G ./1
2052
2053 # After:
2054 # bcftools norm -a '*'
2055 # bcftools norm -a \*
2056 100 C G,* 2/1
2057 100 CC C,* 1/2
2058 101 C G,* 2/1
2059
2060 -c, --check-ref e|w|x|s
2061 what to do when incorrect or missing REF allele is encountered:
2062 exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w
2063 option can be combined with x and s. Note that s can swap alleles
2064 and will update genotypes (GT) and AC counts, but will not attempt
2065 to fix PL or other fields. Also note, and this cannot be stressed
2066 enough, that s will NOT fix strand issues in your VCF, do NOT use
2067 it for that purpose!!! (Instead see <http://samtools.github.io/
2068 bcftools/howtos/plugin.af-dist.html> and <<http://
2069 samtools.github.io/bcftools/howtos/plugin.fixref.html>.>)
2070
2071 -d, --rm-dup snps|indels|both|all|exact
2072 If a record is present multiple times, output only the first
2073 instance. See also --collapse in Common Options.
2074
2075 -D, --remove-duplicates
2076 If a record is present in multiple files, output only the first
2077 instance. Alias for -d none, deprecated.
2078
2079 -f, --fasta-ref FILE
2080 reference sequence. Supplying this option will turn on
2081 left-alignment and normalization, however, see also the
2082 --do-not-normalize option below.
2083
2084 --force
2085 try to proceed with -m- even if malformed tags with incorrect
2086 number of fields are encountered, discarding such tags.
2087 (Experimental, use at your own risk.)
2088
2089 --keep-sum TAG[,...]
2090 keep vector sum constant when splitting multiallelic sites. Only AD
2091 tag is currently supported. See also <https://github.com/samtools/
2092 bcftools/issues/360>
2093
2094 -m, --multiallelics -|+[snps|indels|both|any]
2095 split multiallelic sites into biallelic records (-) or join
2096 biallelic sites into multiallelic records (+). An optional type
2097 string can follow which controls variant types which should be
2098 split or merged together: If only SNP records should be split or
2099 merged, specify snps; if both SNPs and indels should be merged
2100 separately into two records, specify both; if SNPs and indels
2101 should be merged into a single record, specify any.
2102
2103 --no-version
2104 see Common Options
2105
2106 -N, --do-not-normalize
2107 the -c s option can be used to fix or set the REF allele from the
2108 reference -f. The -N option will not turn on indel normalisation as
2109 the -f option normally implies
2110
2111 --old-rec-tag STR
2112 Add INFO/STR annotation with the original record. The format of the
2113 annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
2114
2115 -o, --output FILE
2116 see Common Options
2117
2118 -O, --output-type b|u|z|v
2119 see Common Options
2120
2121 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2122 see Common Options
2123
2124 -R, --regions-file file
2125 see Common Options
2126
2127 -s, --strict-filter
2128 when merging (-m+), merged site is PASS only if all sites being
2129 merged PASS
2130
2131 -t, --targets LIST
2132 see Common Options
2133
2134 -T, --targets-file FILE
2135 see Common Options
2136
2137 --threads INT
2138 see Common Options
2139
2140 -w, --site-win INT
2141 maximum distance between two records to consider when locally
2142 sorting variants which changed position during the realignment
2143
2144 bcftools [plugin NAME|+NAME] [OPTIONS] FILE —; [PLUGIN OPTIONS]
2145 A common framework for various utilities. The plugins can be used the
2146 same way as normal commands only their name is prefixed with "+". Most
2147 plugins accept two types of parameters: general options shared by all
2148 plugins followed by a separator, and a list of plugin-specific options.
2149 There are some exceptions to this rule, some plugins do not accept the
2150 common options and implement their own parameters. Therefore please pay
2151 attention to the usage examples that each plugin comes with.
2152
2153 VCF input options:
2154 -e, --exclude EXPRESSION
2155 exclude sites for which EXPRESSION is true. For valid expressions
2156 see EXPRESSIONS.
2157
2158 -i, --include EXPRESSION
2159 include only sites for which EXPRESSION is true. For valid
2160 expressions see EXPRESSIONS.
2161
2162 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2163 see Common Options
2164
2165 -R, --regions-file file
2166 see Common Options
2167
2168 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2169 see Common Options
2170
2171 -T, --targets-file file
2172 see Common Options
2173
2174 VCF output options:
2175 --no-version
2176 see Common Options
2177
2178 -o, --output FILE
2179 see Common Options
2180
2181 -O, --output-type b|u|z|v
2182 see Common Options
2183
2184 --threads INT
2185 see Common Options
2186
2187 Plugin options:
2188 -h, --help
2189 list plugin’s options
2190
2191 -l, --list-plugins
2192 List all available plugins.
2193
2194 By default, appropriate system directories are searched for
2195 installed plugins.
2196 You can override this by setting the BCFTOOLS_PLUGINS
2197 environment variable
2198 to a colon-separated list of directories to search.
2199 If BCFTOOLS_PLUGINS begins with a colon, ends with a colon, or
2200 contains
2201 adjacent colons, the system directories are also searched at
2202 that position
2203 in the list of directories.
2204
2205 -v, --verbose
2206 print debugging information to debug plugin failure
2207
2208 -V, --version
2209 print version string and exit
2210
2211 List of plugins coming with the distribution:
2212 ad-bias
2213 find positions with wildly varying ALT allele frequency (Fisher
2214 test on FMT/AD)
2215
2216 add-variantkey
2217 add VariantKey INFO fields VKX and RSX
2218
2219 af-dist
2220 collect AF deviation stats and GT probability distribution given AF
2221 and assuming HWE
2222
2223 allele-length
2224 count the frequency of the length of REF, ALT and REF+ALT
2225
2226 check-ploidy
2227 check if ploidy of samples is consistent for all sites
2228
2229 check-sparsity
2230 print samples without genotypes in a region or chromosome
2231
2232 color-chrs
2233 color shared chromosomal segments, requires trio VCF with phased
2234 GTs
2235
2236 contrast
2237 runs a basic association test, per-site or in a region, and checks
2238 for novel alleles and genotypes in two groups of samples. Adds the
2239 following INFO annotations:
2240
2241 • PASSOC .. Fisher’s exact test probability of genotypic
2242 association (REF vs non-REF allele)
2243
2244 • FASSOC .. proportion of non-REF allele in controls and cases
2245
2246 • NASSOC .. number of control-ref, control-alt, case-ref and
2247 case-alt alleles
2248
2249 • NOVELAL .. lists samples with a novel allele not observed in
2250 the control group
2251
2252 • NOVELGT .. lists samples with a novel genotype not observed in
2253 the control group
2254
2255 counts
2256 a minimal plugin which counts number of SNPs, Indels, and total
2257 number of sites.
2258
2259 dosage
2260 print genotype dosage. By default the plugin searches for PL, GL
2261 and GT, in that order.
2262
2263 fill-from-fasta
2264 fill INFO or REF field based on values in a fasta file
2265
2266 fill-tags
2267 set various INFO tags. The list of tags supported in this version:
2268
2269 • INFO/AC Number:A Type:Integer .. Allele count in
2270 genotypes
2271
2272 • INFO/AC_Hom Number:A Type:Integer .. Allele counts in
2273 homozygous genotypes
2274
2275 • INFO/AC_Het Number:A Type:Integer .. Allele counts in
2276 heterozygous genotypes
2277
2278 • INFO/AC_Hemi Number:A Type:Integer .. Allele counts in
2279 hemizygous genotypes
2280
2281 • INFO/AF Number:A Type:Float .. Allele frequency
2282
2283 • INFO/AN Number:1 Type:Integer .. Total number of
2284 alleles in called genotypes
2285
2286 • INFO/ExcHet Number:A Type:Float .. Test excess
2287 heterozygosity; 1=good, 0=bad
2288
2289 • INFO/END Number:1 Type:Integer .. End position of the
2290 variant
2291
2292 • INFO/F_MISSING Number:1 Type:Float .. Fraction of missing
2293 genotypes
2294
2295 • INFO/HWE Number:A Type:Float .. HWE test
2296 (PMID:15789306); 1=good, 0=bad
2297
2298 • INFO/MAF Number:A Type:Float .. Minor Allele
2299 frequency
2300
2301 • INFO/NS Number:1 Type:Integer .. Number of samples
2302 with data
2303
2304 • INFO/TYPE Number:. Type:String .. The record type
2305 (REF,SNP,MNP,INDEL,etc)
2306
2307 • FORMAT/VAF Number:A Type:Float .. The fraction of
2308 reads with the alternate allele, requires FORMAT/AD or ADF+ADR
2309
2310 • FORMAT/VAF1 Number:1 Type:Float .. The same as
2311 FORMAT/VAF but for all alternate alleles cumulatively
2312
2313 • TAG=func(TAG) Number:1 Type:Integer .. Experimental support
2314 for user-defined expressions such as "DP=sum(DP)"
2315
2316 fix-ploidy
2317 sets correct ploidy
2318
2319 fixref
2320 determine and fix strand orientation
2321
2322 frameshifts
2323 annotate frameshift indels
2324
2325 GTisec
2326 count genotype intersections across all possible sample subsets in
2327 a vcf file
2328
2329 GTsubset
2330 output only sites where the requested samples all exclusively share
2331 a genotype
2332
2333 guess-ploidy
2334 determine sample sex by checking genotype likelihoods (GL,PL) or
2335 genotypes (GT) in the non-PAR region of chrX.
2336
2337 gvcfz
2338 compress gVCF file by resizing non-variant blocks according to
2339 specified criteria
2340
2341 impute-info
2342 add imputation information metrics to the INFO field based on
2343 selected FORMAT tags
2344
2345 indel-stats
2346 calculates per-sample or de novo indels stats. The usage and format
2347 is similar to smpl-stats and trio-stats
2348
2349 isecGT
2350 compare two files and set non-identical genotypes to missing
2351
2352 mendelian
2353 count Mendelian consistent / inconsistent genotypes.
2354
2355 missing2ref
2356 sets missing genotypes ("./.") to ref allele ("0/0" or "0|0")
2357
2358 parental-origin
2359 determine parental origin of a CNV region
2360
2361 prune
2362 prune sites by missingness, allele frequency or linkage
2363 disequilibrium. Alternatively, annotate sites with r2, Lewontin’s
2364 D' (PMID:19433632), Ragsdale’s D (PMID:31697386).
2365
2366 remove-overlaps
2367 remove overlapping variants and duplicate sites
2368
2369 scatter
2370 intended as an inverse to bcftools concat, scatter VCF by chunks or
2371 regions, creating multiple VCFs.
2372
2373 setGT
2374 general tool to set genotypes according to rules requested by the
2375 user
2376
2377 smpl-stats
2378 calculates basic per-sample stats. The usage and format is similar
2379 to indel-stats and trio-stats.
2380
2381 split
2382 split VCF by sample, creating single- or multi-sample VCFs
2383
2384 split-vep
2385 extract fields from structured annotations such as INFO/CSQ created
2386 by bcftools/csq or VEP. These can be added as a new INFO field to
2387 the VCF or in a custom text format. See <http://samtools.github.io/
2388 bcftools/howtos/plugin.split-vep.html> for more.
2389
2390 tag2tag
2391 Convert between similar tags, such as GL,PL,GP or QR,QA,QS.
2392
2393 trio-dnm2
2394 screen variants for possible de-novo mutations in trios
2395
2396 trio-stats
2397 calculate transmission rate in trio children. The usage and format
2398 is similar to indel-stats and smpl-stats.
2399
2400 trio-switch-rate
2401 calculate phase switch rate in trio samples, children samples must
2402 have phased GTs
2403
2404 variantkey-hex
2405 generate unsorted VariantKey-RSid index files in hexadecimal format
2406
2407 Examples:
2408 # List options common to all plugins
2409 bcftools plugin
2410
2411 # List available plugins
2412 bcftools plugin -l
2413
2414 # Run a plugin
2415 bcftools plugin counts in.vcf
2416
2417 # Run a plugin using the abbreviated "+" notation
2418 bcftools +counts in.vcf
2419
2420 # Run a plugin from an explicit location
2421 bcftools +/path/to/counts.so in.vcf
2422
2423 # The input VCF can be streamed just like in other commands
2424 cat in.vcf | bcftools +counts
2425
2426 # Print usage information of plugin "dosage"
2427 bcftools +dosage -h
2428
2429 # Replace missing genotypes with 0/0
2430 bcftools +missing2ref in.vcf
2431
2432 # Replace missing genotypes with 0|0
2433 bcftools +missing2ref in.vcf -- -p
2434
2435 Plugins troubleshooting:
2436 Things to check if your plugin does not show up in the bcftools plugin
2437 -l output:
2438
2439 • Run with the -v option for verbose output: bcftools plugin -lv
2440
2441 • Does the environment variable BCFTOOLS_PLUGINS include the correct
2442 path?
2443
2444 Plugins API:
2445 // Short description used by 'bcftools plugin -l'
2446 const char *about(void);
2447
2448 // Longer description used by 'bcftools +name -h'
2449 const char *usage(void);
2450
2451 // Called once at startup, allows initialization of local variables.
2452 // Return 1 to suppress normal VCF/BCF header output, -1 on critical
2453 // errors, 0 otherwise.
2454 int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr);
2455
2456 // Called for each VCF record, return NULL to suppress the output
2457 bcf1_t *process(bcf1_t *rec);
2458
2459 // Called after all lines have been processed to clean up
2460 void destroy(void);
2461
2462 bcftools polysomy [OPTIONS] file.vcf.gz
2463 Detect number of chromosomal copies in VCFs annotates with the
2464 Illumina’s B-allele frequency (BAF) values. Note that this command is
2465 not compiled in by default, see the section Optional Compilation with
2466 GSL in the INSTALL file for help.
2467
2468 General options:
2469 -o, --output-dir path
2470 output directory
2471
2472 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2473 see Common Options
2474
2475 -R, --regions-file file
2476 see Common Options
2477
2478 -s, --sample string
2479 sample name
2480
2481 -t, --targets LIST
2482 see Common Options
2483
2484 -T, --targets-file FILE
2485 see Common Options
2486
2487 -v, --verbose
2488 verbose debugging output which gives hints about the thresholds and
2489 decisions made by the program. Note that the exact output can
2490 change between versions.
2491
2492 Algorithm options:
2493 -b, --peak-size float
2494 the minimum peak size considered as a good match can be from the
2495 interval [0,1] where larger is stricter
2496
2497 -c, --cn-penalty float
2498 a penalty for increasing copy number state. How this works:
2499 multiple peaks are always a better fit than a single peak,
2500 therefore the program prefers a single peak (normal copy number)
2501 unless the absolute deviation of the multiple peaks fit is
2502 significantly smaller. Here the meaning of "significant" is given
2503 by the float from the interval [0,1] where larger is stricter.
2504
2505 -f, --fit-th float
2506 threshold for goodness of fit (normalized absolute deviation),
2507 smaller is stricter
2508
2509 -i, --include-aa
2510 include also the AA peak in CN2 and CN3 evaluation. This usually
2511 requires increasing -f.
2512
2513 -m, --min-fraction float
2514 minimum distinguishable fraction of aberrant cells. The experience
2515 shows that trustworthy are estimates of 20% and more.
2516
2517 -p, --peak-symmetry float
2518 a heuristics to filter failed fits where the expected peak symmetry
2519 is violated. The float is from the interval [0,1] and larger is
2520 stricter
2521
2522 bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]
2523 Extracts fields from VCF or BCF files and outputs them in user-defined
2524 format.
2525
2526 -e, --exclude EXPRESSION
2527 exclude sites for which EXPRESSION is true. For valid expressions
2528 see EXPRESSIONS.
2529
2530 -f, --format FORMAT
2531 learn by example, see below
2532
2533 -H, --print-header
2534 print header
2535
2536 -i, --include EXPRESSION
2537 include only sites for which EXPRESSION is true. For valid
2538 expressions see EXPRESSIONS.
2539
2540 -l, --list-samples
2541 list sample names and exit
2542
2543 -o, --output FILE
2544 see Common Options
2545
2546 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2547 see Common Options
2548
2549 -R, --regions-file file
2550 see Common Options
2551
2552 -s, --samples LIST
2553 see Common Options
2554
2555 -S, --samples-file FILE
2556 see Common Options
2557
2558 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2559 see Common Options
2560
2561 -T, --targets-file file
2562 see Common Options
2563
2564 -u, --allow-undef-tags
2565 do not throw an error if there are undefined tags in the format
2566 string, print "." instead
2567
2568 -v, --vcf-list FILE
2569 process multiple VCFs listed in the file
2570
2571 Format:
2572 %CHROM The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
2573 %END End position of the REF allele
2574 %END0 End position of the REF allele in 0-based coordinates
2575 %FIRST_ALT Alias for %ALT{0}
2576 %FORMAT Prints all FORMAT fields or a subset of samples with -s or -S
2577 %GT Genotype (e.g. 0/1)
2578 %INFO Prints the whole INFO column
2579 %INFO/TAG Any tag in the INFO column
2580 %IUPACGT Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
2581 %LINE Prints the whole line
2582 %MASK Indicates presence of the site in other files (with multiple files)
2583 %N_PASS(expr) Number of samples that pass the filtering expression (see *<<expressions,EXPRESSIONS>>*)
2584 %POS0 POS in 0-based coordinates
2585 %PBINOM(TAG) Calculate phred-scaled binomial probability, the allele index is determined from GT
2586 %SAMPLE Sample name
2587 %TAG{INT} Curly brackets to print a subfield (e.g. INFO/TAG{1}, the indexes are 0-based)
2588 %TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
2589 %TGT Translated genotype (e.g. C/A)
2590 %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER)
2591 [] Format fields must be enclosed in brackets to loop over all samples
2592 \n new line
2593 \t tab character
2594
2595 Everything else is printed verbatim.
2596
2597 Examples:
2598 # Print chromosome, position, ref allele and the first alternate allele
2599 bcftools query -f '%CHROM %POS %REF %ALT{0}\n' file.vcf.gz
2600
2601 # Similar to above, but use tabs instead of spaces, add sample name and genotype
2602 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
2603
2604 # Print FORMAT/GT fields followed by FORMAT/GT fields
2605 bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
2606
2607 # Make a BED file: chr, pos (0-based), end pos (1-based), id
2608 bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
2609
2610 # Print only samples with alternate (non-reference) genotypes
2611 bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
2612
2613 # Print all samples at sites with at least one alternate genotype
2614 bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
2615
2616 # Print phred-scaled binomial probability from FORMAT/AD tag for all heterozygous genotypes
2617 bcftools query -i'GT="het"' -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' file.vcf
2618
2619 # Print the second value of AC field if bigger than 10. Note the (unfortunate) difference in
2620 # index subscript notation: formatting expressions (-f) uses "{}" while filtering expressions
2621 # (-i) use "[]". This is for historic reasons and backward-compatibility.
2622 bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz
2623
2624 bcftools reheader [OPTIONS] file.vcf.gz
2625 Modify header of VCF/BCF files, change sample names.
2626
2627 -f, --fai FILE
2628 add to the header contig names and their lengths from the provided
2629 fasta index file (.fai). Lengths of existing contig lines will be
2630 updated and contig lines not present in the fai file will be
2631 removed
2632
2633 -h, --header FILE
2634 new VCF header
2635
2636 -o, --output FILE
2637 see Common Options
2638
2639 -s, --samples FILE
2640 new sample names, one name per line, in the same order as they
2641 appear in the VCF file. Alternatively, only samples which need to
2642 be renamed can be listed as "old_name new_name\n" pairs separated
2643 by whitespaces, each on a separate line. If a sample name contains
2644 spaces, the spaces can be escaped using the backslash character,
2645 for example "Not\ a\ good\ sample\ name".
2646
2647 -T, --temp-prefix PATH
2648 template for temporary file names, used with -f
2649
2650 --threads INT
2651 see Common Options
2652
2653 bcftools roh [OPTIONS] file.vcf.gz
2654 A program for detecting runs of homo/autozygosity. Only bi-allelic
2655 sites are considered.
2656
2657 The HMM model:
2658 Notation:
2659 D = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
2660 f = non-ref allele frequency
2661
2662 Emission probabilities:
2663 oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
2664 oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA)
2665
2666 Transition probabilities:
2667 tAZ = P(AZ|HW) .. from HW to AZ, the -a parameter
2668 tHW = P(HW|AZ) .. from AZ to HW, the -H parameter
2669
2670 ci = P_i(C) .. probability of cross-over at site i, from genetic map
2671 AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
2672 HWi = P_i(HW)
2673
2674 P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
2675 P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
2676
2677 General Options:
2678 --AF-dflt FLOAT
2679 in case allele frequency is not known, use the FLOAT. By default,
2680 sites where allele frequency cannot be determined, or is 0, are
2681 skipped.
2682
2683 --AF-tag TAG
2684 use the specified INFO tag TAG as an allele frequency estimate
2685 instead of the default AC and AN tags. Sites which do not have TAG
2686 will be skipped.
2687
2688 --AF-file FILE
2689 Read allele frequencies from a tab-delimited file containing the
2690 columns: CHROM\tPOS\tREF,ALT\tAF. The file can be compressed with
2691 bgzip and indexed with tabix -s1 -b2 -e2. Sites which are not
2692 present in the FILE or have different reference or alternate allele
2693 will be skipped. Note that such a file can be easily created from a
2694 VCF using:
2695
2696 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
2697
2698 -b, --buffer-size INT[,INT]
2699 when the entire many-sample file cannot fit into memory, a sliding
2700 buffer approach can be used. The first value is the number of sites
2701 to keep in memory. If negative, it is interpreted as the maximum
2702 memory to use, in MB. The second, optional, value sets the number
2703 of overlapping sites. The default overlap is set to roughly 1% of
2704 the buffer size.
2705
2706 -e, --estimate-AF FILE
2707 estimate the allele frequency by recalculating INFO/AC and INFO/AN
2708 on the fly, using the specified TAG which can be either FORMAT/GT
2709 ("GT") or FORMAT/PL ("PL"). If TAG is not given, "GT" is assumed.
2710 Either all samples ("-") or samples listed in FILE will be
2711 included. For example, use "PL,-" to estimate AF from FORMAT/PL of
2712 all samples. If neither -e nor the other --AF-... options are
2713 given, the allele frequency is estimated from AC and AN counts
2714 which are already present in the INFO field.
2715
2716 --exclude EXPRESSION
2717 exclude sites for which EXPRESSION is true. For valid expressions
2718 see EXPRESSIONS.
2719
2720 -G, --GTs-only FLOAT
2721 use genotypes (FORMAT/GT fields) ignoring genotype likelihoods
2722 (FORMAT/PL), setting PL of unseen genotypes to FLOAT. Safe value to
2723 use is 30 to account for GT errors.
2724
2725 --include EXPRESSION
2726 include only sites for which EXPRESSION is true. For valid
2727 expressions see EXPRESSIONS.
2728
2729 -I, --skip-indels
2730 skip indels as their genotypes are usually enriched for errors
2731
2732 -m, --genetic-map FILE
2733 genetic map in the format required also by IMPUTE2. Only the first
2734 and third column are used (position and Genetic_Map(cM)). The FILE
2735 can be a single file or a file mask, where string "{CHROM}" is
2736 replaced with chromosome name.
2737
2738 -M, --rec-rate FLOAT
2739 constant recombination rate per bp. In combination with
2740 --genetic-map, the --rec-rate parameter is interpreted differently,
2741 as FLOAT-fold increase of transition probabilities, which allows
2742 the model to become more sensitive yet still account for
2743 recombination hotspots. Note that also the range of the values is
2744 therefore different in both cases: normally the parameter will be
2745 in the range (1e-3,1e-9) but with --genetic-map it will be in the
2746 range (10,1000).
2747
2748 -o, --output FILE
2749 Write output to the FILE, by default the output is printed on
2750 stdout
2751
2752 -O, --output-type s|r[z]
2753 Generate per-site output (s) or per-region output (r). By default
2754 both types are printed and the output is uncompressed. Add z for a
2755 compressed output.
2756
2757 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2758 see Common Options
2759
2760 -R, --regions-file file
2761 see Common Options
2762
2763 -s, --samples LIST
2764 see Common Options
2765
2766 -S, --samples-file FILE
2767 see Common Options
2768
2769 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2770 see Common Options
2771
2772 -T, --targets-file file
2773 see Common Options
2774
2775 HMM Options:
2776 -a, --hw-to-az FLOAT
2777 P(AZ|HW) transition probability from AZ (autozygous) to HW
2778 (Hardy-Weinberg) state
2779
2780 -H, --az-to-hw FLOAT
2781 P(HW|AZ) transition probability from HW to AZ state
2782
2783 -V, --viterbi-training FLOAT
2784 estimate HMM parameters using Baum-Welch algorithm, using the
2785 convergence threshold FLOAT, e.g. 1e-10 (experimental)
2786
2787 bcftools sort [OPTIONS] file.bcf
2788 -m, --max-mem FLOAT[kMG]
2789 Maximum memory to use. Approximate, affects the number of temporary
2790 files written to the disk. Note that if the command fails at this
2791 step because of too many open files, your system limit on the
2792 number of open files ("ulimit") may need to be increased.
2793
2794 -o, --output FILE
2795 see Common Options
2796
2797 -O, --output-type b|u|z|v
2798 see Common Options
2799
2800 -T, --temp-dir DIR
2801 Use this directory to store temporary files
2802
2803 bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]
2804 Parses VCF or BCF and produces text file stats which is suitable for
2805 machine processing and can be plotted using plot-vcfstats. When two
2806 files are given, the program generates separate stats for intersection
2807 and the complements. By default only sites are compared, -s/-S must
2808 given to include also sample columns. When one VCF file is specified on
2809 the command line, then stats by non-reference allele frequency, depth
2810 distribution, stats by quality and per-sample counts, singleton stats,
2811 etc. are printed. When two VCF files are given, then stats such as
2812 concordance (Genotype concordance by non-reference allele frequency,
2813 Genotype concordance by sample, Non-Reference Discordance) and
2814 correlation are also printed. Per-site discordance (PSD) is also
2815 printed in --verbose mode.
2816
2817 --af-bins LIST|FILE
2818 comma separated list of allele frequency bins (e.g. 0.1,0.5,1) or a
2819 file listing the allele frequency bins one per line (e.g.
2820 0.1\n0.5\n1)
2821
2822 --af-tag TAG
2823 allele frequency INFO tag to use for binning. By default the allele
2824 frequency is estimated from AC/AN, if available, or directly from
2825 the genotypes (GT) if not.
2826
2827 -1, --1st-allele-only
2828 consider only the 1st alternate allele at multiallelic sites
2829
2830 -c, --collapse snps|indels|both|all|some|none
2831 see Common Options
2832
2833 -d, --depth INT,INT,INT
2834 ranges of depth distribution: min, max, and size of the bin
2835
2836 --debug
2837 produce verbose per-site and per-sample output
2838
2839 -e, --exclude EXPRESSION
2840 exclude sites for which EXPRESSION is true. For valid expressions
2841 see EXPRESSIONS.
2842
2843 -E, --exons file.gz
2844 tab-delimited file with exons for indel frameshifts statistics. The
2845 columns of the file are CHR, FROM, TO, with 1-based, inclusive,
2846 positions. The file is BGZF-compressed and indexed with tabix
2847
2848 tabix -s1 -b2 -e3 file.gz
2849
2850 -f, --apply-filters LIST
2851 see Common Options
2852
2853 -F, --fasta-ref ref.fa
2854 faidx indexed reference sequence file to determine INDEL context
2855
2856 -i, --include EXPRESSION
2857 include only sites for which EXPRESSION is true. For valid
2858 expressions see EXPRESSIONS.
2859
2860 -I, --split-by-ID
2861 collect stats separately for sites which have the ID column set
2862 ("known sites") or which do not have the ID column set ("novel
2863 sites").
2864
2865 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2866 see Common Options
2867
2868 -R, --regions-file file
2869 see Common Options
2870
2871 -s, --samples LIST
2872 see Common Options
2873
2874 -S, --samples-file FILE
2875 see Common Options
2876
2877 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2878 see Common Options
2879
2880 -T, --targets-file file
2881 see Common Options
2882
2883 -u, --user-tstv <TAG[:min:max:n]>
2884 collect Ts/Tv stats for any tag using the given binning [0:1:100]
2885
2886 -v, --verbose
2887 produce verbose per-site and per-sample output
2888
2889 bcftools view [OPTIONS] file.vcf.gz [REGION [...]]
2890 View, subset and filter VCF or BCF files by position and filtering
2891 expression. Convert between VCF and BCF. Former bcftools subset.
2892
2893 Output options
2894 -G, --drop-genotypes
2895 drop individual genotype information (after subsetting if -s option
2896 is set)
2897
2898 -h, --header-only
2899 output the VCF header only
2900
2901 -H, --no-header
2902 suppress the header in VCF output
2903
2904 -l, --compression-level [0-9]
2905 compression level. 0 stands for uncompressed, 1 for best speed and
2906 9 for best compression.
2907
2908 --no-version
2909 see Common Options
2910
2911 -O, --output-type b|u|z|v
2912 see Common Options
2913
2914 -o, --output FILE:
2915 output file name. If not present, the default is to print to
2916 standard output (stdout).
2917
2918 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2919 see Common Options
2920
2921 -R, --regions-file file
2922 see Common Options
2923
2924 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2925 see Common Options
2926
2927 -T, --targets-file file
2928 see Common Options
2929
2930 --threads INT
2931 see Common Options
2932
2933 Subset options:
2934 -a, --trim-alt-alleles
2935 remove alleles not seen in the genotype fields from the ALT column.
2936 Note that if no alternate allele remains after trimming, the record
2937 itself is not removed but ALT is set to ".". If the option -s or -S
2938 is given, removes alleles not seen in the subset. INFO and FORMAT
2939 tags declared as Type=A, G or R will be trimmed as well.
2940
2941 --force-samples
2942 only warn about unknown subset samples
2943
2944 -I, --no-update
2945 do not (re)calculate INFO fields for the subset (currently INFO/AC
2946 and INFO/AN)
2947
2948 -s, --samples LIST
2949 see Common Options. Note that it is possible to create multiple
2950 subsets simultaneously using the split plugin.
2951
2952 -S, --samples-file FILE
2953 see Common Options. Note that it is possible to create multiple
2954 subsets simultaneously using the split plugin.
2955
2956 Filter options:
2957 Note that filter options below dealing with counting the number of
2958 alleles will, for speed, first check for the values of AC and AN in the
2959 INFO column to avoid parsing all the genotype (FORMAT/GT) fields in the
2960 VCF. This means that a filter like --min-af 0.1 will be calculated from
2961 INFO/AC and INFO/AN when available or FORMAT/GT otherwise. However, it
2962 will not attempt to use any other existing field, like INFO/AF for
2963 example. For that, use --exclude AF<0.1 instead.
2964
2965 Also note that one must be careful when sample subsetting and filtering
2966 is performed in a single command because the order of internal
2967 operations can influence the result. For example, the -i/-e filtering
2968 is performed before sample removal, but the -P filtering is performed
2969 after, and some are inherently ambiguous, for example allele counts can
2970 be taken from the INFO column when present but calculated on the fly
2971 when absent. Therefore it is strongly recommended to spell out the
2972 required order explicitly by separating such commands into two steps.
2973 (Make sure to use the -O u option when piping!)
2974
2975 -c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
2976 minimum allele count (INFO/AC) of sites to be printed. Specifying
2977 the type of allele is optional and can be set to non-reference
2978 (nref, the default), 1st alternate (alt1), the least frequent
2979 (minor), the most frequent (major) or sum of all but the most
2980 frequent (nonmajor) alleles.
2981
2982 -C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
2983 maximum allele count (INFO/AC) of sites to be printed. Specifying
2984 the type of allele is optional and can be set to non-reference
2985 (nref, the default), 1st alternate (alt1), the least frequent
2986 (minor), the most frequent (major) or sum of all but the most
2987 frequent (nonmajor) alleles.
2988
2989 -e, --exclude EXPRESSION
2990 exclude sites for which EXPRESSION is true. For valid expressions
2991 see EXPRESSIONS.
2992
2993 -f, --apply-filters LIST
2994 see Common Options
2995
2996 -g, --genotype [^][hom|het|miss]
2997 include only sites with one or more homozygous (hom), heterozygous
2998 (het) or missing (miss) genotypes. When prefixed with ^, the
2999 logic is reversed; thus ^het excludes sites with heterozygous
3000 genotypes.
3001
3002 -i, --include EXPRESSION
3003 include sites for which EXPRESSION is true. For valid expressions
3004 see EXPRESSIONS.
3005
3006 -k, --known
3007 print known sites only (ID column is not ".")
3008
3009 -m, --min-alleles INT
3010 print sites with at least INT alleles listed in REF and ALT columns
3011
3012 -M, --max-alleles INT
3013 print sites with at most INT alleles listed in REF and ALT columns.
3014 Use -m2 -M2 -v snps to only view biallelic SNPs.
3015
3016 -n, --novel
3017 print novel sites only (ID column is ".")
3018
3019 -p, --phased
3020 print sites where all samples are phased. Haploid genotypes are
3021 considered phased. Missing genotypes considered unphased unless the
3022 phased bit is set.
3023
3024 -P, --exclude-phased
3025 exclude sites where all samples are phased
3026
3027 -q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3028 minimum allele frequency (INFO/AC / INFO/AN) of sites to be
3029 printed. Specifying the type of allele is optional and can be set
3030 to non-reference (nref, the default), 1st alternate (alt1), the
3031 least frequent (minor), the most frequent (major) or sum of all but
3032 the most frequent (nonmajor) alleles.
3033
3034 -Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3035 maximum allele frequency (INFO/AC / INFO/AN) of sites to be
3036 printed. Specifying the type of allele is optional and can be set
3037 to non-reference (nref, the default), 1st alternate (alt1), the
3038 least frequent (minor), the most frequent (major) or sum of all but
3039 the most frequent (nonmajor) alleles.
3040
3041 -u, --uncalled
3042 print sites without a called genotype
3043
3044 -U, --exclude-uncalled
3045 exclude sites without a called genotype
3046
3047 -v, --types snps|indels|mnps|other
3048 comma-separated list of variant types to select. Site is selected
3049 if any of the ALT alleles is of the type requested. Types are
3050 determined by comparing the REF and ALT alleles in the VCF record
3051 not INFO tags like INFO/INDEL or INFO/VT. Use --include to select
3052 based on INFO tags.
3053
3054 -V, --exclude-types snps|indels|mnps|ref|bnd|other
3055 comma-separated list of variant types to exclude. Site is excluded
3056 if any of the ALT alleles is of the type requested. Types are
3057 determined by comparing the REF and ALT alleles in the VCF record
3058 not INFO tags like INFO/INDEL or INFO/VT. Use --exclude to exclude
3059 based on INFO tags.
3060
3061 -x, --private
3062 print sites where only the subset samples carry an non-reference
3063 allele. Requires --samples or --samples-file.
3064
3065 -X, --exclude-private
3066 exclude sites where only the subset samples carry an non-reference
3067 allele
3068
3069 bcftools help [COMMAND] | bcftools --help [COMMAND]
3070 Display a brief usage message listing the bcftools commands
3071 available. If the name of a command is also given, e.g., bcftools help
3072 view, the detailed usage message for that particular command is
3073 displayed.
3074
3075 bcftools [--version|-v]
3076 Display the version numbers and copyright information for bcftools and
3077 the important libraries used by bcftools.
3078
3079 bcftools [--version-only]
3080 Display the full bcftools version number in a machine-readable format.
3081
3083 These filtering expressions are accepted by most of the commands.
3084
3085 Valid expressions may contain:
3086
3087 • numerical constants, string constants, file names (this is
3088 currently supported only to filter by the ID column)
3089
3090 1, 1.0, 1e-4
3091 "String"
3092 @file_name
3093
3094 • arithmetic operators
3095
3096 +,*,-,/
3097
3098 • comparison operators
3099
3100 == (same as =), >, >=, <=, <, !=
3101
3102 • regex operators "\~" and its negation "!~". The expressions are
3103 case sensitive unless "/i" is added.
3104
3105 INFO/HAYSTACK ~ "needle"
3106 INFO/HAYSTACK ~ "NEEDless/i"
3107
3108 • parentheses
3109
3110 (, )
3111
3112 • logical operators. See also the examples below and the filtering
3113 tutorial <http://samtools.github.io/bcftools/howtos/filtering.html>
3114 about the distinction between "&&" vs "&" and "||" vs
3115 "|".
3116
3117 &&, &, ||, |
3118
3119 • INFO tags, FORMAT tags, column names
3120
3121 INFO/DP or DP
3122 FORMAT/DV, FMT/DV, or DV
3123 FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
3124
3125 • starting with 1.11, the FILTER column can be queried as follows:
3126
3127 FILTER="PASS"
3128 FILTER="A" .. exact match, for example "A;B" does not pass
3129 FILTER!="A" .. exact match, for example "A;B" does pass
3130 FILTER~"A" .. both "A" and "A;B" pass
3131 FILTER!~"A" .. neither "A" nor "A;B" pass
3132
3133 • 1 (or 0) to test the presence (or absence) of a flag
3134
3135 FlagA=1 && FlagB=0
3136
3137 • "." to test missing values
3138
3139 DP=".", DP!=".", ALT="."
3140
3141 • missing genotypes can be matched regardless of phase and ploidy
3142 (".|.", "./.", ".", "0|.") using these expressions
3143
3144 GT="mis", GT~"\.", GT!~"\."
3145
3146 • missing genotypes can be matched including the phase and ploidy
3147 (".|.", "./.", ".") using these expressions
3148
3149 GT=".|.", GT="./.", GT="."
3150
3151 • sample genotype: reference (haploid or diploid), alternate (hom or
3152 het, haploid or diploid), missing genotype, homozygous,
3153 heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het,
3154 alt-alt het, haploid ref, haploid alt (case-insensitive)
3155
3156 GT="ref"
3157 GT="alt"
3158 GT="mis"
3159 GT="hom"
3160 GT="het"
3161 GT="hap"
3162 GT="RR"
3163 GT="AA"
3164 GT="RA" or GT="AR"
3165 GT="Aa" or GT="aA"
3166 GT="R"
3167 GT="A"
3168
3169 • TYPE for variant type in REF,ALT columns
3170 (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~"
3171 to require at least one allele of the given type or the equal sign
3172 "=" to require that all alleles are of the given type. Compare
3173
3174 TYPE="snp"
3175 TYPE~"snp"
3176 TYPE!="snp"
3177 TYPE!~"snp"
3178
3179 • array subscripts (0-based), "*" for any element, "-" to indicate a
3180 range. Note that for querying FORMAT vectors, the colon ":" can be
3181 used to select a sample and an element of the vector, as shown in
3182 the examples below
3183
3184 INFO/AF[0] > 0.3 .. first AF value bigger than 0.3
3185 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30
3186 FORMAT/AD[0:1] .. first sample, second AD value
3187 FORMAT/AD[1:0] .. second sample, first AD value
3188 DP4[*] == 0 .. any DP4 value
3189 FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30
3190 FORMAT/DP[1-3] > 10 .. samples 2-4
3191 FORMAT/DP[1-] < 7 .. all samples but the first
3192 FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5
3193 FORMAT/AD[0:1] .. first sample, second AD field
3194 FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
3195 FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field
3196 (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
3197 CSQ[*] ~ "missense_variant.*deleterious"
3198
3199 • with many samples it can be more practical to provide a file with
3200 sample names, one sample name per line
3201
3202 GT[@samples.txt]="het" & binom(AD)<0.01
3203
3204 • function on FORMAT tags (over samples) and INFO tags (over vector
3205 fields): maximum; minimum; arithmetic mean (AVG is synonymous with
3206 MEAN); median; standard deviation from mean; sum; string length;
3207 absolute value; number of elements:
3208
3209 MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
3210
3211 Note that functions above evaluate to a single value across all
3212 samples and are intended to select sites, not samples, even when
3213 applied on FORMAT tags. However, when prefixed with SMPL_ (or "s"
3214 for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector
3215 of per-sample values when applied on FORMAT tags:
3216
3217 SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM,
3218 sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
3219
3220 • two-tailed binomial test. Note that for N=0 the test evaluates to a
3221 missing value and when FORMAT/GT is used to determine the vector
3222 indices, it evaluates to 1 for homozygous genotypes.
3223
3224 binom(FMT/AD) .. GT can be used to determine the correct index
3225 binom(AD[0],AD[1]) .. or the fields can be given explicitly
3226 phred(binom()) .. the same as binom but phred-scaled
3227
3228 • variables calculated on the fly if not present: number of alternate
3229 alleles; number of samples; count of alternate alleles; minor
3230 allele count (similar to AC but is always smaller than 0.5);
3231 frequency of alternate alleles (AF=AC/AN); frequency of minor
3232 alleles (MAF=MAC/AN); number of alleles in called genotypes; number
3233 of samples with missing genotype; fraction of samples with missing
3234 genotype; indel length (deletions negative, insertions positive)
3235
3236 N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
3237
3238 • the number (N_PASS) or fraction (F_PASS) of samples which pass the
3239 expression
3240
3241 N_PASS(GQ>90 & GT!="mis") > 90
3242 F_PASS(GQ>90 & GT!="mis") > 0.9
3243
3244 • custom perl filtering. Note that this command is not compiled in by
3245 default, see the section Optional Compilation with Perl in the
3246 INSTALL file for help and misc/demo-flt.pl for a working example.
3247 The demo defined the perl subroutine "severity" which can be
3248 invoked from the command line as follows:
3249
3250 perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
3251
3252 Notes:
3253
3254 • String comparisons and regular expressions are case-insensitive
3255
3256 • Comma in strings is interpreted as a separator and when multiple
3257 values are compared, the OR logic is used. Consequently, the
3258 following two expressions are equivalent but not the third:
3259
3260 -i 'TAG="hello,world"'
3261 -i 'TAG="hello" || TAG="world"'
3262 -i 'TAG="hello" && TAG="world"'
3263
3264 • Variables and function names are case-insensitive, but not tag
3265 names. For example, "qual" can be used instead of "QUAL",
3266 "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
3267
3268 • When querying multiple values, all elements are tested and the OR
3269 logic is used on the result. For example, when querying
3270 "TAG=1,2,3,4", it will be evaluated as follows:
3271
3272 -i 'TAG[*]=1' .. true, the record will be printed
3273 -i 'TAG[*]!=1' .. true
3274 -e 'TAG[*]=1' .. false, the record will be discarded
3275 -e 'TAG[*]!=1' .. false
3276 -i 'TAG[0]=1' .. true
3277 -i 'TAG[0]!=1' .. false
3278 -e 'TAG[0]=1' .. false
3279 -e 'TAG[0]!=1' .. true
3280
3281 Examples:
3282
3283 MIN(DV)>5 .. selects the whole site, evaluates min across all values and samples
3284
3285 SMPL_MIN(DV)>5 .. selects matching samples, evaluates within samples
3286
3287 MIN(DV/DP)>0.3
3288
3289 MIN(DP)>10 & MIN(DV)>3
3290
3291 FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample
3292
3293 FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
3294
3295 QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
3296
3297 QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
3298
3299 TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
3300
3301 COUNT(GT="hom")=0 .. no homozygous genotypes at the site
3302
3303 AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50
3304
3305 ID=@file .. selects lines with ID present in the file
3306
3307 ID!=@~/file .. skip lines with ID present in the ~/file
3308
3309 MAF[0]<0.05 .. select rare variants at 5% cutoff
3310
3311 POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
3312
3313 Shell expansion:
3314
3315 Note that expressions must often be quoted because some characters have
3316 special meaning in the shell. An example of expression enclosed in
3317 single quotes which cause that the whole expression is passed to the
3318 program as intended:
3319
3320 bcftools view -i '%ID!="." & MAF[0]<0.01'
3321
3322 Please refer to the documentation of your shell for details.
3323
3325 plot-vcfstats [OPTIONS] file.vchk [...]
3326 Script for processing output of bcftools stats. It can merge results
3327 from multiple outputs (useful when running the stats for each
3328 chromosome separately), plots graphs and creates a PDF presentation.
3329
3330 -m, --merge
3331 Merge vcfstats files to STDOUT, skip plotting.
3332
3333 -p, --prefix DIR
3334 The output directory. This directory will be created if it does not
3335 exist.
3336
3337 -P, --no-PDF
3338 Skip the PDF creation step.
3339
3340 -r, --rasterize
3341 Rasterize PDF images for faster rendering. This is the default and
3342 the opposite of -v, --vectors.
3343
3344 -s, --sample-names
3345 Use sample names for xticks rather than numeric IDs.
3346
3347 -t, --title STRING
3348 Identify files by these titles in plots. The option can be given
3349 multiple times, for each ID in the bcftools stats output. If not
3350 present, the script will use abbreviated source file names for the
3351 titles.
3352
3353 -v, --vectors
3354 Generate vector graphics for PDF images, the opposite of -r,
3355 --rasterize.
3356
3357 -T, --main-title STRING
3358 Main title for the PDF.
3359
3360 Example:
3361
3362 # Generate the stats
3363 bcftools stats -s - > file.vchk
3364
3365 # Plot the stats
3366 plot-vcfstats -p outdir file.vchk
3367
3368 # The final looks can be customized by editing the generated
3369 # 'outdir/plot.py' script and re-running manually
3370 cd outdir && python plot.py && pdflatex summary.tex
3371
3373 HTSlib was designed with BCF format in mind. When parsing VCF files,
3374 all records are internally converted into BCF representation. Simple
3375 operations, like removing a single column from a VCF file, can be
3376 therefore done much faster with standard UNIX commands, such as awk or
3377 cut. Therefore it is recommended to use BCF as input/output format
3378 whenever possible to avoid large overhead of the VCF → BCF → VCF
3379 conversion.
3380
3382 Please report any bugs you encounter on the github website: <http://
3383 github.com/samtools/bcftools>
3384
3386 Heng Li from the Sanger Institute wrote the original C version of
3387 htslib, samtools and bcftools. Bob Handsaker from the Broad Institute
3388 implemented the BGZF library. Petr Danecek, Shane McCarthy and John
3389 Marshall are maintaining and further developing bcftools. Many other
3390 people contributed to the program and to the file format
3391 specifications, both directly and indirectly by providing patches,
3392 testing and reporting bugs. We thank them all.
3393
3395 BCFtools GitHub website: <http://github.com/samtools/bcftools>
3396
3397 Samtools GitHub website: <http://github.com/samtools/samtools>
3398
3399 HTSlib GitHub website: <http://github.com/samtools/htslib>
3400
3401 File format specifications: <http://samtools.github.io/hts-specs>
3402
3403 BCFtools documentation: <http://samtools.github.io/bcftools>
3404
3405 BCFtools wiki page: <https://github.com/samtools/bcftools/wiki>
3406
3408 The MIT/Expat License or GPL License, see the LICENSE document for
3409 details. Copyright (c) Genome Research Ltd.
3410
3411
3412
3413 2021-07-07 BCFTOOLS(1)