1BCFTOOLS(1) BCFTOOLS(1)
2
3
4
6 bcftools - utilities for variant calling and manipulating VCFs and
7 BCFs.
8
10 bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
11
13 BCFtools is a set of utilities that manipulate variant calls in the
14 Variant Call Format (VCF) and its binary counterpart BCF. All commands
15 work transparently with both VCFs and BCFs, both uncompressed and
16 BGZF-compressed.
17
18 Most commands accept VCF, bgzipped VCF and BCF with filetype detected
19 automatically even when streaming from a pipe. Indexed VCF and BCF will
20 work in all situations. Un-indexed VCF and BCF and streams will work in
21 most, but not all situations. In general, whenever multiple VCFs are
22 read simultaneously, they must be indexed and therefore also
23 compressed. (Note that files with non-standard index names can be
24 accessed as e.g. "bcftools view -r X:2928329
25 file.vcf.gz##idx##non-standard-index-name".)
26
27 BCFtools is designed to work on a stream. It regards an input file "-"
28 as the standard input (stdin) and outputs to the standard output
29 (stdout). Several commands can thus be combined with Unix pipes.
30
31 VERSION
32 This manual page was last updated 2022-04-07 and refers to bcftools git
33 version 1.15.1.
34
35 BCF1
36 The BCF1 format output by versions of samtools <= 0.1.19 is not
37 compatible with this version of bcftools. To read BCF1 files one can
38 use the view command from old versions of bcftools packaged with
39 samtools versions <= 0.1.19 to convert to VCF, which can then be read
40 by this version of bcftools.
41
42 samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
43
44 VARIANT CALLING
45 See bcftools call for variant calling from the output of the samtools
46 mpileup command. In versions of samtools <= 0.1.19 calling was done
47 with bcftools view. Users are now required to choose between the old
48 samtools calling model (-c/--consensus-caller) and the new multiallelic
49 calling model (-m/--multiallelic-caller). The multiallelic calling
50 model is recommended for most tasks.
51
53 For a full list of available commands, run bcftools without arguments.
54 For a full list of available options, run bcftools COMMAND without
55 arguments.
56
57 • annotate .. edit VCF files, add or remove annotations
58
59 • call .. SNP/indel calling (former "view")
60
61 • cnv .. Copy Number Variation caller
62
63 • concat .. concatenate VCF/BCF files from the same set of samples
64
65 • consensus .. create consensus sequence by applying VCF variants
66
67 • convert .. convert VCF/BCF to other formats and back
68
69 • csq .. haplotype aware consequence caller
70
71 • filter .. filter VCF/BCF files using fixed thresholds
72
73 • gtcheck .. check sample concordance, detect sample swaps and
74 contamination
75
76 • head .. view VCF/BCF file headers
77
78 • index .. index VCF/BCF
79
80 • isec .. intersections of VCF/BCF files
81
82 • merge .. merge VCF/BCF files files from non-overlapping sample
83 sets
84
85 • mpileup .. multi-way pileup producing genotype likelihoods
86
87 • norm .. normalize indels
88
89 • plugin .. run user-defined plugin
90
91 • polysomy .. detect contaminations and whole-chromosome aberrations
92
93 • query .. transform VCF/BCF into user-defined formats
94
95 • reheader .. modify VCF/BCF header, change sample names
96
97 • roh .. identify runs of homo/auto-zygosity
98
99 • sort .. sort VCF/BCF files
100
101 • stats .. produce VCF/BCF stats (former vcfcheck)
102
103 • view .. subset, filter and convert VCF and BCF files
104
106 Some helper scripts are bundled with the bcftools code.
107
108 • plot-vcfstats .. plots the output of stats
109
111 Common Options
112 The following options are common to many bcftools commands. See usage
113 for specific commands to see if they apply.
114
115 FILE
116 Files can be both VCF or BCF, uncompressed or BGZF-compressed. The
117 file "-" is interpreted as standard input. Some tools may require
118 tabix- or CSI-indexed files.
119
120 -c, --collapse snps|indels|both|all|some|none|id
121 Controls how to treat records with duplicate positions and defines
122 compatible records across multiple input files. Here by
123 "compatible" we mean records which should be considered as
124 identical by the tools. For example, when performing line
125 intersections, the desire may be to consider as identical all sites
126 with matching positions (bcftools isec -c all), or only sites with
127 matching variant type (bcftools isec -c snps -c indels), or only
128 sites with all alleles identical (bcftools isec -c none).
129
130 none
131 only records with identical REF and ALT alleles are compatible
132
133 some
134 only records where some subset of ALT alleles match are
135 compatible
136
137 all
138 all records are compatible, regardless of whether the ALT
139 alleles match or not. In the case of records with the same
140 position, only the first will be considered and appear on
141 output.
142
143 snps
144 any SNP records are compatible, regardless of whether the ALT
145 alleles match or not. For duplicate positions, only the first
146 SNP record will be considered and appear on output.
147
148 indels
149 all indel records are compatible, regardless of whether the
150 REF and ALT alleles match or not. For duplicate positions, only
151 the first indel record will be considered and appear on output.
152
153 both
154 abbreviation of "-c indels -c snps"
155
156 id
157 only records with identical ID column are compatible. Supported
158 by bcftools merge only.
159
160 -f, --apply-filters LIST
161 Skip sites where FILTER column does not contain any of the strings
162 listed in LIST. For example, to include only sites which have no
163 filters set, use -f .,PASS.
164
165 --no-version
166 Do not append version and command line information to the output
167 VCF header.
168
169 -o, --output FILE
170 When output consists of a single stream, write it to FILE rather
171 than to standard output, where it is written by default. The file
172 type is determined automatically from the file name suffix and in
173 case a conflicting -O option is given, the file name suffix takes
174 precedence.
175
176 -O, --output-type b|u|z|v[0-9]
177 Output compressed BCF (b), uncompressed BCF (u), compressed VCF
178 (z), uncompressed VCF (v). Use the -Ou option when piping between
179 bcftools subcommands to speed up performance by removing
180 unnecessary compression/decompression and VCF←→BCF conversion. The
181 compression level of the compressed formats (b and z) can be set by
182 by appending a number between 0-9.
183
184 -r, --regions chr|chr:pos|chr:beg-end|chr:beg-[,...]
185 Comma-separated list of regions, see also -R, --regions-file.
186 Overlapping records are matched even when the starting coordinate
187 is outside of the region, unlike the -t/-T options where only the
188 POS coordinate is checked. Note that -r cannot be used in
189 combination with -R.
190
191 -R, --regions-file FILE
192 Regions can be specified either on command line or in a VCF, BED,
193 or tab-delimited file (the default). The columns of the
194 tab-delimited file can contain either positions (two-column format)
195 or intervals (three-column format): CHROM, POS, and, optionally,
196 END, where positions are 1-based and inclusive. The columns of the
197 tab-delimited BED file are also CHROM, POS and END (trailing
198 columns are ignored), but coordinates are 0-based, half-open. To
199 indicate that a file be treated as BED rather than the 1-based
200 tab-delimited file, the file must have the ".bed" or ".bed.gz"
201 suffix (case-insensitive). Uncompressed files are stored in memory,
202 while bgzip-compressed and tabix-indexed region files are streamed.
203 Note that sequence names must match exactly, "chr20" is not the
204 same as "20". Also note that chromosome ordering in FILE will be
205 respected, the VCF will be processed in the order in which
206 chromosomes first appear in FILE. However, within chromosomes, the
207 VCF will always be processed in ascending genomic coordinate order
208 no matter what order they appear in FILE. Note that overlapping
209 regions in FILE can result in duplicated out of order positions in
210 the output. This option requires indexed VCF/BCF files. Note that
211 -R cannot be used in combination with -r.
212
213 --regions-overlap pos|record|variant|0|1|2
214 This option controls how overlapping records are determined: set to
215 pos or 0 if the VCF record has to have POS inside a region (this
216 corresponds to the default behavior of -t/-T); set to record or 1
217 if also overlapping records with POS outside a region should be
218 included (this is the default behavior of -r/-R, and includes
219 indels with POS at the end of a region, which are technically
220 outside the region); or set to variant or 2 to include only true
221 overlapping variation (compare the full VCF representation "TA>T-"
222 vs the true sequence variation "A>-").
223
224 -s, --samples [^]LIST
225 Comma-separated list of samples to include or exclude if prefixed
226 with "^." (Note that when multiple samples are to be excluded, the
227 "^" prefix is still present only once, e.g. "^SAMPLE1,SAMPLE2".)
228 The sample order is updated to reflect that given on the command
229 line. Note that in general tags such as INFO/AC, INFO/AN, etc are
230 not updated to correspond to the subset samples. bcftools view is
231 the exception where some tags will be updated (unless the -I,
232 --no-update option is used; see bcftools view documentation). To
233 use updated tags for the subset in another command one can pipe
234 from view into that command. For example:
235
236 bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
237
238 -S, --samples-file [^]FILE
239 File of sample names to include or exclude if prefixed with "^".
240 One sample per line. See also the note above for the -s, --samples
241 option. The sample order is updated to reflect that given in the
242 input file. The command bcftools call accepts an optional second
243 column indicating ploidy (0, 1 or 2) or sex (as defined by
244 --ploidy, for example "F" or "M"), for example:
245
246 sample1 1
247 sample2 2
248 sample3 2
249
250 or
251
252 sample1 M
253 sample2 F
254 sample3 F
255
256 If the second column is not present, the sex "F" is assumed. With
257 bcftools call -C trio, PED file is expected. The program ignores the
258 first column and the last indicates sex (1=male, 2=female), for
259 example:
260
261 ignored_column daughterA fatherA motherA 2
262 ignored_column sonB fatherB motherB 1
263
264 -t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
265 Similar as -r, --regions, but the next position is accessed by
266 streaming the whole VCF/BCF rather than using the tbi/csi index.
267 Both -r and -t options can be applied simultaneously: -r uses the
268 index to jump to a region and -t discards positions which are
269 not in the targets. Unlike -r, targets can be prefixed with "^" to
270 request logical complement. For example, "^X,Y,MT" indicates that
271 sequences X, Y and MT should be skipped. Yet another difference
272 between the -t/-T and -r/-R is that -r/-R checks for proper
273 overlaps and considers both POS and the end position of an indel,
274 while -t/-T considers the POS coordinate only (by default; see also
275 --regions-overlap and --targets-overlap). Note that -t cannot be
276 used in combination with -T.
277
278 -T, --targets-file [^]FILE
279 Same -t, --targets, but reads regions from a file. Note that -T
280 cannot be used in combination with -t.
281
282 With the call -C alleles command, third column of the targets file
283 must be comma-separated list of alleles, starting with the
284 reference allele. Note that the file must be compressed and
285 indexed. Such a file can be easily created from a VCF using:
286
287 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
288
289 --targets-overlap pos|record|variant|0|1|2
290 Same as --regions-overlap but for -t/-T.
291
292 --threads INT
293 Use multithreading with INT worker threads. The option is currently
294 used only for the compression of the output stream, only when
295 --output-type is b or z. Default: 0.
296
297 bcftools annotate [OPTIONS] FILE
298 Add or remove annotations.
299
300 -a, --annotations file
301 Bgzip-compressed and tabix-indexed file with annotations. The file
302 can be VCF, BED, or a tab-delimited file with mandatory columns
303 CHROM, POS (or, alternatively, FROM and TO), optional columns REF
304 and ALT, and arbitrary number of annotation columns. BED files are
305 expected to have the ".bed" or ".bed.gz" suffix (case-insensitive),
306 otherwise a tab-delimited file is assumed. Note that in case of
307 tab-delimited file, the coordinates POS, FROM and TO are one-based
308 and inclusive. When REF and ALT are present, only matching VCF
309 records will be annotated. If the END coordinate is present in the
310 annotation file and given on command line as "-c ~INFO/END", then
311 VCF records will be matched also by the INFO/END coordinate. If ID
312 is present in the annotation file and given as "-c ~ID", then VCF
313 records will be matched also by the ID column.
314
315 When multiple ALT alleles are present in the annotation file (given
316 as comma-separated list of alleles), at least one must match one of
317 the alleles in the corresponding VCF record. Similarly, at least
318 one alternate allele from a multi-allelic VCF record must be
319 present in the annotation file.
320
321 Missing values can be added by providing "." in place of actual
322 value and using the missing value modifier with -c, such as ".TAG".
323
324 Note that flag types, such as "INFO/FLAG", can be annotated by
325 including a field with the value "1" to set the flag, "0" to remove
326 it, or "." to keep existing flags. See also -c, --columns and -h,
327 --header-lines.
328
329 # Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
330 1 752566 SomeString 5
331 1 798959 SomeOtherString 6
332
333 -c, --columns list
334 Comma-separated list of columns or tags to carry over from the
335 annotation file (see also -a, --annotations). If the annotation
336 file is not a VCF/BCF, list describes the columns of the annotation
337 file and must include CHROM, POS (or, alternatively, FROM and TO),
338 and optionally REF and ALT. Unused columns which should be ignored
339 can be indicated by "-".
340
341 If the annotation file is a VCF/BCF, only the edited columns/tags
342 must be present and their order does not matter. The columns ID,
343 QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be
344 written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be
345 written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations
346 can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
347
348 To carry over all INFO annotations, use "INFO". To add all INFO
349 annotations except "TAG", use "^INFO/TAG". By default, existing
350 values are replaced.
351
352 By default, existing tags are overwritten unless the source value
353 is a missing value (i.e. "."). If also missing values should be
354 carried over (and overwrite existing tags), use ".TAG" instead of
355 "TAG". To add annotations without overwriting existing values (that
356 is, to add tags that are absent or to add values to existing tags
357 with missing values), use "+TAG" instead of "TAG". These can be
358 combined, for example ".+TAG" can be used to add TAG even if the
359 source value is missing but only if TAG does not exist in the
360 target file; existing tags will not be overwritten. To append to
361 existing values (rather than replacing or leaving untouched), use
362 "=TAG" (instead of "TAG" or "+TAG"). To replace only existing
363 values without modifying missing annotations, use "-TAG". To match
364 the record also by ID or INFO/END, in addition to REF and ALT, use
365 "~ID" or "~INFO/END". If position needs to be replaced, mark the
366 column with the new position as "~POS".
367
368 If the annotation file is not a VCF/BCF, all new annotations must
369 be defined via -h, --header-lines.
370
371 See also the -l, --merge-logic option.
372
373 -C, --columns-file file
374 Read the list of columns from a file (normally given via the -c,
375 --columns option). "-" to skip a column of the annotation file. One
376 column name per row, an additional space- or tab-separated field
377 can be present to indicate the merge logic (normally given via the
378 -l, --merge-logic option). This is useful when many annotations are
379 added at once.
380
381 -e, --exclude EXPRESSION
382 exclude sites for which EXPRESSION is true. For valid expressions
383 see EXPRESSIONS.
384
385 --force
386 continue even when parsing errors, such as undefined tags, are
387 encountered. Note this can be an unsafe operation and can result in
388 corrupted BCF files. If this option is used, make sure to sanity
389 check the result thoroughly.
390
391 -h, --header-lines file
392 Lines to append to the VCF header, see also -c, --columns and -a,
393 --annotations. For example:
394
395 ##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
396 ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
397
398 -I, --set-id [+]FORMAT
399 assign ID on the fly. The format is the same as in the query
400 command (see below). By default all existing IDs are replaced. If
401 the format string is preceded by "+", only missing IDs will be set.
402 For example, one can use
403
404 bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
405
406 -i, --include EXPRESSION
407 include only sites for which EXPRESSION is true. For valid
408 expressions see EXPRESSIONS.
409
410 -k, --keep-sites
411 keep sites which do not pass -i and -e expressions instead of
412 discarding them
413
414 -l, --merge-logic
415 tag:first|append|append-missing|unique|sum|avg|min|max[,...]
416 When multiple regions overlap a single record, this option defines
417 how to treat multiple annotation values when setting tag in the
418 destination file: use the first encountered value ignoring the rest
419 (first); append allowing duplicates (append); append even if the
420 appended value is missing, i.e. is a dot (append-missing); append
421 discarding duplicate values (unique); sum the values (sum, numeric
422 fields only); average the values (avg); use the minimum value (min)
423 or the maximum (max). + Note that this option is intended for use
424 with BED or TAB-delimited annotation files only. Moreover, it is
425 effective only when either REF and ALT or BEG and END --columns are
426 present . + Multiple rules can be given either as a comma-separated
427 list or giving the option multiple times. This is an experimental
428 feature.
429
430 -m, --mark-sites TAG
431 annotate sites which are present ("+") or absent ("-") in the -a
432 file with a new INFO/TAG flag
433
434 --min-overlap ANN:'VCF'
435 minimum overlap required as a fraction of the variant in the
436 annotation -a file (ANN), in the target VCF file (:VCF), or both
437 for reciprocal overlap (ANN:VCF). By default overlaps of arbitrary
438 length are sufficient. The option can be used only with the
439 tab-delimited annotation -a file and with BEG and END columns
440 present.
441
442 --no-version
443 see Common Options
444
445 -o, --output FILE
446 see Common Options
447
448 -O, --output-type b|u|z|v[0-9]
449 see Common Options
450
451 --pair-logic snps|indels|both|all|some|exact
452 Controls how to match records from the annotation file to the
453 target VCF. Effective only when -a is a VCF or BCF. The option
454 replaces the former uninuitive --collapse. See Common Options for
455 more.
456
457 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
458 see Common Options
459
460 -R, --regions-file file
461 see Common Options
462
463 --regions-overlap 0|1|2
464 see Common Options
465
466 --rename-annots file
467 rename annotations according to the map in file, with "old_name
468 new_name\n" pairs separated by whitespaces, each on a separate
469 line. The old name must be prefixed with the annotation type: INFO,
470 FORMAT, or FILTER.
471
472 --rename-chrs file
473 rename chromosomes according to the map in file, with "old_name
474 new_name\n" pairs separated by whitespaces, each on a separate
475 line.
476
477 -s, --samples [^]LIST
478 subset of samples to annotate, see also Common Options
479
480 -S, --samples-file FILE
481 subset of samples to annotate. If the samples are named differently
482 in the target VCF and the -a, --annotations VCF, the name mapping
483 can be given as "src_name dst_name\n", separated by whitespaces,
484 each pair on a separate line.
485
486 --single-overlaps
487 use this option to keep memory requirements low with very large
488 annotation files. Note, however, that this comes at a cost, only
489 single overlapping intervals are considered in this mode. This was
490 the default mode until the commit af6f0c9 (Feb 24 2019).
491
492 --threads INT
493 see Common Options
494
495 -x, --remove list
496 List of annotations to remove. Use "FILTER" to remove all filters
497 or "FILTER/SomeFilter" to remove a specific filter. Similarly,
498 "INFO" can be used to remove all INFO tags and "FORMAT" to remove
499 all FORMAT tags except GT. To remove all INFO tags except "FOO" and
500 "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and
501 FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
502
503 Examples:
504
505 # Remove three fields
506 bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
507
508 # Remove all INFO fields and all FORMAT fields except for GT and PL
509 bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
510
511 # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
512 bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
513
514 # Carry over all INFO and FORMAT annotations except FORMAT/GT
515 bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
516
517 # Annotate from a tab-delimited file with six columns (the fifth is ignored),
518 # first indexing with tabix. The coordinates are 1-based.
519 tabix -s1 -b2 -e2 annots.tab.gz
520 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
521
522 # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
523 tabix -s1 -b2 -e3 annots.tab.gz
524 bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
525
526 # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
527 bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
528
529 # Transfer the INFO/END tag, matching by POS,REF,ALT and ID. This example assumes
530 # that INFO/END is already present in the VCF header.
531 bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf
532
533 # For more examples see http://samtools.github.io/bcftools/howtos/annotate.html
534
535 bcftools call [OPTIONS] FILE
536 This command replaces the former bcftools view caller. Some of the
537 original functionality has been temporarily lost in the process of
538 transition under htslib <http://github.com/samtools/htslib>, but will
539 be added back on popular demand. The original calling model can be
540 invoked with the -c option.
541
542 File format options:
543 --no-version
544 see Common Options
545
546 -o, --output FILE
547 see Common Options
548
549 -O, --output-type b|u|z|v[0-9]
550 see Common Options
551
552 --ploidy ASSEMBLY[?]
553 predefined ploidy, use list (or any other unused word) to print a
554 list of all predefined assemblies. Append a question mark to print
555 the actual definition. See also --ploidy-file.
556
557 --ploidy-file FILE
558 ploidy definition given as a space/tab-delimited list of CHROM,
559 FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and correspond
560 to the ones used by --samples-file. The default ploidy can be given
561 using the starred records (see below), unlisted regions have ploidy
562 2. The default ploidy definition is
563
564 X 1 60000 M 1
565 X 2699521 154931043 M 1
566 Y 1 59373566 M 1
567 Y 1 59373566 F 0
568 MT 1 16569 M 1
569 MT 1 16569 F 1
570 * * * M 2
571 * * * F 2
572
573 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
574 see Common Options
575
576 -R, --regions-file file
577 see Common Options
578
579 --regions-overlap 0|1|2
580 see Common Options
581
582 -s, --samples LIST
583 see Common Options
584
585 -S, --samples-file FILE
586 see Common Options
587
588 -t, --targets LIST
589 see Common Options
590
591 -T, --targets-file FILE
592 see Common Options
593
594 --targets-overlap 0|1|2
595 see Common Options
596
597 --threads INT
598 see Common Options
599
600 Input/output options:
601 -A, --keep-alts
602 output all alternate alleles present in the alignments even if they
603 do not appear in any of the genotypes
604
605 -f, --format-fields list
606 comma-separated list of FORMAT fields to output for each sample.
607 Currently GQ and GP fields are supported. For convenience, the
608 fields can be given as lower case letters. Prefixed with "^"
609 indicates a request for tag removal of auxiliary tags useful only
610 for calling.
611
612 -F, --prior-freqs AN,AC
613 take advantage of prior knowledge of population allele frequencies.
614 The workflow looks like this:
615
616 # Extract AN,AC values from an existing VCF, such 1000Genomes
617 bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
618
619 # If the tags AN,AC are not already present, use the +fill-tags plugin
620 bcftools +fill-tags 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
621 tabix -s1 -b2 -e2 AFs.tab.gz
622
623 # Create a VCF header description, here we name the tags REF_AN,REF_AC
624 cat AFs.hdr
625 ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
626 ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
627
628 # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
629 bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
630
631 -G, --group-samples FILE|-
632 by default, all samples are assumed to come from a single
633 population. This option allows to group samples into populations
634 and apply the HWE assumption within but not across the populations.
635 FILE is a tab-delimited text file with sample names in the first
636 column and group names in the second column. If - is given instead,
637 no HWE assumption is made at all and single-sample calling is
638 performed. (Note that in low coverage data this inflates the rate
639 of false positives.) The -G option requires the presence of
640 per-sample FORMAT/QS or FORMAT/AD tag generated with bcftools
641 mpileup -a QS (or -a AD).
642
643 -g, --gvcf INT
644 output also gVCF blocks of homozygous REF calls. The parameter INT
645 is the minimum per-sample depth required to include a site in the
646 non-variant block.
647
648 -i, --insert-missed INT
649 output also sites missed by mpileup but present in -T,
650 --targets-file.
651
652 -M, --keep-masked-ref
653 output sites where REF allele is N
654
655 -V, --skip-variants snps|indels
656 skip indel/SNP sites
657
658 -v, --variants-only
659 output variant sites only
660
661 Consensus/variant calling options:
662 -c, --consensus-caller
663 the original samtools/bcftools calling method (conflicts with -m)
664
665 -C, --constrain alleles|trio
666
667 alleles
668 call genotypes given alleles. See also -T, --targets-file.
669
670 trio
671 call genotypes given the father-mother-child constraint. See
672 also -s, --samples and -n, --novel-rate.
673
674 -m, --multiallelic-caller
675 alternative model for multiallelic and rare-variant calling
676 designed to overcome known limitations in -c calling model
677 (conflicts with -c)
678
679 -n, --novel-rate float[,...]
680 likelihood of novel mutation for constrained -C trio calling. The
681 trio genotype calling maximizes likelihood of a particular
682 combination of genotypes for father, mother and the child
683 P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) * (1-Pn).
684 By providing three values, the mutation rate Pn is set explicitly
685 for SNPs, deletions and insertions, respectively. If two values are
686 given, the first is interpreted as the mutation rate of SNPs and
687 the second is used to calculate the mutation rate of indels
688 according to their length as Pn=float*exp(-a-b*len), where
689 a=22.8689, b=0.2994 for insertions and a=21.9313, b=0.2856 for
690 deletions [pubmed:23975140]. If only one value is given, the same
691 mutation rate Pn is used for SNPs and indels.
692
693 -p, --pval-threshold float
694 with -c, accept variant if P(ref|D) < float.
695
696 -P, --prior float
697 expected substitution rate, or 0 to disable the prior. Only with
698 -m.
699
700 -t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
701 see Common Options
702
703 -X, --chromosome-X
704 haploid output for male samples (requires PED file with -s)
705
706 -Y, --chromosome-Y
707 haploid output for males and skips females (requires PED file with
708 -s)
709
710 bcftools cnv [OPTIONS] FILE
711 Copy number variation caller, requires a VCF annotated with the
712 Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR)
713 values. The HMM considers the following copy number states: CN 2
714 (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy
715 gain).
716
717 General Options:
718 -c, --control-sample string
719 optional control sample name. If given, pairwise calling is
720 performed and the -P option can be used
721
722 -f, --AF-file file
723 read allele frequencies from a tab-delimited file with the columns
724 CHR,POS,REF,ALT,AF
725
726 -o, --output-dir path
727 output directory
728
729 -p, --plot-threshold float
730 call matplotlib to produce plots for chromosomes with quality at
731 least float, useful for visual inspection of the calls. With -p 0,
732 plots for all chromosomes will be generated. If not given, a
733 matplotlib script will be created but not called.
734
735 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
736 see Common Options
737
738 -R, --regions-file file
739 see Common Options
740
741 --regions-overlap 0|1|2
742 see Common Options
743
744 -s, --query-sample string
745 query sample name
746
747 -t, --targets LIST
748 see Common Options
749
750 -T, --targets-file FILE
751 see Common Options
752
753 --targets-overlap 0|1|2
754 see Common Options
755
756 HMM Options:
757 -a, --aberrant float[,float]
758 fraction of aberrant cells in query and control. The hallmark of
759 duplications and contaminations is the BAF value of heterozygous
760 markers which is dependent on the fraction of aberrant cells.
761 Sensitivity to smaller fractions of cells can be increased by
762 setting -a to a lower value. Note however, that this comes at the
763 cost of increased false discovery rate.
764
765 -b, --BAF-weight float
766 relative contribution from BAF
767
768 -d, --BAF-dev float[,float]
769 expected BAF deviation in query and control, i.e. the noise
770 observed in the data.
771
772 -e, --err-prob float
773 uniform error probability
774
775 -l, --LRR-weight float
776 relative contribution from LRR. With noisy data, this option can
777 have big effect on the number of calls produced. In truly random
778 noise (such as in simulated data), the value should be set high
779 (1.0), but in the presence of systematic noise when LRR are not
780 informative, lower values result in cleaner calls (0.2).
781
782 -L, --LRR-smooth-win int
783 reduce LRR noise by applying moving average given this window size
784
785 -O, --optimize float
786 iteratively estimate the fraction of aberrant cells, down to the
787 given fraction. Lowering this value from the default 1.0 to say,
788 0.3, can help discover more events but also increases noise
789
790 -P, --same-prob float
791 the prior probability of the query and the control sample being the
792 same. Setting to 0 calls both independently, setting to 1 forces
793 the same copy number state in both.
794
795 -x, --xy-prob float
796 the HMM probability of transition to another copy number state.
797 Increasing this values leads to smaller and more frequent calls.
798
799 bcftools concat [OPTIONS] FILE1 FILE2 [...]
800 Concatenate or combine VCF/BCF files. All source files must have the
801 same sample columns appearing in the same order. Can be used, for
802 example, to concatenate chromosome VCFs into one VCF, or combine a SNP
803 VCF and an indel VCF into one. The input files must be sorted by chr
804 and position. The files must be given in the correct order to produce
805 sorted VCF on output unless the -a, --allow-overlaps option is
806 specified. With the --naive option, the files are concatenated without
807 being recompressed, which is very fast..
808
809 -a, --allow-overlaps
810 First coordinate of the next file can precede last record of the
811 current file.
812
813 -c, --compact-PS
814 Do not output PS tag at each site, only at the start of a new phase
815 set block.
816
817 -d, --rm-dups snps|indels|both|all|exact
818 Output duplicate records of specified type present in multiple
819 files only once. Note that records duplicate within one file are
820 not removed with this option, for that use bcftools norm -d
821 instead.
822 In other words, the default behavior of the program is similar to
823 unix "cat" in that when two files contain a record with the same
824 position, that position will appear twice on output. With -d, every
825 line that finds a matching record in another file will be printed
826 only once.
827 Requires -a, --allow-overlaps.
828
829 -D, --remove-duplicates
830 Alias for -d exact
831
832 -f, --file-list FILE
833 Read file names from FILE, one file name per line.
834
835 -l, --ligate
836 Ligate phased VCFs by matching phase at overlapping haplotypes.
837 Note that the option is intended for VCFs with perfect overlap,
838 sites in overlapping regions present in one but missing in the
839 other are dropped.
840
841 --ligate-force
842 Keep all sites and ligate even non-overlapping chunks and chunks
843 with imperfect overlap
844
845 --ligate-warn
846 Drop sites in imperfect overlaps
847
848 --no-version
849 see Common Options
850
851 -n, --naive
852 Concatenate VCF or BCF files without recompression. This is very
853 fast but requires that all files are of the same type (all VCF or
854 all BCF) and have the same headers. This is because all tags and
855 chromosome names in the BCF body rely on the order of the contig
856 and tag definitions in the header. A header check compatibility is
857 performed and the program throws an error if it is not safe to use
858 the option.
859
860 --naive-force
861 Same as --naive, but header compatibility is not checked.
862 Dangerous, use with caution.
863
864 -o, --output FILE
865 see Common Options
866
867 -O, --output-type b|u|z|v[0-9]
868 see Common Options
869
870 -q, --min-PQ INT
871 Break phase set if phasing quality is lower than INT
872
873 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
874 see Common Options. Requires -a, --allow-overlaps.
875
876 -R, --regions-file FILE
877 see Common Options. Requires -a, --allow-overlaps.
878
879 --regions-overlap 0|1|2
880 see Common Options
881
882 --threads INT
883 see Common Options
884
885 bcftools consensus [OPTIONS] FILE
886 Create consensus sequence by applying VCF variants to a reference fasta
887 file. By default, the program will apply all ALT variants to the
888 reference fasta to obtain the consensus sequence. Using the --sample
889 (and, optionally, --haplotype) option will apply genotype (haplotype)
890 calls from FORMAT/GT. Note that the program does not act as a primitive
891 variant caller and ignores allelic depth information, such as INFO/AD
892 or FORMAT/AD. For that, consider using the setGT plugin.
893
894 -a, --absent CHAR
895 replace positions absent from VCF with CHAR
896
897 -c, --chain FILE
898 write a chain file for liftover
899
900 -e, --exclude EXPRESSION
901 exclude sites for which EXPRESSION is true. For valid expressions
902 see EXPRESSIONS.
903
904 -f, --fasta-ref FILE
905 reference sequence in fasta format
906
907 -H, --haplotype 1|2|R|A|I|LR|LA|SR|SA|1pIu|2pIu
908 choose which allele from the FORMAT/GT field to use (the codes are
909 case-insensitive):
910
911 1
912 the first allele, regardless of phasing
913
914 2
915 the second allele, regardless of phasing
916
917 R
918 the REF allele (in heterozygous genotypes)
919
920 A
921 the ALT allele (in heterozygous genotypes)
922
923 I
924 IUPAC code for all genotypes
925
926 LR, LA
927 the longer allele. If both have the same length, use the REF
928 allele (LR), or the ALT allele (LA)
929
930 SR, SA
931 the shorter allele. If both have the same length, use the REF
932 allele (SR), or the ALT allele (SA)
933
934 1pIu, 2pIu
935 first/second allele for phased genotypes and IUPAC code for
936 unphased genotypes
937
938 This option requires *-s*, unless exactly one sample is present in the VCF
939
940 -i, --include EXPRESSION
941 include only sites for which EXPRESSION is true. For valid
942 expressions see EXPRESSIONS.
943
944 -I, --iupac-codes
945 output variants in the form of IUPAC ambiguity codes
946
947 --mark-del CHAR
948 instead of removing sequence, insert CHAR for deletions
949
950 --mark-ins uc|lc
951 highlight inserted sequence in uppercase (uc) or lowercase (lc),
952 leaving the rest of the sequence as is
953
954 --mark-snv uc|lc
955 highlight substitutions in uppercase (uc) or lowercase (lc),
956 leaving the rest of the sequence as is
957
958 -m, --mask FILE
959 BED file or TAB file with regions to be replaced with N (the
960 default) or as specified by the next --mask-with option. See
961 discussion of --regions-file in Common Options for file format
962 details.
963
964 --mask-with CHAR|lc|uc
965 replace sequence from --mask with CHAR, skipping overlapping
966 variants, or change to lowercase (lc) or uppercase (uc)
967
968 -M, --missing CHAR
969 instead of skipping the missing genotypes, output the character
970 CHAR (e.g. "?")
971
972 -o, --output FILE
973 write output to a file
974
975 -s, --sample NAME
976 apply variants of the given sample
977
978 Examples:
979
980 # Apply variants present in sample "NA001", output IUPAC codes for hets
981 bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
982
983 # Create consensus for one region. The fasta header lines are then expected
984 # in the form ">chr:from-to".
985 samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
986
987 bcftools convert [OPTIONS] FILE
988 VCF input options:
989 -e, --exclude EXPRESSION
990 exclude sites for which EXPRESSION is true. For valid expressions
991 see EXPRESSIONS.
992
993 -i, --include EXPRESSION
994 include only sites for which EXPRESSION is true. For valid
995 expressions see EXPRESSIONS.
996
997 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
998 see Common Options
999
1000 -R, --regions-file FILE
1001 see Common Options
1002
1003 --regions-overlap 0|1|2
1004 see Common Options
1005
1006 -s, --samples LIST
1007 see Common Options
1008
1009 -S, --samples-file FILE
1010 see Common Options
1011
1012 -t, --targets LIST
1013 see Common Options
1014
1015 -T, --targets-file FILE
1016 see Common Options
1017
1018 --targets-overlap 0|1|2
1019 see Common Options
1020
1021 VCF output options:
1022 --no-version
1023 see Common Options
1024
1025 -o, --output FILE
1026 see Common Options
1027
1028 -O, --output-type b|u|z|v[0-9]
1029 see Common Options
1030
1031 --threads INT
1032 see Common Options
1033
1034 GEN/SAMPLE conversion:
1035 -G, --gensample2vcf prefix or gen-file,sample-file
1036 convert IMPUTE2 output to VCF. One of the ID columns ("SNP ID" or
1037 "rsID" in https://www.cog-genomics.org/plink/2.0/formats#gen) must
1038 be of the form "CHROM:POS_REF_ALT" to detect possible strand swaps.
1039 When the --vcf-ids option is given, the other column (autodetected)
1040 is used to fill the ID column of the VCF.
1041 See also -g and --3N6 options.
1042
1043 -g, --gensample prefix or gen-file,sample-file
1044 convert from VCF to gen/sample format used by IMPUTE2 and SHAPEIT.
1045 The columns of .gen file format are ID1,ID2,POS,A,B followed by
1046 three genotype probabilities P(AA), P(AB), P(BB) for each sample.
1047 In order to prevent strand swaps, the program uses IDs of the form
1048 "CHROM:POS_REF_ALT". When the --vcf-ids option is given, the second
1049 column is set to match the ID column of the VCF.
1050 See also -G and --3N6 options.
1051 The file .gen and .sample file format are:
1052
1053 .gen (with --3N6 --vcf-ids)
1054 ---------------------------
1055 chr1 1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
1056 chr1 1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
1057
1058 .gen (with --vcf-ids)
1059 ---------------------------
1060 1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
1061 1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
1062
1063 .gen (the default)
1064 ------------------------------
1065 1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
1066 1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
1067
1068 .sample
1069 -------
1070 ID_1 ID_2 missing
1071 0 0 0
1072 sample1 sample1 0
1073 sample2 sample2 0
1074
1075 --3N6
1076 Expect/Create files in the 3*N+6 column format. This is the new
1077 .gen file format with the first column containing the chromosome
1078 name, see https://www.cog-genomics.org/plink/2.0/formats#gen
1079
1080 --tag STRING
1081 tag to take values for .gen file: GT,PL,GL,GP
1082
1083 --sex FILE
1084 output sex column in the sample file. The FILE format is
1085
1086 MaleSample M
1087 FemaleSample F
1088
1089 --vcf-ids
1090 output VCF IDs in the second column instead of CHROM:POS_REF_ALT
1091
1092 gVCF conversion:
1093 --gvcf2vcf
1094 convert gVCF to VCF, expanding REF blocks into sites. Note that the
1095 -i and -e options work differently with this switch. In this
1096 situation the filtering expressions define which sites should be
1097 expanded and which sites should be left unmodified, but all sites
1098 are printed on output. In order to drop sites, stream first through
1099 bcftools view.
1100
1101 -f, --fasta-ref file
1102 reference sequence in fasta format. Must be indexed with samtools
1103 faidx
1104
1105 HAP/SAMPLE conversion:
1106 --hapsample2vcf prefix or hap-file,sample-file
1107 convert from hap/sample format to VCF. The columns of .hap file are
1108 similar to .gen file above, but there are only two haplotype
1109 columns per sample. Note that the first or the second column of the
1110 .hap file is expected to be in the form "CHR:POS_REF_ALT[_END]",
1111 with the _END being optional for defining the INFO/END tag when ALT
1112 is a symbolic allele. For example:
1113
1114 .hap (with --vcf-ids)
1115 ---------------------
1116 1:111485207_G_A rsID1 111485207 G A 0 1 0 0
1117 1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
1118
1119 .hap (the default)
1120 ------------------
1121 1 1:111485207_G_A 111485207 G A 0 1 0 0
1122 1 1:111495231_A_<DEL>_111495784 111495231 A <DEL> 0 0 1 0
1123
1124 --hapsample prefix or hap-file,sample-file
1125 convert from VCF to hap/sample format used by IMPUTE2 and SHAPEIT.
1126 The columns of .hap file begin with ID,RSID,POS,REF,ALT. In order
1127 to prevent strand swaps, the program uses IDs of the form
1128 "CHROM:POS_REF_ALT".
1129
1130 --haploid2diploid
1131 with -h option converts haploid genotypes to homozygous diploid
1132 genotypes. For example, the program will print 0 0 instead of the
1133 default 0 -. This is useful for programs which do not handle
1134 haploid genotypes correctly.
1135
1136 --sex FILE
1137 output sex column in the sample file. The FILE format is
1138
1139 MaleSample M
1140 FemaleSample F
1141
1142 --vcf-ids
1143 the second column of the .hap file holds the VCF ids, the first
1144 column is of the form "CHR:POS_REF_ALT[_END]". Without the option,
1145 the format follows
1146 https://www.cog-genomics.org/plink/2.0/formats#haps with ids (the
1147 second column) of the form "CHR:POS_REF_ALT[_END]"
1148
1149 HAP/LEGEND/SAMPLE conversion:
1150 -H, --haplegendsample2vcf prefix or hap-file,legend-file,sample-file
1151 convert from hap/legend/sample format used by IMPUTE2 to VCF. See
1152 also -h, --hapslegendsample below.
1153
1154 -h, --haplegendsample prefix or hap-file,legend-file,sample-file
1155 convert from VCF to hap/legend/sample format used by IMPUTE2 and
1156 SHAPEIT. The columns of .legend file ID,POS,REF,ALT. In order to
1157 prevent strand swaps, the program uses IDs of the form
1158 "CHROM:POS_REF_ALT". The .sample file is quite basic at the moment
1159 with columns for population, group and sex expected to be edited by
1160 the user. For example:
1161
1162 .hap
1163 -----
1164 0 1 0 0 1 0
1165 0 1 0 0 0 1
1166
1167 .legend
1168 -------
1169 id position a0 a1
1170 1:111485207_G_A 111485207 G A
1171 1:111494194_C_T 111494194 C T
1172
1173 .sample
1174 -------
1175 sample population group sex
1176 sample1 sample1 sample1 2
1177 sample2 sample2 sample2 2
1178
1179 --haploid2diploid
1180 with -h option converts haploid genotypes to homozygous diploid
1181 genotypes. For example, the program will print 0 0 instead of the
1182 default 0 -. This is useful for programs which do not handle
1183 haploid genotypes correctly.
1184
1185 --sex FILE
1186 output sex column in the sample file. The FILE format is
1187
1188 MaleSample M
1189 FemaleSample F
1190
1191 --vcf-ids
1192 output VCF IDs instead of "CHROM:POS_REF_ALT". Note that this
1193 option can be used with --haplegendsample but not with
1194 --haplegendsample2vcf.
1195
1196 TSV conversion:
1197 --tsv2vcf file
1198 convert from TSV (tab-separated values) format (such as generated
1199 by 23andMe) to VCF. The input file fields can be tab- or space-
1200 delimited
1201
1202 -c, --columns list
1203 comma-separated list of fields in the input file. In the current
1204 version, the fields CHROM, POS, ID, and AA are expected and can
1205 appear in arbitrary order, columns which should be ignored in the
1206 input file can be indicated by "-". The AA field lists alleles on
1207 the forward reference strand, for example "CC" or "CT" for diploid
1208 genotypes or "C" for haploid genotypes (sex chromosomes).
1209 Insertions and deletions are not supported yet, missing data can be
1210 indicated with "--".
1211
1212 -f, --fasta-ref file
1213 reference sequence in fasta format. Must be indexed with samtools
1214 faidx
1215
1216 -s, --samples LIST
1217 list of sample names. See Common Options
1218
1219 -S, --samples-file FILE
1220 file of sample names. See Common Options
1221
1222 Example:
1223
1224 # Convert 23andme results into VCF
1225 bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
1226
1227 bcftools csq [OPTIONS] FILE
1228 Haplotype aware consequence predictor which correctly handles combined
1229 variants such as MNPs split over multiple VCF records, SNPs separated
1230 by an intron (but adjacent in the spliced transcript) or nearby
1231 frame-shifting indels which in combination in fact are not
1232 frame-shifting.
1233
1234 The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag
1235 (configurable with the -c option). The latter is a bitmask of indexes
1236 to INFO/BCSQ, with interleaved haplotypes. See the usage examples below
1237 for using the %TBCSQ converter in query for extracting a more human
1238 readable form from this bitmask. The construction of the bitmask limits
1239 the number of consequences that can be referenced per sample in the
1240 FORMAT/BCSQ tags. By default this is 15, but if more are required, see
1241 the --ncsq option.
1242
1243 The program requires on input a VCF/BCF file, the reference genome in
1244 fasta format (--fasta-ref) and genomic features in the GFF3 format
1245 downloadable from the Ensembl website (--gff-annot), and outputs an
1246 annotated VCF/BCF file. Currently, only Ensembl GFF3 files are
1247 supported.
1248
1249 By default, the input VCF should be phased. If phase is unknown, or
1250 only partially known, the --phase option can be used to indicate how to
1251 handle unphased data. Alternatively, haplotype aware calling can be
1252 turned off with the --local-csq option.
1253
1254 If conflicting (overlapping) variants within one haplotype are
1255 detected, a warning will be emitted and predictions will be based on
1256 only the first variant in the analysis.
1257
1258 Symbolic alleles are not supported. They will remain unannotated in the
1259 output VCF and are ignored for the prediction analysis.
1260
1261 -c, --custom-tag STRING
1262 use this custom tag to store consequences rather than the default
1263 BCSQ tag
1264
1265 -B, --trim-protein-seq INT
1266 abbreviate protein-changing predictions to maximum of INT
1267 aminoacids. For example, instead of writing the whole modified
1268 protein sequence with potentially hundreds of aminoacids, with -B 1
1269 only an abbreviated version such as 25E..329>25G..94 will be
1270 written.
1271
1272 -e, --exclude EXPRESSION
1273 exclude sites for which EXPRESSION is true. For valid expressions
1274 see EXPRESSIONS.
1275
1276 -f, --fasta-ref FILE
1277 reference sequence in fasta format (required)
1278
1279 --force
1280 run even if some sanity checks fail. Currently the option allows to
1281 skip transcripts in malformatted GFFs with incorrect phase
1282
1283 -g, --gff-annot FILE
1284 GFF3 annotation file (required), such as
1285 ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens. An example of
1286 a minimal working GFF file:
1287
1288 # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
1289 # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
1290 # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
1291 # (the most interesting is "protein_coding").
1292 #
1293 # Attributes required for
1294 # gene lines:
1295 # - ID=gene:<gene_id>
1296 # - biotype=<biotype>
1297 # - Name=<gene_name> [optional]
1298 #
1299 # transcript lines:
1300 # - ID=transcript:<transcript_id>
1301 # - Parent=gene:<gene_id>
1302 # - biotype=<biotype>
1303 #
1304 # other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
1305 # - Parent=transcript:<transcript_id>
1306 #
1307 # Supported biotypes:
1308 # - see the function gff_parse_biotype() in bcftools/csq.c
1309
1310 1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1311 1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1312 1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId
1313 1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId
1314 1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId
1315 1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
1316
1317 -i, --include EXPRESSION
1318 include only sites for which EXPRESSION is true. For valid
1319 expressions see EXPRESSIONS.
1320
1321 -l, --local-csq
1322 switch off haplotype-aware calling, run localized predictions
1323 considering only one VCF record at a time
1324
1325 -n, --ncsq INT
1326 maximum number of per-haplotype consequences to consider for each
1327 site. The INFO/BCSQ column includes all consequences, but only the
1328 first INT will be referenced by the FORMAT/BCSQ fields. The default
1329 value is 15 which corresponds to one 32-bit integer per diploid
1330 sample, after accounting for values reserved by the BCF
1331 specification. Note that increasing the value leads to increased
1332 size of the output BCF.
1333
1334 --no-version
1335 see Common Options
1336
1337 -o, --output FILE
1338 see Common Options
1339
1340 -O, --output-type t|b|u|z|v[0-9]
1341 see Common Options. In addition, a custom tab-delimited plain text
1342 output can be printed (t).
1343
1344 -p, --phase a|m|r|R|s
1345 how to handle unphased heterozygous genotypes:
1346
1347 a
1348 take GTs as is, create haplotypes regardless of phase (0/1 →
1349 0|1)
1350
1351 m
1352 merge all GTs into a single haplotype (0/1 → 1, 1/2 → 1)
1353
1354 r
1355 require phased GTs, throw an error on unphased heterozygous GTs
1356
1357 R
1358 create non-reference haplotypes if possible (0/1 → 1|1, 1/2 →
1359 1|2)
1360
1361 s
1362 skip unphased heterozygous GTs
1363
1364 -q, --quiet
1365 suppress warning messages
1366
1367 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1368 see Common Options
1369
1370 -R, --regions-file FILE
1371 see Common Options
1372
1373 --regions-overlap 0|1|2
1374 see Common Options
1375
1376 -s, --samples LIST
1377 samples to include or "-" to apply all variants and ignore samples
1378
1379 -S, --samples-file FILE
1380 see Common Options
1381
1382 -t, --targets LIST
1383 see Common Options
1384
1385 -T, --targets-file FILE
1386 see Common Options
1387
1388 --targets-overlap 0|1|2
1389 see Common Options
1390
1391 Examples:
1392
1393 # Basic usage
1394 bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
1395
1396 # Extract the translated haplotype consequences. The following TBCSQ variations
1397 # are recognised:
1398 # %TBCSQ .. print consequences in all haplotypes in separate columns
1399 # %TBCSQ{0} .. print the first haplotype only
1400 # %TBCSQ{1} .. print the second haplotype only
1401 # %TBCSQ{*} .. print a list of unique consequences present in either haplotype
1402 bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
1403
1404 Examples of BCSQ annotation:
1405
1406 # Two separate VCF records at positions 2:122106101 and 2:122106102
1407 # change the same codon. This UV-induced C>T dinucleotide mutation
1408 # has been annotated fully at the position 2:122106101 with
1409 # - consequence type
1410 # - gene name
1411 # - ensembl transcript ID
1412 # - coding strand (+ fwd, - rev)
1413 # - amino acid position (in the coding strand orientation)
1414 # - list of corresponding VCF variants
1415 # The annotation at the second position gives the position of the full
1416 # annotation
1417 BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
1418 BCSQ=@122106101
1419
1420 # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
1421 BCSQ=@46115084
1422 BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
1423
1424 # Stop gained variant
1425 BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
1426
1427 # The consequence type of a variant downstream from a stop are prefixed with *
1428 BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
1429
1430 Supported consequence types
1431
1432 3_prime_utr
1433 5_prime_utr
1434 coding_sequence
1435 feature_elongation
1436 frameshift
1437 inframe_altering
1438 inframe_deletion
1439 inframe_insertion
1440 intergenic
1441 intron
1442 missense
1443 non_coding
1444 splice_acceptor
1445 splice_donor
1446 splice_region
1447 start_lost
1448 start_retained
1449 stop_gained
1450 stop_lost
1451 stop_retained
1452 synonymous
1453
1454 See also
1455 https://ensembl.org/info/genome/variation/prediction/predicted_data.html
1456
1457 bcftools filter [OPTIONS] FILE
1458 Apply fixed-threshold filters.
1459
1460 -e, --exclude EXPRESSION
1461 exclude sites for which EXPRESSION is true. For valid expressions
1462 see EXPRESSIONS.
1463
1464 -g, --SnpGap INT[:'indel',mnp,bnd,other,overlap]
1465 filter SNPs within INT base pairs of an indel or other other
1466 variant type. The following example demonstrates the logic of
1467 --SnpGap 3 applied on a deletion and an insertion:
1468
1469 The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
1470 0123456789
1471 ref .G.GT..G..
1472 del .A.G-..A..
1473 Here the positions 1 and 6 are filtered, 0 and 7 are not:
1474 0123-456789
1475 ref .G.G-..G..
1476 ins .A.GT..A..
1477
1478 -G, --IndelGap INT
1479 filter clusters of indels separated by INT or fewer base pairs
1480 allowing only one to pass. The following example demonstrates the
1481 logic of --IndelGap 2 applied on a deletion and an insertion:
1482
1483 The second indel is filtered:
1484 012345678901
1485 ref .GT.GT..GT..
1486 del .G-.G-..G-..
1487 And similarly here, the second is filtered:
1488 01 23 456 78
1489 ref .A-.A-..A-..
1490 ins .AT.AT..AT..
1491
1492 -i, --include EXPRESSION
1493 include only sites for which EXPRESSION is true. For valid
1494 expressions see EXPRESSIONS.
1495
1496 --mask [^]REGION
1497 Soft filter regions, prepepend "^" to negate. Requires -s,
1498 --soft-filter.
1499
1500 -M, --mask-file [^]FILE
1501 Soft filter regions listed in a file, "^" to negate. Requires -s,
1502 --soft-filter.
1503
1504 --mask-overlap 0|1|2
1505 Same as --regions-overlap but for --mask/--mask-file. See Common
1506 Options. [1]
1507
1508 -m, --mode [+x]
1509 define behaviour at sites with existing FILTER annotations. The
1510 default mode replaces existing filters of failed sites with a new
1511 FILTER string while leaving sites which pass untouched when
1512 non-empty and setting to "PASS" when the FILTER string is absent.
1513 The "+" mode appends new FILTER strings of failed sites instead of
1514 replacing them. The "x" mode resets filters of sites which pass to
1515 "PASS". Modes "+" and "x" can both be set.
1516
1517 --no-version
1518 see Common Options
1519
1520 -o, --output FILE
1521 see Common Options
1522
1523 -O, --output-type b|u|z|v[0-9]
1524 see Common Options
1525
1526 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1527 see Common Options
1528
1529 -R, --regions-file file
1530 see Common Options
1531
1532 --regions-overlap 0|1|2
1533 see Common Options
1534
1535 -s, --soft-filter STRING|+
1536 annotate FILTER column with STRING or, with +, a unique filter name
1537 generated by the program ("Filter%d").
1538
1539 -S, --set-GTs .|0
1540 set genotypes of failed samples to missing value (.) or reference
1541 allele (0)
1542
1543 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1544 see Common Options
1545
1546 -T, --targets-file file
1547 see Common Options
1548
1549 --targets-overlap 0|1|2
1550 see Common Options
1551
1552 --threads INT
1553 see Common Options
1554
1555 bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz
1556 Checks sample identity. The program can operate in two modes. If the -g
1557 option is given, the identity of samples from query.vcf.gz is checked
1558 against the samples in the -g file. Without the -g option, multi-sample
1559 cross-check of samples in query.vcf.gz is performed.
1560
1561 --distinctive-sites NUM[,MEM[,DIR]]
1562 Find sites that can distinguish between at least NUM sample pairs.
1563 If the number is smaller or equal to 1, it is interpreted as the
1564 fraction of pairs. The optional MEM string sets the maximum memory
1565 used for in-memory sorting and DIR is the temporary directory for
1566 external sorting. This option requires also --pairs to be given.
1567
1568 --dry-run
1569 Stop after first record to estimate required time.
1570
1571 -e, --error-probability INT
1572 Interpret genotypes and genotype likelihoods probabilistically. The
1573 value of INT represents genotype quality when GT tag is used (e.g.
1574 Q=30 represents one error in 1,000 genotypes and Q=40 one error in
1575 10,000 genotypes) and is ignored when PL tag is used (in that case
1576 an arbitrary non-zero integer can be provided). See also the -u,
1577 --use option below. If set to 0, the discordance equals to the
1578 number of mismatching genotypes when GT vs GT is compared. Note
1579 that the values with and without -e are not comparable, only values
1580 generated with -e 0 correspond to mismatching genotypes. If
1581 performance is an issue, set to 0 for faster run but less accurate
1582 results.
1583
1584 -g, --genotypes FILE
1585 VCF/BCF file with reference genotypes to compare against
1586
1587 -H, --homs-only
1588 Homozygous genotypes only, useful with low coverage data (requires
1589 -g, --genotypes)
1590
1591 --n-matches INT
1592 Print only top INT matches for each sample, 0 for unlimited. Use
1593 negative value to sort by HWE probability rather than the number of
1594 discordant sites. Note that average score is used to determine the
1595 top matches, not absolute values.
1596
1597 --no-HWE-prob
1598 Disable calculation of HWE probability to reduce memory
1599 requirements with comparisons between very large number of sample
1600 pairs.
1601
1602 -p, --pairs LIST
1603 A comma-separated list of sample pairs to compare. When the -g
1604 option is given, the first sample must be from the query file, the
1605 second from the -g file, third from the query file etc
1606 (qry,gt[,qry,gt..]). Without the -g option, the pairs are created
1607 the same way but both samples are from the query file
1608 (qry,qry[,qry,qry..])
1609
1610 -P, --pairs-file FILE
1611 A file with tab-delimited sample pairs to compare. The first sample
1612 in the pair must come from the query file, the second from the
1613 genotypes file when -g is given
1614
1615 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1616 Restrict to comma-separated list of regions, see Common Options
1617
1618 *-R, --regions-file' FILE
1619 Restrict to regions listed in a file, see Common Options
1620
1621 --regions-overlap 0|1|2
1622 see Common Options
1623
1624 -s, --samples [qry|gt]:'LIST': List of query samples or -g samples. If
1625 neither -s nor -S are given, all possible sample pair combinations are
1626 compared
1627
1628 -S, --samples-file [qry|gt]:'FILE' File with the query or -g samples to
1629 compare. If neither -s nor -S are given, all possible sample pair
1630 combinations are compared
1631
1632 -t, --targets file
1633 see Common Options
1634
1635 -T, --targets-file file
1636 see Common Options
1637
1638 --targets-overlap 0|1|2
1639 see Common Options
1640
1641 -u, --use TAG1[,TAG2]
1642 specifies which tag to use in the query file (TAG1) and the -g
1643 (TAG2) file. By default, the PL tag is used in the query file and
1644 GT in the -g file when available.
1645
1646 Examples:
1647
1648 # Check discordance of all samples from B against all sample in A
1649 bcftools gtcheck -g A.bcf B.bcf
1650
1651 # Limit comparisons to the fiven list of samples
1652 bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
1653
1654 # Compare only two pairs a1,b1 and a1,b2
1655 bcftools gtcheck -p a1,b1,a1,b2 -g A.bcf B.bcf
1656
1657 bcftools head [OPTIONS] [FILE]
1658 By default, prints all headers from the specified input file to
1659 standard output in VCF format. The input file may be in VCF or BCF
1660 format; if no FILE is specified, standard input will be read. With
1661 appropriate options, only some of the headers and/or additionally some
1662 of the variant records will be printed.
1663
1664 The bcftools head command outputs VCF headers almost exactly as they
1665 appear in the input file: it may add a ##FILTER=<ID=PASS> header if not
1666 already present, but it never adds version or command line information
1667 itself.
1668
1669 Options:
1670 -h, --header INT
1671 Display only the first INT header lines. By default, all header
1672 lines are displayed.
1673
1674 -n, --records INT
1675 Also display the first INT variant records. By default, no variant
1676 records are displayed.
1677
1678 bcftools index [OPTIONS] in.bcf|in.vcf.gz
1679 Creates index for bgzip compressed VCF/BCF files for random access. CSI
1680 (coordinate-sorted index) is created by default. The CSI format
1681 supports indexing of chromosomes up to length 2^31. TBI (tabix index)
1682 index files, which support chromosome lengths up to 2^29, can be
1683 created by using the -t/--tbi option or using the tabix program
1684 packaged with htslib. When loading an index file, bcftools will try the
1685 CSI first and then the TBI.
1686
1687 Indexing options:
1688 -c, --csi
1689 generate CSI-format index for VCF/BCF files [default]
1690
1691 -f, --force
1692 overwrite index if it already exists
1693
1694 -m, --min-shift INT
1695 set minimal interval size for CSI indices to 2^INT; default: 14
1696
1697 -o, --output FILE
1698 output file name. If not set, then the index will be created using
1699 the input file name plus a .csi or .tbi extension
1700
1701 -t, --tbi
1702 generate TBI-format index for VCF files
1703
1704 --threads INT
1705 see Common Options
1706
1707 Stats options:
1708 -n, --nrecords
1709 print the number of records based on the CSI or TBI index files
1710
1711 -s, --stats
1712 Print per contig stats based on the CSI or TBI index files. Output
1713 format is three tab-delimited columns listing the contig name,
1714 contig length (. if unknown) and number of records for the contig.
1715 Contigs with zero records are not printed.
1716
1717 bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]
1718 Creates intersections, unions and complements of VCF files. Depending
1719 on the options, the program can output records from one (or more) files
1720 which have (or do not have) corresponding records with the same
1721 position in the other files.
1722
1723 -c, --collapse snps|indels|both|all|some|none
1724 see Common Options
1725
1726 -C, --complement
1727 output positions present only in the first file but missing in the
1728 others
1729
1730 -e, --exclude -|EXPRESSION
1731 exclude sites for which EXPRESSION is true. If -e (or -i) appears
1732 only once, the same filtering expression will be applied to all
1733 input files. Otherwise, -e or -i must be given for each input file.
1734 To indicate that no filtering should be performed on a file, use
1735 "-" in place of EXPRESSION, as shown in the example below. For
1736 valid expressions see EXPRESSIONS.
1737
1738 -f, --apply-filters LIST
1739 see Common Options
1740
1741 -i, --include EXPRESSION
1742 include only sites for which EXPRESSION is true. See discussion of
1743 -e, --exclude above.
1744
1745 -n, --nfiles [+-=]INT|~BITMAP
1746 output positions present in this many (=), this many or more (+),
1747 this many or fewer (-), or the exact same (~) files
1748
1749 -o, --output FILE
1750 see Common Options. When several files are being output, their
1751 names are controlled via -p instead.
1752
1753 -O, --output-type b|u|z|v[0-9]
1754 see Common Options
1755
1756 -p, --prefix DIR
1757 if given, subset each of the input files accordingly. See also -w.
1758
1759 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1760 see Common Options
1761
1762 -R, --regions-file file
1763 see Common Options
1764
1765 --regions-overlap 0|1|2
1766 see Common Options
1767
1768 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
1769 see Common Options
1770
1771 -T, --targets-file file
1772 see Common Options
1773
1774 --targets-overlap 0|1|2
1775 see Common Options
1776
1777 -w, --write LIST
1778 list of input files to output given as 1-based indices. With -p and
1779 no -w, all files are written.
1780
1781 Examples:
1782 Create intersection and complements of two sets saving the output in
1783 dir/*
1784
1785 bcftools isec -p dir A.vcf.gz B.vcf.gz
1786
1787 Filter sites in A (require INFO/MAF>=0.01) and B (require INFO/dbSNP)
1788 but not in C, and create an intersection, including only sites which
1789 appear in at least two of the files after filters have been applied
1790
1791 bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
1792
1793 Extract and write records from A shared by both A and B using exact
1794 allele match
1795
1796 bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
1797
1798 Extract records private to A or B comparing by position only
1799
1800 bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
1801
1802 Print a list of records which are present in A and B but not in C and D
1803
1804 bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
1805
1806 bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]
1807 Merge multiple VCF/BCF files from non-overlapping sample sets to create
1808 one multi-sample file. For example, when merging file A.vcf.gz
1809 containing samples S1, S2 and S3 and file B.vcf.gz containing samples
1810 S3 and S4, the output file will contain five samples named S1, S2, S3,
1811 2:S3 and S4.
1812
1813 Note that it is responsibility of the user to ensure that the sample
1814 names are unique across all files. If they are not, the program will
1815 exit with an error unless the option --force-samples is given. The
1816 sample names can be also given explicitly using the --print-header and
1817 --use-header options.
1818
1819 Note that only records from different files can be merged, never from
1820 the same file. For "vertical" merge take a look at bcftools concat or
1821 bcftools norm -m instead.
1822
1823 --force-samples
1824 if the merged files contain duplicate samples names, proceed
1825 anyway. Duplicate sample names will be resolved by prepending the
1826 index of the file as it appeared on the command line to the
1827 conflicting sample name (see 2:S3 in the above example).
1828
1829 --print-header
1830 print only merged header and exit
1831
1832 --use-header FILE
1833 use the VCF header in the provided text FILE
1834
1835 -0 --missing-to-ref
1836 assume genotypes at missing sites are 0/0
1837
1838 -f, --apply-filters LIST
1839 see Common Options
1840
1841 -F, --filter-logic x|+
1842 Set the output record to PASS if any of the inputs is PASS (x), or
1843 apply all filters (+), which is the default.
1844
1845 -g, --gvcf -|FILE
1846 merge gVCF blocks, INFO/END tag is expected. If the reference fasta
1847 file FILE is not given and the dash (-) is given, unknown reference
1848 bases generated at gVCF block splits will be substituted with N’s.
1849 The --gvcf option uses the following default INFO rules: -i
1850 QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max.
1851
1852 -i, --info-rules -|TAG:METHOD[,...]
1853 Rules for merging INFO fields (scalars or vectors) or - to disable
1854 the default rules. METHOD is one of sum, avg, min, max, join.
1855 Default is DP:sum,DP4:sum if these fields exist in the input files.
1856 Fields with no specified rule will take the value from the first
1857 input file. The merged QUAL value is currently set to the maximum.
1858 This behaviour is not user controllable at the moment.
1859
1860 -l, --file-list FILE
1861 Read file names from FILE, one file name per line.
1862
1863 -L, --local-alleles INT
1864 Sites with many alternate alleles can require extremely large
1865 storage space which can exceed the 2GB size limit representable by
1866 BCF. This is caused by Number=G tags (such as FORMAT/PL) which
1867 store a value for each combination of reference and alternate
1868 alleles. The -L, --local-alleles option allows to replace such tags
1869 with a localized tag (FORMAT/LPL) which only includes a subset of
1870 alternate alleles relevant for that sample. A new FORMAT/LAA tag is
1871 added which lists 1-based indices of the alternate alleles relevant
1872 (local) for the current sample. The number INT gives the maximum
1873 number of alternate alleles that can be included in the PL tag. The
1874 default value is 0 which disables the feature and outputs values
1875 for all alternate alleles.
1876
1877 -m, --merge snps|indels|both|all|none|id
1878 The option controls what types of multiallelic records can be
1879 created:
1880
1881 -m none .. no new multiallelics, output multiple records instead
1882 -m snps .. allow multiallelic SNP records
1883 -m indels .. allow multiallelic indel records
1884 -m both .. both SNP and indel records can be multiallelic
1885 -m all .. SNP records can be merged with indel records
1886 -m id .. merge by ID
1887
1888 --no-index
1889 the option allows to merge files without indexing them first. In
1890 order for this option to work, the user must ensure that the input
1891 files have chromosomes in the same order and consistent with the
1892 order of sequences in the VCF header.
1893
1894 --no-version
1895 see Common Options
1896
1897 -o, --output FILE
1898 see Common Options
1899
1900 -O, --output-type b|u|z|v[0-9]
1901 see Common Options
1902
1903 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
1904 see Common Options
1905
1906 -R, --regions-file file
1907 see Common Options
1908
1909 --regions-overlap 0|1|2
1910 see Common Options
1911
1912 --threads INT
1913 see Common Options
1914
1915 bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]
1916 Generate VCF or BCF containing genotype likelihoods for one or multiple
1917 alignment (BAM or CRAM) files. This is based on the original samtools
1918 mpileup command (with the -v or -g options) producing genotype
1919 likelihoods in VCF or BCF format, but not the textual pileup output.
1920 The mpileup command was transferred to bcftools in order to avoid
1921 errors resulting from use of incompatible versions of samtools and
1922 bcftools when using in the mpileup+bcftools call pipeline.
1923
1924 Individuals are identified from the SM tags in the @RG header lines.
1925 Multiple individuals can be pooled in one alignment file, also one
1926 individual can be separated into multiple files. If sample identifiers
1927 are absent, each input file is regarded as one sample.
1928
1929 Note that there are two orthogonal ways to specify locations in the
1930 input file; via -r region and -t positions. The former uses (and
1931 requires) an index to do random access while the latter streams through
1932 the file contents filtering out the specified regions, requiring no
1933 index. The two may be used in conjunction. For example a BED file
1934 containing locations of genes in chromosome 20 could be specified using
1935 -r 20 -t chr20.bed, meaning that the index is used to find chromosome
1936 20 and then it is filtered for the regions listed in the BED file. Also
1937 note that the -r option can be much slower than -t with many regions
1938 and can require more memory when multiple regions and many alignment
1939 files are processed.
1940
1941 Input options
1942 -6, --illumina1.3+
1943 Assume the quality is in the Illumina 1.3+ encoding.
1944
1945 -A, --count-orphans
1946 Do not skip anomalous read pairs in variant calling.
1947
1948 -b, --bam-list FILE
1949 List of input alignment files, one file per line [null]
1950
1951 -B, --no-BAQ
1952 Disable probabilistic realignment for the computation of base
1953 alignment quality (BAQ). BAQ is the Phred-scaled probability of a
1954 read base being misaligned. Applying this option greatly helps to
1955 reduce false SNPs caused by misalignments.
1956
1957 -C, --adjust-MQ INT
1958 Coefficient for downgrading mapping quality for reads containing
1959 excessive mismatches. Given a read with a phred-scaled probability
1960 q of being generated from the mapped position, the new mapping
1961 quality is about sqrt((INT-q)/INT)*INT. A zero value (the default)
1962 disables this functionality.
1963
1964 -D, --full-BAQ
1965 Run the BAQ algorithm on all reads, not just those in problematic
1966 regions. This matches the behaviour for Bcftools 1.12 and earlier.
1967
1968 By default mpileup uses heuristics to decide when to apply the BAQ
1969 algorithm. Most sequences will not be BAQ adjusted, giving a CPU
1970 time closer to --no-BAQ, but it will still be applied in regions
1971 with suspected problematic alignments. This has been tested to work
1972 well on single sample data with even allele frequency, but the
1973 reliability is unknown for multi-sample calling and for low allele
1974 frequency variants so full BAQ is still recommended in those
1975 scenarios.
1976
1977 -d, --max-depth INT
1978 At a position, read maximally INT reads per input file. Note that
1979 the original samtools mpileup command had a minimum value of 8000/n
1980 where n was the number of input files given to mpileup. This means
1981 that in samtools mpileup the default was highly likely to be
1982 increased and the -d parameter would have an effect only once above
1983 the cross-sample minimum of 8000. This behavior was problematic
1984 when working with a combination of single- and multi-sample bams,
1985 therefore in bcftools mpileup the user is given the full control
1986 (and responsibility), and an informative message is printed instead
1987 [250]
1988
1989 -E, --redo-BAQ
1990 Recalculate BAQ on the fly, ignore existing BQ tags
1991
1992 -f, --fasta-ref FILE
1993 The faidx-indexed reference file in the FASTA format. The file can
1994 be optionally compressed by bgzip. Reference is required by default
1995 unless the --no-reference option is set [null]
1996
1997 --no-reference
1998 Do not require the --fasta-ref option.
1999
2000 -G, --read-groups FILE
2001 list of read groups to include or exclude if prefixed with "^". One
2002 read group per line. This file can also be used to assign new
2003 sample names to read groups by giving the new sample name as a
2004 second white-space-separated field, like this: "read_group_id
2005 new_sample_name". If the read group name is not unique, also the
2006 bam file name can be included: "read_group_id file_name
2007 sample_name". If all reads from the alignment file should be
2008 treated as a single sample, the asterisk symbol can be used: "*
2009 file_name sample_name". Alignments without a read group ID can be
2010 matched with "?". NOTE: The meaning of bcftools mpileup -G is the
2011 opposite of samtools mpileup -G.
2012
2013 RG_ID_1
2014 RG_ID_2 SAMPLE_A
2015 RG_ID_3 SAMPLE_A
2016 RG_ID_4 SAMPLE_B
2017 RG_ID_5 FILE_1.bam SAMPLE_A
2018 RG_ID_6 FILE_2.bam SAMPLE_A
2019 * FILE_3.bam SAMPLE_C
2020 ? FILE_3.bam SAMPLE_D
2021
2022 -q, -min-MQ INT
2023 Minimum mapping quality for an alignment to be used [0]
2024
2025 -Q, --min-BQ INT
2026 Minimum base quality for a base to be considered [13]
2027
2028 --max-BQ INT
2029 Caps the base quality to a maximum value [60]. This can be
2030 particularly useful on technologies that produce overly optimistic
2031 high qualities, leading to too many false positives or incorrect
2032 genotype assignments.
2033
2034 -r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
2035 Only generate mpileup output in given regions. Requires the
2036 alignment files to be indexed. If used in conjunction with -l then
2037 considers the intersection; see Common Options
2038
2039 -R, --regions-file FILE
2040 As for -r, --regions, but regions read from FILE; see Common
2041 Options
2042
2043 --regions-overlap 0|1|2
2044 see Common Options
2045
2046 --ignore-RG
2047 Ignore RG tags. Treat all reads in one alignment file as one
2048 sample.
2049
2050 --ls, --skip-all-set
2051 Skip reads with all of the FLAG bits set [null]
2052
2053 --ns, --skip-any-set
2054 Skip reads with any of the FLAG bits set. This option replaces and
2055 is synonymous to the deprecated --ff, --excl-flags
2056 [UNMAP,SECONDARY,QCFAIL,DUP]
2057
2058 --lu, --skip-all-unset
2059 Skip reads with all of the FLAG bits unset. This option replaces
2060 and is synonymous to the deprecated --rf, --incl-flags [null]
2061
2062 --nu, --skip-any-unset
2063 Skip reads with any of the FLAG bits unset [null]
2064
2065 -s, --samples LIST
2066 list of sample names. See Common Options
2067
2068 -S, --samples-file FILE
2069 file of sample names to include or exclude if prefixed with "^".
2070 One sample per line. This file can also be used to rename samples
2071 by giving the new sample name as a second white-space-separated
2072 column, like this: "old_name new_name". If a sample name contains
2073 spaces, the spaces can be escaped using the backslash character,
2074 for example "Not\ a\ good\ sample\ name".
2075
2076 -t, --targets LIST
2077 see Common Options
2078
2079 -T, --targets-file FILE
2080 see Common Options
2081
2082 --targets-overlap 0|1|2
2083 see Common Options
2084
2085 -x, --ignore-overlaps
2086 Disable read-pair overlap detection.
2087
2088 --seed INT
2089 Set the random number seed used when sub-sampling deep regions [0].
2090
2091 Output options
2092 -a, --annotate LIST
2093 Comma-separated list of FORMAT and INFO tags to output.
2094 (case-insensitive, the "FORMAT/" prefix is optional, and use "?" to
2095 list available annotations on the command line) [null]:
2096
2097 FORMAT/AD .. Allelic depth (Number=R,Type=Integer)
2098 FORMAT/ADF .. Allelic depths on the forward strand (Number=R,Type=Integer)
2099 FORMAT/ADR .. Allelic depths on the reverse strand (Number=R,Type=Integer)
2100 FORMAT/DP .. Number of high-quality bases (Number=1,Type=Integer)
2101 FORMAT/SP .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
2102 FORMAT/SCR .. Number of soft-clipped reads (Number=1,Type=Integer)
2103
2104 INFO/AD .. Total allelic depth (Number=R,Type=Integer)
2105 INFO/ADF .. Total allelic depths on the forward strand (Number=R,Type=Integer)
2106 INFO/ADR .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
2107 INFO/SCR .. Number of soft-clipped reads (Number=1,Type=Integer)
2108
2109 FORMAT/DV .. Deprecated in favor of FORMAT/AD; Number of high-quality non-reference bases, (Number=1,Type=Integer)
2110 FORMAT/DP4 .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR; Number of high-quality ref-forward, ref-reverse,
2111 alt-forward and alt-reverse bases (Number=4,Type=Integer)
2112 FORMAT/DPR .. Deprecated in favor of FORMAT/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
2113 INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
2114
2115 -g, --gvcf INT[,...]
2116 output gVCF blocks of homozygous REF calls, with depth (DP) ranges
2117 specified by the list of integers. For example, passing 5,15 will
2118 group sites into two types of gVCF blocks, the first with minimum
2119 per-sample DP from the interval [5,15) and the latter with minimum
2120 depth 15 or more. In this example, sites with minimum per-sample
2121 depth less than 5 will be printed as separate records, outside of
2122 gVCF blocks.
2123
2124 --no-version
2125 see Common Options
2126
2127 -o, --output FILE
2128 Write output to FILE, rather than the default of standard output.
2129 (The same short option is used for both --open-prob and --output.
2130 If -o's argument contains any non-digit characters other than a
2131 leading + or - sign, it is interpreted as --output. Usually the
2132 filename extension will take care of this, but to write to an
2133 entirely numeric filename use -o ./123 or --output 123.)
2134
2135 -O, --output-type b|u|z|v[0-9]
2136 see Common Options
2137
2138 --threads INT
2139 see Common Options
2140
2141 -U, --mwu-u
2142 The the previous Mann-Whitney U test score from version 1.12 and
2143 earlier. This is a probability score, but importantly it folds
2144 probabilities above or below the desired score into the same P. The
2145 new Mann-Whitney U test score is a "Z score", expressing the score
2146 as the number of standard deviations away from the mean (with zero
2147 being matching the mean). It keeps both positive and negative
2148 values. This can be important for some tests where errors are
2149 asymmetric.
2150
2151 This option changes the INFO field names produced back to the ones
2152 used by the earlier Bcftools releases. For excample BQBZ becomes
2153 BQB.
2154
2155 Options for SNP/INDEL genotype likelihood computation
2156 -X, --config STR
2157 Specify a platform specific configuration profile. The profile
2158 should be one of 1.12, illumina, ont or pacbio-ccs. Settings
2159 applied are as follows:
2160
2161 1.12 -Q13 -h100 -m1
2162 illumina [ default values ]
2163 ont -B -Q5 --max-BQ 30 -I
2164 pacbio-ccs -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
2165
2166 --ar, --ambig-reads drop|incAD|incAD0
2167 What to do with ambiguous indel reads that do not span an entire
2168 short tandem repeat region: discard ambiguous reads from calling
2169 and do not increment high-quality AD depth counters (drop), exclude
2170 from calling but increment AD counters proportionally (incAD),
2171 exclude from calling and increment the first value of the AD
2172 counter (incAD0) [drop]
2173
2174 -e, --ext-prob INT
2175 Phred-scaled gap extension sequencing error probability. Reducing
2176 INT leads to longer indels [20]
2177
2178 -F, --gap-frac FLOAT
2179 Minimum fraction of gapped reads [0.002]
2180
2181 -h, --tandem-qual INT
2182 Coefficient for modeling homopolymer errors. Given an l-long
2183 homopolymer run, the sequencing error of an indel of size s is
2184 modeled as INT*s/l [500] Increasing this informs the caller that
2185 indels in long homopolymers are more likely genuine and less likely
2186 to be sequencing artifacts. Hence increasing tandem-qual will have
2187 higher recall and lower precision. Bcftools 1.12 and earlier had a
2188 default of 100, which was tuned around more error prone
2189 instruments. Note changing this may have a minor impact on SNP
2190 calling too. For maximum SNP calling accuracy, it may be preferable
2191 to adjust this lower again, although this will adversely affect
2192 indels.
2193
2194 --indel-bias FLOAT
2195 Skews the indel scores up or down, trading recall (low
2196 false-negative) vs precision (low false-positive) [1.0]. In
2197 Bcftools 1.12 and earlier this parameter didn’t exist, but had an
2198 implied value of 1.0. If you are planning to do heavy filtering of
2199 variants, selecting the best quality ones only (favouring precision
2200 over recall), it is advisable to set this lower (such as 0.75)
2201 while higher depth samples or where you favour recall rates over
2202 precision may work better with a higher value such as 2.0.
2203
2204 --indel-size INT
2205 Indel window size to use when assessing the quality of candidate
2206 indels. Note that although the window size approximately
2207 corresponds to the maximum indel size considered, it is not an
2208 exact threshold [110]
2209
2210 -I, --skip-indels
2211 Do not perform INDEL calling
2212
2213 -L, --max-idepth INT
2214 Skip INDEL calling if the average per-sample depth is above INT
2215 [250]
2216
2217 -m, --min-ireads INT
2218 Minimum number gapped reads for indel candidates INT [1]
2219
2220 -M, --max-read-len INT
2221 The maximum read length permitted by the BAQ algorithm [500].
2222 Variants are still called on longer reads, but they will not be
2223 passed through the BAQ method. This limit exists to prevent
2224 excessively long BAQ times and high memory usage. Note if partial
2225 BAQ is enabled with -D then raising this parameter will likely not
2226 have a significant a CPU cost.
2227
2228 -o, --open-prob INT
2229 Phred-scaled gap open sequencing error probability. Reducing INT
2230 leads to more indel calls. (The same short option is used for both
2231 --open-prob and --output. When -o’s argument contains only an
2232 optional + or - sign followed by the digits 0 to 9, it is
2233 interpreted as --open-prob.) [40]
2234
2235 -p, --per-sample-mF
2236 Apply -m and -F thresholds per sample to increase sensitivity of
2237 calling. By default both options are applied to reads pooled from
2238 all samples.
2239
2240 -P, --platforms STR
2241 Comma-delimited list of platforms (determined by @RG-PL) from
2242 which indel candidates are obtained. It is recommended to collect
2243 indel candidates from sequencing technologies that have low indel
2244 error rate such as ILLUMINA [all]
2245
2246 Examples:
2247 Call SNPs and short INDELs, then mark low quality sites and sites with
2248 the read depth exceeding a limit. (The read depth should be adjusted to
2249 about twice the average read depth as higher read depths usually
2250 indicate problematic regions which are often enriched for artefacts.)
2251 One may consider to add -C50 to mpileup if mapping quality is
2252 overestimated for reads containing excessive mismatches. Applying
2253 this option usually helps for BWA-backtrack alignments, but may not
2254 other aligners.
2255
2256 bcftools mpileup -Ou -f ref.fa aln.bam | \
2257 bcftools call -Ou -mv | \
2258 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
2259
2260 bcftools norm [OPTIONS] file.vcf.gz
2261 Left-align and normalize indels, check if REF alleles match the
2262 reference, split multiallelic sites into multiple rows; recover
2263 multiallelics from multiple rows. Left-alignment and normalization will
2264 only be applied if the --fasta-ref option is supplied.
2265
2266 -a, --atomize
2267 Decompose complex variants, e.g. split MNVs into consecutive SNVs.
2268 See also --atom-overlaps and --old-rec-tag.
2269
2270 --atom-overlaps .|*
2271 Alleles missing because of an overlapping variant can be set either
2272 to missing (.) or to the star alele (*), as recommended by the VCF
2273 specification. IMPORTANT: Note that asterisk is expaneded by shell
2274 and must be put in quotes or escaped by a backslash:
2275
2276 # Before atomization:
2277 100 CC C,GG 1/2
2278
2279 # After:
2280 # bcftools norm -a .
2281 100 C G ./1
2282 100 CC C 1/.
2283 101 C G ./1
2284
2285 # After:
2286 # bcftools norm -a '*'
2287 # bcftools norm -a \*
2288 100 C G,* 2/1
2289 100 CC C,* 1/2
2290 101 C G,* 2/1
2291
2292 -c, --check-ref e|w|x|s
2293 what to do when incorrect or missing REF allele is encountered:
2294 exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w
2295 option can be combined with x and s. Note that s can swap alleles
2296 and will update genotypes (GT) and AC counts, but will not attempt
2297 to fix PL or other fields. Also note, and this cannot be stressed
2298 enough, that s will NOT fix strand issues in your VCF, do NOT use
2299 it for that purpose!!! (Instead see
2300 http://samtools.github.io/bcftools/howtos/plugin.af-dist.html and
2301 http://samtools.github.io/bcftools/howtos/plugin.fixref.html.)
2302
2303 -d, --rm-dup snps|indels|both|all|exact
2304 If a record is present multiple times, output only the first
2305 instance. See also --collapse in Common Options.
2306
2307 -D, --remove-duplicates
2308 If a record is present in multiple files, output only the first
2309 instance. Alias for -d none, deprecated.
2310
2311 -f, --fasta-ref FILE
2312 reference sequence. Supplying this option will turn on
2313 left-alignment and normalization, however, see also the
2314 --do-not-normalize option below.
2315
2316 --force
2317 try to proceed with -m- even if malformed tags with incorrect
2318 number of fields are encountered, discarding such tags.
2319 (Experimental, use at your own risk.)
2320
2321 --keep-sum TAG[,...]
2322 keep vector sum constant when splitting multiallelic sites. Only AD
2323 tag is currently supported. See also
2324 https://github.com/samtools/bcftools/issues/360
2325
2326 -m, --multiallelics -|+[snps|indels|both|any]
2327 split multiallelic sites into biallelic records (-) or join
2328 biallelic sites into multiallelic records (+). An optional type
2329 string can follow which controls variant types which should be
2330 split or merged together: If only SNP records should be split or
2331 merged, specify snps; if both SNPs and indels should be merged
2332 separately into two records, specify both; if SNPs and indels
2333 should be merged into a single record, specify any.
2334
2335 --no-version
2336 see Common Options
2337
2338 -N, --do-not-normalize
2339 the -c s option can be used to fix or set the REF allele from the
2340 reference -f. The -N option will not turn on indel normalisation as
2341 the -f option normally implies
2342
2343 --old-rec-tag STR
2344 Add INFO/STR annotation with the original record. The format of the
2345 annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
2346
2347 -o, --output FILE
2348 see Common Options
2349
2350 -O, --output-type b|u|z|v[0-9]
2351 see Common Options
2352
2353 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2354 see Common Options
2355
2356 -R, --regions-file file
2357 see Common Options
2358
2359 --regions-overlap 0|1|2
2360 see Common Options
2361
2362 -s, --strict-filter
2363 when merging (-m+), merged site is PASS only if all sites being
2364 merged PASS
2365
2366 -t, --targets LIST
2367 see Common Options
2368
2369 -T, --targets-file FILE
2370 see Common Options
2371
2372 --targets-overlap 0|1|2
2373 see Common Options
2374
2375 --threads INT
2376 see Common Options
2377
2378 -w, --site-win INT
2379 maximum distance between two records to consider when locally
2380 sorting variants which changed position during the realignment
2381
2382 bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]
2383 A common framework for various utilities. The plugins can be used the
2384 same way as normal commands only their name is prefixed with "+". Most
2385 plugins accept two types of parameters: general options shared by all
2386 plugins followed by a separator, and a list of plugin-specific options.
2387 There are some exceptions to this rule, some plugins do not accept the
2388 common options and implement their own parameters. Therefore please pay
2389 attention to the usage examples that each plugin comes with.
2390
2391 VCF input options:
2392 -e, --exclude EXPRESSION
2393 exclude sites for which EXPRESSION is true. For valid expressions
2394 see EXPRESSIONS.
2395
2396 -i, --include EXPRESSION
2397 include only sites for which EXPRESSION is true. For valid
2398 expressions see EXPRESSIONS.
2399
2400 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2401 see Common Options
2402
2403 -R, --regions-file file
2404 see Common Options
2405
2406 --regions-overlap 0|1|2
2407 see Common Options
2408
2409 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2410 see Common Options
2411
2412 -T, --targets-file file
2413 see Common Options
2414
2415 --targets-overlap 0|1|2
2416 see Common Options
2417
2418 VCF output options:
2419 --no-version
2420 see Common Options
2421
2422 -o, --output FILE
2423 see Common Options
2424
2425 -O, --output-type b|u|z|v[0-9]
2426 see Common Options
2427
2428 --threads INT
2429 see Common Options
2430
2431 Plugin options:
2432 -h, --help
2433 list plugin’s options
2434
2435 -l, --list-plugins
2436 List all available plugins.
2437
2438 By default, appropriate system directories are searched for
2439 installed plugins. You can override this by setting the
2440 BCFTOOLS_PLUGINS environment variable to a colon-separated list of
2441 directories to search. If BCFTOOLS_PLUGINS begins with a colon,
2442 ends with a colon, or contains adjacent colons, the system
2443 directories are also searched at that position in the list of
2444 directories.
2445
2446 -v, --verbose
2447 print debugging information to debug plugin failure
2448
2449 -V, --version
2450 print version string and exit
2451
2452 List of plugins coming with the distribution:
2453 ad-bias
2454 find positions with wildly varying ALT allele frequency (Fisher
2455 test on FMT/AD)
2456
2457 add-variantkey
2458 add VariantKey INFO fields VKX and RSX
2459
2460 af-dist
2461 collect AF deviation stats and GT probability distribution given AF
2462 and assuming HWE
2463
2464 allele-length
2465 count the frequency of the length of REF, ALT and REF+ALT
2466
2467 check-ploidy
2468 check if ploidy of samples is consistent for all sites
2469
2470 check-sparsity
2471 print samples without genotypes in a region or chromosome
2472
2473 color-chrs
2474 color shared chromosomal segments, requires trio VCF with phased
2475 GTs
2476
2477 contrast
2478 runs a basic association test, per-site or in a region, and checks
2479 for novel alleles and genotypes in two groups of samples. Adds the
2480 following INFO annotations:
2481
2482 • PASSOC .. Fisher’s exact test probability of genotypic
2483 association (REF vs non-REF allele)
2484
2485 • FASSOC .. proportion of non-REF allele in controls and cases
2486
2487 • NASSOC .. number of control-ref, control-alt, case-ref and
2488 case-alt alleles
2489
2490 • NOVELAL .. lists samples with a novel allele not observed in
2491 the control group
2492
2493 • NOVELGT .. lists samples with a novel genotype not observed in
2494 the control group
2495
2496 counts
2497 a minimal plugin which counts number of SNPs, Indels, and total
2498 number of sites.
2499
2500 dosage
2501 print genotype dosage. By default the plugin searches for PL, GL
2502 and GT, in that order.
2503
2504 fill-from-fasta
2505 fill INFO or REF field based on values in a fasta file
2506
2507 fill-tags
2508 set various INFO tags. The list of tags supported in this version:
2509
2510 • INFO/AC Number:A Type:Integer .. Allele count in
2511 genotypes
2512
2513 • INFO/AC_Hom Number:A Type:Integer .. Allele counts in
2514 homozygous genotypes
2515
2516 • INFO/AC_Het Number:A Type:Integer .. Allele counts in
2517 heterozygous genotypes
2518
2519 • INFO/AC_Hemi Number:A Type:Integer .. Allele counts in
2520 hemizygous genotypes
2521
2522 • INFO/AF Number:A Type:Float .. Allele frequency
2523
2524 • INFO/AN Number:1 Type:Integer .. Total number of
2525 alleles in called genotypes
2526
2527 • INFO/ExcHet Number:A Type:Float .. Test excess
2528 heterozygosity; 1=good, 0=bad
2529
2530 • INFO/END Number:1 Type:Integer .. End position of the
2531 variant
2532
2533 • INFO/F_MISSING Number:1 Type:Float .. Fraction of missing
2534 genotypes
2535
2536 • INFO/HWE Number:A Type:Float .. HWE test
2537 (PMID:15789306); 1=good, 0=bad
2538
2539 • INFO/MAF Number:A Type:Float .. Minor Allele
2540 frequency
2541
2542 • INFO/NS Number:1 Type:Integer .. Number of samples
2543 with data
2544
2545 • INFO/TYPE Number:. Type:String .. The record type
2546 (REF,SNP,MNP,INDEL,etc)
2547
2548 • FORMAT/VAF Number:A Type:Float .. The fraction of
2549 reads with the alternate allele, requires FORMAT/AD or ADF+ADR
2550
2551 • FORMAT/VAF1 Number:1 Type:Float .. The same as
2552 FORMAT/VAF but for all alternate alleles cumulatively
2553
2554 • TAG=func(TAG) Number:1 Type:Integer .. Experimental support
2555 for user-defined expressions such as "DP=sum(DP)"
2556
2557 fix-ploidy
2558 sets correct ploidy
2559
2560 fixref
2561 determine and fix strand orientation
2562
2563 frameshifts
2564 annotate frameshift indels
2565
2566 GTisec
2567 count genotype intersections across all possible sample subsets in
2568 a vcf file
2569
2570 GTsubset
2571 output only sites where the requested samples all exclusively share
2572 a genotype
2573
2574 guess-ploidy
2575 determine sample sex by checking genotype likelihoods (GL,PL) or
2576 genotypes (GT) in the non-PAR region of chrX.
2577
2578 gvcfz
2579 compress gVCF file by resizing non-variant blocks according to
2580 specified criteria
2581
2582 impute-info
2583 add imputation information metrics to the INFO field based on
2584 selected FORMAT tags
2585
2586 indel-stats
2587 calculates per-sample or de novo indels stats. The usage and format
2588 is similar to smpl-stats and trio-stats
2589
2590 isecGT
2591 compare two files and set non-identical genotypes to missing
2592
2593 mendelian
2594 count Mendelian consistent / inconsistent genotypes.
2595
2596 missing2ref
2597 sets missing genotypes ("./.") to ref allele ("0/0" or "0|0")
2598
2599 parental-origin
2600 determine parental origin of a CNV region
2601
2602 prune
2603 prune sites by missingness, allele frequency or linkage
2604 disequilibrium. Alternatively, annotate sites with r2, Lewontin’s
2605 D' (PMID:19433632), Ragsdale’s D (PMID:31697386).
2606
2607 remove-overlaps
2608 remove overlapping variants and duplicate sites
2609
2610 scatter
2611 intended as an inverse to bcftools concat, scatter VCF by chunks or
2612 regions, creating multiple VCFs.
2613
2614 setGT
2615 general tool to set genotypes according to rules requested by the
2616 user
2617
2618 smpl-stats
2619 calculates basic per-sample stats. The usage and format is similar
2620 to indel-stats and trio-stats.
2621
2622 split
2623 split VCF by sample, creating single- or multi-sample VCFs
2624
2625 split-vep
2626 extract fields from structured annotations such as INFO/CSQ created
2627 by bcftools/csq or VEP. These can be added as a new INFO field to
2628 the VCF or in a custom text format. See
2629 http://samtools.github.io/bcftools/howtos/plugin.split-vep.html for
2630 more.
2631
2632 tag2tag
2633 Convert between similar tags, such as GL,PL,GP or QR,QA,QS.
2634
2635 trio-dnm2
2636 screen variants for possible de-novo mutations in trios
2637
2638 trio-stats
2639 calculate transmission rate in trio children. The usage and format
2640 is similar to indel-stats and smpl-stats.
2641
2642 trio-switch-rate
2643 calculate phase switch rate in trio samples, children samples must
2644 have phased GTs
2645
2646 variantkey-hex
2647 generate unsorted VariantKey-RSid index files in hexadecimal format
2648
2649 Examples:
2650 # List options common to all plugins
2651 bcftools plugin
2652
2653 # List available plugins
2654 bcftools plugin -l
2655
2656 # Run a plugin
2657 bcftools plugin counts in.vcf
2658
2659 # Run a plugin using the abbreviated "+" notation
2660 bcftools +counts in.vcf
2661
2662 # Run a plugin from an explicit location
2663 bcftools +/path/to/counts.so in.vcf
2664
2665 # The input VCF can be streamed just like in other commands
2666 cat in.vcf | bcftools +counts
2667
2668 # Print usage information of plugin "dosage"
2669 bcftools +dosage -h
2670
2671 # Replace missing genotypes with 0/0
2672 bcftools +missing2ref in.vcf
2673
2674 # Replace missing genotypes with 0|0
2675 bcftools +missing2ref in.vcf -- -p
2676
2677 Plugins troubleshooting:
2678 Things to check if your plugin does not show up in the bcftools plugin
2679 -l output:
2680
2681 • Run with the -v option for verbose output: bcftools plugin -lv
2682
2683 • Does the environment variable BCFTOOLS_PLUGINS include the correct
2684 path?
2685
2686 Plugins API:
2687 // Short description used by 'bcftools plugin -l'
2688 const char *about(void);
2689
2690 // Longer description used by 'bcftools +name -h'
2691 const char *usage(void);
2692
2693 // Called once at startup, allows initialization of local variables.
2694 // Return 1 to suppress normal VCF/BCF header output, -1 on critical
2695 // errors, 0 otherwise.
2696 int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr);
2697
2698 // Called for each VCF record, return NULL to suppress the output
2699 bcf1_t *process(bcf1_t *rec);
2700
2701 // Called after all lines have been processed to clean up
2702 void destroy(void);
2703
2704 bcftools polysomy [OPTIONS] file.vcf.gz
2705 Detect number of chromosomal copies in VCFs annotates with the
2706 Illumina’s B-allele frequency (BAF) values. Note that this command is
2707 not compiled in by default, see the section Optional Compilation with
2708 GSL in the INSTALL file for help.
2709
2710 General options:
2711 -o, --output-dir path
2712 output directory
2713
2714 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2715 see Common Options
2716
2717 -R, --regions-file file
2718 see Common Options
2719
2720 --regions-overlap 0|1|2
2721 see Common Options
2722
2723 -s, --sample string
2724 sample name
2725
2726 -t, --targets LIST
2727 see Common Options
2728
2729 -T, --targets-file FILE
2730 see Common Options
2731
2732 --targets-overlap 0|1|2
2733 see Common Options
2734
2735 -v, --verbose
2736 verbose debugging output which gives hints about the thresholds and
2737 decisions made by the program. Note that the exact output can
2738 change between versions.
2739
2740 Algorithm options:
2741 -b, --peak-size float
2742 the minimum peak size considered as a good match can be from the
2743 interval [0,1] where larger is stricter
2744
2745 -c, --cn-penalty float
2746 a penalty for increasing copy number state. How this works:
2747 multiple peaks are always a better fit than a single peak,
2748 therefore the program prefers a single peak (normal copy number)
2749 unless the absolute deviation of the multiple peaks fit is
2750 significantly smaller. Here the meaning of "significant" is given
2751 by the float from the interval [0,1] where larger is stricter.
2752
2753 -f, --fit-th float
2754 threshold for goodness of fit (normalized absolute deviation),
2755 smaller is stricter
2756
2757 -i, --include-aa
2758 include also the AA peak in CN2 and CN3 evaluation. This usually
2759 requires increasing -f.
2760
2761 -m, --min-fraction float
2762 minimum distinguishable fraction of aberrant cells. The experience
2763 shows that trustworthy are estimates of 20% and more.
2764
2765 -p, --peak-symmetry float
2766 a heuristics to filter failed fits where the expected peak symmetry
2767 is violated. The float is from the interval [0,1] and larger is
2768 stricter
2769
2770 bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]
2771 Extracts fields from VCF or BCF files and outputs them in user-defined
2772 format.
2773
2774 -e, --exclude EXPRESSION
2775 exclude sites for which EXPRESSION is true. For valid expressions
2776 see EXPRESSIONS.
2777
2778 --force-samples
2779 continue even when some samples requested via -s/-S do not exist
2780
2781 -f, --format FORMAT
2782 learn by example, see below
2783
2784 -H, --print-header
2785 print header
2786
2787 -i, --include EXPRESSION
2788 include only sites for which EXPRESSION is true. For valid
2789 expressions see EXPRESSIONS.
2790
2791 -l, --list-samples
2792 list sample names and exit
2793
2794 -o, --output FILE
2795 see Common Options
2796
2797 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
2798 see Common Options
2799
2800 -R, --regions-file file
2801 see Common Options
2802
2803 --regions-overlap 0|1|2
2804 see Common Options
2805
2806 -s, --samples LIST
2807 see Common Options
2808
2809 -S, --samples-file FILE
2810 see Common Options
2811
2812 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
2813 see Common Options
2814
2815 -T, --targets-file file
2816 see Common Options
2817
2818 --targets-overlap 0|1|2
2819 see Common Options
2820
2821 -u, --allow-undef-tags
2822 do not throw an error if there are undefined tags in the format
2823 string, print "." instead
2824
2825 -v, --vcf-list FILE
2826 process multiple VCFs listed in the file
2827
2828 Format:
2829 %CHROM The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
2830 %END End position of the REF allele
2831 %END0 End position of the REF allele in 0-based coordinates
2832 %FIRST_ALT Alias for %ALT{0}
2833 %FORMAT Prints all FORMAT fields or a subset of samples with -s or -S
2834 %GT Genotype (e.g. 0/1)
2835 %INFO Prints the whole INFO column
2836 %INFO/TAG Any tag in the INFO column
2837 %IUPACGT Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
2838 %LINE Prints the whole line
2839 %MASK Indicates presence of the site in other files (with multiple files)
2840 %N_PASS(expr) Number of samples that pass the filtering expression (see *<<expressions,EXPRESSIONS>>*)
2841 %POS0 POS in 0-based coordinates
2842 %PBINOM(TAG) Calculate phred-scaled binomial probability, the allele index is determined from GT
2843 %SAMPLE Sample name
2844 %TAG{INT} Curly brackets to print a subfield (e.g. INFO/TAG{1}, the indexes are 0-based)
2845 %TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
2846 %TGT Translated genotype (e.g. C/A)
2847 %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER)
2848 [] Format fields must be enclosed in brackets to loop over all samples
2849 \n new line
2850 \t tab character
2851
2852 Everything else is printed verbatim.
2853
2854 Examples:
2855 # Print chromosome, position, ref allele and the first alternate allele
2856 bcftools query -f '%CHROM %POS %REF %ALT{0}\n' file.vcf.gz
2857
2858 # Similar to above, but use tabs instead of spaces, add sample name and genotype
2859 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
2860
2861 # Print FORMAT/GT fields followed by FORMAT/GT fields
2862 bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
2863
2864 # Make a BED file: chr, pos (0-based), end pos (1-based), id
2865 bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
2866
2867 # Print only samples with alternate (non-reference) genotypes
2868 bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
2869
2870 # Print all samples at sites with at least one alternate genotype
2871 bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
2872
2873 # Print phred-scaled binomial probability from FORMAT/AD tag for all heterozygous genotypes
2874 bcftools query -i'GT="het"' -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' file.vcf
2875
2876 # Print the second value of AC field if bigger than 10. Note the (unfortunate) difference in
2877 # index subscript notation: formatting expressions (-f) uses "{}" while filtering expressions
2878 # (-i) use "[]". This is for historic reasons and backward-compatibility.
2879 bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz
2880
2881 bcftools reheader [OPTIONS] file.vcf.gz
2882 Modify header of VCF/BCF files, change sample names.
2883
2884 -f, --fai FILE
2885 add to the header contig names and their lengths from the provided
2886 fasta index file (.fai). Lengths of existing contig lines will be
2887 updated and contig lines not present in the fai file will be
2888 removed
2889
2890 -h, --header FILE
2891 new VCF header
2892
2893 -o, --output FILE
2894 see Common Options
2895
2896 -s, --samples FILE
2897 new sample names, one name per line, in the same order as they
2898 appear in the VCF file. Alternatively, only samples which need to
2899 be renamed can be listed as "old_name new_name\n" pairs separated
2900 by whitespaces, each on a separate line. If a sample name contains
2901 spaces, the spaces can be escaped using the backslash character,
2902 for example "Not\ a\ good\ sample\ name".
2903
2904 -T, --temp-prefix PATH
2905 template for temporary file names, used with -f
2906
2907 --threads INT
2908 see Common Options
2909
2910 bcftools roh [OPTIONS] file.vcf.gz
2911 A program for detecting runs of homo/autozygosity. Only bi-allelic
2912 sites are considered.
2913
2914 The HMM model:
2915 Notation:
2916 D = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
2917 f = non-ref allele frequency
2918
2919 Emission probabilities:
2920 oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
2921 oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA)
2922
2923 Transition probabilities:
2924 tAZ = P(AZ|HW) .. from HW to AZ, the -a parameter
2925 tHW = P(HW|AZ) .. from AZ to HW, the -H parameter
2926
2927 ci = P_i(C) .. probability of cross-over at site i, from genetic map
2928 AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
2929 HWi = P_i(HW)
2930
2931 P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
2932 P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
2933
2934 General Options:
2935 --AF-dflt FLOAT
2936 in case allele frequency is not known, use the FLOAT. By default,
2937 sites where allele frequency cannot be determined, or is 0, are
2938 skipped.
2939
2940 --AF-tag TAG
2941 use the specified INFO tag TAG as an allele frequency estimate
2942 instead of the default AC and AN tags. Sites which do not have TAG
2943 will be skipped.
2944
2945 --AF-file FILE
2946 Read allele frequencies from a tab-delimited file containing the
2947 columns: CHROM\tPOS\tREF,ALT\tAF. The file can be compressed with
2948 bgzip and indexed with tabix -s1 -b2 -e2. Sites which are not
2949 present in the FILE or have different reference or alternate allele
2950 will be skipped. Note that such a file can be easily created from a
2951 VCF using:
2952
2953 bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
2954
2955 -b, --buffer-size INT[,INT]
2956 when the entire many-sample file cannot fit into memory, a sliding
2957 buffer approach can be used. The first value is the number of sites
2958 to keep in memory. If negative, it is interpreted as the maximum
2959 memory to use, in MB. The second, optional, value sets the number
2960 of overlapping sites. The default overlap is set to roughly 1% of
2961 the buffer size.
2962
2963 -e, --estimate-AF FILE
2964 estimate the allele frequency by recalculating INFO/AC and INFO/AN
2965 on the fly, using the specified TAG which can be either FORMAT/GT
2966 ("GT") or FORMAT/PL ("PL"). If TAG is not given, "GT" is assumed.
2967 Either all samples ("-") or samples listed in FILE will be
2968 included. For example, use "PL,-" to estimate AF from FORMAT/PL of
2969 all samples. If neither -e nor the other --AF-... options are
2970 given, the allele frequency is estimated from AC and AN counts
2971 which are already present in the INFO field.
2972
2973 --exclude EXPRESSION
2974 exclude sites for which EXPRESSION is true. For valid expressions
2975 see EXPRESSIONS.
2976
2977 -G, --GTs-only FLOAT
2978 use genotypes (FORMAT/GT fields) ignoring genotype likelihoods
2979 (FORMAT/PL), setting PL of unseen genotypes to FLOAT. Safe value to
2980 use is 30 to account for GT errors.
2981
2982 --include EXPRESSION
2983 include only sites for which EXPRESSION is true. For valid
2984 expressions see EXPRESSIONS.
2985
2986 -I, --skip-indels
2987 skip indels as their genotypes are usually enriched for errors
2988
2989 -m, --genetic-map FILE
2990 genetic map in the format required also by IMPUTE2. Only the first
2991 and third column are used (position and Genetic_Map(cM)). The FILE
2992 can be a single file or a file mask, where string "{CHROM}" is
2993 replaced with chromosome name.
2994
2995 -M, --rec-rate FLOAT
2996 constant recombination rate per bp. In combination with
2997 --genetic-map, the --rec-rate parameter is interpreted differently,
2998 as FLOAT-fold increase of transition probabilities, which allows
2999 the model to become more sensitive yet still account for
3000 recombination hotspots. Note that also the range of the values is
3001 therefore different in both cases: normally the parameter will be
3002 in the range (1e-3,1e-9) but with --genetic-map it will be in the
3003 range (10,1000).
3004
3005 -o, --output FILE
3006 Write output to the FILE, by default the output is printed on
3007 stdout
3008
3009 -O, --output-type s|r[z]
3010 Generate per-site output (s) or per-region output (r). By default
3011 both types are printed and the output is uncompressed. Add z for a
3012 compressed output.
3013
3014 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3015 see Common Options
3016
3017 -R, --regions-file file
3018 see Common Options
3019
3020 --regions-overlap 0|1|2
3021 see Common Options
3022
3023 -s, --samples LIST
3024 see Common Options
3025
3026 -S, --samples-file FILE
3027 see Common Options
3028
3029 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3030 see Common Options
3031
3032 -T, --targets-file file
3033 see Common Options
3034
3035 --targets-overlap 0|1|2
3036 see Common Options
3037
3038 HMM Options:
3039 -a, --hw-to-az FLOAT
3040 P(AZ|HW) transition probability from AZ (autozygous) to HW
3041 (Hardy-Weinberg) state
3042
3043 -H, --az-to-hw FLOAT
3044 P(HW|AZ) transition probability from HW to AZ state
3045
3046 -V, --viterbi-training FLOAT
3047 estimate HMM parameters using Baum-Welch algorithm, using the
3048 convergence threshold FLOAT, e.g. 1e-10 (experimental)
3049
3050 bcftools sort [OPTIONS] file.bcf
3051 -m, --max-mem FLOAT[kMG]
3052 Maximum memory to use. Approximate, affects the number of temporary
3053 files written to the disk. Note that if the command fails at this
3054 step because of too many open files, your system limit on the
3055 number of open files ("ulimit") may need to be increased.
3056
3057 -o, --output FILE
3058 see Common Options
3059
3060 -O, --output-type b|u|z|v[0-9]
3061 see Common Options
3062
3063 -T, --temp-dir DIR
3064 Use this directory to store temporary files
3065
3066 bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]
3067 Parses VCF or BCF and produces text file stats which is suitable for
3068 machine processing and can be plotted using plot-vcfstats. When two
3069 files are given, the program generates separate stats for intersection
3070 and the complements. By default only sites are compared, -s/-S must
3071 given to include also sample columns. When one VCF file is specified on
3072 the command line, then stats by non-reference allele frequency, depth
3073 distribution, stats by quality and per-sample counts, singleton stats,
3074 etc. are printed. When two VCF files are given, then stats such as
3075 concordance (Genotype concordance by non-reference allele frequency,
3076 Genotype concordance by sample, Non-Reference Discordance) and
3077 correlation are also printed. Per-site discordance (PSD) is also
3078 printed in --verbose mode.
3079
3080 --af-bins LIST|FILE
3081 comma separated list of allele frequency bins (e.g. 0.1,0.5,1) or a
3082 file listing the allele frequency bins one per line (e.g.
3083 0.1\n0.5\n1)
3084
3085 --af-tag TAG
3086 allele frequency INFO tag to use for binning. By default the allele
3087 frequency is estimated from AC/AN, if available, or directly from
3088 the genotypes (GT) if not.
3089
3090 -1, --1st-allele-only
3091 consider only the 1st alternate allele at multiallelic sites
3092
3093 -c, --collapse snps|indels|both|all|some|none
3094 see Common Options
3095
3096 -d, --depth INT,INT,INT
3097 ranges of depth distribution: min, max, and size of the bin
3098
3099 --debug
3100 produce verbose per-site and per-sample output
3101
3102 -e, --exclude EXPRESSION
3103 exclude sites for which EXPRESSION is true. For valid expressions
3104 see EXPRESSIONS.
3105
3106 -E, --exons file.gz
3107 tab-delimited file with exons for indel frameshifts statistics. The
3108 columns of the file are CHR, FROM, TO, with 1-based, inclusive,
3109 positions. The file is BGZF-compressed and indexed with tabix
3110
3111 tabix -s1 -b2 -e3 file.gz
3112
3113 -f, --apply-filters LIST
3114 see Common Options
3115
3116 -F, --fasta-ref ref.fa
3117 faidx indexed reference sequence file to determine INDEL context
3118
3119 -i, --include EXPRESSION
3120 include only sites for which EXPRESSION is true. For valid
3121 expressions see EXPRESSIONS.
3122
3123 -I, --split-by-ID
3124 collect stats separately for sites which have the ID column set
3125 ("known sites") or which do not have the ID column set ("novel
3126 sites").
3127
3128 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3129 see Common Options
3130
3131 -R, --regions-file file
3132 see Common Options
3133
3134 --regions-overlap 0|1|2
3135 see Common Options
3136
3137 -s, --samples LIST
3138 see Common Options
3139
3140 -S, --samples-file FILE
3141 see Common Options
3142
3143 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3144 see Common Options
3145
3146 -T, --targets-file file
3147 see Common Options
3148
3149 --targets-overlap 0|1|2
3150 see Common Options
3151
3152 -u, --user-tstv <TAG[:min:max:n]>
3153 collect Ts/Tv stats for any tag using the given binning [0:1:100]
3154
3155 -v, --verbose
3156 produce verbose per-site and per-sample output
3157
3158 bcftools view [OPTIONS] file.vcf.gz [REGION [...]]
3159 View, subset and filter VCF or BCF files by position and filtering
3160 expression. Convert between VCF and BCF. Former bcftools subset.
3161
3162 Output options
3163 -G, --drop-genotypes
3164 drop individual genotype information (after subsetting if -s option
3165 is set)
3166
3167 -h, --header-only
3168 output the VCF header only (see also bcftools head)
3169
3170 -H, --no-header
3171 suppress the header in VCF output
3172
3173 --with-header
3174 output both VCF header and records (this is the default, but the
3175 option is useful for explicitness or to reset the effects of -h or
3176 -H)
3177
3178 -l, --compression-level [0-9]
3179 compression level. 0 stands for uncompressed, 1 for best speed and
3180 9 for best compression.
3181
3182 --no-version
3183 see Common Options
3184
3185 -O, --output-type b|u|z|v[0-9]
3186 see Common Options
3187
3188 -o, --output FILE: output file name. If not present, the default is to
3189 print to standard output (stdout).
3190
3191 -r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
3192 see Common Options
3193
3194 -R, --regions-file file
3195 see Common Options
3196
3197 --regions-overlap 0|1|2
3198 see Common Options
3199
3200 -t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
3201 see Common Options
3202
3203 -T, --targets-file file
3204 see Common Options
3205
3206 --targets-overlap 0|1|2
3207 see Common Options
3208
3209 --threads INT
3210 see Common Options
3211
3212 Subset options:
3213 -a, --trim-alt-alleles
3214 remove alleles not seen in the genotype fields from the ALT column.
3215 Note that if no alternate allele remains after trimming, the record
3216 itself is not removed but ALT is set to ".". If the option -s or -S
3217 is given, removes alleles not seen in the subset. INFO and FORMAT
3218 tags declared as Type=A, G or R will be trimmed as well.
3219
3220 --force-samples
3221 only warn about unknown subset samples
3222
3223 -I, --no-update
3224 do not (re)calculate INFO fields for the subset (currently INFO/AC
3225 and INFO/AN)
3226
3227 -s, --samples LIST
3228 see Common Options. Note that it is possible to create multiple
3229 subsets simultaneously using the split plugin.
3230
3231 -S, --samples-file FILE
3232 see Common Options. Note that it is possible to create multiple
3233 subsets simultaneously using the split plugin.
3234
3235 Filter options:
3236 Note that filter options below dealing with counting the number of
3237 alleles will, for speed, first check for the values of AC and AN in the
3238 INFO column to avoid parsing all the genotype (FORMAT/GT) fields in the
3239 VCF. This means that a filter like --min-af 0.1 will be calculated from
3240 INFO/AC and INFO/AN when available or FORMAT/GT otherwise. However, it
3241 will not attempt to use any other existing field, like INFO/AF for
3242 example. For that, use --exclude AF<0.1 instead.
3243
3244 Also note that one must be careful when sample subsetting and filtering
3245 is performed in a single command because the order of internal
3246 operations can influence the result. For example, the -i/-e filtering
3247 is performed before sample removal, but the -P filtering is performed
3248 after, and some are inherently ambiguous, for example allele counts can
3249 be taken from the INFO column when present but calculated on the fly
3250 when absent. Therefore it is strongly recommended to spell out the
3251 required order explicitly by separating such commands into two steps.
3252 (Make sure to use the -O u option when piping!)
3253
3254 -c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
3255 minimum allele count (INFO/AC) of sites to be printed. Specifying
3256 the type of allele is optional and can be set to non-reference
3257 (nref, the default), 1st alternate (alt1), the least frequent
3258 (minor), the most frequent (major) or sum of all but the most
3259 frequent (nonmajor) alleles.
3260
3261 -C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
3262 maximum allele count (INFO/AC) of sites to be printed. Specifying
3263 the type of allele is optional and can be set to non-reference
3264 (nref, the default), 1st alternate (alt1), the least frequent
3265 (minor), the most frequent (major) or sum of all but the most
3266 frequent (nonmajor) alleles.
3267
3268 -e, --exclude EXPRESSION
3269 exclude sites for which EXPRESSION is true. For valid expressions
3270 see EXPRESSIONS.
3271
3272 -f, --apply-filters LIST
3273 see Common Options
3274
3275 -g, --genotype [^][hom|het|miss]
3276 include only sites with one or more homozygous (hom), heterozygous
3277 (het) or missing (miss) genotypes. When prefixed with ^, the logic
3278 is reversed; thus ^het excludes sites with heterozygous genotypes.
3279
3280 -i, --include EXPRESSION
3281 include sites for which EXPRESSION is true. For valid expressions
3282 see EXPRESSIONS.
3283
3284 -k, --known
3285 print known sites only (ID column is not ".")
3286
3287 -m, --min-alleles INT
3288 print sites with at least INT alleles listed in REF and ALT columns
3289
3290 -M, --max-alleles INT
3291 print sites with at most INT alleles listed in REF and ALT columns.
3292 Use -m2 -M2 -v snps to only view biallelic SNPs.
3293
3294 -n, --novel
3295 print novel sites only (ID column is ".")
3296
3297 -p, --phased
3298 print sites where all samples are phased. Haploid genotypes are
3299 considered phased. Missing genotypes considered unphased unless the
3300 phased bit is set.
3301
3302 -P, --exclude-phased
3303 exclude sites where all samples are phased
3304
3305 -q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3306 minimum allele frequency (INFO/AC / INFO/AN) of sites to be
3307 printed. Specifying the type of allele is optional and can be set
3308 to non-reference (nref, the default), 1st alternate (alt1), the
3309 least frequent (minor), the most frequent (major) or sum of all but
3310 the most frequent (nonmajor) alleles.
3311
3312 -Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
3313 maximum allele frequency (INFO/AC / INFO/AN) of sites to be
3314 printed. Specifying the type of allele is optional and can be set
3315 to non-reference (nref, the default), 1st alternate (alt1), the
3316 least frequent (minor), the most frequent (major) or sum of all but
3317 the most frequent (nonmajor) alleles.
3318
3319 -u, --uncalled
3320 print sites without a called genotype
3321
3322 -U, --exclude-uncalled
3323 exclude sites without a called genotype
3324
3325 -v, --types snps|indels|mnps|other
3326 comma-separated list of variant types to select. Site is selected
3327 if any of the ALT alleles is of the type requested. Types are
3328 determined by comparing the REF and ALT alleles in the VCF record
3329 not INFO tags like INFO/INDEL or INFO/VT. Use --include to select
3330 based on INFO tags.
3331
3332 -V, --exclude-types snps|indels|mnps|ref|bnd|other
3333 comma-separated list of variant types to exclude. Site is excluded
3334 if any of the ALT alleles is of the type requested. Types are
3335 determined by comparing the REF and ALT alleles in the VCF record
3336 not INFO tags like INFO/INDEL or INFO/VT. Use --exclude to exclude
3337 based on INFO tags.
3338
3339 -x, --private
3340 print sites where only the subset samples carry an non-reference
3341 allele. Requires --samples or --samples-file.
3342
3343 -X, --exclude-private
3344 exclude sites where only the subset samples carry an non-reference
3345 allele
3346
3347 bcftools help [COMMAND] | bcftools --help [COMMAND]
3348 Display a brief usage message listing the bcftools commands
3349 available. If the name of a command is also given, e.g., bcftools help
3350 view, the detailed usage message for that particular command is
3351 displayed.
3352
3353 bcftools [--version|-v]
3354 Display the version numbers and copyright information for bcftools and
3355 the important libraries used by bcftools.
3356
3357 bcftools [--version-only]
3358 Display the full bcftools version number in a machine-readable format.
3359
3361 These filtering expressions are accepted by most of the commands.
3362
3363 Valid expressions may contain:
3364
3365 • numerical constants, string constants, file names (this is
3366 currently supported only to filter by the ID column)
3367
3368 1, 1.0, 1e-4
3369 "String"
3370 @file_name
3371
3372 • arithmetic operators
3373
3374 +,*,-,/
3375
3376 • comparison operators
3377
3378 == (same as =), >, >=, <=, <, !=
3379
3380 • regex operators "\~" and its negation "!~". The expressions are
3381 case sensitive unless "/i" is added.
3382
3383 INFO/HAYSTACK ~ "needle"
3384 INFO/HAYSTACK ~ "NEEDless/i"
3385
3386 • parentheses
3387
3388 (, )
3389
3390 • logical operators. See also the examples below and the filtering
3391 tutorial <http://samtools.github.io/bcftools/howtos/filtering.html>
3392 about the distinction between "&&" vs "&" and "||" vs "|".
3393
3394 &&, &, ||, |
3395
3396 • INFO tags, FORMAT tags, column names
3397
3398 INFO/DP or DP
3399 FORMAT/DV, FMT/DV, or DV
3400 FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
3401
3402 • starting with 1.11, the FILTER column can be queried as follows:
3403
3404 FILTER="PASS"
3405 FILTER="A" .. exact match, for example "A;B" does not pass
3406 FILTER!="A" .. exact match, for example "A;B" does pass
3407 FILTER~"A" .. both "A" and "A;B" pass
3408 FILTER!~"A" .. neither "A" nor "A;B" pass
3409
3410 • 1 (or 0) to test the presence (or absence) of a flag
3411
3412 FlagA=1 && FlagB=0
3413
3414 • "." to test missing values
3415
3416 DP=".", DP!=".", ALT="."
3417
3418 • missing genotypes can be matched regardless of phase and ploidy
3419 (".|.", "./.", ".", "0|.") using these expressions
3420
3421 GT="mis", GT~"\.", GT!~"\."
3422
3423 • missing genotypes can be matched including the phase and ploidy
3424 (".|.", "./.", ".") using these expressions
3425
3426 GT=".|.", GT="./.", GT="."
3427
3428 • sample genotype: reference (haploid or diploid), alternate (hom or
3429 het, haploid or diploid), missing genotype, homozygous,
3430 heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het,
3431 alt-alt het, haploid ref, haploid alt (case-insensitive)
3432
3433 GT="ref"
3434 GT="alt"
3435 GT="mis"
3436 GT="hom"
3437 GT="het"
3438 GT="hap"
3439 GT="RR"
3440 GT="AA"
3441 GT="RA" or GT="AR"
3442 GT="Aa" or GT="aA"
3443 GT="R"
3444 GT="A"
3445
3446 • TYPE for variant type in REF,ALT columns
3447 (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~"
3448 to require at least one allele of the given type or the equal sign
3449 "=" to require that all alleles are of the given type. Compare
3450
3451 TYPE="snp"
3452 TYPE~"snp"
3453 TYPE!="snp"
3454 TYPE!~"snp"
3455
3456 • array subscripts (0-based), "*" for any element, "-" to indicate a
3457 range. Note that for querying FORMAT vectors, the colon ":" can be
3458 used to select a sample and an element of the vector, as shown in
3459 the examples below
3460
3461 INFO/AF[0] > 0.3 .. first AF value bigger than 0.3
3462 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30
3463 FORMAT/AD[0:1] .. first sample, second AD value
3464 FORMAT/AD[1:0] .. second sample, first AD value
3465 DP4[*] == 0 .. any DP4 value
3466 FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30
3467 FORMAT/DP[1-3] > 10 .. samples 2-4
3468 FORMAT/DP[1-] < 7 .. all samples but the first
3469 FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5
3470 FORMAT/AD[0:1] .. first sample, second AD field
3471 FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
3472 FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field
3473 (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
3474 CSQ[*] ~ "missense_variant.*deleterious"
3475
3476 • with many samples it can be more practical to provide a file with
3477 sample names, one sample name per line
3478
3479 GT[@samples.txt]="het" & binom(AD)<0.01
3480
3481 • function on FORMAT tags (over samples) and INFO tags (over vector
3482 fields): maximum; minimum; arithmetic mean (AVG is synonymous with
3483 MEAN); median; standard deviation from mean; sum; string length;
3484 absolute value; number of elements:
3485
3486 MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
3487
3488 Note that functions above evaluate to a single value across all
3489 samples and are intended to select sites, not samples, even when
3490 applied on FORMAT tags. However, when prefixed with SMPL_ (or "s"
3491 for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector
3492 of per-sample values when applied on FORMAT tags:
3493
3494 SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM,
3495 sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
3496
3497 • two-tailed binomial test. Note that for N=0 the test evaluates to a
3498 missing value and when FORMAT/GT is used to determine the vector
3499 indices, it evaluates to 1 for homozygous genotypes.
3500
3501 binom(FMT/AD) .. GT can be used to determine the correct index
3502 binom(AD[0],AD[1]) .. or the fields can be given explicitly
3503 phred(binom()) .. the same as binom but phred-scaled
3504
3505 • variables calculated on the fly if not present: number of alternate
3506 alleles; number of samples; count of alternate alleles; minor
3507 allele count (similar to AC but is always smaller than 0.5);
3508 frequency of alternate alleles (AF=AC/AN); frequency of minor
3509 alleles (MAF=MAC/AN); number of alleles in called genotypes; number
3510 of samples with missing genotype; fraction of samples with missing
3511 genotype; indel length (deletions negative, insertions positive)
3512
3513 N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
3514
3515 • the number (N_PASS) or fraction (F_PASS) of samples which pass the
3516 expression
3517
3518 N_PASS(GQ>90 & GT!="mis") > 90
3519 F_PASS(GQ>90 & GT!="mis") > 0.9
3520
3521 • custom perl filtering. Note that this command is not compiled in by
3522 default, see the section Optional Compilation with Perl in the
3523 INSTALL file for help and misc/demo-flt.pl for a working example.
3524 The demo defined the perl subroutine "severity" which can be
3525 invoked from the command line as follows:
3526
3527 perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
3528
3529 Notes:
3530
3531 • String comparisons and regular expressions are case-insensitive
3532
3533 • Comma in strings is interpreted as a separator and when multiple
3534 values are compared, the OR logic is used. Consequently, the
3535 following two expressions are equivalent but not the third:
3536
3537 -i 'TAG="hello,world"'
3538 -i 'TAG="hello" || TAG="world"'
3539 -i 'TAG="hello" && TAG="world"'
3540
3541 • Variables and function names are case-insensitive, but not tag
3542 names. For example, "qual" can be used instead of "QUAL",
3543 "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
3544
3545 • When querying multiple values, all elements are tested and the OR
3546 logic is used on the result. For example, when querying
3547 "TAG=1,2,3,4", it will be evaluated as follows:
3548
3549 -i 'TAG[*]=1' .. true, the record will be printed
3550 -i 'TAG[*]!=1' .. true
3551 -e 'TAG[*]=1' .. false, the record will be discarded
3552 -e 'TAG[*]!=1' .. false
3553 -i 'TAG[0]=1' .. true
3554 -i 'TAG[0]!=1' .. false
3555 -e 'TAG[0]=1' .. false
3556 -e 'TAG[0]!=1' .. true
3557
3558 Examples:
3559
3560 MIN(DV)>5 .. selects the whole site, evaluates min across all values and samples
3561
3562 SMPL_MIN(DV)>5 .. selects matching samples, evaluates within samples
3563
3564 MIN(DV/DP)>0.3
3565
3566 MIN(DP)>10 & MIN(DV)>3
3567
3568 FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample
3569
3570 FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
3571
3572 QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
3573
3574 QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
3575
3576 TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
3577
3578 COUNT(GT="hom")=0 .. no homozygous genotypes at the site
3579
3580 AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50
3581
3582 ID=@file .. selects lines with ID present in the file
3583
3584 ID!=@~/file .. skip lines with ID present in the ~/file
3585
3586 MAF[0]<0.05 .. select rare variants at 5% cutoff
3587
3588 POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
3589
3590 Shell expansion:
3591
3592 Note that expressions must often be quoted because some characters have
3593 special meaning in the shell. An example of expression enclosed in
3594 single quotes which cause that the whole expression is passed to the
3595 program as intended:
3596
3597 bcftools view -i '%ID!="." & MAF[0]<0.01'
3598
3599 Please refer to the documentation of your shell for details.
3600
3602 plot-vcfstats [OPTIONS] file.vchk [...]
3603 Script for processing output of bcftools stats. It can merge results
3604 from multiple outputs (useful when running the stats for each
3605 chromosome separately), plots graphs and creates a PDF presentation.
3606
3607 -m, --merge
3608 Merge vcfstats files to STDOUT, skip plotting.
3609
3610 -p, --prefix DIR
3611 The output directory. This directory will be created if it does not
3612 exist.
3613
3614 -P, --no-PDF
3615 Skip the PDF creation step.
3616
3617 -r, --rasterize
3618 Rasterize PDF images for faster rendering. This is the default and
3619 the opposite of -v, --vectors.
3620
3621 -s, --sample-names
3622 Use sample names for xticks rather than numeric IDs.
3623
3624 -t, --title STRING
3625 Identify files by these titles in plots. The option can be given
3626 multiple times, for each ID in the bcftools stats output. If not
3627 present, the script will use abbreviated source file names for the
3628 titles.
3629
3630 -v, --vectors
3631 Generate vector graphics for PDF images, the opposite of -r,
3632 --rasterize.
3633
3634 -T, --main-title STRING
3635 Main title for the PDF.
3636
3637 Example:
3638
3639 # Generate the stats
3640 bcftools stats -s - > file.vchk
3641
3642 # Plot the stats
3643 plot-vcfstats -p outdir file.vchk
3644
3645 # The final looks can be customized by editing the generated
3646 # 'outdir/plot.py' script and re-running manually
3647 cd outdir && python plot.py && pdflatex summary.tex
3648
3650 HTSlib was designed with BCF format in mind. When parsing VCF files,
3651 all records are internally converted into BCF representation. Simple
3652 operations, like removing a single column from a VCF file, can be
3653 therefore done much faster with standard UNIX commands, such as awk or
3654 cut. Therefore it is recommended to use BCF as input/output format
3655 whenever possible to avoid large overhead of the VCF → BCF → VCF
3656 conversion.
3657
3659 Please report any bugs you encounter on the github website:
3660 http://github.com/samtools/bcftools
3661
3663 Heng Li from the Sanger Institute wrote the original C version of
3664 htslib, samtools and bcftools. Bob Handsaker from the Broad Institute
3665 implemented the BGZF library. Petr Danecek, Shane McCarthy and John
3666 Marshall are maintaining and further developing bcftools. Many other
3667 people contributed to the program and to the file format
3668 specifications, both directly and indirectly by providing patches,
3669 testing and reporting bugs. We thank them all.
3670
3672 BCFtools GitHub website: http://github.com/samtools/bcftools
3673
3674 Samtools GitHub website: http://github.com/samtools/samtools
3675
3676 HTSlib GitHub website: http://github.com/samtools/htslib
3677
3678 File format specifications: http://samtools.github.io/hts-specs
3679
3680 BCFtools documentation: http://samtools.github.io/bcftools
3681
3682 BCFtools wiki page: https://github.com/samtools/bcftools/wiki
3683
3685 The MIT/Expat License or GPL License, see the LICENSE document for
3686 details. Copyright (c) Genome Research Ltd.
3687
3688
3689
3690 2022-04-07 BCFTOOLS(1)