1vcftools(man) 2 August 2018 vcftools(man)
2
3
4
6 vcftools v0.1.16 - Utilities for the variant call format (VCF) and
7 binary variant call format (BCF)
8
10 vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PRE‐
11 FIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS ]
12
14 vcftools is a suite of functions for use on genetic variation data in
15 the form of VCF and BCF files. The tools provided will be used mainly
16 to summarize data, run calculations on data, filter out data, and con‐
17 vert data into other useful file formats.
18
20 Output allele frequency for all sites in the input vcf file from chro‐
21 mosome 1
22 vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis
23
24 Output a new vcf file from the input vcf file that removes any indel
25 sites
26 vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-
27 all --out SNPs_only
28
29 Output file comparing the sites in two vcf files
30 vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz
31 --diff-site --out in1_v_in2
32
33 Output a new vcf file to standard out without any sites that have a
34 filter tag, then compress it with gzip
35 vcftools --gzvcf input_file.vcf.gz --remove-filtered-all --recode
36 --stdout | gzip -c > output_PASS_only.vcf.gz
37
38 Output a Hardy-Weinberg p-value for every site in the bcf file that
39 does not have any missing genotypes
40 vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out out‐
41 put_noMissing
42
43 Output nucleotide diversity at a list of positions
44 zcat input_file.vcf.gz | vcftools --vcf - --site-pi --positions
45 SNP_list.txt --out nucleotide_diversity
46
48 These options are used to specify the input and output files.
49
50 INPUT FILE OPTIONS
51 --vcf <input_filename>
52 This option defines the VCF file to be processed. VCFtools expects
53 files in VCF format v4.0, v4.1 or v4.2. The latter two are sup‐
54 ported with some small limitations. If the user provides a dash
55 character '-' as a file name, the program expects a VCF file to be
56 piped in through standard in.
57
58 --gzvcf <input_filename>
59 This option can be used in place of the --vcf option to read com‐
60 pressed (gzipped) VCF files directly.
61
62 --bcf <input_filename>
63 This option can be used in place of the --vcf option to read BCF2
64 files directly. You do not need to specify if this file is com‐
65 pressed with BGZF encoding. If the user provides a dash character
66 '-' as a file name, the program expects a BCF2 file to be piped in
67 through standard in.
68
69 OUTPUT FILE OPTIONS
70 --out <output_prefix>
71 This option defines the output filename prefix for all files gener‐
72 ated by vcftools. For example, if <prefix> is set to output_file‐
73 name, then all output files will be of the form output_filename.***
74 . If this option is omitted, all output files will have the prefix
75 "out." in the current working directory.
76
77 --stdout
78 -c
79 These options direct the vcftools output to standard out so it can
80 be piped into another program or written directly to a filename of
81 choice. However, a select few output functions cannot be written to
82 standard out.
83
84 --temp <temporary_directory>
85 This option can be used to redirect any temporary files that
86 vcftools creates into a specified directory.
87
89 These options are used to include or exclude certain sites from any
90 analysis being performed by the program.
91
92 POSITION FILTERING
93 --chr <chromosome>
94 --not-chr <chromosome>
95 Includes or excludes sites with indentifiers matching <chromosome>.
96 These options may be used multiple times to include or exclude more
97 than one chromosome.
98
99 --from-bp <integer>
100 --to-bp <integer>
101 These options specify a lower bound and upper bound for a range of
102 sites to be processed. Sites with positions less than or greater
103 than these values will be excluded. These options can only be used
104 in conjunction with a single usage of --chr. Using one of these
105 does not require use of the other.
106
107 --positions <filename>
108 --exclude-positions <filename>
109 Include or exclude a set of sites on the basis of a list of posi‐
110 tions in a file. Each line of the input file should contain a (tab-
111 separated) chromosome and position. The file can have comment lines
112 that start with a "#", they will be ignored.
113
114 --positions-overlap <filename>
115 --exclude-positions-overlap <filename>
116 Include or exclude a set of sites on the basis of the reference
117 allele overlapping with a list of positions in a file. Each line of
118 the input file should contain a (tab-separated) chromosome and
119 position. The file can have comment lines that start with a "#",
120 they will be ignored.
121
122 --bed <filename>
123 --exclude-bed <filename>
124 Include or exclude a set of sites on the basis of a BED file. Only
125 the first three columns (chrom, chromStart and chromEnd) are
126 required. The BED file is expected to have a header line. A site
127 will be kept or excluded if any part of any allele (REF or ALT) at
128 a site is within the range of one of the BED entries.
129
130 --thin <integer>
131 Thin sites so that no two sites are within the specified distance
132 from one another.
133
134 --mask <filename>
135 --invert-mask <filename>
136 --mask-min <integer>
137 These options are used to specify a FASTA-like mask file to filter
138 with. The mask file contains a sequence of integer digits (between
139 0 and 9) for each position on a chromosome that specify if a site
140 at that position should be filtered or not.
141 An example mask file would look like:
142 >1
143 0000011111222...
144 >2
145 2222211111000...
146 In this example, sites in the VCF file located within the first 5
147 bases of the start of chromosome 1 would be kept, whereas sites at
148 position 6 onwards would be filtered out. And sites after the 11th
149 position on chromosome 2 would be filtered out as well.
150 The "--invert-mask" option takes the same format mask file as the
151 "--mask" option, however it inverts the mask file before filtering
152 with it.
153 And the "--mask-min" option specifies a threshold mask value
154 between 0 and 9 to filter positions by. The default threshold is 0,
155 meaning only sites with that value or lower will be kept.
156
157 SITE ID FILTERING
158 --snp <string>
159 Include SNP(s) with matching ID (e.g. a dbSNP rsID). This command
160 can be used multiple times in order to include more than one SNP.
161
162 --snps <filename>
163 --exclude <filename>
164 Include or exclude a list of SNPs given in a file. The file should
165 contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line.
166 No header line is expected.
167
168 VARIANT TYPE FILTERING
169 --keep-only-indels
170 --remove-indels
171 Include or exclude sites that contain an indel. For these options
172 "indel" means any variant that alters the length of the REF allele.
173
174 FILTER FLAG FILTERING
175 --remove-filtered-all
176 Removes all sites with a FILTER flag other than PASS.
177
178 --keep-filtered <string>
179 --remove-filtered <string>
180 Includes or excludes all sites marked with a specific FILTER flag.
181 These options may be used more than once to specify multiple FILTER
182 flags.
183
184 INFO FIELD FILTERING
185 --keep-INFO <string>
186 --remove-INFO <string>
187 Includes or excludes all sites with a specific INFO flag. These
188 options only filter on the presence of the flag and not its value.
189 These options can be used multiple times to specify multiple INFO
190 flags.
191
192 ALLELE FILTERING
193 --maf <float>
194 --max-maf <float>
195 Include only sites with a Minor Allele Frequency greater than or
196 equal to the "--maf" value and less than or equal to the "--max-
197 maf" value. One of these options may be used without the other.
198 Allele frequency is defined as the number of times an allele
199 appears over all individuals at that site, divided by the total
200 number of non-missing alleles at that site.
201
202 --non-ref-af <float>
203 --max-non-ref-af <float>
204 --non-ref-ac <integer>
205 --max-non-ref-ac <integer>
206
207 --non-ref-af-any <float>
208 --max-non-ref-af-any <float>
209 --non-ref-ac-any <integer>
210 --max-non-ref-ac-any <integer>
211 Include only sites with all Non-Reference (ALT) Allele Frequencies
212 (af) or Counts (ac) within the range specified, and including the
213 specified value. The default options require all alleles to meet
214 the specified criteria, whereas the options appended with "any"
215 require only one allele to meet the criteria. The Allele frequency
216 is defined as the number of times an allele appears over all indi‐
217 viduals at that site, divided by the total number of non-missing
218 alleles at that site.
219
220 --mac <integer>
221 --max-mac <integer>
222 Include only sites with Minor Allele Count greater than or equal to
223 the "--mac" value and less than or equal to the "--max-mac" value.
224 One of these options may be used without the other. Allele count is
225 simply the number of times that allele appears over all individuals
226 at that site.
227
228 --min-alleles <integer>
229 --max-alleles <integer>
230 Include only sites with a number of alleles greater than or equal
231 to the "--min-alleles" value and less than or equal to the "--max-
232 alleles" value. One of these options may be used without the other.
233 For example, to include only bi-allelic sites, one could use:
234 vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2
235
236 GENOTYPE VALUE FILTERING
237 --min-meanDP <float>
238 --max-meanDP <float>
239 Includes only sites with mean depth values (over all included indi‐
240 viduals) greater than or equal to the "--min-meanDP" value and less
241 than or equal to the "--max-meanDP" value. One of these options may
242 be used without the other. These options require that the "DP" FOR‐
243 MAT tag is included for each site.
244
245 --hwe <float>
246 Assesses sites for Hardy-Weinberg Equilibrium using an exact test,
247 as defined by Wigginton, Cutler and Abecasis (2005). Sites with a
248 p-value below the threshold defined by this option are taken to be
249 out of HWE, and therefore excluded.
250
251 --max-missing <float>
252 Exclude sites on the basis of the proportion of missing data
253 (defined to be between 0 and 1, where 0 allows sites that are com‐
254 pletely missing and 1 indicates no missing data allowed).
255
256 --max-missing-count <integer>
257 Exclude sites with more than this number of missing genotypes over
258 all individuals.
259
260 --phased
261 Excludes all sites that contain unphased genotypes.
262
263 MISCELLANEOUS FILTERING
264 --minQ <float>
265 Includes only sites with Quality value above this threshold.
266
268 These options are used to include or exclude certain individuals from
269 any analysis being performed by the program.
270 --indv <string>
271 --remove-indv <string>
272 Specify an individual to be kept or removed from the analysis. This
273 option can be used multiple times to specify multiple individuals.
274 If both options are specified, then the "--indv" option is executed
275 before the "--remove-indv option".
276
277 --keep <filename>
278 --remove <filename>
279 Provide files containing a list of individuals to either include or
280 exclude in subsequent analysis. Each individual ID (as defined in
281 the VCF headerline) should be included on a separate line. If both
282 options are used, then the "--keep" option is executed before the
283 "--remove" option. When multiple files are provided, the union of
284 individuals from all keep files subtracted by the union of individ‐
285 uals from all remove files are kept. No header line is expected.
286
287 --max-indv <integer>
288 Randomly thins individuals so that only the specified number are
289 retained.
290
292 These options are used to exclude genotypes from any analysis being
293 performed by the program. If excluded, these values will be treated as
294 missing.
295 --remove-filtered-geno-all
296 Excludes all genotypes with a FILTER flag not equal to "." (a miss‐
297 ing value) or PASS.
298
299 --remove-filtered-geno <string>
300 Excludes genotypes with a specific FILTER flag.
301
302 --minGQ <float>
303 Exclude all genotypes with a quality below the threshold specified.
304 This option requires that the "GQ" FORMAT tag is specified for all
305 sites.
306
307 --minDP <float>
308 --maxDP <float>
309 Includes only genotypes greater than or equal to the "--minDP"
310 value and less than or equal to the "--maxDP" value. This option
311 requires that the "DP" FORMAT tag is specified for all sites.
312
314 These options specify which analyses or conversions to perform on the
315 data that passed through all specified filters.
316
317 OUTPUT ALLELE STATISTICS
318 --freq
319 --freq2
320 Outputs the allele frequency for each site in a file with the suf‐
321 fix ".frq". The second option is used to suppress output of any
322 information about the alleles.
323
324 --counts
325 --counts2
326 Outputs the raw allele counts for each site in a file with the suf‐
327 fix ".frq.count". The second option is used to suppress output of
328 any information about the alleles.
329
330 --derived
331 For use with the previous four frequency and count options only.
332 Re-orders the output file columns so that the ancestral allele
333 appears first. This option relies on the ancestral allele being
334 specified in the VCF file using the AA tag in the INFO field.
335
336 OUTPUT DEPTH STATISTICS
337 --depth
338 Generates a file containing the mean depth per individual. This
339 file has the suffix ".idepth".
340
341 --site-depth
342 Generates a file containing the depth per site summed across all
343 individuals. This output file has the suffix ".ldepth".
344
345 --site-mean-depth
346 Generates a file containing the mean depth per site averaged across
347 all individuals. This output file has the suffix ".ldepth.mean".
348
349 --geno-depth
350 Generates a (possibly very large) file containing the depth for
351 each genotype in the VCF file. Missing entries are given the value
352 -1. The file has the suffix ".gdepth".
353
354 OUTPUT LD STATISTICS
355 --hap-r2
356 Outputs a file reporting the r2, D, and D' statistics using phased
357 haplotypes. These are the traditional measures of LD often reported
358 in the population genetics literature. The output file has the suf‐
359 fix ".hap.ld". This option assumes that the VCF input file has
360 phased haplotypes.
361
362 --geno-r2
363 Calculates the squared correlation coefficient between genotypes
364 encoded as 0, 1 and 2 to represent the number of non-reference
365 alleles in each individual. This is the same as the LD measure
366 reported by PLINK. The D and D' statistics are only available for
367 phased genotypes. The output file has the suffix ".geno.ld".
368
369 --geno-chisq
370 If your data contains sites with more than two alleles, then this
371 option can be used to test for genotype independence via the chi-
372 squared statistic. The output file has the suffix ".geno.chisq".
373
374 --hap-r2-positions <positions list file>
375 --geno-r2-positions <positions list file>
376 Outputs a file reporting the r2 statistics of the sites contained
377 in the provided file verses all other sites. The output files have
378 the suffix ".list.hap.ld" or ".list.geno.ld", depending on which
379 option is used.
380
381 --ld-window <integer>
382 This optional parameter defines the maximum number of SNPs between
383 the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and
384 "--geno-chisq" functions.
385
386 --ld-window-bp <integer>
387 This optional parameter defines the maximum number of physical
388 bases between the SNPs being tested for LD in the "--hap-r2",
389 "--geno-r2", and "--geno-chisq" functions.
390
391 --ld-window-min <integer>
392 This optional parameter defines the minimum number of SNPs between
393 the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and
394 "--geno-chisq" functions.
395
396 --ld-window-bp-min <integer>
397 This optional parameter defines the minimum number of physical
398 bases between the SNPs being tested for LD in the "--hap-r2",
399 "--geno-r2", and "--geno-chisq" functions.
400
401 --min-r2 <float>
402 This optional parameter sets a minimum value for r2, below which
403 the LD statistic is not reported by the "--hap-r2", "--geno-r2",
404 and "--geno-chisq" functions.
405
406 --interchrom-hap-r2
407 --interchrom-geno-r2
408 Outputs a file reporting the r2 statistics for sites on different
409 chromosomes. The output files have the suffix ".interchrom.hap.ld"
410 or ".interchrom.geno.ld", depending on the option used.
411
412 OUTPUT TRANSITION/TRANSVERSION STATISTICS
413 --TsTv <integer>
414 Calculates the Transition / Transversion ratio in bins of size
415 defined by this option. Only uses bi-allelic SNPs. The resulting
416 output file has the suffix ".TsTv".
417
418 --TsTv-summary
419 Calculates a simple summary of all Transitions and Transversions.
420 The output file has the suffix ".TsTv.summary".
421
422 --TsTv-by-count
423 Calculates the Transition / Transversion ratio as a function of
424 alternative allele count. Only uses bi-allelic SNPs. The resulting
425 output file has the suffix ".TsTv.count".
426
427 --TsTv-by-qual
428 Calculates the Transition / Transversion ratio as a function of SNP
429 quality threshold. Only uses bi-allelic SNPs. The resulting output
430 file has the suffix ".TsTv.qual".
431
432 --FILTER-summary
433 Generates a summary of the number of SNPs and Ts/Tv ratio for each
434 FILTER category. The output file has the suffix ".FILTER.summary".
435
436 OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS
437 --site-pi
438 Measures nucleotide divergency on a per-site basis. The output file
439 has the suffix ".sites.pi".
440
441 --window-pi <integer>
442 --window-pi-step <integer>
443 Measures the nucleotide diversity in windows, with the number pro‐
444 vided as the window size. The output file has the suffix ".win‐
445 dowed.pi". The latter is an optional argument used to specify the
446 step size in between windows.
447
448 OUTPUT FST STATISTICS
449 --weir-fst-pop <filename>
450 This option is used to calculate an Fst estimate from Weir and
451 Cockerham's 1984 paper. This is the preferred calculation of Fst.
452 The provided file must contain a list of individuals (one individ‐
453 ual per line) from the VCF file that correspond to one population.
454 This option can be used multiple times to calculate Fst for more
455 than two populations. These files will also be included as "--keep"
456 options. By default, calculations are done on a per-site basis. The
457 output file has the suffix ".weir.fst".
458
459 --fst-window-size <integer>
460 --fst-window-step <integer>
461 These options can be used with "--weir-fst-pop" to do the Fst cal‐
462 culations on a windowed basis instead of a per-site basis. These
463 arguments specify the desired window size and the desired step size
464 between windows.
465
466 OUTPUT OTHER STATISTICS
467 --het
468 Calculates a measure of heterozygosity on a per-individual basis.
469 Specfically, the inbreeding coefficient, F, is estimated for each
470 individual using a method of moments. The resulting file has the
471 suffix ".het".
472
473 --hardy
474 Reports a p-value for each site from a Hardy-Weinberg Equilibrium
475 test (as defined by Wigginton, Cutler and Abecasis (2005)). The
476 resulting file (with suffix ".hwe") also contains the Observed num‐
477 bers of Homozygotes and Heterozygotes and the corresponding
478 Expected numbers under HWE.
479
480 --TajimaD <integer>
481 Outputs Tajima's D statistic in bins with size of the specified
482 number. The output file has the suffix ".Tajima.D".
483
484 --indv-freq-burden
485 This option calculates the number of variants within each individ‐
486 ual of a specific frequency. The resulting file has the suffix
487 ".ifreqburden".
488
489 --LROH
490 This option will identify and output Long Runs of Homozygosity. The
491 output file has the suffix ".LROH". This function is experimental,
492 and will use a lot of memory if applied to large datasets.
493
494 --relatedness
495 This option is used to calculate and output a relatedness statistic
496 based on the method of Yang et al, Nature Genetics 2010
497 (doi:10.1038/ng.608). Specifically, calculate the unadjusted Ajk
498 statistic. Expectation of Ajk is zero for individuals within a pop‐
499 ulations, and one for an individual with themselves. The output
500 file has the suffix ".relatedness".
501
502 --relatedness2
503 This option is used to calculate and output a relatedness statistic
504 based on the method of Manichaikul et al., BIOINFORMATICS 2010
505 (doi:10.1093/bioinformatics/btq559). The output file has the suffix
506 ".relatedness2".
507
508 --site-quality
509 Generates a file containing the per-site SNP quality, as found in
510 the QUAL column of the VCF file. This file has the suffix ".lqual".
511
512 --missing-indv
513 Generates a file reporting the missingness on a per-individual
514 basis. The file has the suffix ".imiss".
515
516 --missing-site
517 Generates a file reporting the missingness on a per-site basis. The
518 file has the suffix ".lmiss".
519
520 --SNPdensity <integer>
521 Calculates the number and density of SNPs in bins of size defined
522 by this option. The resulting output file has the suffix ".snpden".
523
524 --kept-sites
525 Creates a file listing all sites that have been kept after filter‐
526 ing. The file has the suffix ".kept.sites".
527
528 --removed-sites
529 Creates a file listing all sites that have been removed after fil‐
530 tering. The file has the suffix ".removed.sites".
531
532 --singletons
533 This option will generate a file detailing the location of single‐
534 tons, and the individual they occur in. The file reports both true
535 singletons, and private doubletons (i.e. SNPs where the minor
536 allele only occurs in a single individual and that individual is
537 homozygotic for that allele). The output file has the suffix ".sin‐
538 gletons".
539
540 --hist-indel-len
541 This option will generate a histogram file of the length of all
542 indels (including SNPs). It shows both the count and the percentage
543 of all indels for indel lengths that occur at least once in the
544 input file. SNPs are considered indels with length zero. The output
545 file has the suffix ".indel.hist".
546
547 --hapcount <BED file>
548 This option will output the number of unique haplotypes within user
549 specified bins, as defined by the BED file. The output file has the
550 suffix ".hapcount".
551
552 --mendel <PED file>
553 This option is use to report mendel errors identified in trios. The
554 command requires a PLINK-style PED file, with the first four col‐
555 umns specifying a family ID, the child ID, the father ID, and the
556 mother ID. The output of this command has the suffix ".mendel".
557
558 --extract-FORMAT-info <string>
559 Extract information from the genotype fields in the VCF file relat‐
560 ing to a specfied FORMAT identifier. The resulting output file has
561 the suffix ".<FORMAT_ID>.FORMAT". For example, the following com‐
562 mand would extract the all of the GT (i.e. Genotype) entries:
563 vcftools --vcf file1.vcf --extract-FORMAT-info GT
564
565 --get-INFO <string>
566 This option is used to extract information from the INFO field in
567 the VCF file. The <string> argument specifies the INFO tag to be
568 extracted, and the option can be used multiple times in order to
569 extract multiple INFO entries. The resulting file, with suffix
570 ".INFO", contains the required INFO information in a tab-separated
571 table. For example, to extract the NS and DB flags, one would use
572 the command:
573 vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB
574
575 OUTPUT VCF FORMAT
576 --recode
577 --recode-bcf
578 These options are used to generate a new file in either VCF or BCF
579 from the input VCF or BCF file after applying the filtering options
580 specified by the user. The output file has the suffix ".recode.vcf"
581 or ".recode.bcf". By default, the INFO fields are removed from the
582 output file, as the INFO values may be invalidated by the recoding
583 (e.g. the total depth may need to be recalculated if individuals
584 are removed). This behavior may be overriden by the following
585 options. By default, BCF files are written out as BGZF compressed
586 files.
587
588 --recode-INFO <string>
589 --recode-INFO-all
590 These options can be used with the above recode options to define
591 an INFO key name to keep in the output file. This option can be
592 used multiple times to keep more of the INFO fields. The second
593 option is used to keep all INFO values in the original file.
594
595 --contigs <string>
596 This option can be used in conjuction with the --recode-bcf when
597 the input file does not have any contig declarations. This option
598 expects a file name with one contig header per line. These lines
599 are included in the output file.
600
601 OUTPUT OTHER FORMATS
602 --012
603 This option outputs the genotypes as a large matrix. Three files
604 are produced. The first, with suffix ".012", contains the genotypes
605 of each individual on a separate line. Genotypes are represented as
606 0, 1 and 2, where the number represent that number of non-reference
607 alleles. Missing genotypes are represented by -1. The second file,
608 with suffix ".012.indv" details the individuals included in the
609 main file. The third file, with suffix ".012.pos" details the site
610 locations included in the main file.
611
612 --IMPUTE
613 This option outputs phased haplotypes in IMPUTE reference-panel
614 format. As IMPUTE requires phased data, using this option also
615 implies --phased. Unphased individuals and genotypes are therefore
616 excluded. Only bi-allelic sites are included in the output. Using
617 this option generates three files. The IMPUTE haplotype file has
618 the suffix ".impute.hap", and the IMPUTE legend file has the suffix
619 ".impute.hap.legend". The third file, with suffix
620 ".impute.hap.indv", details the individuals included in the haplo‐
621 type file, although this file is not needed by IMPUTE.
622
623 --ldhat
624 --ldhelmet
625 --ldhat-geno
626 These options output data in LDhat/LDhelmet format. This option
627 requires the "--chr" filter option to also be used. The two first
628 options output phased data only, and therefore also implies
629 "--phased" be used, leading to unphased individuals and genotypes
630 being excluded. For LDhelmet, only snps will be considered, and
631 therefore it implies "--remove-indels". The second option treats
632 all of the data as unphased, and therefore outputs LDhat files in
633 genotype/unphased format. Two output files are generated with the
634 suffixes ".ldhat.sites" and ".ldhat.locs", which correspond to the
635 LDhat "sites" and "locs" input files respectively; for LDhelmet,
636 the two files generated have the suffixes ".ldhelmet.snps" and
637 ".ldhelmet.pos", which corresponds to the "SNPs" and "positions"
638 files.
639
640 --BEAGLE-GL
641 --BEAGLE-PL
642 These options output genotype likelihood information for input into
643 the BEAGLE program. The VCF file is required to contain FORMAT
644 fields with "GL" or "PL" tags, which can generally be output by SNP
645 callers such as the GATK. Use of this option requires a chromosome
646 to be specified via the "--chr" option. The resulting output file
647 has the suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype
648 likelihoods for biallelic sites. This file is suitable for input
649 into BEAGLE via the "like=" argument.
650
651 --plink
652 --plink-tped
653 --chrom-map
654 These options output the genotype data in PLINK PED format. With
655 the first option, two files are generated, with suffixes ".ped" and
656 ".map". Note that only bi-allelic loci will be output. Further
657 details of these files can be found in the PLINK documentation.
658 Note: The first option can be very slow on large datasets. Using
659 the --chr option to divide up the dataset is advised, or alterna‐
660 tively use the --plink-tped option which outputs the files in the
661 PLINK transposed format with suffixes ".tped" and ".tfam".
662 For usage with variant sites in species other than humans, the
663 --chrom-map option may be used to specify a file name that has a
664 tab-delimited mapping of chromosome name to a desired integer value
665 with one line per chromosome. This file must contain a mapping for
666 every chromosome value found in the file.
667
669 These options are used to compare the original variant file to another
670 variant file and output the results. All of the diff functions require
671 both files to contain the same chromosomes and that the files be sorted
672 in the same order. If one of the files contains chromosomes that the
673 other file does not, use the --not-chr filter to remove them from the
674 analysis.
675
676 DIFF VCF FILE
677 --diff <filename>
678 --gzdiff <filename>
679 --diff-bcf <filename>
680 These options compare the original input file to this specified
681 VCF, gzipped VCF, or BCF file. These options must be specified with
682 one additional option described below in order to specify what type
683 of comparison is to be performed. See the examples section for typ‐
684 ical usage.
685
686 DIFF OPTIONS
687 --diff-site
688 Outputs the sites that are common / unique to each file. The output
689 file has the suffix ".diff.sites_in_files".
690
691 --diff-indv
692 Outputs the individuals that are common / unique to each file. The
693 output file has the suffix ".diff.indv_in_files".
694
695 --diff-site-discordance
696 This option calculates discordance on a site by site basis. The
697 resulting output file has the suffix ".diff.sites".
698
699 --diff-indv-discordance
700 This option calculates discordance on a per-individual basis. The
701 resulting output file has the suffix ".diff.indv".
702
703 --diff-indv-map <filename>
704 This option allows the user to specify a mapping of individual IDs
705 in the second file to those in the first file. The program expects
706 the file to contain a tab-delimited line containing an individual's
707 name in file one followed by that same individual's name in file
708 two with one mapping per line.
709
710 --diff-discordance-matrix
711 This option calculates a discordance matrix. This option only works
712 with bi-allelic loci with matching alleles that are present in both
713 files. The resulting output file has the suffix ".diff.discor‐
714 dance.matrix".
715
716 --diff-switch-error
717 This option calculates phasing errors (specifically "switch
718 errors"). This option creates an output file describing switch
719 errors found between sites, with suffix ".diff.switch".
720
722 Adam Auton
723 Anthony Marcketta
724
725
726
7271 page vcftools(man)