vcftools(1)

1vcftools(man)                    2 August 2018                   vcftools(man)
2
3
4

NAME

6       vcftools  v0.1.16  -  Utilities  for  the variant call format (VCF) and
7       binary variant call format (BCF)
8

SYNOPSIS

10       vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT  PRE‐
11       FIX ] [ FILTERING OPTIONS ]  [ OUTPUT OPTIONS ]
12

DESCRIPTION

14       vcftools  is  a suite of functions for use on genetic variation data in
15       the form of VCF and BCF files. The tools provided will be  used  mainly
16       to  summarize data, run calculations on data, filter out data, and con‐
17       vert data into other useful file formats.
18

EXAMPLES

20       Output allele frequency for all sites in the input vcf file from  chro‐
21       mosome 1
22         vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis
23
24       Output  a  new  vcf file from the input vcf file that removes any indel
25       sites
26         vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-
27         all --out SNPs_only
28
29       Output file comparing the sites in two vcf files
30         vcftools   --gzvcf   input_file1.vcf.gz  --gzdiff  input_file2.vcf.gz
31         --diff-site --out in1_v_in2
32
33       Output a new vcf file to standard out without any  sites  that  have  a
34       filter tag, then compress it with gzip
35         vcftools  --gzvcf  input_file.vcf.gz  --remove-filtered-all  --recode
36         --stdout | gzip -c > output_PASS_only.vcf.gz
37
38       Output a Hardy-Weinberg p-value for every site in  the  bcf  file  that
39       does not have any missing genotypes
40         vcftools  --bcf  input_file.bcf  --hardy --max-missing 1.0 --out out‐
41         put_noMissing
42
43       Output nucleotide diversity at a list of positions
44         zcat input_file.vcf.gz  |  vcftools  --vcf  -  --site-pi  --positions
45         SNP_list.txt --out nucleotide_diversity
46

BASIC OPTIONS

48       These options are used to specify the input and output files.
49
50   INPUT FILE OPTIONS
51         --vcf <input_filename>
52           This  option defines the VCF file to be processed. VCFtools expects
53           files in VCF format v4.0, v4.1 or v4.2. The  latter  two  are  sup‐
54           ported  with  some  small  limitations. If the user provides a dash
55           character '-' as a file name, the program expects a VCF file to  be
56           piped in through standard in.
57
58         --gzvcf <input_filename>
59           This  option  can be used in place of the --vcf option to read com‐
60           pressed (gzipped) VCF files directly.
61
62         --bcf <input_filename>
63           This option can be used in place of the --vcf option to  read  BCF2
64           files  directly.  You  do  not need to specify if this file is com‐
65           pressed with BGZF encoding. If the user provides a  dash  character
66           '-'  as a file name, the program expects a BCF2 file to be piped in
67           through standard in.
68
69   OUTPUT FILE OPTIONS
70         --out <output_prefix>
71           This option defines the output filename prefix for all files gener‐
72           ated  by  vcftools. For example, if <prefix> is set to output_file‐
73           name, then all output files will be of the form output_filename.***
74           .  If this option is omitted, all output files will have the prefix
75           "out." in the current working directory.
76
77         --stdout
78         -c
79           These options direct the vcftools output to standard out so it  can
80           be  piped into another program or written directly to a filename of
81           choice. However, a select few output functions cannot be written to
82           standard out.
83
84         --temp <temporary_directory>
85           This  option  can  be  used  to  redirect  any temporary files that
86           vcftools creates into a specified directory.
87

SITE FILTERING OPTIONS

89       These options are used to include or exclude  certain  sites  from  any
90       analysis being performed by the program.
91
92   POSITION FILTERING
93         --chr <chromosome>
94         --not-chr <chromosome>
95           Includes or excludes sites with indentifiers matching <chromosome>.
96           These options may be used multiple times to include or exclude more
97           than one chromosome.
98
99         --from-bp <integer>
100         --to-bp <integer>
101           These  options specify a lower bound and upper bound for a range of
102           sites to be processed. Sites with positions less  than  or  greater
103           than  these values will be excluded. These options can only be used
104           in conjunction with a single usage of --chr.  Using  one  of  these
105           does not require use of the other.
106
107         --positions <filename>
108         --exclude-positions <filename>
109           Include  or  exclude a set of sites on the basis of a list of posi‐
110           tions in a file. Each line of the input file should contain a (tab-
111           separated) chromosome and position. The file can have comment lines
112           that start with a "#", they will be ignored.
113
114         --positions-overlap <filename>
115         --exclude-positions-overlap <filename>
116           Include or exclude a set of sites on the  basis  of  the  reference
117           allele overlapping with a list of positions in a file. Each line of
118           the input file should  contain  a  (tab-separated)  chromosome  and
119           position.  The  file  can have comment lines that start with a "#",
120           they will be ignored.
121
122         --bed <filename>
123         --exclude-bed <filename>
124           Include or exclude a set of sites on the basis of a BED file.  Only
125           the  first  three  columns  (chrom,  chromStart  and  chromEnd) are
126           required. The BED file is expected to have a header  line.  A  site
127           will  be kept or excluded if any part of any allele (REF or ALT) at
128           a site is within the range of one of the BED entries.
129
130         --thin <integer>
131           Thin sites so that no two sites are within the  specified  distance
132           from one another.
133
134         --mask <filename>
135         --invert-mask <filename>
136         --mask-min <integer>
137           These  options are used to specify a FASTA-like mask file to filter
138           with. The mask file contains a sequence of integer digits  (between
139           0  and  9) for each position on a chromosome that specify if a site
140           at that position should be filtered or not.
141           An example mask file would look like:
142             >1
143             0000011111222...
144             >2
145             2222211111000...
146           In this example, sites in the VCF file located within the  first  5
147           bases  of the start of chromosome 1 would be kept, whereas sites at
148           position 6 onwards would be filtered out. And sites after the  11th
149           position on chromosome 2 would be filtered out as well.
150           The  "--invert-mask"  option takes the same format mask file as the
151           "--mask" option, however it inverts the mask file before  filtering
152           with it.
153           And  the  "--mask-min"  option  specifies  a  threshold  mask value
154           between 0 and 9 to filter positions by. The default threshold is 0,
155           meaning only sites with that value or lower will be kept.
156
157   SITE ID FILTERING
158         --snp <string>
159           Include  SNP(s)  with matching ID (e.g. a dbSNP rsID). This command
160           can be used multiple times in order to include more than one SNP.
161
162         --snps <filename>
163         --exclude <filename>
164           Include or exclude a list of SNPs given in a file. The file  should
165           contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line.
166           No header line is expected.
167
168   VARIANT TYPE FILTERING
169         --keep-only-indels
170         --remove-indels
171           Include or exclude sites that contain an indel. For  these  options
172           "indel" means any variant that alters the length of the REF allele.
173
174   FILTER FLAG FILTERING
175         --remove-filtered-all
176           Removes all sites with a FILTER flag other than PASS.
177
178         --keep-filtered <string>
179         --remove-filtered <string>
180           Includes  or excludes all sites marked with a specific FILTER flag.
181           These options may be used more than once to specify multiple FILTER
182           flags.
183
184   INFO FIELD FILTERING
185         --keep-INFO <string>
186         --remove-INFO <string>
187           Includes  or  excludes  all  sites with a specific INFO flag. These
188           options only filter on the presence of the flag and not its  value.
189           These  options  can be used multiple times to specify multiple INFO
190           flags.
191
192   ALLELE FILTERING
193         --maf <float>
194         --max-maf <float>
195           Include only sites with a Minor Allele Frequency  greater  than  or
196           equal  to  the  "--maf" value and less than or equal to the "--max-
197           maf" value. One of these options may be  used  without  the  other.
198           Allele  frequency  is  defined  as  the  number  of times an allele
199           appears over all individuals at that site,  divided  by  the  total
200           number of non-missing alleles at that site.
201
202         --non-ref-af <float>
203         --max-non-ref-af <float>
204         --non-ref-ac <integer>
205         --max-non-ref-ac <integer>
206
207         --non-ref-af-any <float>
208         --max-non-ref-af-any <float>
209         --non-ref-ac-any <integer>
210         --max-non-ref-ac-any <integer>
211           Include  only sites with all Non-Reference (ALT) Allele Frequencies
212           (af) or Counts (ac) within the range specified, and  including  the
213           specified  value.  The  default options require all alleles to meet
214           the specified criteria, whereas the  options  appended  with  "any"
215           require  only one allele to meet the criteria. The Allele frequency
216           is defined as the number of times an allele appears over all  indi‐
217           viduals  at  that  site, divided by the total number of non-missing
218           alleles at that site.
219
220         --mac <integer>
221         --max-mac <integer>
222           Include only sites with Minor Allele Count greater than or equal to
223           the  "--mac" value and less than or equal to the "--max-mac" value.
224           One of these options may be used without the other. Allele count is
225           simply the number of times that allele appears over all individuals
226           at that site.
227
228         --min-alleles <integer>
229         --max-alleles <integer>
230           Include only sites with a number of alleles greater than  or  equal
231           to  the "--min-alleles" value and less than or equal to the "--max-
232           alleles" value. One of these options may be used without the other.
233           For example, to include only bi-allelic sites, one could use:
234             vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2
235
236   GENOTYPE VALUE FILTERING
237         --min-meanDP <float>
238         --max-meanDP <float>
239           Includes only sites with mean depth values (over all included indi‐
240           viduals) greater than or equal to the "--min-meanDP" value and less
241           than or equal to the "--max-meanDP" value. One of these options may
242           be used without the other. These options require that the "DP" FOR‐
243           MAT tag is included for each site.
244
245         --hwe <float>
246           Assesses sites for Hardy-Weinberg Equilibrium using an exact  test,
247           as  defined  by Wigginton, Cutler and Abecasis (2005). Sites with a
248           p-value below the threshold defined by this option are taken to  be
249           out of HWE, and therefore excluded.
250
251         --max-missing <float>
252           Exclude  sites  on  the  basis  of  the  proportion of missing data
253           (defined to be between 0 and 1, where 0 allows sites that are  com‐
254           pletely missing and 1 indicates no missing data allowed).
255
256         --max-missing-count <integer>
257           Exclude  sites with more than this number of missing genotypes over
258           all individuals.
259
260         --phased
261           Excludes all sites that contain unphased genotypes.
262
263   MISCELLANEOUS FILTERING
264         --minQ <float>
265           Includes only sites with Quality value above this threshold.
266

INDIVIDUAL FILTERING OPTIONS

268       These options are used to include or exclude certain  individuals  from
269       any analysis being performed by the program.
270         --indv <string>
271         --remove-indv <string>
272           Specify an individual to be kept or removed from the analysis. This
273           option can be used multiple times to specify multiple  individuals.
274           If both options are specified, then the "--indv" option is executed
275           before the "--remove-indv option".
276
277         --keep <filename>
278         --remove <filename>
279           Provide files containing a list of individuals to either include or
280           exclude  in  subsequent analysis. Each individual ID (as defined in
281           the VCF headerline) should be included on a separate line. If  both
282           options  are  used, then the "--keep" option is executed before the
283           "--remove" option. When multiple files are provided, the  union  of
284           individuals from all keep files subtracted by the union of individ‐
285           uals from all remove files are kept. No header line is expected.
286
287         --max-indv <integer>
288           Randomly thins individuals so that only the  specified  number  are
289           retained.
290

GENOTYPE FILTERING OPTIONS

292       These  options  are  used  to exclude genotypes from any analysis being
293       performed by the program. If excluded, these values will be treated  as
294       missing.
295         --remove-filtered-geno-all
296           Excludes all genotypes with a FILTER flag not equal to "." (a miss‐
297           ing value) or PASS.
298
299         --remove-filtered-geno <string>
300           Excludes genotypes with a specific FILTER flag.
301
302         --minGQ <float>
303           Exclude all genotypes with a quality below the threshold specified.
304           This  option requires that the "GQ" FORMAT tag is specified for all
305           sites.
306
307         --minDP <float>
308         --maxDP <float>
309           Includes only genotypes greater than  or  equal  to  the  "--minDP"
310           value  and  less  than or equal to the "--maxDP" value. This option
311           requires that the "DP" FORMAT tag is specified for all sites.
312

OUTPUT OPTIONS

314       These options specify which analyses or conversions to perform  on  the
315       data that passed through all specified filters.
316
317   OUTPUT ALLELE STATISTICS
318         --freq
319         --freq2
320           Outputs  the allele frequency for each site in a file with the suf‐
321           fix ".frq". The second option is used to  suppress  output  of  any
322           information about the alleles.
323
324         --counts
325         --counts2
326           Outputs the raw allele counts for each site in a file with the suf‐
327           fix ".frq.count". The second option is used to suppress  output  of
328           any information about the alleles.
329
330         --derived
331           For  use  with  the previous four frequency and count options only.
332           Re-orders the output file columns  so  that  the  ancestral  allele
333           appears  first.  This  option  relies on the ancestral allele being
334           specified in the VCF file using the AA tag in the INFO field.
335
336   OUTPUT DEPTH STATISTICS
337         --depth
338           Generates a file containing the mean  depth  per  individual.  This
339           file has the suffix ".idepth".
340
341         --site-depth
342           Generates  a  file  containing the depth per site summed across all
343           individuals. This output file has the suffix ".ldepth".
344
345         --site-mean-depth
346           Generates a file containing the mean depth per site averaged across
347           all individuals. This output file has the suffix ".ldepth.mean".
348
349         --geno-depth
350           Generates  a  (possibly  very  large) file containing the depth for
351           each genotype in the VCF file. Missing entries are given the  value
352           -1. The file has the suffix ".gdepth".
353
354   OUTPUT LD STATISTICS
355         --hap-r2
356           Outputs  a file reporting the r2, D, and D' statistics using phased
357           haplotypes. These are the traditional measures of LD often reported
358           in the population genetics literature. The output file has the suf‐
359           fix ".hap.ld". This option assumes that  the  VCF  input  file  has
360           phased haplotypes.
361
362         --geno-r2
363           Calculates  the  squared  correlation coefficient between genotypes
364           encoded as 0, 1 and 2 to  represent  the  number  of  non-reference
365           alleles  in  each  individual.  This  is the same as the LD measure
366           reported by PLINK. The D and D' statistics are only  available  for
367           phased genotypes. The output file has the suffix ".geno.ld".
368
369         --geno-chisq
370           If  your  data contains sites with more than two alleles, then this
371           option can be used to test for genotype independence via  the  chi-
372           squared statistic. The output file has the suffix ".geno.chisq".
373
374         --hap-r2-positions <positions list file>
375         --geno-r2-positions <positions list file>
376           Outputs  a  file reporting the r2 statistics of the sites contained
377           in the provided file verses all other sites. The output files  have
378           the  suffix  ".list.hap.ld"  or ".list.geno.ld", depending on which
379           option is used.
380
381         --ld-window <integer>
382           This optional parameter defines the maximum number of SNPs  between
383           the  SNPs  being  tested for LD in the "--hap-r2", "--geno-r2", and
384           "--geno-chisq" functions.
385
386         --ld-window-bp <integer>
387           This optional parameter defines  the  maximum  number  of  physical
388           bases  between  the  SNPs  being  tested  for LD in the "--hap-r2",
389           "--geno-r2", and "--geno-chisq" functions.
390
391         --ld-window-min <integer>
392           This optional parameter defines the minimum number of SNPs  between
393           the  SNPs  being  tested for LD in the "--hap-r2", "--geno-r2", and
394           "--geno-chisq" functions.
395
396         --ld-window-bp-min <integer>
397           This optional parameter defines  the  minimum  number  of  physical
398           bases  between  the  SNPs  being  tested  for LD in the "--hap-r2",
399           "--geno-r2", and "--geno-chisq" functions.
400
401         --min-r2 <float>
402           This optional parameter sets a minimum value for  r2,  below  which
403           the  LD  statistic  is not reported by the "--hap-r2", "--geno-r2",
404           and "--geno-chisq" functions.
405
406         --interchrom-hap-r2
407         --interchrom-geno-r2
408           Outputs a file reporting the r2 statistics for sites  on  different
409           chromosomes.  The output files have the suffix ".interchrom.hap.ld"
410           or ".interchrom.geno.ld", depending on the option used.
411
412   OUTPUT TRANSITION/TRANSVERSION STATISTICS
413         --TsTv <integer>
414           Calculates the Transition / Transversion  ratio  in  bins  of  size
415           defined  by  this  option. Only uses bi-allelic SNPs. The resulting
416           output file has the suffix ".TsTv".
417
418         --TsTv-summary
419           Calculates a simple summary of all Transitions  and  Transversions.
420           The output file has the suffix ".TsTv.summary".
421
422         --TsTv-by-count
423           Calculates  the  Transition  /  Transversion ratio as a function of
424           alternative allele count. Only uses bi-allelic SNPs. The  resulting
425           output file has the suffix ".TsTv.count".
426
427         --TsTv-by-qual
428           Calculates the Transition / Transversion ratio as a function of SNP
429           quality threshold. Only uses bi-allelic SNPs. The resulting  output
430           file has the suffix ".TsTv.qual".
431
432         --FILTER-summary
433           Generates  a summary of the number of SNPs and Ts/Tv ratio for each
434           FILTER category. The output file has the suffix ".FILTER.summary".
435
436   OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS
437         --site-pi
438           Measures nucleotide divergency on a per-site basis. The output file
439           has the suffix ".sites.pi".
440
441         --window-pi <integer>
442         --window-pi-step <integer>
443           Measures  the nucleotide diversity in windows, with the number pro‐
444           vided as the window size. The output file  has  the  suffix  ".win‐
445           dowed.pi".  The  latter is an optional argument used to specify the
446           step size in between windows.
447
448   OUTPUT FST STATISTICS
449         --weir-fst-pop <filename>
450           This option is used to calculate an  Fst  estimate  from  Weir  and
451           Cockerham's  1984  paper. This is the preferred calculation of Fst.
452           The provided file must contain a list of individuals (one  individ‐
453           ual  per line) from the VCF file that correspond to one population.
454           This option can be used multiple times to calculate  Fst  for  more
455           than two populations. These files will also be included as "--keep"
456           options. By default, calculations are done on a per-site basis. The
457           output file has the suffix ".weir.fst".
458
459         --fst-window-size <integer>
460         --fst-window-step <integer>
461           These  options can be used with "--weir-fst-pop" to do the Fst cal‐
462           culations on a windowed basis instead of a  per-site  basis.  These
463           arguments specify the desired window size and the desired step size
464           between windows.
465
466   OUTPUT OTHER STATISTICS
467         --het
468           Calculates a measure of heterozygosity on a  per-individual  basis.
469           Specfically,  the  inbreeding coefficient, F, is estimated for each
470           individual using a method of moments. The resulting  file  has  the
471           suffix ".het".
472
473         --hardy
474           Reports  a  p-value for each site from a Hardy-Weinberg Equilibrium
475           test (as defined by Wigginton, Cutler  and  Abecasis  (2005)).  The
476           resulting file (with suffix ".hwe") also contains the Observed num‐
477           bers  of  Homozygotes  and  Heterozygotes  and  the   corresponding
478           Expected numbers under HWE.
479
480         --TajimaD <integer>
481           Outputs  Tajima's  D  statistic  in bins with size of the specified
482           number. The output file has the suffix ".Tajima.D".
483
484         --indv-freq-burden
485           This option calculates the number of variants within each  individ‐
486           ual  of  a  specific  frequency.  The resulting file has the suffix
487           ".ifreqburden".
488
489         --LROH
490           This option will identify and output Long Runs of Homozygosity. The
491           output  file has the suffix ".LROH". This function is experimental,
492           and will use a lot of memory if applied to large datasets.
493
494         --relatedness
495           This option is used to calculate and output a relatedness statistic
496           based   on   the  method  of  Yang  et  al,  Nature  Genetics  2010
497           (doi:10.1038/ng.608). Specifically, calculate  the  unadjusted  Ajk
498           statistic. Expectation of Ajk is zero for individuals within a pop‐
499           ulations, and one for an individual  with  themselves.  The  output
500           file has the suffix ".relatedness".
501
502         --relatedness2
503           This option is used to calculate and output a relatedness statistic
504           based on the method of  Manichaikul  et  al.,  BIOINFORMATICS  2010
505           (doi:10.1093/bioinformatics/btq559). The output file has the suffix
506           ".relatedness2".
507
508         --site-quality
509           Generates a file containing the per-site SNP quality, as  found  in
510           the QUAL column of the VCF file. This file has the suffix ".lqual".
511
512         --missing-indv
513           Generates  a  file  reporting  the  missingness on a per-individual
514           basis. The file has the suffix ".imiss".
515
516         --missing-site
517           Generates a file reporting the missingness on a per-site basis. The
518           file has the suffix ".lmiss".
519
520         --SNPdensity <integer>
521           Calculates  the  number and density of SNPs in bins of size defined
522           by this option. The resulting output file has the suffix ".snpden".
523
524         --kept-sites
525           Creates a file listing all sites that have been kept after  filter‐
526           ing. The file has the suffix ".kept.sites".
527
528         --removed-sites
529           Creates  a file listing all sites that have been removed after fil‐
530           tering. The file has the suffix ".removed.sites".
531
532         --singletons
533           This option will generate a file detailing the location of  single‐
534           tons,  and the individual they occur in. The file reports both true
535           singletons, and private  doubletons  (i.e.  SNPs  where  the  minor
536           allele  only  occurs  in a single individual and that individual is
537           homozygotic for that allele). The output file has the suffix ".sin‐
538           gletons".
539
540         --hist-indel-len
541           This  option  will  generate  a histogram file of the length of all
542           indels (including SNPs). It shows both the count and the percentage
543           of  all  indels  for  indel lengths that occur at least once in the
544           input file. SNPs are considered indels with length zero. The output
545           file has the suffix ".indel.hist".
546
547         --hapcount <BED file>
548           This option will output the number of unique haplotypes within user
549           specified bins, as defined by the BED file. The output file has the
550           suffix ".hapcount".
551
552         --mendel <PED file>
553           This option is use to report mendel errors identified in trios. The
554           command requires a PLINK-style PED file, with the first  four  col‐
555           umns  specifying  a family ID, the child ID, the father ID, and the
556           mother ID. The output of this command has the suffix ".mendel".
557
558         --extract-FORMAT-info <string>
559           Extract information from the genotype fields in the VCF file relat‐
560           ing  to a specfied FORMAT identifier. The resulting output file has
561           the suffix ".<FORMAT_ID>.FORMAT". For example, the  following  com‐
562           mand would extract the all of the GT (i.e. Genotype) entries:
563             vcftools --vcf file1.vcf --extract-FORMAT-info GT
564
565         --get-INFO <string>
566           This  option  is used to extract information from the INFO field in
567           the VCF file. The <string> argument specifies the INFO  tag  to  be
568           extracted,  and  the  option can be used multiple times in order to
569           extract multiple INFO entries.  The  resulting  file,  with  suffix
570           ".INFO",  contains the required INFO information in a tab-separated
571           table. For example, to extract the NS and DB flags, one  would  use
572           the command:
573             vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB
574
575   OUTPUT VCF FORMAT
576         --recode
577         --recode-bcf
578           These  options are used to generate a new file in either VCF or BCF
579           from the input VCF or BCF file after applying the filtering options
580           specified by the user. The output file has the suffix ".recode.vcf"
581           or ".recode.bcf". By default, the INFO fields are removed from  the
582           output  file, as the INFO values may be invalidated by the recoding
583           (e.g. the total depth may need to be  recalculated  if  individuals
584           are  removed).  This  behavior  may  be  overriden by the following
585           options. By default, BCF files are written out as  BGZF  compressed
586           files.
587
588         --recode-INFO <string>
589         --recode-INFO-all
590           These  options  can be used with the above recode options to define
591           an INFO key name to keep in the output file.  This  option  can  be
592           used  multiple  times  to  keep more of the INFO fields. The second
593           option is used to keep all INFO values in the original file.
594
595         --contigs <string>
596           This option can be used in conjuction with  the  --recode-bcf  when
597           the  input  file does not have any contig declarations. This option
598           expects a file name with one contig header per  line.  These  lines
599           are included in the output file.
600
601   OUTPUT OTHER FORMATS
602         --012
603           This  option  outputs  the genotypes as a large matrix. Three files
604           are produced. The first, with suffix ".012", contains the genotypes
605           of each individual on a separate line. Genotypes are represented as
606           0, 1 and 2, where the number represent that number of non-reference
607           alleles.  Missing genotypes are represented by -1. The second file,
608           with suffix ".012.indv" details the  individuals  included  in  the
609           main  file. The third file, with suffix ".012.pos" details the site
610           locations included in the main file.
611
612         --IMPUTE
613           This option outputs phased  haplotypes  in  IMPUTE  reference-panel
614           format.  As  IMPUTE  requires  phased  data, using this option also
615           implies --phased. Unphased individuals and genotypes are  therefore
616           excluded.  Only  bi-allelic sites are included in the output. Using
617           this option generates three files. The IMPUTE  haplotype  file  has
618           the suffix ".impute.hap", and the IMPUTE legend file has the suffix
619           ".impute.hap.legend".    The    third     file,     with     suffix
620           ".impute.hap.indv",  details the individuals included in the haplo‐
621           type file, although this file is not needed by IMPUTE.
622
623         --ldhat
624         --ldhelmet
625         --ldhat-geno
626           These options output data in  LDhat/LDhelmet  format.  This  option
627           requires  the  "--chr" filter option to also be used. The two first
628           options  output  phased  data  only,  and  therefore  also  implies
629           "--phased"  be  used, leading to unphased individuals and genotypes
630           being excluded. For LDhelmet, only snps  will  be  considered,  and
631           therefore  it  implies  "--remove-indels". The second option treats
632           all of the data as unphased, and therefore outputs LDhat  files  in
633           genotype/unphased  format.  Two output files are generated with the
634           suffixes ".ldhat.sites" and ".ldhat.locs", which correspond to  the
635           LDhat  "sites"  and  "locs" input files respectively; for LDhelmet,
636           the two files generated  have  the  suffixes  ".ldhelmet.snps"  and
637           ".ldhelmet.pos",  which  corresponds  to the "SNPs" and "positions"
638           files.
639
640         --BEAGLE-GL
641         --BEAGLE-PL
642           These options output genotype likelihood information for input into
643           the  BEAGLE  program.  The  VCF  file is required to contain FORMAT
644           fields with "GL" or "PL" tags, which can generally be output by SNP
645           callers  such as the GATK. Use of this option requires a chromosome
646           to be specified via the "--chr" option. The resulting  output  file
647           has  the  suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype
648           likelihoods for biallelic sites. This file is  suitable  for  input
649           into BEAGLE via the "like=" argument.
650
651         --plink
652         --plink-tped
653         --chrom-map
654           These  options  output  the genotype data in PLINK PED format. With
655           the first option, two files are generated, with suffixes ".ped" and
656           ".map".  Note  that  only  bi-allelic  loci will be output. Further
657           details of these files can be found in the PLINK documentation.
658           Note: The first option can be very slow on  large  datasets.  Using
659           the  --chr  option to divide up the dataset is advised, or alterna‐
660           tively use the --plink-tped option which outputs the files  in  the
661           PLINK transposed format with suffixes ".tped" and ".tfam".
662           For  usage  with  variant  sites  in species other than humans, the
663           --chrom-map option may be used to specify a file name  that  has  a
664           tab-delimited mapping of chromosome name to a desired integer value
665           with one line per chromosome. This file must contain a mapping  for
666           every chromosome value found in the file.
667

COMPARISON OPTIONS

669       These  options are used to compare the original variant file to another
670       variant file and output the results. All of the diff functions  require
671       both files to contain the same chromosomes and that the files be sorted
672       in the same order. If one of the files contains  chromosomes  that  the
673       other  file  does not, use the --not-chr filter to remove them from the
674       analysis.
675
676   DIFF VCF FILE
677         --diff <filename>
678         --gzdiff <filename>
679         --diff-bcf <filename>
680           These options compare the original input  file  to  this  specified
681           VCF, gzipped VCF, or BCF file. These options must be specified with
682           one additional option described below in order to specify what type
683           of comparison is to be performed. See the examples section for typ‐
684           ical usage.
685
686   DIFF OPTIONS
687         --diff-site
688           Outputs the sites that are common / unique to each file. The output
689           file has the suffix ".diff.sites_in_files".
690
691         --diff-indv
692           Outputs  the individuals that are common / unique to each file. The
693           output file has the suffix ".diff.indv_in_files".
694
695         --diff-site-discordance
696           This option calculates discordance on a site  by  site  basis.  The
697           resulting output file has the suffix ".diff.sites".
698
699         --diff-indv-discordance
700           This  option  calculates discordance on a per-individual basis. The
701           resulting output file has the suffix ".diff.indv".
702
703         --diff-indv-map <filename>
704           This option allows the user to specify a mapping of individual  IDs
705           in  the second file to those in the first file. The program expects
706           the file to contain a tab-delimited line containing an individual's
707           name  in  file  one followed by that same individual's name in file
708           two with one mapping per line.
709
710         --diff-discordance-matrix
711           This option calculates a discordance matrix. This option only works
712           with bi-allelic loci with matching alleles that are present in both
713           files. The resulting output  file  has  the  suffix  ".diff.discor‐
714           dance.matrix".
715
716         --diff-switch-error
717           This   option   calculates  phasing  errors  (specifically  "switch
718           errors"). This option creates  an  output  file  describing  switch
719           errors found between sites, with suffix ".diff.switch".
720

AUTHORS

722       Adam Auton
723       Anthony Marcketta
724
725
726
7271                                    page                        vcftools(man)