|
NAMEvcftools v0.1.16 - Utilities for the variant call format (VCF) and binary variant call format (BCF)SYNOPSISvcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS ]DESCRIPTIONvcftools is a suite of functions for use on genetic variation data in the form of VCF and BCF files. The tools provided will be used mainly to summarize data, run calculations on data, filter out data, and convert data into other useful file formats.EXAMPLESOutput allele frequency for all sites in the input vcf file from chromosome 1vcftools --gzvcf input_file.vcf.gz --freq --chr 1
--out chr1_analysis
Output a new vcf file from the input vcf file that removes any indel sites vcftools --vcf input_file.vcf --remove-indels
--recode --recode-INFO-all --out SNPs_only
Output file comparing the sites in two vcf files vcftools --gzvcf input_file1.vcf.gz --gzdiff
input_file2.vcf.gz --diff-site --out in1_v_in2
Output a new vcf file to standard out without any sites that have a filter tag, then compress it with gzip vcftools --gzvcf input_file.vcf.gz
--remove-filtered-all --recode --stdout | gzip -c >
output_PASS_only.vcf.gz
Output a Hardy-Weinberg p-value for every site in the bcf file that does not have any missing genotypes vcftools --bcf input_file.bcf --hardy
--max-missing 1.0 --out output_noMissing
Output nucleotide diversity at a list of positions zcat input_file.vcf.gz | vcftools --vcf -
--site-pi --positions SNP_list.txt --out nucleotide_diversity
BASIC OPTIONSThese options are used to specify the input and output files.INPUT FILE OPTIONS--vcf <input_filename>
This option defines the VCF file to be processed.
VCFtools expects files in VCF format v4.0, v4.1 or v4.2. The latter two are
supported with some small limitations. If the user provides a dash character
'-' as a file name, the program expects a VCF file to be piped in through
standard in.
--gzvcf <input_filename> This option can be used in place of the --vcf option to
read compressed (gzipped) VCF files directly.
--bcf <input_filename> This option can be used in place of the --vcf option to
read BCF2 files directly. You do not need to specify if this file is
compressed with BGZF encoding. If the user provides a dash character '-' as a
file name, the program expects a BCF2 file to be piped in through standard
in.
OUTPUT FILE OPTIONS--out <output_prefix>
This option defines the output filename prefix for all
files generated by vcftools. For example, if <prefix> is set to
output_filename, then all output files will be of the form output_filename.***
. If this option is omitted, all output files will have the prefix
"out." in the current working directory.
--stdout
These options direct the vcftools output to standard out
so it can be piped into another program or written directly to a filename of
choice. However, a select few output functions cannot be written to standard
out.
--temp <temporary_directory> This option can be used to redirect any temporary files
that vcftools creates into a specified directory.
SITE FILTERING OPTIONSThese options are used to include or exclude certain sites from any analysis being performed by the program.POSITION FILTERING--chr <chromosome>
--not-chr <chromosome> Includes or excludes sites with indentifiers matching
<chromosome>. These options may be used multiple times to include or
exclude more than one chromosome.
--from-bp <integer>
These options specify a lower bound and upper bound for a
range of sites to be processed. Sites with positions less than or greater than
these values will be excluded. These options can only be used in conjunction
with a single usage of --chr. Using one of these does not require use of the
other.
--positions <filename>
Include or exclude a set of sites on the basis of a list
of positions in a file. Each line of the input file should contain a
(tab-separated) chromosome and position. The file can have comment lines that
start with a "#", they will be ignored.
--positions-overlap <filename>
Include or exclude a set of sites on the basis of the
reference allele overlapping with a list of positions in a file. Each line of
the input file should contain a (tab-separated) chromosome and position. The
file can have comment lines that start with a "#", they will be
ignored.
--bed <filename>
Include or exclude a set of sites on the basis of a BED
file. Only the first three columns (chrom, chromStart and chromEnd) are
required. The BED file is expected to have a header line. A site will be kept
or excluded if any part of any allele (REF or ALT) at a site is within the
range of one of the BED entries.
--thin <integer> Thin sites so that no two sites are within the specified
distance from one another.
--mask <filename>
These options are used to specify a FASTA-like mask file
to filter with. The mask file contains a sequence of integer digits (between 0
and 9) for each position on a chromosome that specify if a site at that
position should be filtered or not.
An example mask file would look like: >1
In this example, sites in the VCF file located within the first 5 bases of the
start of chromosome 1 would be kept, whereas sites at position 6 onwards would
be filtered out. And sites after the 11th position on chromosome 2 would be
filtered out as well.
0000011111222... >2 2222211111000... The "--invert-mask" option takes the same format mask file as the "--mask" option, however it inverts the mask file before filtering with it. And the "--mask-min" option specifies a threshold mask value between 0 and 9 to filter positions by. The default threshold is 0, meaning only sites with that value or lower will be kept. SITE ID FILTERING--snp <string>
Include SNP(s) with matching ID (e.g. a dbSNP rsID). This
command can be used multiple times in order to include more than one
SNP.
--snps <filename>
Include or exclude a list of SNPs given in a file. The
file should contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per
line. No header line is expected.
VARIANT TYPE FILTERING--keep-only-indels
--remove-indels Include or exclude sites that contain an indel. For these
options "indel" means any variant that alters the length of the REF
allele.
FILTER FLAG FILTERING--remove-filtered-all
Removes all sites with a FILTER flag other than
PASS.
--keep-filtered <string>
Includes or excludes all sites marked with a specific
FILTER flag. These options may be used more than once to specify multiple
FILTER flags.
INFO FIELD FILTERING--keep-INFO <string>
--remove-INFO <string> Includes or excludes all sites with a specific INFO flag.
These options only filter on the presence of the flag and not its value. These
options can be used multiple times to specify multiple INFO flags.
ALLELE FILTERING--maf <float>
--max-maf <float> Include only sites with a Minor Allele Frequency greater
than or equal to the "--maf" value and less than or equal to the
"--max-maf" value. One of these options may be used without the
other. Allele frequency is defined as the number of times an allele appears
over all individuals at that site, divided by the total number of non-missing
alleles at that site.
--non-ref-af <float>
--non-ref-af-any <float>
Include only sites with all Non-Reference (ALT) Allele
Frequencies (af) or Counts (ac) within the range specified, and including the
specified value. The default options require all alleles to meet the
specified criteria, whereas the options appended with "any" require
only one allele to meet the criteria. The Allele frequency is defined as the
number of times an allele appears over all individuals at that site, divided
by the total number of non-missing alleles at that site.
--mac <integer>
Include only sites with Minor Allele Count greater than
or equal to the "--mac" value and less than or equal to the
"--max-mac" value. One of these options may be used without the
other. Allele count is simply the number of times that allele appears over all
individuals at that site.
--min-alleles <integer>
Include only sites with a number of alleles greater than
or equal to the "--min-alleles" value and less than or equal to the
"--max-alleles" value. One of these options may be used without the
other.
For example, to include only bi-allelic sites, one could use: vcftools --vcf file1.vcf --min-alleles 2
--max-alleles 2
GENOTYPE VALUE FILTERING--min-meanDP <float>
--max-meanDP <float> Includes only sites with mean depth values (over all
included individuals) greater than or equal to the "--min-meanDP"
value and less than or equal to the "--max-meanDP" value. One of
these options may be used without the other. These options require that the
"DP" FORMAT tag is included for each site.
--hwe <float> Assesses sites for Hardy-Weinberg Equilibrium using an
exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a
p-value below the threshold defined by this option are taken to be out of HWE,
and therefore excluded.
--max-missing <float> Exclude sites on the basis of the proportion of missing
data (defined to be between 0 and 1, where 0 allows sites that are completely
missing and 1 indicates no missing data allowed).
--max-missing-count <integer> Exclude sites with more than this number of missing
genotypes over all individuals.
--phased Excludes all sites that contain unphased genotypes.
MISCELLANEOUS FILTERING--minQ <float>
Includes only sites with Quality value above this
threshold.
INDIVIDUAL FILTERING OPTIONSThese options are used to include or exclude certain individuals from any analysis being performed by the program.--indv <string>
--remove-indv <string> Specify an individual to be kept or removed from the
analysis. This option can be used multiple times to specify multiple
individuals. If both options are specified, then the "--indv" option
is executed before the "--remove-indv option".
--keep <filename>
Provide files containing a list of individuals to either
include or exclude in subsequent analysis. Each individual ID (as defined in
the VCF headerline) should be included on a separate line. If both options are
used, then the "--keep" option is executed before the
"--remove" option. When multiple files are provided, the union of
individuals from all keep files subtracted by the union of individuals from
all remove files are kept. No header line is expected.
--max-indv <integer> Randomly thins individuals so that only the specified
number are retained.
GENOTYPE FILTERING OPTIONSThese options are used to exclude genotypes from any analysis being performed by the program. If excluded, these values will be treated as missing.--remove-filtered-geno-all
Excludes all genotypes with a FILTER flag not equal to
"." (a missing value) or PASS.
--remove-filtered-geno <string> Excludes genotypes with a specific FILTER flag.
--minGQ <float> Exclude all genotypes with a quality below the threshold
specified. This option requires that the "GQ" FORMAT tag is
specified for all sites.
--minDP <float>
Includes only genotypes greater than or equal to the
"--minDP" value and less than or equal to the "--maxDP"
value. This option requires that the "DP" FORMAT tag is specified
for all sites.
OUTPUT OPTIONSThese options specify which analyses or conversions to perform on the data that passed through all specified filters.OUTPUT ALLELE STATISTICS--freq
--freq2 Outputs the allele frequency for each site in a file with
the suffix ".frq". The second option is used to suppress output of
any information about the alleles.
--counts
Outputs the raw allele counts for each site in a file
with the suffix ".frq.count". The second option is used to suppress
output of any information about the alleles.
--derived For use with the previous four frequency and count
options only. Re-orders the output file columns so that the ancestral allele
appears first. This option relies on the ancestral allele being specified in
the VCF file using the AA tag in the INFO field.
OUTPUT DEPTH STATISTICS--depth
Generates a file containing the mean depth per
individual. This file has the suffix ".idepth".
--site-depth Generates a file containing the depth per site summed
across all individuals. This output file has the suffix
".ldepth".
--site-mean-depth Generates a file containing the mean depth per site
averaged across all individuals. This output file has the suffix
".ldepth.mean".
--geno-depth Generates a (possibly very large) file containing the
depth for each genotype in the VCF file. Missing entries are given the value
-1. The file has the suffix ".gdepth".
OUTPUT LD STATISTICS--hap-r2
Outputs a file reporting the r2, D, and D' statistics
using phased haplotypes. These are the traditional measures of LD often
reported in the population genetics literature. The output file has the suffix
".hap.ld". This option assumes that the VCF input file has phased
haplotypes.
--geno-r2 Calculates the squared correlation coefficient between
genotypes encoded as 0, 1 and 2 to represent the number of non-reference
alleles in each individual. This is the same as the LD measure reported by
PLINK. The D and D' statistics are only available for phased genotypes. The
output file has the suffix ".geno.ld".
--geno-chisq If your data contains sites with more than two alleles,
then this option can be used to test for genotype independence via the
chi-squared statistic. The output file has the suffix
".geno.chisq".
--hap-r2-positions <positions list file>
Outputs a file reporting the r2 statistics of the sites
contained in the provided file verses all other sites. The output files have
the suffix ".list.hap.ld" or ".list.geno.ld", depending on
which option is used.
--ld-window <integer> This optional parameter defines the maximum number of
SNPs between the SNPs being tested for LD in the "--hap-r2",
"--geno-r2", and "--geno-chisq" functions.
--ld-window-bp <integer> This optional parameter defines the maximum number of
physical bases between the SNPs being tested for LD in the
"--hap-r2", "--geno-r2", and "--geno-chisq"
functions.
--ld-window-min <integer> This optional parameter defines the minimum number of
SNPs between the SNPs being tested for LD in the "--hap-r2",
"--geno-r2", and "--geno-chisq" functions.
--ld-window-bp-min <integer> This optional parameter defines the minimum number of
physical bases between the SNPs being tested for LD in the
"--hap-r2", "--geno-r2", and "--geno-chisq"
functions.
--min-r2 <float> This optional parameter sets a minimum value for r2,
below which the LD statistic is not reported by the "--hap-r2",
"--geno-r2", and "--geno-chisq" functions.
--interchrom-hap-r2
Outputs a file reporting the r2 statistics for sites on
different chromosomes. The output files have the suffix
".interchrom.hap.ld" or ".interchrom.geno.ld", depending
on the option used.
OUTPUT TRANSITION/TRANSVERSION STATISTICS--TsTv <integer>
Calculates the Transition / Transversion ratio in bins of
size defined by this option. Only uses bi-allelic SNPs. The resulting output
file has the suffix ".TsTv".
--TsTv-summary Calculates a simple summary of all Transitions and
Transversions. The output file has the suffix ".TsTv.summary".
--TsTv-by-count Calculates the Transition / Transversion ratio as a
function of alternative allele count. Only uses bi-allelic SNPs. The resulting
output file has the suffix ".TsTv.count".
--TsTv-by-qual Calculates the Transition / Transversion ratio as a
function of SNP quality threshold. Only uses bi-allelic SNPs. The resulting
output file has the suffix ".TsTv.qual".
--FILTER-summary Generates a summary of the number of SNPs and Ts/Tv ratio
for each FILTER category. The output file has the suffix
".FILTER.summary".
OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS--site-pi
Measures nucleotide divergency on a per-site basis. The
output file has the suffix ".sites.pi".
--window-pi <integer>
Measures the nucleotide diversity in windows, with the
number provided as the window size. The output file has the suffix
".windowed.pi". The latter is an optional argument used to specify
the step size in between windows.
OUTPUT FST STATISTICS--weir-fst-pop <filename>
This option is used to calculate an Fst estimate from
Weir and Cockerham's 1984 paper. This is the preferred calculation of Fst. The
provided file must contain a list of individuals (one individual per line)
from the VCF file that correspond to one population. This option can be used
multiple times to calculate Fst for more than two populations. These files
will also be included as "--keep" options. By default, calculations
are done on a per-site basis. The output file has the suffix
".weir.fst".
--fst-window-size <integer>
These options can be used with "--weir-fst-pop"
to do the Fst calculations on a windowed basis instead of a per-site basis.
These arguments specify the desired window size and the desired step size
between windows.
OUTPUT OTHER STATISTICS--het
Calculates a measure of heterozygosity on a
per-individual basis. Specfically, the inbreeding coefficient, F, is estimated
for each individual using a method of moments. The resulting file has the
suffix ".het".
--hardy Reports a p-value for each site from a Hardy-Weinberg
Equilibrium test (as defined by Wigginton, Cutler and Abecasis (2005)). The
resulting file (with suffix ".hwe") also contains the Observed
numbers of Homozygotes and Heterozygotes and the corresponding Expected
numbers under HWE.
--TajimaD <integer> Outputs Tajima's D statistic in bins with size of the
specified number. The output file has the suffix ".Tajima.D".
--indv-freq-burden This option calculates the number of variants within each
individual of a specific frequency. The resulting file has the suffix
".ifreqburden".
--LROH This option will identify and output Long Runs of
Homozygosity. The output file has the suffix ".LROH". This function
is experimental, and will use a lot of memory if applied to large
datasets.
--relatedness This option is used to calculate and output a relatedness
statistic based on the method of Yang et al, Nature Genetics 2010
(doi:10.1038/ng.608). Specifically, calculate the unadjusted Ajk statistic.
Expectation of Ajk is zero for individuals within a populations, and one for
an individual with themselves. The output file has the suffix
".relatedness".
--relatedness2 This option is used to calculate and output a relatedness
statistic based on the method of Manichaikul et al., BIOINFORMATICS 2010
(doi:10.1093/bioinformatics/btq559). The output file has the suffix
".relatedness2".
--site-quality Generates a file containing the per-site SNP quality, as
found in the QUAL column of the VCF file. This file has the suffix
".lqual".
--missing-indv Generates a file reporting the missingness on a
per-individual basis. The file has the suffix ".imiss".
--missing-site Generates a file reporting the missingness on a per-site
basis. The file has the suffix ".lmiss".
--SNPdensity <integer> Calculates the number and density of SNPs in bins of size
defined by this option. The resulting output file has the suffix
".snpden".
--kept-sites Creates a file listing all sites that have been kept
after filtering. The file has the suffix ".kept.sites".
--removed-sites Creates a file listing all sites that have been removed
after filtering. The file has the suffix ".removed.sites".
--singletons This option will generate a file detailing the location
of singletons, and the individual they occur in. The file reports both true
singletons, and private doubletons (i.e. SNPs where the minor allele only
occurs in a single individual and that individual is homozygotic for that
allele). The output file has the suffix ".singletons".
--hist-indel-len This option will generate a histogram file of the length
of all indels (including SNPs). It shows both the count and the percentage of
all indels for indel lengths that occur at least once in the input file. SNPs
are considered indels with length zero. The output file has the suffix
".indel.hist".
--hapcount <BED file> This option will output the number of unique haplotypes
within user specified bins, as defined by the BED file. The output file has
the suffix ".hapcount".
--mendel <PED file> This option is use to report mendel errors identified in
trios. The command requires a PLINK-style PED file, with the first four
columns specifying a family ID, the child ID, the father ID, and the mother
ID. The output of this command has the suffix ".mendel".
--extract-FORMAT-info <string> Extract information from the genotype fields in the VCF
file relating to a specfied FORMAT identifier. The resulting output file has
the suffix ".<FORMAT_ID>.FORMAT". For example, the following
command would extract the all of the GT (i.e. Genotype) entries:
vcftools --vcf file1.vcf --extract-FORMAT-info
GT
--get-INFO <string> This option is used to extract information from the INFO
field in the VCF file. The <string> argument specifies the INFO tag to
be extracted, and the option can be used multiple times in order to extract
multiple INFO entries. The resulting file, with suffix ".INFO",
contains the required INFO information in a tab-separated table. For example,
to extract the NS and DB flags, one would use the command:
vcftools --vcf file1.vcf --get-INFO NS --get-INFO
DB
OUTPUT VCF FORMAT--recode
--recode-bcf These options are used to generate a new file in either
VCF or BCF from the input VCF or BCF file after applying the filtering options
specified by the user. The output file has the suffix ".recode.vcf"
or ".recode.bcf". By default, the INFO fields are removed from the
output file, as the INFO values may be invalidated by the recoding (e.g. the
total depth may need to be recalculated if individuals are removed). This
behavior may be overriden by the following options. By default, BCF files are
written out as BGZF compressed files.
--recode-INFO <string>
These options can be used with the above recode options
to define an INFO key name to keep in the output file. This option can be used
multiple times to keep more of the INFO fields. The second option is used to
keep all INFO values in the original file.
--contigs <string> This option can be used in conjuction with the
--recode-bcf when the input file does not have any contig declarations. This
option expects a file name with one contig header per line. These lines are
included in the output file.
OUTPUT OTHER FORMATS--012
This option outputs the genotypes as a large matrix.
Three files are produced. The first, with suffix ".012", contains
the genotypes of each individual on a separate line. Genotypes are represented
as 0, 1 and 2, where the number represent that number of non-reference
alleles. Missing genotypes are represented by -1. The second file, with suffix
".012.indv" details the individuals included in the main file. The
third file, with suffix ".012.pos" details the site locations
included in the main file.
--IMPUTE This option outputs phased haplotypes in IMPUTE
reference-panel format. As IMPUTE requires phased data, using this option also
implies --phased. Unphased individuals and genotypes are therefore excluded.
Only bi-allelic sites are included in the output. Using this option generates
three files. The IMPUTE haplotype file has the suffix ".impute.hap",
and the IMPUTE legend file has the suffix ".impute.hap.legend". The
third file, with suffix ".impute.hap.indv", details the individuals
included in the haplotype file, although this file is not needed by
IMPUTE.
--ldhat
These options output data in LDhat/LDhelmet format. This
option requires the "--chr" filter option to also be used. The two
first options output phased data only, and therefore also implies
"--phased" be used, leading to unphased individuals and genotypes
being excluded. For LDhelmet, only snps will be considered, and therefore it
implies "--remove-indels". The second option treats all of the data
as unphased, and therefore outputs LDhat files in genotype/unphased format.
Two output files are generated with the suffixes ".ldhat.sites" and
".ldhat.locs", which correspond to the LDhat "sites" and
"locs" input files respectively; for LDhelmet, the two files
generated have the suffixes ".ldhelmet.snps" and
".ldhelmet.pos", which corresponds to the "SNPs" and
"positions" files.
--BEAGLE-GL
These options output genotype likelihood information for
input into the BEAGLE program. The VCF file is required to contain FORMAT
fields with "GL" or "PL" tags, which can generally be
output by SNP callers such as the GATK. Use of this option requires a
chromosome to be specified via the "--chr" option. The resulting
output file has the suffix ".BEAGLE.GL" or ".BEAGLE.PL"
and contains genotype likelihoods for biallelic sites. This file is suitable
for input into BEAGLE via the "like=" argument.
--plink
These options output the genotype data in PLINK PED
format. With the first option, two files are generated, with suffixes
".ped" and ".map". Note that only bi-allelic loci will be
output. Further details of these files can be found in the PLINK
documentation.
Note: The first option can be very slow on large datasets. Using the --chr option to divide up the dataset is advised, or alternatively use the --plink-tped option which outputs the files in the PLINK transposed format with suffixes ".tped" and ".tfam". For usage with variant sites in species other than humans, the --chrom-map option may be used to specify a file name that has a tab-delimited mapping of chromosome name to a desired integer value with one line per chromosome. This file must contain a mapping for every chromosome value found in the file. COMPARISON OPTIONSThese options are used to compare the original variant file to another variant file and output the results. All of the diff functions require both files to contain the same chromosomes and that the files be sorted in the same order. If one of the files contains chromosomes that the other file does not, use the --not-chr filter to remove them from the analysis.DIFF VCF FILE--diff <filename>
--gzdiff <filename> --diff-bcf <filename> These options compare the original input file to this
specified VCF, gzipped VCF, or BCF file. These options must be specified with
one additional option described below in order to specify what type of
comparison is to be performed. See the examples section for typical
usage.
DIFF OPTIONS--diff-site
Outputs the sites that are common / unique to each file.
The output file has the suffix ".diff.sites_in_files".
--diff-indv Outputs the individuals that are common / unique to each
file. The output file has the suffix ".diff.indv_in_files".
--diff-site-discordance This option calculates discordance on a site by site
basis. The resulting output file has the suffix ".diff.sites".
--diff-indv-discordance This option calculates discordance on a per-individual
basis. The resulting output file has the suffix ".diff.indv".
--diff-indv-map <filename> This option allows the user to specify a mapping of
individual IDs in the second file to those in the first file. The program
expects the file to contain a tab-delimited line containing an individual's
name in file one followed by that same individual's name in file two with one
mapping per line.
--diff-discordance-matrix This option calculates a discordance matrix. This option
only works with bi-allelic loci with matching alleles that are present in both
files. The resulting output file has the suffix
".diff.discordance.matrix".
--diff-switch-error This option calculates phasing errors (specifically
"switch errors"). This option creates an output file describing
switch errors found between sites, with suffix ".diff.switch".
AUTHORSAdam AutonAnthony Marcketta
Visit the GSP FreeBSD Man Page Interface. |