NAME

vcf-split - Efficiently split a multi-sample VCF stream into single-sample files

SYNOPSIS

vcf-split \
    [--het-only] [--alt-only] [--max-calls N] \
    [--sample-id-file file] [--output-fields field-spec] \
    output-file-prefix first-column last-column < file.vcf
bcftools view file.bcf | vcf-split ...

OPTIONS and ARGUMENTS

--het-only

Output only heterozygous sites. When decoding a BCF file, "bcftools view --genotype het" slows down the "bcftools view" process and vcf-split uses far less CPU. Since bcftools is already saturating a CPU core and vcf-split has CPU cycles to spare, allowing vcf-split to perform the heterozygous site selection increases pipeline performance considerably.

--alt-only

Output only sites with at least one ALT allele.

--max-calls N

Limit the number of VCF calls processed (for quick testing without the need to generate smaller test input files).

--sample-id-file filename

File containing a whitespace-separated list of arbitrary sample IDS, which must match the sample names in the VCF input.

--output-fields field-spec

Indicates which fields to pass to the output. Fields not indicated here are replaced with a reasonable placeholder for that field, such as ".". field-spec is a comma-separated list of fields to include in the output including one or more of chrom,pos,id,ref,alt,qual,filter, and info.

output-file-prefix

Common filename prefix for all single-sample output files (see Examples directory).

first-column last-column

1-based column numbers limiting the number of samples for each run. vcf-split opens one output stream for each sample and many systems cannot support more than 30,000 to 40,000 open files at a time. E.g. for a 100,000 sample VCF stream, you may want to do multiple runs of 10,000 each (see EXAMPLES below). To prevent system overload, the maximum number of open files is hard-coded at 10,000. To override this limit, you must edit vcf-split.h and recompile. Note that using --sample-id-file may limits the number of open files to less than last-column - first_column + 1 and may make multiple runs unnecessary.

PURPOSE

vcf-split efficiently splits a multi-sample VCF stream into single-sample VCF files.

DESCRIPTION

Traditional methods for splitting a multi-sample VCF stream into single-sample files involve a loop or parallel job that rereads the multi-sample input for every sample. This is grossly inefficient and can become a major bottleneck where there are many samples and/or the input is compressed. For example, using "bcftools view" with optimal filtering options to decode one human chromosome BCF with 137,977 samples and pipe the VCF output through "wc" took 12 hours on a fast server using 2 cores. To split it into 137,977 single-sample VCFs would therefore require about 137,977 * 12 * 2 = ~3 million core-hours. This translates to 171 years on a single server or 125 days using 1000 cores on an HPC cluster, assuming file I/O is not a bottleneck with 1000 processes reading the same input file. ( The input would need to be prestaged on multiple local disks to avoid overloading the network file system. )

vcf-split solves this problem by writing a large number of single-sample VCFs simultaneously during a single read of the multi-sample input. Modern Unix systems support tens of thousands of simultaneously open files, providing a simple way to achieve enormous speedup.

To avoid system overload, vcf-split has a hard-coded limit of 10,000 samples at a time. Hence, vcf-split may reduce the time required to split a large VCF by a factor of 10,000 and can process 137,977 samples in 14 passes.

vcf-split is written entirely in C and attempts to optimize CPU, memory, and disk access. It does not inhale large amounts of data into RAM, so memory use is trivial and it runs mostly in cache RAM, making computational code as fast as possible.

The example BCF file mentioned above can be split in a few days on a single server using the maximum of 10,000 samples per run.

EXAMPLES

Split a simple VCF file with 100 samples, filtering for specific sample IDs:

vcf-split < input.vcf --het-only --sample-id-file samples.csv \
    single-sample- 1 100

Split a large BCF file with 120,000 samples (too many for your open file limit):

bcftools view --min-ac 2 --exclude-types indels \
    freeze.8.chr1.pass_only.phased.bcf \
    | vcf-split --het-only chr01. 1 30000
bcftools view --min-ac 2 --exclude-types indels \
    freeze.8.chr1.pass_only.phased.bcf \
    | vcf-split --het-only chr01. 30001 60000
bcftools view --min-ac 2 --exclude-types indels \
    freeze.8.chr1.pass_only.phased.bcf \
    | vcf-split --het-only chr01. 60001 90000
bcftools view --min-ac 2 --exclude-types indels \
    freeze.8.chr1.pass_only.phased.bcf \
    | vcf-split --het-only chr01. 90001 120000

BUGS

Please report bugs to the author and send patches in unified diff format. (Run "man diff" for more information)

AUTHOR

Jason W. Bacon
Paul Auer Lab
UW -- Milwaukee Zilber School of Public Health