|
NAMEvcf-split - Efficiently split a multi-sample VCF stream into single-sample filesSYNOPSISvcf-split \ [--het-only] [--alt-only] [--max-calls N] \ [--sample-id-file file] [--output-fields field-spec] \ output-file-prefix first-column last-column < file.vcf bcftools view file.bcf | vcf-split ... OPTIONS and ARGUMENTS
PURPOSEvcf-split efficiently splits a multi-sample VCF stream into single-sample VCF files.DESCRIPTIONTraditional methods for splitting a multi-sample VCF stream into single-sample files involve a loop or parallel job that rereads the multi-sample input for every sample. This is grossly inefficient and can become a major bottleneck where there are many samples and/or the input is compressed. For example, using "bcftools view" with optimal filtering options to decode one human chromosome BCF with 137,977 samples and pipe the VCF output through "wc" took 12 hours on a fast server using 2 cores. To split it into 137,977 single-sample VCFs would therefore require about 137,977 * 12 * 2 = ~3 million core-hours. This translates to 171 years on a single server or 125 days using 1000 cores on an HPC cluster, assuming file I/O is not a bottleneck with 1000 processes reading the same input file. ( The input would need to be prestaged on multiple local disks to avoid overloading the network file system. )vcf-split solves this problem by writing a large number of single-sample VCFs simultaneously during a single read of the multi-sample input. Modern Unix systems support tens of thousands of simultaneously open files, providing a simple way to achieve enormous speedup. To avoid system overload, vcf-split has a hard-coded limit of 10,000 samples at a time. Hence, vcf-split may reduce the time required to split a large VCF by a factor of 10,000 and can process 137,977 samples in 14 passes. vcf-split is written entirely in C and attempts to optimize CPU, memory, and disk access. It does not inhale large amounts of data into RAM, so memory use is trivial and it runs mostly in cache RAM, making computational code as fast as possible. The example BCF file mentioned above can be split in a few days on a single server using the maximum of 10,000 samples per run. SEE ALSOad2vcf, vcf2hap, haplohseq, biolibcEXAMPLESSplit a simple VCF file with 100 samples, filtering for specific sample IDs:vcf-split < input.vcf --het-only --sample-id-file samples.csv \ single-sample- 1 100 Split a large BCF file with 120,000 samples (too many for your open file limit): bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 1 30000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 30001 60000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 60001 90000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 90001 120000 BUGSPlease report bugs to the author and send patches in unified diff format. (Run "man diff" for more information)AUTHORJason W. Bacon Paul Auer Lab UW -- Milwaukee Zilber School of Public Health Visit the GSP FreeBSD Man Page Interface. |