vcfppR: rapid manipulation of the VCF/BCF file

R-CMD-check CRAN status codecov

The vcfppR package implements various useful functions for rapidly manipulating VCF/BCF files in R using the C++ API of vcfpp.h.


## install.package("vcfppR") ## from CRAN
remotes::install_github("Zilong-Li/vcfppR") ## from latest github

Cite the work

If you find it useful, please cite the paper


vcftable: read VCF as tabular data

vcftable gives you fine control over what you want to extract from VCF/BCF files.

Read only SNP variants

vcffile <- ""
res <- vcftable(vcffile, "chr21:1-5100000", vartype = "snps")

Read only SNP variants with PL format and drop the INFO column in the VCF/BCF

vcffile <- ""
res <- vcftable(vcffile, "chr21:1-5100000", vartype = "snps", format = "PL", info = FALSE)

Read only INDEL variants with DP format in the VCF/BCF

vcffile <- ""
res <- vcftable(vcffile, "chr21:1-5100000", vartype = "indels", format = "DP")

vcfcomp: compare two VCF files and report concordance

Want to investigate the concordance between two VCF files? vcfcomp is the utility function you need!

Genotype correlation

vcffile <- ""
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "r2", formats = c('GT','GT'))

Genotype F1 score

vcffile <- ""
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "f1")

Genotype Non-Reference Concordance

vcffile <- ""
res <- vcfcomp(test = vcffile, truth = vcffile, region = "chr21:1-5100000", stats = "nrc")

vcfsummary: variants characterization

Want to summarize variants discovered by genotype caller e.g. GATK? vcfsummary is the utility function you need!

Small variants

vcffile <- ""
region <- "chr21:10000000-10010000"
res <- vcfsummary(vcffile, region)
# get labels and do plottiing
ped <- read.table("", h=T)
ped <- ped[order(ped$Superpopulation),]
out <- sapply(unique(ped$Superpopulation), function(pop) {
  id <- subset(ped, Superpopulation == pop)[,"SampleID"]
  ord <- match(id, res$samples)
  res$SNP[ord] + res$INDEL[ord]

boxplot(out, main = paste0("Average number of SNP & INDEL variants\nin region ", region))

Complex structure variants

svfile <- ""
sv <- vcfsummary(svfile, svtype = TRUE, region = "chr20")
allsvs <- sv$summary[-1]
bar <- barplot(allsvs, ylim = c(0, 1.1*max(allsvs)),
               main = "Variant Counts on chr20 (all SVs)")


There are two classes i.e. vcfreader and vcfwriter offering the full R-bindings of vcfpp.h. Check out the examples in the tests folder or refer to the manual.
