Generating realistic data with known truth using the jointseg package

M. Pierre-Jean, G. Rigaill, P. Neuvial

2025-05-14

This vignette describes how to use the jointseg package to partition bivariate DNA copy number signals from SNP array data into segments of constant parent-specific copy number. We demonstrate the use of the PSSeg function of this package for applying two different strategies. Both strategies consist in first identifying a list of candidate change points through a fast (greedy) segmentation method, and then to prune this list is using dynamic programming [1]. The segmentation method presented here is Recursive Binary Segmentation (RBS, [2]). We refer to [3] for a more comprehensive performance assessment of this method and other segmentation methods.

segmentation, change point model, binary segmentation, dynamic programming, DNA copy number, parent-specific copy number.

Please see Appendix \(\ref{citation}\) for citing jointseg.

HERE

This vignette illustrates how the jointseg package may be used to generate a variety of copy-number profiles from the same biological ``truth’’. Such profiles have been used to compare the performance of segmentation methods in [3].

Citing jointseg

citation("jointseg")
## 
## To cite package 'jointseg' in publications, please use the following
## references:
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial (). jointseg:
##   Joint segmentation of multivariate (copy number) signals.R package
##   version 1.0.3.
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial. Performance
##   evaluation of DNA copy number segmentation methods.  Briefings in
##   Bioinformatics (2015) 16 (4): 600-615.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

Setup

The parameters are defined as follows:

n <- 1e4                                 ## signal length
bkp <- c(2334, 6121)                     ## breakpoint positions
regions <- c("(1,1)", "(1,2)", "(0,2)")  ## copy number regions
ylims <- cbind(c(0, 5), c(-0.1, 1.1))
colG <- rep("#88888855", n)
hetCol <- "#00000088"

For convenience we define a custom plot function for this vignette:

plotFUN <- function(dataSet, tumorFraction) {
    regDat <- acnr::loadCnRegionData(dataSet=dataSet, tumorFraction=tumorFraction)
    sim <- getCopyNumberDataByResampling(n, bkp=bkp,
                                         regions=regions, regData=regDat)
    dat <- sim$profile
    wHet <- which(dat$genotype==1/2)
    colGG <- colG
    colGG[wHet] <- hetCol
    plotSeg(dat, sim$bkp, col=colGG)
}

Affymetrix data

ds <- "GSE29172"
pct <- 1
plotFUN(ds, pct)
Data set GSE29172 : 1 % tumor cells
Data set GSE29172 : 1 % tumor cells
plotFUN(ds, pct)
Data set GSE29172 : 1 % tumor cells (another resampling)
Data set GSE29172 : 1 % tumor cells (another resampling)
pct <- 0.7
plotFUN(ds, pct)
Data set GSE29172 : 0.7 % tumor cells
Data set GSE29172 : 0.7 % tumor cells
pct <- 0.5
plotFUN(ds, pct)
Data set GSE29172 : 0.5 % tumor cells
Data set GSE29172 : 0.5 % tumor cells

Illumina data

ds <- "GSE11976"

Session information

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] jointseg_1.0.3 knitr_1.45    
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.57.0 digest_0.6.36      acnr_1.0.0         R6_2.5.1          
##  [5] lifecycle_1.0.3    jsonlite_1.8.8     evaluate_0.23      highr_0.8         
##  [9] cachem_1.0.6       rlang_1.1.1        cli_3.6.1          rstudioapi_0.11   
## [13] jquerylib_0.1.4    bslib_0.6.1        rmarkdown_2.25     tools_4.0.2       
## [17] xfun_0.41          yaml_2.3.8         fastmap_1.1.1      compiler_4.0.2    
## [21] htmltools_0.5.7    DNAcopy_1.62.0     sass_0.4.8

References

[1] Bellman, Richard. 1961. “On the Approximation of Curves by Line Segments Using Dynamic Programming.” Communications of the ACM 4 (6). ACM: 284.

[2] Gey, Servane, et al. 2008. “Using CART to Detect Multiple Change Points in the Mean for Large Sample.” https://hal.science/hal-00327146.

[3] Pierre-Jean, Morgane, et al. 2015. “Performance Evaluation of DNA Copy Number Segmentation Methods.” Briefings in Bioinformatics, no. 4: 600-615.