This vignette demonstrates the use of the **oosse**
package for estimating for estimating out-of-sample R² and its standard
error through resampling algorithms, described in “Out-of-sample R²:
estimation and inference” by Hawinkel et al. 2023.

The *R2oosse* function works with any pair of fitting and
prediction functions. Here we illustrate a number of them, but any
prediction function implemented in R can be used. The built-in dataset
*Brassica* is used, which contains *rlog*-transformed gene
expression measurements for the 1,000 most expressed genes in the
*Expr* slot, as well as 5 outcome phenotypes in the
*Pheno* slot.

The fitting model must accept at least an outcome vector *y*
and a regressor matrix *x*:

The predictive model must accept arguments *mod* (the fitted
model) and *x*, the regressor matrix for a new set of
observations.

Now that these functions have been defined, we apply the prediction
model for leaf_8_width using the first 10 genes. Multithreading is used
automatically using the *BiocParallel* package. Change the
following setup depending on your system.

```
nCores = 2 # For CRAN build max 2
library(BiocParallel)
if(.Platform$OS.type == "unix"){
#On unix-based systems, use MulticoreParam
register(MulticoreParam(nCores))
} else {
#On windows, use makeCluster
library(doParallel)
Clus = makeCluster(nCores)
registerDoParallel(Clus)
register(DoparParam(), default = TRUE)
}
```

Now estimate out-of-sample \(R^2\), also a rough estimate of the computation time is given. Remember to provide the cluster for multithreading.

```
R2lm = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:5],
fitFun = fitFunLM, predFun = predFunLM)
```

```
## Fitting and evaluating the model once took 0 seconds.
## You requested 200 repeats of 10-fold cross-validation with 2 cores, which is expected to last for roughly
## 0 seconds
```

Estimates and standard error of the different components are now available.

```
## R2 R2SE
## 0.4830960 0.1234188
```

```
## MSE MSESE
## 2.9563220 0.4768678
```

```
## MST MSTSE
## 5.719286 1.035600
```

Also confidence intervals can be constructed:

```
## 2.5% 97.5%
## 0.2411996 0.7249923
```

```
## 5% 95%
## 2.171944 3.740700
```

```
## 2.5% 97.5%
## 4.129867 8.446729
```

By default, cross-validation (CV) is used to estimate the MSE, and nonparametric bootstrapping is used to estimate the correlation between MSE and MST estimators. Other choices can be supplied though, e.g. for bootstrap .632 estimation of the MSE and jackknife estimation of the correlation:

```
R2lm632jn = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:5],
fitFun = fitFunLM, predFun = predFunLM, methodMSE = "bootstrap",
methodCor = "jackknife")
```

```
## Fitting and evaluating the model once took 0 seconds.
## You requested 200 .632 bootstrap instances with 2 cores, which is expected to last for roughly
## 12.8 seconds
```

Supplying a dataframe with predictor variables is not allowed. The
user is asked to build the design matrix yourself with model.matrix
prior to calling *R2oosse*:

```
#We construct a fake data frame also containing genotypes
fakeDf = data.frame(Brassica$Expr[, 1:5], "genotype" = sample(c("Genotype1", "Genotype2", "Genotype3"), replace = TRUE, nrow(Brassica$Expr)))
#Build the design matrix. The model.matrix variables automatically constructs dummy variables
designMatrix = model.matrix(~ . , data = fakeDf)[, -1] #Include no intercept as fitting function already does this
#Now run oosse
R2modMat = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = designMatrix,
fitFun = fitFunLM, predFun = predFunLM)
```

```
## Fitting and evaluating the model once took 0 seconds.
## You requested 200 repeats of 10-fold cross-validation with 2 cores, which is expected to last for roughly
## 10.25 seconds
```

For high-dimensional problems, such as the Brassica dataset, a
regularised linear model is better suited to incorporate information for
all genes. We use the *cv.glmnet* function from the
*glmnet* package, which includes internal cross-validation for
tuning the penalty parameter. Following custom function definitions are
needed to fit in with the naming convention of the *oosse*
package.

```
fitFunReg = function(y, x, ...) {cv.glmnet(y = y, x = x, ...)}
predFunReg = function(mod, x, ...){predict(mod, newx = x, ...)}
```

We adapt the parameter settings a bit to reduce computation time of the vignette, it is recommended to use 10-fold cross-validation, at least 200 repeats of the cross-validation splits and 50 bootstrap replicates.

```
nFolds = 5; cvReps = 1e2; nBoots = 4e1;numFeat = 25
if(require(glmnet)){
if(onWindows <- (.Platform$OS.type == "windows")){
clusterEvalQ(Clus, require(glmnet))
}
R2pen = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, seq_len(numFeat)], #Subset genes for speed
nFolds = nFolds, cvReps = cvReps, nBootstrapsCor = nBoots,
fitFun = fitFunReg, predFun = predFunReg, alpha = 1)#Lasso model
R2pen$R2
}
```

`## Loading required package: glmnet`

`## Loading required package: Matrix`

`## Loaded glmnet 4.1-8`

```
## Fitting and evaluating the model once took 0.05 seconds.
## You requested 100 repeats of 5-fold cross-validation with 2 cores, which is expected to last for roughly
## 1 minutes and 11.55 seconds
```

```
## R2 R2SE
## 0.6360594 0.0977301
```

As a final example we use a random forest as a prediction model. We
use the implementation from the *randomForest* package.

```
if(require(randomForest)){
if(onWindows){
clusterEvalQ(Clus, require(randomForest))
}
fitFunrf = function(y, x, ...){randomForest(y = y, x, ...)}
predFunrf = function(mod, x, ...){predict(mod, x, ...)}
R2rf = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, seq_len(numFeat)],
nFolds = nFolds, cvReps = cvReps, nBootstrapsCor = nBoots,
fitFun = fitFunrf, predFun = predFunrf)
R2rf$R2
}
```

`## Loading required package: randomForest`

`## randomForest 4.7-1.1`

`## Type rfNews() to see new features/changes/bug fixes.`

```
## Fitting and evaluating the model once took 0.05 seconds.
## You requested 100 repeats of 5-fold cross-validation with 2 cores, which is expected to last for roughly
## 1 minutes and 3.45 seconds
```

```
## R2 R2SE
## 0.69839009 0.08881258
```

The \(R^2\) estimate is comparable to that of the penalised regression model.

```
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Amsterdam
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.1 glmnet_4.1-8 Matrix_1.6-5
## [4] BiocParallel_1.36.0 oosse_1.0.11
##
## loaded via a namespace (and not attached):
## [1] doParallel_1.0.17 cli_3.6.2 knitr_1.45 rlang_1.1.3
## [5] xfun_0.41 jsonlite_1.8.8 htmltools_0.5.7 sass_0.4.8
## [9] rmarkdown_2.25 grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 foreach_1.5.2 lifecycle_1.0.4
## [17] compiler_4.3.2 codetools_0.2-19 Rcpp_1.0.12 rstudioapi_0.15.0
## [21] lattice_0.22-5 digest_0.6.34 R6_2.5.1 splines_4.3.2
## [25] shape_1.4.6 Rdpack_2.6 parallel_4.3.2 rbibutils_2.2.16
## [29] bslib_0.6.1 tools_4.3.2 iterators_1.0.14 survival_3.5-7
## [33] cachem_1.0.8
```