---
title: "Using FLORAL for Microbiome Analysis"
output:
rmarkdown::html_vignette:
md_extensions: [
"-autolink_bare_uris"
]
vignette: >
%\VignetteIndexEntry{Using FLORAL for Microbiome Analysis}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
```
```{r setup, warning=FALSE, message=FALSE}
library(FLORAL)
library(dplyr)
library(patchwork)
```
## Data
We will be using data from the `curatedMetagenomicData` package. For easier installation, we saved a flat copy of the data, but the steps below show how that was created.
```r
if (! "BiocManager" %in% installed.packages()) install.packages("BiocManager")
if (! "curatedMetagenomicData" %in% installed.packages()) BiocManager::install("curatedMetagenomicData")
if (! "patchwork" %in% installed.packages()) install.packages("patchwork")
library(curatedMetagenomicData)
# Take a look at the summary of the studies available:
curatedMetagenomicData::sampleMetadata |> group_by(.data$study_name, .data$study_condition) |> count() |> arrange(.data$study_name)
#As an example, let us look at the `YachidaS_2019` study between healthy controls and colorectal cancer (CRC) patients.
curatedMetagenomicData::curatedMetagenomicData("YachidaS_2019")
# note -- if you are behind a firewall, see the solutions to 500 errors here:
# https://support.bioconductor.org/p/132864/
rawdata <- curatedMetagenomicData::curatedMetagenomicData("2021-03-31.YachidaS_2019.relative_abundance", dryrun = FALSE, counts = TRUE) |> mergeData()
x <- SummarizedExperiment::assays(rawdata)$relative_abundance %>% t()
y <- rawdata@colData$disease
save(list = c("x", "y"), file = file.path("inst", "extdata", "YachidaS_2019.Rdata"))
```
```{r getData}
load(system.file("extdata", "YachidaS_2019.Rdata", package="FLORAL"))
```
## Running FLORAL
We extracted the data from the `TreeSummarizedExperiment` object to two objects: the taxa matrix `x` and the "outcomes" vector `y` of whether a patient is healthy or has colorectal cancer (CRC). Note that for binary outcomes, the input vector `y` needs to be formatted with entries equal to either `0` or `1`. In addition, we need to specify `family = "binomial"` in `FLORAL` to fit the logistic regression model. To print the progress bar as the algorithm runs, please use `progress = TRUE`.
```{r floral}
x <- x[y %in% c("CRC","healthy"),]
x <- x[,colSums(x >= 100) >= nrow(x)*0.2] # filter low abundance taxa
colnames(x) <- sapply(colnames(x), function(x) strsplit(x,split="[|]")[[1]][length(strsplit(x,split="[|]")[[1]])])
y <- as.numeric(as.factor(y[y %in% c("CRC","healthy")]))-1
fit <- FLORAL(x = x, y = y, family="binomial", ncv=10, progress=TRUE)
```
## Interpreting the Model
FLORAL, like other methods that have an optimization step, has two "best" solutions for $\lambda$ available: one minimizing the mean squared error ($\lambda_\min$), and one maximizing the value of $\lambda$ withing 1 standard error of the minimum mean squared error ($\lambda_{\text{1se}}$). These are referred to as the `min` and `1se` solutions, respectively.
We can see the mean squared error (MSE) and the coefficients vs log($\lambda$) as follows:
```{r plots,fig.height=4,fig.width=10,dpi=300}
fit$pmse + fit$pcoef
```
In both plots, the vertical dashed line and dotted line represent $\lambda_\min$ and $\lambda_{\text{1se}}$, respectively. In the MSE plot, the bands represent plus minus one standard error of the MSE. In the coefficient plot, the colored lines represent individual taxa, where taxa with non-zero values at $\lambda_\min$ and $\lambda_{\text{1se}}$ are selected as predictive of the outcome.
To view specific names of the selected taxa, please see `fit$selected.feature$min` or `fit$selected.feature$1se` vectors. To view all coefficient estimates, please see `fit$best.beta$min` or `fit$best.beta$1se`. Without looking into ratios, one can crudely interpret positive or negative association between a taxon and the outcome by the positive or negative sign of the coefficient estimates. However, we recommend referring to the two-step procedure discussed below for a more rigorous interpretation based on ratios, which is derived from the log-ratio model assumption.
```{r viewTaxa}
head(fit$selected.feature$min)
head(sort(fit$best.beta$min))
```
## The Two-step Procedure
In the previous section, we checked the lasso estimates without identifying specific ratios that are predictive of the outcome (CRC in this case). By default, `FLORAL` performs a two-step selection procedure to use `glmnet` and `step` regression to further identify taxa pairs which form predictive log-ratios. To view those pairs, use `fit$step2.ratios$min` or `fit$step2.ratios$1se` for names of ratios and `fit$step2.ratios$min.idx` or `fit$step2.ratios$1se.idx` for the pairs of indices in the original input count matrix `x`. Note that one taxon can occur in multiple ratios.
```{r view2step}
head(fit$step2.ratios$`1se`)
fit$step2.ratios$`1se.idx`
```
To further interpret the positive or negative associations between the outcome, please refer to the output `step` regression tables, where the effect sizes of the ratios can be found.
While the corresponding p-values are also available, we recommend only using the p-values as a criterion to rank the strength of the association. We do not recommend directly reporting the p-values for inference, because these p-values were obtained after running the first step lasso model without rigorous post-selective inference. However, it is still valid to claim these selected log-ratios are predictive of the outcome, as demonstrated by the improved 10-fold cross-validated prediction errors.
```{r viewTable}
fit$step2.tables$`1se`
```
## Generating taxa selection probabilities
It is encouraged to run k-fold cross-validation for several times to account for the random fold splits. `FLORAL` provides `mcv.FLORAL` functions to repeat cross-validations for `mcv` times and on `ncore` cores. The output summarizes taxa selection probabilities, average coefficients based on $\lambda_\min$ and $\lambda_{\text{1se}}$. Interpretable plots can be created if `plot = TRUE` is specified.
```{r mcv}
mcv.fit <- mcv.FLORAL(mcv=2,
ncore=1,
x = x,
y = y,
family = "binomial",
ncv = 3,
progress=TRUE)
```
```{r mcvplots,fig.height=6,fig.width=10,dpi=300}
mcv.fit$p_min
#Other options are also available
#mcv.fit$p_min_ratio
#mcv.fit$p_1se
#mcv.fit$p_1se_ratio
```
## Elastic net
Beyond lasso model, `FLORAL` also supports elastic net models by specifying the tuning parameter `a` between 0 and 1. Lasso penalty will be used when `a=1` while ridge penalty will be used when `a=0`.
The `a.FLORAL` function can help investigate the prediction performance for different choices of `a` and return a plot of the corresponding prediction metric trajectories against the choice of $\lambda$.
```{r a.floral,out.width = '50%',fig.height=4,fig.width=4,dpi=300}
a.fit <- a.FLORAL(a = c(0.1,1),
ncore = 1,
x = x,
y = y,
family = "binomial",
ncv = 3,
progress=TRUE)
a.fit
```