The enpls
package offers an algorithmic framework for measuring feature importance, detecting outliers, and ensemble modeling based on (sparse) partial least squares regression. The key functions included in the package are listed in the table below.
Task | Partial Least Squares | Sparse Partial Least Squares |
---|---|---|
Model fitting | enpls.fit() |
enspls.fit() |
Cross validation | cv.enpls() |
cv.enspls() |
Detect outliers | enpls.od() |
enspls.od() |
Measure feature importance | enpls.fs() |
enspls.fs() |
Evaluate applicability domain | enpls.ad() |
enspls.ad() |
Next, we will use the data from (Wang et al. 2015) to demonstrate the general workflow of enpls
. The dataset contains 1,000 compounds, each characterized by 80 molecular descriptors. The response is the octanol/water partition coefficient at pH 7.4 (logD7.4).
Let’s load the data and take a look at it:
library("enpls")
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library("ggplot2")
data("logd1k")
x <- logd1k$x
y <- logd1k$y
head(x)[, 1:5]
## BalabanJ BertzCT Chi0 Chi0n Chi0v
## 1 1.949 882.760 16.845 13.088 13.088
## 2 1.970 781.936 15.905 13.204 14.021
## 3 2.968 343.203 9.845 7.526 7.526
## 4 2.050 1133.679 19.836 15.406 15.406
## 5 2.719 437.346 12.129 9.487 9.487
## 6 2.031 983.304 19.292 15.289 15.289
head(y)
## [1] -0.96 -0.92 -0.90 -0.83 -0.82 -0.79
Here we fit the ensemble sparse partial least squares to the data, so that the model complexity could usually be further reduced than vanilla partial least squares when we build each model.
set.seed(42)
fit <- enspls.fit(x, y, ratio = 0.7, reptimes = 20, maxcomp = 3)
y.pred <- predict(fit, newx = x)
df <- data.frame(y, y.pred)
ggplot(df, aes_string(x = "y", y = "y.pred")) +
geom_abline(slope = 1, intercept = 0, colour = "darkgrey") +
geom_point(size = 3, shape = 1, alpha = 0.8) +
coord_fixed(ratio = 1) +
xlab("Observed Response") +
ylab("Predicted Response")