The ENphylo_modeling
function is meant to couple
Environmental Niche Factor Analysis (ENFA) and phylogenetic imputation
(ENphylo) to model the spatial distribution of rare
(Mondanaro et al. 2023) and extremely rare (i.e. 1-5 occurrences)
species (Mondanaro et al. 2024). By tuning the arguments in
ENphylo_modeling
, the user can refine the model validation
process and evaluate the performance of both ENFA and
ENphylo models. In the following lines, we provide a
detailed example demonstrating how to format the data for running
ENphylo_modeling
, along with a description of its key
arguments.
The main input object of ENphylo_modeling
, the argument
input_data
, is a list of sf::data.frame
objects containing species occurrence data in binary format (ones for
presence, zero for background points) along with the explanatory
continuous variables.
Important note: before launching
ENphylo_modeling
, please ensure that all non-explanatory
variables are removed from input_data
and make sure that
each element of the list is named using the names of the target
species.
To demonstrate how to prepare input_data
, we load a list
of 15 geopackage files, generated by means of
[eucop_data_preparation
], as a case study. We chose four
random bioclimatic variables as relevant for the target species (“bio1”,
“bio4”, “bio11”, and “bio19”), thus removed all others from the data,
and assigned species names to the input_data
list. If the
user generated the geopackages by means of
eucop_data_preparation
as shown in the first step of this
tutorial, they should run the unevaluated lines in the following
chunk.
library(RRgeo)
library(terra)
library(sf)
library(ape)
# datG<-lapply(grep(".gpkg",list.files(),value=TRUE),st_read)
# names(datG)<-sapply(strsplit(grep(".gpkg",list.files(),value=TRUE),"_"),"[[",1)
# dat<-lapply(datG,function(x) x[,c("OBS","age","bio1", "bio4", "bio11", "bio19")])
yourdir<-"YOUR_DIRECTORY"
setwd(yourdir)
latesturl<-RRgeo:::get_latest_version("12734585")
load(url(paste0(latesturl,"/files/dat.Rda?download=1")))
read.tree(system.file("exdata/Eucopdata_tree.txt", package="RRgeo"))->tree
tree$tip.label<-gsub("_"," ",tree$tip.label)
head(dat[1])
#> $`Canis latrans`
#> Simple feature collection with 52630 features and 6 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -10615100 ymin: 2006845 xmax: -7715096 ymax: 6256845
#> Projected CRS: World_Mollweide
#> First 10 features:
#> OBS age bio1 bio4 bio11 bio19 geom
#> 1 1 0 15.364 903.399 4.527 1459.937 POINT (-9115819 4064085)
#> 2 0 0 3.244 651.593 -4.342 2762.059 POINT (-8865096 6256845)
#> 3 0 0 1.522 669.779 -6.328 2257.113 POINT (-8815096 6256845)
#> 4 0 0 0.327 688.093 -7.788 1905.053 POINT (-8765096 6256845)
#> 5 0 0 -1.166 706.524 -9.546 1546.615 POINT (-8715096 6256845)
#> 6 0 0 0.591 725.077 -8.054 1636.931 POINT (-8665096 6256845)
#> 7 0 0 -2.777 739.570 -11.635 1146.882 POINT (-8615096 6256845)
#> 8 0 0 -3.373 751.297 -12.410 1023.144 POINT (-8565096 6256845)
#> 9 0 0 -0.921 763.073 -10.138 1159.108 POINT (-8515096 6256845)
#> 10 0 0 2.390 624.733 -4.815 2822.780 POINT (-8965096 6206845)
As shown in the example, the standard format for input data should include the relevant explanatory variables, the geometry, and columns indicating species occurrence data in binary format. Additionally, if available, columns representing the time intervals associated with species presence and background points may be added.
ENphylo_modeling
Now, we can proceed with setting all ENphylo_modeling
parameters. Other than the abovementioned input_data
argument, the function requires a phylogenetic tree
and an
input_mask
as mandatory arguments. The latter is the
geographical mask defining the spatial domain encompassing the
background area enclosing all the target species. Other possible
arguments include:
min_occ_enfa
: the threshold number of presence
datapoints discriminating whether ENFA or ENphylo will be implemented.
If the number of occurrences (i.e., OBS=1 in the input data example)
exceeds the min_occ_enfa
threshold, the species is modeled
by using ENFA, otherwise ENphylo_modeling
performs
phylogenetic imputation.boot_test_perc
and boot_reps
: regulate the
cross-validation scheme by defining the proportion of data allocated for
training/testing and the number of iterations used to assess model
accuracy.swap.args
: is a list of settings to generate as many
alternative phylogenies as nsim
. A large number of
nsim
allows testing different phylogenies, therefore
accounting for phylogenetic uncertainty, and potentially improving
ENphylo model performance. Nonetheless, increasing
nsim
also leads to higher computational costs. To reach a
trade-off between model performance and computational efficiency the
size of the initial phylogeny must be taken into account, as extensive
modifications of small trees have limited impact on model accuracy.eval.args
: a list of settings to assess the accuracy of
both ENFA and ENphylo. It allows choosing the model
validation metric (eval_metric_for_imputation
), define the
evaluation threshold number (eval_threshold
), and specify
the strategy to select the best-fit model for each species
(output_options
).Note: ENphylo_modeling
is highly
suitable for a multi-species modeling approach. However, it is strongly
recommended not to exceed an abundant/rare species ratio of 0.7. This
limit ensures accurate and reliable estimation of marginality and
specialization values for rare species via imputation. If the 0.7 ratio
is exceeded, ENphylo_modeling
internally splits the
original phylogeny (tree
) into multiple phylogenies to
ensure that the threshold is respected. In each subtree, the function
automatically selects species that are phylogenetically close to the
species to impute and preferentially splits the latter into different
trees.
latesturl<-RRgeo:::get_latest_version("12734585")
curl::curl_download(paste0(latesturl,"/files/X35kya.tif?download=1"),
destfile = "X35kya.tif", quiet = FALSE)
rast("X35kya.tif")->map35
project(map35,st_crs(dat[[1]])$proj4string,res = 50000)->map
ENphylo_modeling(input_data=dat,
tree=tree,
input_mask=map[[1]],
obs_col="OBS",
time_col="age",
min_occ_enfa=15,
boot_test_perc=20,
boot_reps=10,
swap.args=list(nsim=10,si=0.2,si2=0.2),
eval.args=list(eval_metric_for_imputation="AUC",
eval_threshold=0.7,
output_options="best"),
clust=0.5,
output.dir=yourdir)
ENphylo_modeling
outputsENphylo_modeling
creates two output folders within the
output.dir
path:
The content of model outputs object depends on the algorithm used for
modeling individual species and on the combination of
ENphylo_modeling
arguments chosen by users. In any case,
model outputs include the CO matrices (i.e., the correlation matrix of
the environmental predictors) and the model performances.
Note: ENphylo_modeling
returns all the
evaluation metrics available for both ENFA and ENphylo.
However, it internally relies on the selected
eval_metric_for_imputation
and output_options
arguments to retrieve the best-fit “imputed” model. After testing
nsim
alternative phylogenies and calculating the
performances of models generated from these phylogenies,
ENphylo_modeling
selects the model (or multiple models
depending on the output_options
argument) with the highest
eval_metric_for_imputation
value.
ENphylo_prediction
After running ENphylo_modeling
, it is recommended to use
the function ENphylo_prediction
to make spatially explicit
predictions into a new geographical area. Irrespective of whether ENFA
or phylogenetic imputation was performed for modelling each species,
ENphylo_prediction
calculates new marginality and
specialization estimates based on the specified geographical area.
Additionally, the user can convert these predictions into habitat
suitability maps.
To simplify the retrieval of ENFA/ENphylo models, we
developed a new function, getENphylo_results
, which
automatically imports ENFA, ENphylo, or both models
from the specified folder path. In addition, user can focus on single or
multiple species by setting the species_name
argument.
As a case study, we project two models, one ENFA model for Ursus
maritimus and one imputed model for Vulpes velox, by using
the map including bioclimatic variables at 35 kya. First, the map is set
to retain the focal variables only (the ones used in
ENphylo_modeling
) and reprojected by using the Mollweide
World projection (that is the same as input_data
). Then,
getENphylo_results
retrieves the models for Ursus
maritimus and Vulpes velox, and finally the user may set
ENphylo_prediction
to generate their habitat suitability
maps in North America at 35 kya.
Note: ENphylo_prediction
stores all the
outputs in the output.dir
folder by creating a new
ENphylo_prediction folder. Inside the folder, a number of
nested sub-folders, one per species, is created according to the
proj_name
argument of the function to store all the spatial
predictions.
library(rnaturalearth)
ne_countries(returnclass = "sf")->globalmap
subset(globalmap,continent=="North America")->ame_map
map35[[c("bio1","bio4","bio11","bio19")]]->newmap
crop(newmap,ext(ame_map))->newmap
project(newmap,st_crs(dat[[1]])$proj4string,res = 50000)->newmap
getENphylo_results(input.dir =yourdir,
mods="all",
species_name=c("Vulpes velox","Ursus maritimus"))->mod
ENphylo_prediction(object = mod,
newdata = newmap,
convert.to.suitability = TRUE,
output.dir=yourdir,
proj_name="proj_example")
In the following chunk we show how to plot habitat suitibility maps
of the selected species by means of ggplot2. Such maps are provided as
supporting material within the package. Yet, if the users wish to plot
their own maps they can use the unevaluated lines to retrieve them and
modify the fill
argument in ggplot
according
to their output.
library(ggplot2)
library(ggtext)
library(viridis)
rast(system.file("exdata/Suit_Vulpes.tif", package="RRgeo"))->map1
rast(system.file("exdata/Suit_Ursus.tif", package="RRgeo"))->map2
as.data.frame(map1,xy=T)->map1
as.data.frame(map2,xy=T)->map2
# rast("./ENphylo_prediction/Vulpes velox/proj_example/Suitability.tif")->map1
# rast("./ENphylo_prediction/Ursus maritimus/proj_example/Suitability.tif")->map2
p1<-ggplot(map1,aes(x=x,y=y,fill=Suitability_swap_9))+
geom_tile()+
scale_fill_viridis(name = "Suitability")+
labs(title="*Vulpes velox* at 35 kya")+
theme(panel.background = element_rect(fill="aliceblue",colour = "black"),
panel.grid = element_blank(),
axis.text= element_text(size=10),
axis.title = element_blank(),
plot.title = element_markdown(size=12,hjust=0.5),
plot.margin = unit(c(0.1,0.1,0.1,0.1),"cm"))
p2<-ggplot(map2,aes(x=x,y=y,fill=Suitability))+
geom_tile()+
scale_fill_viridis()+
labs(title="*Ursus maritimus* at 35 kya")+
theme(panel.background = element_rect(fill="aliceblue",colour = "black"),
panel.grid = element_blank(),
axis.text=element_text(size=10),
axis.title = element_blank(),
plot.title = element_markdown(size=12,hjust=0.5),
plot.margin = unit(c(0.1,0.1,0.1,0.1),"cm"))
plot(p1)
plot(p2)