Analyses can slow to a crawl when models need hours to run. In this
article you will find a few tricks to prevent this bottleneck when using
orsf()
.
orsf_control_fast()
This is the default control
value for
orsf()
and its run-time compared to other approaches can be
striking. For example:
time_fast <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_fast(),
n_tree = 5)
)
time_net <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_net(),
n_tree = 5)
)
# control_fast() is much faster
time_net['elapsed'] / time_fast['elapsed']
#> elapsed
#> Inf
n_thread
The n_thread
argument uses multi-threading to run
aorsf
functions in parallel when possible. If you know how
many threads you want, e.g. you want exactly 5, just say
n_thread = 5
. If you aren’t sure how many threads you have
available but want to use as many as you can, say
n_thread = 0
and aorsf
will figure out the
number for you.
# automatically pick number of threads based on amount available
orsf(pbc_orsf,
formula = time+status~. -id,
n_tree = 5,
n_thread = 0)
#> ---------- Oblique random survival forest
#>
#> Linear combinations: Accelerated Cox regression
#> N observations: 276
#> N events: 111
#> N trees: 5
#> N predictors total: 17
#> N predictors per node: 5
#> Average leaves per tree: 21
#> Min observations in leaf: 5
#> Min events in leaf: 1
#> OOB stat value: 0.79
#> OOB stat type: Harrell's C-statistic
#> Variable importance: anova
#>
#> -----------------------------------------
Because R is a single threaded language, multi-threading cannot be
applied when orsf()
needs to call R functions from C++,
which occurs when a customized R function is used to find linear
combination of variables or compute prediction accuracy.
There are some defaults in orsf()
that can be adjusted
to make it run faster:
set n_retry
to 0 instead of 3 (the default)
set oobag_pred_type
to ‘none’ instead of ‘surv’ (the
default)
set ‘importance’ to ‘none’ instead of ‘anova’ (the default)
increase split_min_events
,
split_min_obs
, leaf_min_events
, or
leaf_min_obs
to make trees stop growing sooner
increase split_min_stat
to make trees stop growing
sooner
Applying these tips:
orsf(pbc_orsf,
formula = time+status~.,
na_action = 'na_impute_meanmode',
n_thread = 0,
n_tree = 5,
n_retry = 0,
oobag_pred_type = 'none',
importance = 'none',
split_min_events = 20,
leaf_min_events = 10,
split_min_stat = 10)
#> ---------- Oblique random survival forest
#>
#> Linear combinations: Accelerated Cox regression
#> N observations: 276
#> N events: 111
#> N trees: 5
#> N predictors total: 18
#> N predictors per node: 5
#> Average leaves per tree: 5.4
#> Min observations in leaf: 5
#> Min events in leaf: 10
#> OOB stat value: none
#> OOB stat type: none
#> Variable importance: none
#>
#> -----------------------------------------
While these default values do make orsf()
run slower,
they also usually make its predictions more accurate or make the fit
easier to interpret.