Verbal autopsy (VA) is a well-established approach to ascertaining the cause of death (COD) when medical certification and full autopsies are not feasible or practical (Garenne 2014; Taylor et al. 1983). After a death is identified, a specially-trained fieldworker interviews the caregivers (usually family members) of the decedent. A typical VA interview includes a set of structured questions with categorical or quantitative responses and a narrative section that records the “story” of the death from the respondent’s point of view (World Health Organization 2012).

This vignette aims to provide a quick introduction to the **openVA** package using simple case studies. For more detailed discussion of the package and related VA methods, please refer to (Zehang R. Li et al. 2021)

The **openVA** package (Zehang Richard Li et al. 2022) addresses this issue through an open-source toolkit. The **openVA** suite comprises a collection of R packages for the analysis of verbal autopsy data. The goal of this package is to provide researchers with an open-source tool that supports flexible data input formats and allows easier data manipulation and model fitting. The **openVA** suite consists of four core packages that are on CRAN, **InterVA4** (Zehang R. Li et al. 2019), **InterVA5** (Thomas et al. 2021), **InSilicoVA** (Zehang R. Li, McCormick, and Clark 2022), and **Tariff** (Zehang R. Li, McCormick, and Clark 2018), and an optional package **nbc4va** (Wen et al. 2018). Each of these packages implements one coding algorithm. For three of these algorithms – namely, InterVA-4, InterVA-5 and Tariff – there are also compiled software programs distributed by the original authors.

The **openVA** package is hosted on CRAN and can be installed with the following commands. For the analysis in this paper, we also install the **nbc4va** package separately from GitHub. The versions of the supporting packages can be checked in R using the `openVA_status()`

function.

```
set.seed(12345)
library(openVA)
::install_github("rrwen/nbc4va")
remoteslibrary(nbc4va)
openVA_status()
```

The common modeling framework for VA data consists of first converting the survey responses into a series of binary variables, and then assigning a COD to each death based on the binary input variables. Typically, the target of inference consists of two parts: the individual cause-of-death assignment, and the population-level cause-specific mortality fractions (CSMF), i.e., the fraction of deaths due to each cause. In this section, we formally compare the modeling approaches utilized by each algorithm for these two targets. We adopt the following notations. Consider \(N\) deaths, each with \(S\) binary indicators of symptoms. Let \(s_{ij}\) denote the indicator for the presence of \(j\)-th symptom in the \(i\)-th death, which can take values 0, 1, or NA (for missing data). We consider a pre-defined set of causes of size \(C\). For the \(i\)-th death, denote the COD by \(y_i \in \{1, ..., C\}\) and the probability of dying from cause \(k\) is denoted by \(P_{ik}\). For the population, the CSMF of cause \(k\) is denoted as \(\pi_k\), with \(\sum_{k=1}^C \pi_k = 1\).

**InterVA4**(Byass et al. 2012) and**InterVA5**(Byass et al. 2019) algorithms calculate the probability of each COD given the observed symptoms using the following formula, \[ P_{ik} = \frac{\pi_{k}^{(0)} \prod_{j=1}^S P(s_{ij}=1|y_{i}=k) \mathbf{1}_{s_{ij} = 1}} {\sum_{k' = 1}^C \pi_{k'}^{(0)} \prod_{j=1}^S P(s_{ij}=1|y_{i}=k') \mathbf{1}_{s_{ij} = 1}} \] where both the prior distribution of each of the causes, \(\pi_{k}^{(0)}\) and the conditional probabilities \(P(s_{ij} = 1 | y_i = k)\) are fixed values provided in the InterVA software. It is worth noting that the formula does not follow the standard Bayes’ rule as it omits the probability that any symptom is absent. A detailed discussion of this modeling choice can be found in McCormick et al. (2016). The conditional probabilities, \(P(s_{ij}=1|y_{i}=k)\), used in InterVA algorithms are represented as rankings with letter grades instead of numerical values (Byass et al. 2012). For example, \(P(s_{ij}=1|y_{i}=k) = A+\) is translated into \(P(s_{ij}=1|y_{i}=k) = 0.8\), etc. The standard InterVA software only supports the fixed set of symptoms and causes where such prior information is provided. For a different data input format, this formulation can be easily generalized if training data are available. We include in the**openVA**package an extension of the algorithm that calculates \(\hat P(s_{ij}=1|y_{i}=k)\) from the empirical distribution in the training data and then maps to letter grades with different truncation rules. Details of the extended InterVA algorithm can be found in McCormick et al. (2016).After the individual COD distributions are calculated, InterVA-4 utilizes a series of pre-defined rules to identify up to the top three most likely COD assignments, and truncates the probabilities for the rest of the CODs to 0 and adds an ‘undetermined’ category so that the probabilities sum up to 1 (See the user guide of Byass (2015)). Then the population-level CSMFs are calculated as the aggregation of individual COD distributions, such that \[ \pi_k = \sum_{i=1}^N P^*_{ik} \] where \(P^*_{ik}\) denotes the individual COD distribution after introducing the undetermined category.

**Naive Bayes Classifier**(Miasnikof et al. 2015) is very similar to the InterVA algorithm with two major differences. First, instead of considering only symptoms that present, the NBC algorithm also considers symptoms that are absent. Second, the conditional probabilities of symptoms given causes are calculated from training data instead of given by physicians, which is similar to our extension of InterVA discussed above. Similar to InterVA, the NBC method can be written as \[ P_{ik} = \frac{\pi_{k}^{(0)} \prod_{j=1}^S (P(s_{ij}=1|y_{i}=k) \mathbf{1}_{s_{ij} = 1} + P(s_{ij} \neq 1|y_{i}=k) \mathbf{1}_{s_{ij} \neq 1})} {\sum_{k' = 1}^C \pi_{k'}^{(0)} \prod_{j=1}^S (P(s_{ij}=1|y_{i}=k') \mathbf{1}_{s_{ij} = 1}+ P(s_{ij} \neq 1|y_{i}=k') \mathbf{1}_{s_{ij} \neq 1})} \] and the CSMFs are calculated by \(\pi_k = \sum_{i=1}^N P_{ik}\).**InSilicoVA**algorithm (McCormick et al. 2016) assumes a generative model that characterizes both the CSMF at the population level, and the COD distributions at the individual level. In short, the core generative process assumes \[\begin{eqnarray} s_{ij} | y_i = k &\propto& \mbox{Bernoulli}(P(s_{ij} | y_i = k)) \\ y_i | \pi_1, ..., \pi_C &\propto& \mbox{Categorical}(\pi_1, ..., \pi_C) \\ \pi_k &=& \exp \theta_k / \sum_{k=1}^C \exp \theta_k \\ \theta_k &\propto& \mbox{Normal}(\mu, \sigma^2) \end{eqnarray}\] and hyperpriors are also placed on \(P(s_{ij} | y_i = k)\), \(\mu\), and \(\sigma^2\). The priors for \(P(s_{ij} | y_i = k)\) are set by the rankings used in InterVA-4 if the data are prepared into InterVA format, otherwise they are learned from training data. The priors on \(\mu\) and \(\sigma^2\) are diffuse uniform priors. Parameter estimation is performed using MCMC, so that a sample of posterior distributions of \(\pi_k\) can be obtained after the sampler converges.**Tariff**algorithm (James et al. 2011) differs from the other three methods in that it does not calculate an explicit probability distribution of the COD for each death. Instead, for each death \(i\), a Tariff score is calculated for each COD \(k\) so that \[ Score_{ik} = \sum_{j = 1}^{S} \mbox{Tariff}_{kj}\mathbf{1}_{s_{ij}=1} \] where the symptom-specific Tariff score \(\mbox{Tariff}_{kj}\) is defined as \[ \mbox{Tariff}_{kj} = \frac{n_{kj} - median(n_{1j}, n_{2j}, ..., n_{Cj})} {IQR(n_{1j}, n_{2j}, ..., n_{Cj})} \] where \(n_{kj}\) is the count of how many deaths from cause \(k\) contain symptom \(j\) in the training data. The Tariff scores are then turned into rankings by comparing them to a reference distribution of scores calculated from re-sampling the training dataset to obtain a uniform COD distribution. It is worth noting that the Tariff algorithm produces the COD distribution for each death in terms of their rankings instead of the probability distributions. And thus the CSMF for each cause \(k\) is calculated by the fraction of deaths with cause \(k\) being the highest ranked cause, i.e., \[ \pi_k = \frac{\sum_{i=1}^N\mathbf{1}_{y_i = k}}{N} \]

In addition to the different model specifications underlying each algorithm, there is also a major conceptual difference in handling missing symptoms across the algorithms. Missing symptoms could arise from different stages of the data collection process. For example, the respondent may not know whether certain symptoms existed or may refuse to answer a question. From a statistical point of view, knowing that a symptom does not exist provides some information to the possible cause assignment, while a missing symptom does not. Although in theory, most of the VA algorithms could benefit from distinguishing ‘missing’ from ‘absent’, InSilicoVA is the only algorithm that has been implemented to acknowledge missing data. Missing indicators are assumed to be equivalent to ‘absence’ in InterVA, NBC, and Tariff.

In the **openVA** package, we consider two main types of standardized questionnaire: the WHO instrument and the IHME questionnaire. In this section, we focus on describing these two data formats and tools to clean and convert data. Pre-processing the raw data collected from the survey instrument (usually with Open Data Toolkit) is usually performed with additional packages and software outside of the analysis pipeline in R. We briefly mention software for data pre-processing towards the end of this paper.

For users familiar with InterVA software and the associated data processing steps, the standard input format from the WHO 2012 and 2016 instruments is usually well understood. For the 2012 instrument, the data expected by the InterVA-4 software are organized into a data frame where each row represents one death and the corresponding VA information is contained in \(246\) fields, starting from the first item being the ID of the death. The \(245\) items following the ID each represent one binary variable of symptom/indicator, where ‘presence’ is coded by ‘Y’, and ‘absence’ is coded by an empty cell.

To accommodate updates for the WHO 2016 instrument (D’Ambruoso et al. 2017), the InterVA-5 software accepts a data frame with \(354\) columns that include \(353\) columns of symptom/indicators followed by an additional column for the record ID. It should be noted that the R package **InterVA5** retains the format with the record ID residing in the first column. Another important update with InterVA-5 is that it acknowledges the difference between “Yes” and “No” (or “Y/y” and “N/n”, which is different from the coding scheme in InterVA-4), both of which are processed as relevant responses, while all other responses are treated as missing values and ignored. With respect to the list of causes of death, InterVA-5 utilizes the WHO 2016 COD categories, which is nearly identical to the WHO 2012 COD categories (used by InterVA-4) except that hemorrhagic fever and dengue fever are two separate categories in the 2016 COD categories.

The same input format is inherited by the **openVA** package, except for one modification. We further distinguish ‘missing’ and ‘absent’ in the input data frame explicitly. We highly recommend that users pre-process all the raw data so that a ‘missing’ value in the data spreadsheet is coded as a ‘.’ (following the Stata practice familiar to many VA practitioners), and an ‘absent’ value is indicated by an empty cell, as in the standard InterVA-4 software. For WHO 2016 data, both ‘.’ and ‘-’ (following the default coding scheme of InterVA-5 software) are interpreted as missing values. For methods other than InSilicoVA, ‘missing’ and ‘absent’ will be considered the same internally and thus will not introduce a compatibility problem.

The Population Health Metrics Research Consortium (PHMRC) gold standard VA data (Murray et al. 2011) consist of three datasets corresponding to adult, child, and neonatal deaths, respectively. All deaths occurred in health facilities and gold-standard causes are determined based on laboratory, pathology and medical imaging findings. These datasets can be downloaded directly using the link returned by the function `getPHMRC_url()`

. For example, we can read the adult VA dataset using the following command.

`<- read.csv(getPHMRC_url("adult")) PHMRC_adult `

Although the data are publicly accessible, a major practical challenge for studies involving the PHMRC gold standard dataset is that the pre-processing steps described from the relevant publications are not clear enough nor easy to implement. The **openVA** package internally cleans up the PHMRC gold standard data when calling the `codeVA()`

function on the PHMRC data. The procedure follows the steps described in the supplement material of McCormick et al. (2016). Notice that the original PHMRC data are useful for comparing and validating new methods, as well as for using as training data, but the cleaning functions only require that the columns are exactly the same as the PHMRC gold standard datasets, so they could also be used for new data that are pre-processed into the same format.

In addition to the two standard questionnaires discussed previously, researchers might also be interested in including customized dichotomous symptoms in their analysis. The **openVA** package also supports customized inputs as long as they are dichotomous. In such case, neither the built-in conditional probability matrix of InterVA nor the PHMRC gold standard dataset could be used to learn the relationship between training and testing data, thus different training data with known causes of death are necessary for all three algorithms. The `ConvertData()`

function can be used to convert data with customized coding schemes into the format recognized by the **openVA** package.

In the first example, we demonstrate fitting InterVA and InSilicoVA on a random sample of \(1,000\) deaths from the ALPHA network without cause-of-death labels collected with the WHO 2012 instrument. The dataset is already included in the openVA package as a dataset `RandomVA1`

and can be loaded directly.

```
data(RandomVA1)
dim(RandomVA1)
```

`## [1] 1000 246`

`head(RandomVA1[, 1:10])`

```
## ID elder midage adult child under5 infant neonate male female
## 1 d1 Y Y
## 2 d2 Y Y
## 3 d3 Y Y
## 4 d4 Y Y
## 5 d5 Y Y
## 6 d6 Y Y
```

The `codeVA()`

function provides a standardized syntax to fit different VA models. Internally, the `codeVA()`

function organizes the input data according to the specified data type, checks for incompatibility of the data and specified model, and calls the corresponding model fitting functions. It returns a classed object of the specified model class. In this example, we use version 4.03 of the InterVA software, which is the latest release of the original software compatible with the WHO 2012 instrument. Any additional model-specific parameters can be passed through the arguments of `codeVA()`

. Here we specify the HIV and malaria prevalence levels required by the InterVA model to be ‘high’. Guidelines on how to set these parameters can be found in Byass et al. (2012).

```
<- codeVA(data = RandomVA1, data.type = "WHO2012",
fit_inter_who model = "InterVA", version = "4.03",
HIV = "h", Malaria = "h")
```

`summary(fit_inter_who) `

```
## InterVA-4 fitted on 1000 deaths
## CSMF calculated using reported causes by InterVA-4 only
## The remaining probabilities are assigned to 'Undetermined'
##
## Top 5 CSMFs:
## cause likelihood
## Undetermined 0.154
## HIV/AIDS related death 0.122
## Stroke 0.072
## Reproductive neoplasms MF 0.058
## Pulmonary tuberculosis 0.055
```

We can implement InSilicoVA method with similar syntax. We use the default parameters and run the MCMC for \(10,000\) iterations. Setting the `auto.length`

argument to FALSE specifies that the algorithm does not automatically increase the length of the chain when convergence failed. The InSilicoVA algorithm is implemented using a Metropolis-Hastings within Gibbs sampler. The acceptance rate is printed as part of the message as the model samples from the posterior distribution.

```
<- codeVA(RandomVA1, data.type = "WHO2012", model = "InSilicoVA",
fit_ins_who Nsim = 10000, auto.length = FALSE)
```

`summary(fit_ins_who) `

```
## InSilicoVA Call:
## 1000 death processed
## 10000 iterations performed, with first 5000 iterations discarded
## 250 iterations saved after thinning
## Fitted with re-estimated conditional probability level table
## Data consistency check performed as in InterVA4
##
## Top 10 CSMFs:
## Mean Std.Error Lower Median Upper
## Other and unspecified infect dis 0.266 0.0168 0.235 0.265 0.301
## HIV/AIDS related death 0.102 0.0091 0.085 0.102 0.119
## Renal failure 0.101 0.0108 0.084 0.101 0.123
## Other and unspecified neoplasms 0.062 0.0089 0.046 0.061 0.080
## Other and unspecified cardiac dis 0.058 0.0076 0.044 0.058 0.075
## Digestive neoplasms 0.050 0.0077 0.033 0.050 0.065
## Acute resp infect incl pneumonia 0.048 0.0073 0.034 0.049 0.063
## Pulmonary tuberculosis 0.039 0.0068 0.025 0.039 0.054
## Stroke 0.038 0.0061 0.027 0.038 0.052
## Other and unspecified NCD 0.034 0.0089 0.018 0.034 0.052
```

In the second example, we consider a prediction task using the PHMRC adult dataset. We first load the complete PHMRC adult dataset from its on-line repository, and organize it into training and test datasets. We treat all deaths from Andhra Pradesh, India as the test dataset.

```
<- read.csv(getPHMRC_url("adult"))
PHMRC_adult <- which(PHMRC_adult$site == "AP")
is.test <- PHMRC_adult[is.test, ]
test <- PHMRC_adult[-is.test, ]
train dim(test)
```

`## [1] 1554 946`

`dim(train)`

`## [1] 6287 946`

In order to fit the models on the PHMRC data, we specify `data.type = "PHMRC"`

and `phmrc.type = "adult"`

to indicate the data input is collected using the PHMRC adult questionnaire. We also specify the column of the causes-of-death label in the training data. The rest of the syntax is similar to the previous example.

When the input consists of both training and testing data, the InterVA and InSilicoVA algorithms estimate the conditional probabilities of symptoms using the training data, instead of using the built-in values. In such case, the `version`

argument for the InterVA algorithm is suppressed. There are several ways to map the conditional probabilities of symptoms given causes in the training dataset to a letter grade system, specified by the `convert.type`

argument. The `convert.type = "quantile"`

performs the mapping so that the percentile of each rank stays the same as the original \(P_{s|c}\) matrix in InterVA software. Alternatively we can also use the original fixed values of translation, and assign letter grades closest to each entry in \(\hat{P}_{s|c}\). This conversion is specified by `convert.type = "fixed"`

, and is more closely aligned to the original InterVA and InSilicoVA setting. Finally, we can also directly use the values in the \(\hat{P}_{s|c}\) without converting them to ranks and re-estimating the values associated with each rank. This can be specified by `convert.type = "empirical"`

. In this demonstration, we assume the fixed value conversion.

```
<- codeVA(data = test, data.type = "PHMRC", model = "InterVA",
fit_inter data.train = train, causes.train = "gs_text34",
phmrc.type = "adult", convert.type = "fixed")
```

```
<- codeVA(data = test, data.type = "PHMRC", model = "InSilicoVA",
fit_ins data.train = train, causes.train = "gs_text34",
phmrc.type = "adult", convert.type = "fixed",
Nsim=10000, auto.length = FALSE)
```

The NBC and Tariff method can be fit using similar syntax.

```
<- codeVA(data = test, data.type = "PHMRC", model = "NBC",
fit_nbc data.train = train, causes.train = "gs_text34",
phmrc.type = "adult")
```

```
<- codeVA(data = test, data.type = "PHMRC", model = "Tariff",
fit_tariff data.train = train, causes.train = "gs_text34",
phmrc.type = "adult")
```

Notice that we do not need to transform the PHMRC data manually. Data transformations are performed automatically within the `codeVA()`

function.

In this section we demonstrate how to summarize results, extract output, and visualize and compare fitted results. All the fitted object returned by `codeVA()`

are S3 objects, for which a readable summary of model results can be obtained with the `summary()`

function as shown in the previous section. In addition, several other metrics are commonly used to evaluate and compare VA algorithms at either the population or individual levels. In the rest of this section, we show how to easily calculate and visualize some of these metrics with the **openVA** package.

We can extract the CSMFs directly using the `getCSMF()`

function. The function returns a vector of the point estimates of the CSMFs, or a matrix of posterior summaries of the CSMF for the InSilicoVA algorithm.

```
<- getCSMF(fit_inter)
csmf_inter <- getCSMF(fit_ins)
csmf_ins <- getCSMF(fit_nbc)
csmf_nbc <- getCSMF(fit_tariff) csmf_tariff
```

One commonly used metric to evaluate the CSMF estimates is the so-called CSMF accuracy, defined as \[
CSMF_{acc} = 1 - \frac{\sum_j^C CSMF_i - CSMF_j^{(true)}}{2(1 - \min CSMF^{(true)})}
\] The CSMF accuracy can be readily computed using functions in **openVA** as the codes below shows.

```
<- table(c(test$gs_text34, unique(PHMRC_adult$gs_text34))) - 1
csmf_true <- csmf_true / sum(csmf_true)
csmf_true c(getCSMF_accuracy(csmf_inter, csmf_true, undet = "Undetermined"),
getCSMF_accuracy(csmf_ins[, "Mean"], csmf_true),
getCSMF_accuracy(csmf_nbc, csmf_true),
getCSMF_accuracy(csmf_tariff, csmf_true))
```

`## [1] 0.53 0.74 0.77 0.68`

We use the empirical distribution in the test data to calculate the true CSMF distribution, i.e., $CSMF_i^{(true)} = $. Then we evaluate the CSMF accuracy using the `getCSMF_accuracy()`

function. As discussed previously, the default CSMF calculation is slightly different for diCfferent methods. For example, the InterVA algorithm creates the additional category of `Undetermined`

by default, which is not in the true CSMF categories and needs to be specified. The creation of the undetermined category can also be suppressed by `interVA.rule = FALSE`

in the `getCSMF()`

function call. For the InSilicoVA algorithm, we use the posterior mean to calculate the point estimates of the CSMF accuracy.

At the individual level, we can extract the most likely cause-of-death assignment from the fitted object using the `getTopCOD()`

function.

```
<- getTopCOD(fit_inter)
cod_inter <- getTopCOD(fit_ins)
cod_ins <- getTopCOD(fit_nbc)
cod_nbc <- getTopCOD(fit_tariff) cod_tariff
```

With the most likely COD assignment, other types of metrics based on individual COD assignment accuracy can be similarly constructed by users. The summary methods can also be called for each death ID. For example, using the Tariff method, we can extract the fitted rankings of causes for the death with ID 6288 by

`summary(fit_inter, id = "6288")`

```
## InterVA-4 fitted top 5 causes for death ID: 6288
##
## Cause Likelihood
## Stroke 0.509
## Pneumonia 0.318
## COPD 0.081
## Other Infectious Diseases 0.064
## Renal Failure 0.013
```

The `summary()`

function for InSilcoVA does not provide uncertainty estimates for individual COD assignments by default. This is because, in practice, the calculation of individual posterior probabilities of COD distribution can be memory-intensive when the dataset is large. To obtain individual-level uncertainty measurements, we can either run the MCMC chain with the additional argument `indiv.CI = 0.95`

when calling `codeVA()`

, or update the fitted object directly with the saved posterior draws.

```
<- updateIndiv(fit_ins, CI = 0.95)
fit_ins summary(fit_ins, id = "6288")
```

```
## InSilicoVA fitted top causes for death ID: 6288
## Credible intervals shown: 95%
## Mean Lower Median Upper
## Stroke 0.5043 0.3485 0.5083 0.6361
## Pneumonia 0.4116 0.2615 0.4083 0.5834
## Other Infectious Diseases 0.0660 0.0411 0.0642 0.0966
## Epilepsy 0.0099 0.0064 0.0097 0.0142
## COPD 0.0053 0.0031 0.0052 0.0079
## Malaria 0.0007 0.0005 0.0007 0.0011
## Diabetes 0.0005 0.0003 0.0005 0.0009
## Acute Myocardial Infarction 0.0004 0.0003 0.0004 0.0006
## Falls 0.0004 0.0001 0.0004 0.0013
## Renal Failure 0.0004 0.0002 0.0003 0.0005
```

For \(N\) deaths, \(C\) causes, the posterior mean of individual COD distributions returned by the InSilicoVA model, along with median and with credible intervals can be represented by a \((N \times C \times 4)\)-dimensional array. The function `getIndivProb()`

extracts this summary in the form of a list of \(4\) matrices of dimension \(N\) by \(C\), which can then be saved to other formats to facilitate further analysis. For other methods, the point estimates of individual COD distribution are returned as the \(N\) by \(C\) matrix.

```
<- getIndivProb(fit_inter)
fit_prob dim(fit_prob)
```

`## [1] 1554 34`

The previous sections discuss how results could be extracted and examined in R. In this subsection, we show some visualization tools provided in the **openVA** package for presenting these results. The fitted CSMFs for the top causes can be easily visualized by the `plotVA()`

function. The default graph components are specific to each algorithm and individual package implementations, with options for further customization. For example, the figure below shows the estimated CSMF from the InterVA algorithm in the PHMRC data example.

`plotVA(fit_inter, title = "InterVA")`

The CSMFs can also be aggregated for easier visualization of groups of causes. For the InterVA-4 cause list, we included an example grouping built into the package, so the aggregated CSMFs can be compared directly. In practice, the grouping of causes of deaths often needs to be determined according to context and the research question of interest. Changing the grouping can be easily achieved by modifying the `grouping`

argument in the `stackplotVA()`

function. For example, to facilitate the new category of `Undetermined`

returned by InterVA, we first modify the grouping matrix to include it as a new cause and visualize the aggregated CSMF estimates in the figure below.

```
data(SampleCategory)
<- SampleCategory
grouping 1] <- as.character(grouping[,1])
grouping[,<- rbind(grouping, c("Undetermined", "Undetermined"))
grouping <- list(InterVA4 = fit_inter_who,
compare InSilicoVA = fit_ins_who)
stackplotVA(compare, xlab = "", angle = 0, grouping = grouping)
```

The ordering of the stacked bars can also be changed to reflect the structures within the aggregated causes, as demonstrated in the figure below.

```
<- c("TB/AIDS", "Communicable", "NCD", "External", "Maternal",
group_order "causes specific to infancy", "Undetermined")
stackplotVA(compare, xlab = "", angle = 0, grouping = grouping,
group_order = group_order)
```

Among the VA methods discussed in this paper, the InSilicoVA algorithm (McCormick et al. 2016) allows for more flexible modifications to the Bayesian hierarchical model structure when additional information is available. In this section, we illustrate two features unique to the InSilicoVA method: jointly estimating CSMFs from multiple populations, and incorporating partial and potentially noisy physician codings into the algorithm.

In practice researchers may want to estimate and compare CSMFs for different regions, time periods, or demographic groups in the population. Running separate models on subsets of data can be inefficient and does not allow parameter estimation to borrow information across different groups. The generative framework adopted by InSilicoVA allows the specification of sub-populations in analyzing VA data. Consider an input dataset with \(G\) different sub-populations. We can estimate different CSMFs \(\pi^{(g)}\) for \(g = 1, ..., G\) for each sub-population, while assuming the same conditional probability matrix, \(P_{s|c}\) and other hyperpriors. As an example, we show how to estimate different CSMFs for sub-populations specified by sex and age groups, using a randomly sampled ALPHA dataset with additional columns specifying the sub-population each death belongs to.

```
data(RandomVA2)
head(RandomVA2[, 244:248])
```

```
## stradm smobph scosts sex age
## 1 . . . Men 60+
## 2 . . . Women 60-
## 3 . . . Women 60-
## 4 . . . Women 60+
## 5 . . . Women 60-
## 6 . . . Women 60-
```

Then we can fit the model with one or multiple additional columns specifying sub-population membership for each observation.

```
<- codeVA(RandomVA2, model = "InSilicoVA",
fit_sub subpop = list("sex", "age"), indiv.CI = 0.95,
Nsim = 10000, auto.length = FALSE)
```

Functions discussed in the previous sections work in the same way for the fitted object with multiple sub-populations. Additional visualization tools are also available. The figure below plots the CSMFs for two sub-populations on the same plot by specify `type = "compare"`

.

`plotVA(fit_sub, type = "compare", title = "Comparing CSMFs", top = 3)`

By default, the comparison plots will select all the CODs that are included in the top \(K\) for each of the sub-populations. We can also plot only subsets of them by specifying the causes of interest, as shown in the figure below.

```
plotVA(fit_sub, type = "compare", title = "Comparing CSMFs",
causelist = c("HIV/AIDS related death",
"Pulmonary tuberculosis",
"Other and unspecified infect dis",
"Other and unspecified NCD"))
```