%\VignetteIndexEntry{Using CorrBin for the Shell Toxicology data}
%\VignetteDepends{lattice}
\documentclass[reqno]{amsart}
\usepackage[margin=0.8in]{geometry}
\usepackage{graphicx}
\title{Using the CorrBin package for nonparametric analysis of correlated binary
data}
\author[A. Szabo]{Aniko Szabo}
\SweaveOpts{echo=TRUE}
\begin{document}
\setkeys{Gin}{width=0.5\textwidth}
\maketitle
<>=
options(useFancyQuotes=FALSE)
@
The \texttt{CorrBin} package focuses on non-parametric methods for exchangeable correlated binary
data with varying cluster sizes. Exchangeability implies that the order of responses within a
cluster does not matter, only the total number of responses; the package uses the clustersize/
number of responses combination as input. Currently only one categorical cluster-level predictor,
treatment group, is allowed. Many of the functions are geared toward testing trend, so treatment
groups are usually treated as ordered categories.
<>=
library(CorrBin)
library(lattice)
@
\section{Data input}
All the analysis functions in the package work on \texttt{CBData} objects, so
we start by setting up the data in the format needed for analysis. The Shell toxicology
data set is available in the CBData format in the package, however we will load
it from a text file using the \texttt{read.CBData} function to show more typical usage. The
``ShellTox.txt'' file contains four space-delimited columns. The first column contains an integer 1 -- 4 giving the treatment group, the second
column gives the size of the cluster, the third the number of responses in the cluster, and the
last gives the number of times the given combination occured in the data.
<>=
options(width=100)
ps.options(colormodel="rgb")
<>=
sh <- read.CBData("ShellTox.txt", with.freq=TRUE)
levels(sh$Trt) <- c("Control","Low","Medium", "High")
str(sh)
@
Alternatively, if the data is already in a data frame, the \texttt{CBData} function
can be used to define the roles of the variables. Both \texttt{read.CBData} and \texttt{CBData}
can accommodate cluster-level data that has not been summarized to have frequencies of each
clustersize/number of responses combination.
\section{Marginal compatibility}
A basic assumption of all of the following analyses is that of \emph{marginal compatibility} (MC),
which states that the size of the cluster has no effect on either the marginal probability of
response, or the correlation (any order) within the cluster. Of course, this is only relevant if
the clusters of different sizes are present in the data. We
can test for marginal compatibility:
<>=
mc.test.chisq(sh)
@
Neither the overall p-value of 0.4, or the individual treatment group p-values show evidence
of deviation from marginal compatibility.
Now we can obtain non-parametric estimates of the distribution of the number of
responses in the cluster under MC:
<>=
sh.mc <- mc.est(sh)
@
Even though the \texttt{mc.est} functions gives estimates for all cluster-sizes, due to the
marginal compatibility assumption the estimates $\pi_{r,M}$ for the largest cluster-size $M$
determine all the other estimates:
\begin{equation}
P(\text{$r$ responses}\mid\text{cluster size $n$})=\pi_{r,n} = \sum_{t=0}^{M}h(r,t,n)\pi_{t,M},
\end{equation}
where $h(r, t, n) = \binom{t}{r}\binom{M-t}{n-r}/\binom{M}{n}$ is the hypergeometric density function.
Figure~\ref{F:MCest} shows the estimates.
\begin{figure}
<>=
print(xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=sh.mc, subset=ClusterSize>0 & ClusterSize<13,
type="l", as.table=TRUE, auto.key=list(columns=4, lines=TRUE, points=FALSE),
xlab="Number of responses", ylab="P(R=r|N=n)"))
@
\caption{Density function of number of responses under MC by cluster-size estimated separately
for each treatment group}\label{F:MCest}
\end{figure}
<>=
<>
@
The density functions in Figure~\ref{F:MCest} are difficult to compare (there is no
obvious shift); distribution functions often provide a cleaner comparison, so
they are plotted in Figure~\ref{F:MCsurvest}. These plots show that curves for the ``Control'' and
``Low'' groups tend to be above the ``Medium'' and ``High'' group, suggesting the
presence of a dose related trend.
\begin{figure}
<>=
panel.cumsum <- function(x,y,...){
x.ord <- order(x)
panel.xyplot(x[x.ord], cumsum(y[x.ord]), ...)}
print(xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=sh.mc,
subset=ClusterSize>0&ClusterSize<13, type="s",
panel=panel.superpose, panel.groups=panel.cumsum,
as.table=T, auto.key=list(columns=4, lines=T, points=F),
xlab="Number of responses", ylab="Cumulative Probability R(R>=r|N=n)",
ylim=c(0,1.1)))
@
\caption{Distribution functions of number of responses under MC by cluster-size estimated separately
for each treatment group}\label{F:MCsurvest}
\end{figure}
<>=
<>
@
\section{Testing for trend}
Several methods have been proposed for testing for trend with correlated binary data; this
package implements 3 types of tests: the Rao-Scott (RS) test, 3 versions of a GEE-based test\footnote{The GEE approach is implemented,
even though it is not quite a non-parametric test}, and a stochastic
ordering (SO) based test. RS and GEE test for linear trend in the marginal probability of response, while SO tests for ordering of the distribution functions (as in Figure~\ref{F:MCsurvest}) of the number of responses..
The common interface for accessing the trend tests is the \texttt{trend.test} function.
It allows to pick the test, whether a permutation-test based exact or an asymptotic (only for RS and GEE)
p-value should be used, and set algorithm options for the SO test. In this vignette we use $R=50$ permutations
to limit running time, but in actual work larger values should be used.
<>=
set.seed(4461)
(so.res <- trend.test(sh, test="SO", R=50, control=soControl(eps=0.1, max.directions=40)))
@
The \texttt{eps} parameter sets the limit on the absolute error of the log-likelihood (and thus, the
test statistic), while \texttt{max.directions} is a tuning parameter for the default ISDM algorithm: it
affects only the running time, and, in general, larger values lead to fewer iterations that however take
longer. The details of the permutation test are saved in the output object, so the null distribution of
the test statistic can be extracted for, say, a plot as in Figure~\ref{F:SOboot}.
\begin{figure}
<>=
hist(attr(so.res, "boot")$t[,1], freq=FALSE, xlab="Statistic", ylab="Density", main="")
points(so.res$statistic, 0, pch="*",col="red", cex=3)
@
\caption{Null distribution of the SO test statistic with the observed value marked with a red star.}\label{F:SOboot}
\end{figure}
<>=
<>
@
The other tests also show the presence of a statistically significant trend:
<>=
trend.test(sh, test="RS")
trend.test(sh, test="GEE")
@
The stochastic ordering approach provides not only a test for trend, but also the estimated distribution
of the number of responses under the alternative hypothesis of stochastic order. These can be obtained by the
\texttt{SO.mc.est} function, with the value of the log-likelihood, and convergence information. The estimates are
then plotted (see Figure~\ref{F:MCsoest}).
<>=
sh.SO.est <- SO.mc.est(sh, control=soControl(eps=0.1, max.directions=40))
str(sh.SO.est)
@
\begin{figure}
<>=
print(xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=sh.SO.est,
subset=ClusterSize<13, type="s",
panel=panel.superpose, panel.groups=panel.cumsum,
as.table=T, auto.key=list(columns=4, lines=T, points=F),
xlab="Number of responses", ylab="Cumulative Probability R(R>=r|N=n)",
ylim=c(0,1.1), main=""))
@
\caption{Stochastically ordered distribution functions of the number of responses under MC by cluster-size.}\label{F:MCsoest}
\end{figure}
<>=
<>
@
\section{Risk assessment}
The No-Statistical-Significance-Of-Trend dose -- the largest dose at which no trend in the rate of response has
been observed -- is often used to determine a safe dosage level for a potentially toxic compound.
The \texttt{NOSTASOT} function computes this dose by a step-down approach of testing all doses, all but the last, etc.
All three tests (stochastic order, Rao-Scott, and GEE) are available.
<>=
NOSTASOT(sh, test="RS")
@
For the Shell toxicology data, the ``Low'' dose of the drug shows no trend, but all higher doses do, so that's the NOSTASOT
dose.
\end{document}
setwd("c:/RForge/CorrBin/inst/doc")
Sweave("CorrBinVignette.Rnw", stylepath=FALSE)