k-sample test

Giovanni Saraceno

\(k\)-sample test

We generated three samples, with \(n=200\) observations each, from a 2-dimensional Gaussian distributions with mean vectors \(\mu_1 = (0, \frac{\sqrt{3}}{3})\), \({\mu}_2 = (-\frac{1}{2}, -\frac{\sqrt{3}}{6})\) and \(\mu_3 = (\frac{1}{2}, -\frac{\sqrt{3}}{6})\), and the Identity matrix as covariance matrix. In this situation, the generated samples are well separated, following different Gaussian distributions, i.e. \(X_1 \sim N_2(\mu_1, I)\), \(X_2 \sim N_2(\mu_2, I)\) and \(X_3 \sim N_2(\mu_3, I)\)}. The vector y indicates the membership to groups.

library(mvtnorm)
library(QuadratiK)
sizes <- rep(200,3)
eps = 1
set.seed(2468)
x1 <- rmvnorm(sizes[1], mean = c(0,sqrt(3)*eps/3))
x2 <- rmvnorm(sizes[2], mean = c(-eps/2,-sqrt(3)*eps/6))
x3 <- rmvnorm(sizes[3], mean = c(eps/2,-sqrt(3)*eps/6))
x <- rbind(x1, x2, x3)
y <- as.factor(rep(c(1,2,3), times=sizes))

Recall that the computed test statistics correspond to the omnibus tests.

h=1.5
set.seed(2468)
k_test <- kb.test(x=x, y=y, h=h)
k_test
## 
##  Kernel-based quadratic distance k-sample test 
## 
## H0 is rejected:  TRUE 
## 
## Test Statistic:  0.02452253 0.01226127 
## Critical value (CV):  0.001164279 0.0005821397 
## CV method:  subsampling 
## Selected tuning parameter h:  1.5

When the \(k\)-sample test is performed, the summary method on the kb.test object returns the results of the tests together with the standard descriptive statistics for each variable computed, overall and with respect to the provided groups.

summary_ktest <- summary(k_test)
## 
##  Kernel-based quadratic distance k-sample test 
##   Test_Statistic Critical_Value Reject_H0
## 1     0.02452253   0.0011642795      TRUE
## 2     0.01226127   0.0005821397      TRUE
summary_ktest$summary_tables
## [[1]]
##             Group 1    Group 2    Group 3       Overall
## mean   -0.005959147 -0.5370127  0.5442058  0.0004113282
## sd      0.997319811  0.9583059  1.0374834  1.0900980006
## median -0.028244038 -0.5477108  0.5297478 -0.0239486027
## IQR     1.478884929  1.4105832  1.4234532  1.5377418198
## min    -2.860006689 -3.1869808 -2.2119189 -3.1869807848
## max     2.151784802  2.0647648  3.1580700  3.1580700259
## 
## [[2]]
##           Group 1    Group 2    Group 3     Overall
## mean    0.4935364 -0.4042219 -0.2461729 -0.05228613
## sd      1.0449582  1.0411639  1.0474989  1.11391575
## median  0.5281635 -0.4325995 -0.2950922 -0.09520111
## IQR     1.4001089  1.4662111  1.2867345  1.48444495
## min    -2.6448703 -2.8786352 -3.4932849 -3.49328492
## max     3.0792766  2.6788424  2.8290722  3.07927659

Selection of h

If a value of \(h\) is not provided, the function automatically perform the function select_h.

#k_test_h <- kb.test(x=x, y=y)

For a more accurate search of the tuning parameter, the function select_h can be used.This function needs the input x and y as the function kb.test for the \(k\)-sample problem.

set.seed(2468)
h_k <- select_h(x=x, y=y, alternative="skewness")
h_k$h_sel

The figure generated by the select_h function on the result of the selection of \(h\) algorithm for the \(k\)-sample data set displays the obtained power versus the considered \(h\), for each value of skewness alternative \(\delta\) considered.

As it is possible to see from the figure, when the alternative distribution \(F_\delta\) with \(\delta=0.2\) is considered, there are no values of \(h\) which achieve power greater than or equal to 0.5. Then, the second value of \(\delta=0.3\) is take into account and \(h=1.6\) is chosen as optimal value since it is the smallest value with power greater than 0.5. Additionally, it gives a possible set of values with high power performance.