When first knowing about the kerntools
package, or
perhaps after downloading it, you may have asked yourself some of these
questions:
kerntools
?This document is meant to answer exactly that. The first section, Introduction, deals with the basic concepts
regarding the kernel approach. Subsequent sections review all kernels
implemented in kerntools
one by one, highlighting their
most important traits and giving some examples of use.
Kernel methods are all those machine learning methods that use kernel functions. Let \(X = {x_1, x_2, ... , x_N}\) be a dataset of N objects of any type (real-valued, graphs, rankings, matrices, categorical data, etc.). The kernel function evaluates all possible pairs of objects, which are then stored in a kernel matrix with dimension NxN:
\[\mathbf{K_{ij}} = k(x_i , x_j )\]
Intuitively, we can understand kernels as functions \(k(x_i , x_j )\) that measure the similarity between \(x_i\) and \(x_j\). Generally speaking, all kernels are similarity measures; however, not every similarity measure is a kernel: only those that generate proper kernel matrices. Kernel matrices are characterized by four properties: they are real-valued, squared, symmetric and PSD (positive semi-definite: all their eigenvalues are nonnegative). Thus, the kernel matrix K, and not the original dataset X, is the input data for our kernel method. In this approach, choosing a good kernel function is capital, because it’s the “lens” that the method will use to “see” the data.
If using kernels implies an “extra step” (computing K), why should we bother at all? Surely, it’s easier to use any other machine learning method and dealing directly with X, isn’t it? Well, the thing is that the kernel approach has some advantages:
Most machine learning methods require a numeric input (some of them, like Random Forests, also can accept categorical data). Usually, if our dataset is different from the typical real-valued matrix X with D features (variables), we will have to recode it somehow. But in kernel methods, the objects in X are never represented explicitly (only via their pairwise similarities), which has an interesting consequence: kernels can handle data types that are not standard. Instead, the challenge is using (or designing) kernels that extract valuable information from specific data types or problems. This requires a notion of what is considered “similar” in that given context.
Kernel functions can be combined or “custom-tailored” for a specific problem. For instance, previous knowledge about the problem can be encoded explicitly into the kernel. This is facilitated by the fact that kernel design is modular in nature: kernels can be combined. For example, the linear combination of kernel matrices is another kernel matrix.
When using nonlinear kernel functions, methods that were originally linear (for instance: Principal Components Analysis or PCA) “become” nonlinear. In some sense, we “trick” the method: it still works in a linear way, but as its input data has been transformed by a nonlinear function, the output it produces is nonlinear.
Curse of dimensionality: contrarily to a lot of other methods, this is not a big issue when working with kernels. In fact, when dealing with NxD datasets where N << D, kernels can decrease the computational burden if used cleverly, as the method will deal with a kernel matrix of dimension NxN.
The most basic kernel function is the dot product of two real vectors (see Linear kernel). This definition is very important, because all kernels can be written as inner products in some space:
\[x → φ(x)\] \[k(x_i , x_j ) = <φ(x_i),φ(x_j)>\] \(φ(x)\) maps an object from the original (input) space to the so-called feature space. This feature space typically has a higher dimension than the original space (other times it has the same dimension) and it’s endowed with an inner product. However, all the process happens “under the hood”: the kernel function uses this mapping without telling so (recall that the objects’ explicit representations are never used). Sometimes, even knowing the feature space associated to a given kernel can be tricky. However, this take avoids the computational cost of 1) explicitly mapping the dataset into feature space (which sometimes can be very high dimensional) + 2) computing the inner product there: the kernel conflates the two in a single step, which is known as the kernel trick. By the way, this also allows solving nonlinear problems using linear methods: the kernel takes the original data onto a higher dimensional space where it can be operated in a linear way with inner products. (A visual example of this can be found here).
Now we introduce the kernels in kerntools
that deal with
real vectors. There are a lot of kernels that can handle this
data, but here we are going to focus only on those implemented in this
package. By real vectors I refer the typical continuous dataset
X with N samples in rows and D
continuous variables (or features) in columns. Thus, for each sample we
have \(\mathbf{x} \in {\rm
I\!R}^D\).
The Linear kernel is the dot product of two real vectors:
\[Linear(\mathbf{x_i},\mathbf{x_j}) = \mathbf{x_i}^T\mathbf{x_j}\]
It can take positive values, negative values, and even be zero. Obviously, it has a linear nature. Thus, if we use it in a “kernelized” machine learning method, the output will be the same that when using the non-kernelized method (for instance: linear kPCA is equivalent to standard PCA).
The linear kernel is related to the cosine of the angle between two vectors, which is maximum when they share the same direction and 0 if they are perpendicular. Sometimes, is more easy to see the linear kernel as a “similarity score” if the cosine normalization is applied:
\[Linear_{cosine}(\mathbf{x_i},\mathbf{x_j}) = \frac{\mathbf{x_i}^T\mathbf{x_j}}{\sqrt{\mathbf{x_i}^T\mathbf{x_i}}\sqrt{\mathbf{x_j}^T\mathbf{x_j}}}\] This is also known as the “cosine similarity”. A score of 1 implies that \(\mathbf{x_i},\mathbf{x_j}\) are equal, while the lower the score, the lower the similarity.
This is the only kernel that fulfills that feature space = input space.
This kernel is suitable for typical numeric, continuous datasets (especially if you know in advance that your problem has a linear solution). For instance, if we work with the famous iris dataset:
In general, however, it is recommended to standardize a dataset before presenting it to the linear kernel. Imagine that your features are very different in magnitude: for instance, if the range of feature 1 is \([0.01,0.1]\) and the range of feature 2 is \([10^2,10^3]\), feature 2 will absolutely dominate the output of the kernel, even if this feature is almost irrelevant to the problem you want to solve. The reverse case arises when you want to manually prioritize some features over others (for example, because you have previous knowledge about your problem). In that case, you can weight them in advance, specifying your desired coefficients. For example:
Cosine-normalization of the linear kernel can be done with:
The Gaussian Radial Basis Function (RBF) kernel is regarded as the gold standard among kernels. Typically is written as:
\[RBF(\mathbf{x_i},\mathbf{x_j}) = e^{-\gamma {||\mathbf{x_i}-\mathbf{x_j}||}^2}\] It is immediately obvious that it is nonlinear. The output ranges between 0 and 1; again, the higher the RBF kernel value, the higher the similarity between \(\mathbf{x_i},\mathbf{x_j}\). The \({||\mathbf{x_i}-\mathbf{x_j}||}\) term correspond to the Euclidean distance between two real vectors, while \(\gamma>0\) is a hyperparameter. There is an alternative definition of the RBF in some books or packages, where it has a “bandwidth” hyperparameter called \(\sigma>0\). Don’t worry. The relationship between the two is: \(\gamma = \frac{1}{2\sigma^2}\).
The RBF kernel is related to the linear kernel. This is because the inner product induces a L2 norm that, in real vectors, is equivalent to the Euclidean distance. But contrarily to the linear kernel, the RBF kernel is an universal approximator: it can be used to model any decision boundary. In this regard, \(\gamma\) governs the “smoothness” of this kernel when adapting to data. The smaller the value, the more complex the resulting model, and the risk of overfitting is greater. Instead, if the value is too small, the decision boundary become very smooth and “linearized”, to the point that (with very large values) the RBF kernel starts behaving as the linear kernel.
With this kernel, data is mapped onto an infinite-dimensional feature space. Several techniques to approximate it exist. For instance, \(φ(\mathbf{x})\) can be computed via Taylor series expansion (more info in Explicit Approximations of the Gaussian Kernel).
This kernel is perfectly fine for typical numeric, continuous
datasets, especially if the solution to your problem is nonlinear. Here,
you can see how to compute it with kerntools
using the iris
example dataset:
(Centering or not the dataset is irrelevant, but it is advisable to scale the variables by the their standard deviation before presenting the dataset to the RBF kernel. See Linear kernel for more info.)
Here, RBF(...,g)
is the value of the hyperparameter
gamma, which should be chosen by the user. There are some heuristics
that may be useful for choosing a good value. In that respect,
kerntools
provides this handy function:
(Remember that, in kerntools
, the hyperparameter of the
RBF kernel is \(\gamma\), not \(\sigma\).)
Finally, kerntools
’s implementation currently does not
compute the feature space of the RBF kernel.
The Laplacian kernel is similar to the RBF kernel. However, instead of computing the L2 norm (Euclidean distance) between two real vectors, it uses the L1 norm:
\[Laplacian(\mathbf{x_i},\mathbf{x_j}) = e^{-\gamma |\mathbf{x_i}-\mathbf{x_j}|}\]
It ranges between 0 and 1. The higher the laplacian kernel value, the higher the similarity between \(\mathbf{x_i},\mathbf{x_j}\). The hyperparameter \(\gamma > 0\) has a similar role that in the RBF kernel. However, now its relationship to \(\sigma\) is: \(\gamma = \frac{1}{\sigma}\). The hyperparameter governs how the laplacian kernel adapts to data, though using the L1 norm makes this kernel more “sharp” than the RBF kernel.
You can use this kernel with a typical numeric, continuous dataset, especially if the solution to your problem is nonlinear. An example of use with the famous iris dataset:
Here, Laplace(...,g)
is the value of the hyperparameter
\(\gamma\), which should be chosen by
the user. Centering or not the dataset is irrelevant, but it is
advisable to scale the variables by the their standard deviation before
presenting it to the laplacian kernel. See Linear kernel for more info.
Remember that in kerntools
, the hyperparameter of the
laplacian kernel is \(\gamma\), not
\(\sigma\). Moreover,
kerntools
’s implementation right now does not compute the
feature space of the laplace kernel.
The following kernel works with data such as each sample is a real matrix (its entries are real numbers).
The Frobenius kernel of two matrices \(\mathbf{X_i},\mathbf{X_j}\) with dimension NxD is defined as:
\[Frobenius(\mathbf{X_i},\mathbf{X_j}) = tr(\mathbf{X_i} \mathbf{X_j}^T) = \sum_{n=1}^N \sum_{d=1}^D\mathbf{X}_{ind} \circ \mathbf{X}_{jnd} \] \(\circ\) denotes the Hadamard product. It is an inner product and, thus, the Frobenius kernel is a sort of “linear kernel” for matrices. The result can be positive, negative or even zero. As always, the higher the kernel the higher the similarity between the two samples. This is understood more easily when the cosine normalization is applied:
\[Frobenius_{cosine}(\mathbf{X_i},\mathbf{X_j}) = \frac{tr(\mathbf{X_i} \mathbf{X_j}^T)}{\sqrt{tr(\mathbf{X_i} \mathbf{X_i}^T)}\sqrt{tr(\mathbf{X_j} \mathbf{X_j}^T)}}\]
A score of 1 implies that the two samples (matrices) are equal.
The feature space of this kernel is straightforward. It can be obtained by vectorizing (converting into column vectors) each one of the matrices.
This kernel can be used to compare matrix-formatted information, for
instance image data. It can be also used to compare the similarity
between two kernel matrices. In kerntools
, the Frobenius
kernel can be called typing:
### Let's suppose our samples come from three different matrix tables:
setosa <- iris[iris$Species == "setosa", 1:4]
versicolor <- iris[iris$Species == "versicolor", 1:4]
virginica <- iris[iris$Species == "virginica", 1:4]
matrices <- list(setosa,versicolor,virginica)
Frobenius(matrices)
Cosine-normalization of the Frobenius kernel can be done with:
and the feature space (cosine-normalized or not) can be recovered like this:
The following kernels deal with counts and frequency data, be it absolute or relative frequencies. This kind of data lives in datasets of dimension NxD, which (at first glance) may appear very similar to real data. However, some nuances should be taken into account. Count data consists in nonnegative integer vectors of dimension D \(\mathbf{x} \in {\rm I\!R}_{>0}^D\) that quantify the abundance of some trait or analyzed variable. This data has a non-symmetric distribution, a potentially large dynamic range (from zero up to millions) and, when the quantification is done by an instrument, other factors like upper and lower bounds on detection can be at play. Thus, zeroes may be ambiguous: they may signal an actual absence, or a quantity below the detection limit.
Sometimes, counts or absolute frequencies are converted to relative frequencies. This data is usually called compositional and constitutes a specific category within abundance data. In that case, all samples (rows) sum to an arbitrary number (for instance 1, or 100). This data does not live in the positive real space \(\mathbf{x} \in {\rm I\!R}_{>0}^D\) but in the simplex, which is a D - 1 subspace of it. This is important because the variables are not independent, but parts of a whole. Thus, increasing the abundance of a feature within a sample decreases the abundance of the rest.
The kernels that follow are well known in some fields, like ecology, because of their relation to widespread diversity metrics: the Bray-Curtis dissimilarity and the Ruzicka distance.
The Bray-Curtis kernel is defined as:
\[ BCK(x_{i},x_{j}) = 2\frac{\sum_d min(x_{id},x_{jd})}{\sum_d(x_{id}+x_{jd})}\] where \(x_{id}, x_{jd}\) are two samples (that is: two rows of the dataset). It ranges between 0 and 1.
This kernel is \(BCK = 1 - BCD\), where BCD is the Bray-Curtis dissimilarity (it is called “dissimilarity” because it is semimetric and, therefore, not a distance in the strict sense). It can be applied over counts and frequencies; however, the “correctness” of using directly this kernel over compositional data is disputed.
In kerntools
, it can be computed typing:
The Ruzicka kernel (also known as: quantitative Jaccard, or min-max kernel) is defined as:
\[ Ruzicka(x_{i},x_{j}) = \frac{\sum_d min(x_{id},x_{jd})}{\sum_dmax(x_{id}+x_{jd})}\] where \(x_{id}, x_{jd}\) are two samples (that is: two rows of the dataset). It is bounded between 0 and 1.
The Ruzicka kernel is related to the quantitative Jaccard distance, aka Ruzicka distance, in the same manner as the Bray-Curtis kernel is related to the Bray-Curtis dissimilarity. The Ruzicka distance is rank-order similar to BCD, but it fulfills the conditions to be a metric. It is also related to the Jaccard kernel (which is introduced in the Jaccard kernel subsection) for sets or presence/absence data.
It can be applied over counts and frequencies; however, the “correctness” of using directly this kernel over compositional data is disputed.
In kerntools
, it can be computed typing:
Categorical data consist on qualitative variables that can take a finite number of values (called classes, modalities, categories, or levels). Within this package, “categorical data” refers to nominal data whose categories don’t have a meaningful order, in contrast with ordinal data. The typical categorical dataset is a matrix X with N samples and D categorical variables or features. (In R, “data.frames” having columns of class “factor” can be used for this kind of data).
First we define the Overlap kernel over a single categorical variable as:
\[x_{id}=x_{jd} \Leftrightarrow Overlap(x_{id},x_{jd}) = 1\] \[x_{id} \neq x_{jd} \Leftrightarrow Overlap(x_{id},x_{jd}) = 0\]
where \(x_{id},x_{jd}\) are two samples and D = 1. This is the most basic categorical kernel, akin in a way to the linear kernel for continuous data. From there, we can define the Dirac kernel over multivariate categorical data as the linear combination of the Overlap kernel’s evaluations:
\[ Dirac(x_{i},x_{j}) = \sum_d c_d Overlap(x_{id},x_{jd})\] The coefficients \(c_d>0\) can be all 1 (and then, the Dirac kernel is simply the sum of the Overlap kernel evaluations), 1/D (so the Dirac kernel is the mean), or may be used to weight differently the variables (for example, because you have previous knowledge about your problem). When the coefficients are all set to 1/D, the values for the Dirac kernel range between 0 and 1, the 1 meaning that the two samples are equal, and 0 that they are completely different. This is equivalent to computing the “sum” Dirac kernel and then cosine-normalizing it. When custom coefficients are used but \(\sum_d c_d=1\), this output of this kernel is a weighted average bounded between 0 and 1 (thus, trying to cosine-normalize it is useless).
We can map our data onto the feature space of the Overlap kernel via one-hot-encoding. That is: the map \(φ(x)\) recodes a categorical variable (using the R nomenclature, “factor”) with L levels into L dummy variables (each one representing a given level) that can be 0 or 1. For instance, if we have a variable that can take three levels (“A”, “B” and “C”):
cat_feat <- data.frame(var=factor(sample(LETTERS[1:3],10,replace = TRUE)))
rownames(cat_feat) <- 1:10
cat_feat
#> var
#> 1 B
#> 2 C
#> 3 A
#> 4 A
#> 5 C
#> 6 B
#> 7 B
#> 8 B
#> 9 A
#> 10 B
the feature space for the Overlap kernel is:
dummy_data(cat_feat)
#> var_A var_B var_C
#> 1 0 1 0
#> 2 0 0 1
#> 3 1 0 0
#> 4 1 0 0
#> 5 0 0 1
#> 6 0 1 0
#> 7 0 1 0
#> 8 0 1 0
#> 9 1 0 0
#> 10 0 1 0
The feature space for the Dirac kernel results of “concatenating” the feature spaces for all D categorical variables. If the weights \(c_d \neq 1\), then data in feature space should be scaled accordingly. For instance, if \(c_d = 1/D\) for all weights, which (as stated before) is the same than cosine-normalize the Dirac kernel, then the feature space is normalized to unit length.
In kerntools
, we need a NxD dataset of class
“matrix” (2-d dimensional array) that contains only characters, or
instead a dataset of class “data.frame” (with all entries = “character”,
or columns = “factor”). showdata
is an example dataset that
meets this requirements. The Dirac kernel can be called doing:
Moreover, if we want to compute the feature space too, we can type:
By default Dirac(...,comp = "sum")
; that is, all
coefficients \(c_d\) are set to 1. If
we want the Dirac kernel computed as the mean of the Overlap kernel
evaluations, we can do Dirac(...,comp = "mean")
. Instead,
to have the average weighting according to some coefficients, we can
do:
The feature space returned by this function will change accordingly
to the coefficients chosen. Of course,
Dirac(...,comp = "mean")
is the same than doing:
A set S is a collection of different objects (characters, numbers, strings, … even other sets). We call universe to the set of all objects we want to consider in a given problem. The objects of S are called their elements, and we denote as \(|S|\) the size of the set: that is, the number of elements it contains.
Imagine that we have a dataset (size N) such as each sample is a set: \(S_1,...,S_N\). The most simple kernel for this data is the Intersect kernel:
\[Intersect(S_i,S_j)= |S_i \cap S_j| \] that is, the size of the two sets’ shared elements. This is akin to the linear kernel for continuous data (see the “Feature space” subsection below). This kernel ranges from 0 (the sets are completely different) to any positive number. In general higher values indicate higher similarity; however, some sets may be bigger (that is: they contain more elements) than others. You can use the cosine-normalization to remove the size effect and have a result that is bounded between 0 and 1:
\[Intersect_{cosine}(S_i,S_j)= \frac{|S_i \cap S_j|}{\sqrt{|S_i|·|S_j|}}\]
From there, we can consider a multivariate case: each sample has several (D) variables or features that happen to be sets. As we have done with the Dirac kernel, we can do a linear combination of the Intersect kernel’s evaluations to produce a “multivariate” kernel:
\[ Intersect_{multi}(S_{i},S_{j}) = \sum_{d=1}^{D} c_d Intersect(S_{id},S_{jd})\]
The coefficients \(c_d>0\) can be all 1 (and then, this multivariate kernel is simply the sum of the Intersect kernel evaluations), 1/D (so the Intersect-multi kernel is the mean of evaluations), or may be used to weight differently the variables (for example, because you have previous knowledge about your problem). In this last case, the Intersect-multi kernel is a lineal combination of the Intersect kernel’s evaluations. When the coefficients are all set to 1/D and we also perform a cosine normalization over the univariate evaluations, the values for the Intersect-multi kernel range between 0 and 1, the 1 meaning that the two samples are equal, and 0 that they are completely different.
We can map our data onto the feature space of the Intersect kernel via a sort of one-hot-encoding. In this case, the map \(φ(S)\) recodes a set coming from a universe with L elements into L dummy variables (each one representing a given level) that can be 1, if a given element l is present in the set S, or 0 if it is absent. For example:
Favorite colors of 3 people:
universe <- c("blue","green","lightblue","orange","purple","red","white","yellow")
person1 <- c(2, 6, 8)
person2 <- c(1, 8)
person3 <- c(2, 3, 6)
list(person1=universe[person1],person2=universe[person2],person3=universe[person3])
#> $person1
#> [1] "green" "red" "yellow"
#>
#> $person2
#> [1] "blue" "yellow"
#>
#> $person3
#> [1] "green" "lightblue" "red"
The data mapped onto the feature space for this kernel is:
colors <- matrix(0,nrow=3,ncol=length(universe))
rownames(colors) <- c("person1","person2","person3")
colnames(colors) <- universe
colors[1,person1] <- 1
colors[2,person2] <- 1
colors[3,person3] <- 1
colors
#> blue green lightblue orange purple red white yellow
#> person1 0 1 0 0 0 1 0 1
#> person2 1 0 0 0 0 0 0 1
#> person3 0 1 1 0 0 1 0 0
and now the Intersect kernel can be computed directly as the dot product between samples.
If we use the Intersect-cosine, then the feature space is normalized accordingly to the unit vector:
cosnormX(colors)
#> blue green lightblue orange purple red white yellow
#> person1 0.0000000 0.5773503 0.0000000 0 0 0.5773503 0 0.5773503
#> person2 0.7071068 0.0000000 0.0000000 0 0 0.0000000 0 0.7071068
#> person3 0.0000000 0.5773503 0.5773503 0 0 0.5773503 0 0.0000000
Moreover, the Intersect-multi kernel’s feature space results of “concatenating” the feature spaces of all sets associated to a sample, each one scaled accordingly by the weights \(c_d\).
kerntools
implementation requires that the set members
(elements) are character symbols (length=1) combined into a string.
Compare this with the last example:
universe <- c("b","g","l","o","p","r","w","y")
colors <- matrix("",ncol=1,nrow=3)
rownames(colors) <- c("person1","person2","person3")
colors[1,] <-paste(universe[person1],collapse="")
colors[2,] <-paste(universe[person2],collapse="")
colors[3,] <-paste(universe[person3],collapse="")
colors
#> [,1]
#> person1 "gry"
#> person2 "by"
#> person3 "glr"
That is a NxD “matrix” or “data.frame” containing characters. Each sample is in a different row. Now, you can do:
This function requires that we state the universe’s elements explicitly. The feature space can be returned too:
If data is multivariate (\(D>1\) columns, and each one contains a set feature), the universe should span all elements found in all the D sets. For instance, consider this second feature:
Best friends of the same 3 people:
(Anna = A; Bernard = B; Cesc = C; Diana = D ; Elsa = E ; Fran = F; Gisele = G; Horaci = H; Iria = I; Jaume = J)
universe2 <- LETTERS[1:10]
person1 <- c(1,2,9,10)
person2 <- c(1,4,6,7)
person3 <- c(2,3,5,9,10)
person1=universe2[person1]
person2=universe2[person2]
person3=universe2[person3]
list(person1=person3,person2=person3,person3=person3)
#> $person1
#> [1] "B" "C" "E" "I" "J"
#>
#> $person2
#> [1] "B" "C" "E" "I" "J"
#>
#> $person3
#> [1] "B" "C" "E" "I" "J"
friends <- colors
friends[1,] <- paste(person1,collapse="")
friends[2,] <- paste(person2,collapse="")
friends[3,] <- paste(person3,collapse="")
friends
#> [,1]
#> person1 "ABIJ"
#> person2 "ADFG"
#> person3 "BCEIJ"
So the final dataset is:
set_data <- data.frame(colors=colors,friends=friends)
set_data
#> colors friends
#> person1 gry ABIJ
#> person2 by ADFG
#> person3 glr BCEIJ
And now we can compute the kernel:
(We should indicate the elements of the two universes: colors and friends).
When Intersect(...,comp = "sum")
, all \(c_d = 1\) and, in principle, all “features”
(following with our last example: “friends” and “colors”) have the same
weight. In practice you should be careful, because
kerntools
doesn’t cosine-normalizes each univariate
evaluation of the kernel. Thus, you may inadvertently introduce some
bias in favor of “bigger” sets (“friends” will have, implicitly, more
weight than “colors”). Instead, when comp = "mean"
or
comp = "weighted"
is chosen, kerntools
cosine-normalizes all univariate evaluations, so the result ranges
between 0 and 1, and there are not implicit biases.
Intersect(...,comp = "mean")
sets \(c_d = 1/D\) and therefore gives the same
importance to all features. However, if you want to give more importance
to some features over others, you can choose manually the weights that
you want:
### We will make that "friends" has 3 times more importance than "colors":
coeffs <- c(1/4,3/4)
Intersect(set_data, elements = c(universe,universe2),comp = "weighted", coeff=coeffs)
Here, the result of the kernel (between 0 and 1) is the weighted
average of the features. If you set
Intersect(..., feat_space = TRUE)
, the feature space
returned by this function will be scaled accordingly to the \(c_d\) coefficients chosen. The same will
occur when doing Intersect(...,comp = "mean")
, but in this
case the scaling consist in simply dividing by D (the set
features).
The Jaccard index (or Jaccard kernel) of two sets is defined as:
\[Jaccard(S_i,S_j)= \frac{|S_i \cap S_j|}{|S_i \cup S_j|} \] It naturally ranges between 0 (the sets don’t share any element) and 1 (the sets are identical). Thus, the cosine normalization is redundant for this kernel.
Like the Dirac and the Intersect kernels, we can consider a multivariate case for the Jaccard kernel. If each sample has several (D) variables that happen to be sets, we can do a linear combination of the kernel’s evaluations to produce a “multivariate” kernel:
\[ Jaccard_{multi}(S_{i},S_{j}) = \sum_{d=1}^{D} c_d Jaccard(S_{id},S_{jd})\]
The coefficients \(c_d>0\) can all be 1 (and then, this multivariate kernel is simply the sum of the Jaccard kernel evaluations), 1/D (so the Jaccard-multi kernel is the mean of evaluations), or may be used to weight variables (sets) differently (for example, because you have previous knowledge about your problem). In this last case, the Jaccard-multi kernel is a weighted average of the Jaccard kernel’s evaluations. When the coefficients are all set to \(\sum c_d=1\), the values given by the Jaccard-multi kernel range between 0 and 1.
The usage is pretty similar to the Intersect kernel:
The only nuance here is that the univariate Jaccard kernel always is
bounded between 0 and 1. There are not underlying biases and all
features have the same weight (unless you explicitly do
Intersect(...,comp = "weighted")
and enter some
coefficients). Thus, this:
or this:
Is the same than this:
Also, take into account that the Jaccard()
function does
not return the data mapped onto the feature space of the kernel.
Ordinal data consist on qualitative variables that can take a finite number of values that follow an intrinsic and meaningful order. In that regard, ordinal data is in-between continuous and categorical (nominal) data. For instance, in a survey, the typical agree/disagree questions that have answers like “strongly agree”/“agree”/“neither agree or disagree”/“disagree”/etc. are ordinal variables. This is related to the concept of “ranking”, which consists in assigning the ordering labels “1st”, “2nd”, “3rd”, … to these values.
Kendall’s \(\tau\) coefficient measures the similarity or rank correlation between to two ordinal variables. It can be applied to different ordinal variables or to different rankings of the same variable. It is also a kernel function, defined as:
\[Kendall_\tau (x,y)= \frac{nc-nd}{np}\] Let \((x_1,y_1),...,(x_L,y_L)\) be two ordinal variables, or two samples of the same variable. For simplicity, we shall consider that all values taken by \(x_i\) and \(y_i\) are unique. Observations \((x_i, y_i)\) and (\(x_j , y_j)\) with \(i < j\) are “concordant” if we either observe \(x_i < x_j\) and \(y_i < y_j\), or \(x_i > x_j\) and \(y_i > y_j\). Otherwise, they are “discordant”. (Please note that, as \(x_i\) and \(y_i\) are unique, there are no ties: it’s impossible that \(x_i = x_j\) or \(y_i = y_j\)). All possible pairwise evaluations are taken into account: \(np\) is the total number of pairs, given by \(\binom{L}{2}\), \(nc\) is the number of total concordant pairs, and \(nd\) is the number of total discordant pairs.
The result of the Kendall’s \(\tau\) kernel can range between -1 and 1. 1 means that the two rankings are identical: the two ordinal samples/variables are equal, -1 that one ranking is the reverse of the other, and 0 that the two rankings are completely independent.
If we call \(r_i\) and \(s_i\) the ranks of the i-th member in the two ordinal samples, we can then define two matrices A and B that:
\[\mathbf{A}_{ij} = sign(r_j - r_i)\] \[\mathbf{B}_{ij} = sign(s_j - s_i)\] These matrices fulfill that \(\mathbf{A}^T = -\mathbf{A}\) and that \(\mathbf{B}^T = -\mathbf{B}\). Now, we can observe that:
\[Kendall_\tau (x,y) = Frobenius_{cosine}(\mathbf{A},\mathbf{B})\] From there, the data in feature space can be found the same way than in the Frobenius kernel. It can be deduced that, for the Kendall’s \(\tau\) kernel, feature space has dimension \(\binom{L}{2}\), and each feature corresponds to a “relative order” (\(+ = x_i < x_j\); \(- = x_i > x_j\)) between all pair of levels within a ranking.
A NxL or LxN dataset is required. N is the sample size and L the number of values (levels) of the ordinal variable. This is an example that such a dataset:
Three people are given a list of 10 colors. They rank them from most (1) to least (10) favorite.
color_list <- c("black","blue","green","grey","lightblue","orange","purple",
"red","white","yellow")
survey1 <- 1:10
survey2 <- 10:1
survey3 <- sample(10)
color <- cbind(survey1,survey2,survey3) # Samples in columns
rownames(color) <- color_list
color
#> survey1 survey2 survey3
#> black 1 10 10
#> blue 2 9 7
#> green 3 8 5
#> grey 4 7 4
#> lightblue 5 6 1
#> orange 6 5 3
#> purple 7 4 2
#> red 8 3 9
#> white 9 2 8
#> yellow 10 1 6
In this case, the ranking is already done. The Kendall’s \(\tau\) kernel is computed like:
(If we prefer samples in rows, we can do:)
Alternatively, the ranking may be not given explicitly by the user. Consider this other dataset:
The same three people are asked the number of times they ate 5 different kinds of food during the last month
food <- matrix(c(10, 1,18, 25,30, 7, 5,20, 5, 12, 7,20, 20, 3,22),ncol=3,nrow=5,byrow = TRUE)
colnames(food) <- colnames(color)
rownames(food) <- c("spinach", "chicken", "beef" , "salad","lentils")
food
#> survey1 survey2 survey3
#> spinach 10 1 18
#> chicken 25 30 7
#> beef 5 20 5
#> salad 12 7 20
#> lentils 20 3 22
In that case, the ranking is done internally before computing the Kendall’s \(\tau\).
Furthermore, we can combine these two datasets, because the participants are the same:
By default, the Kendall’s \(\tau\)
kernel values of the two datasets are averaged. The alternative is to
sum them, doing: Kendall(...,comp="sum")
.
Right now, Kendall()
does not return the data mapped
onto the feature space.
Strings or sequences are vectors of objects (usually represented as characters) in which repetitions are allowed and order matters. These characters are drawn from a given alphabet \(A\). An example of an alphabet is:
LETTERS
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"
We denote the size of an alphabet as \(|A|\). In the example above, this is:
Let \(x\) be a string or sequence, and \(s\) a substring of length L that consist of characters drawn from an alphabet \(A\), so we can count the times (\(t_s\)) that \(s\) occurs in \(x\). Then, we do the same for all possible substrings of length L generated from \(A\). Each \(t_s\) is a feature in feature space, which has dimension \(|A|^L\).
The Spectrum kernel is the dot product of two strings mapped onto the feature space we have just defined. Alternatively, it can also be written as:
\[Spectrum(x_i,x_j) = \sum_{s \in A^L} |s \in x_i|·|s \in x_j| \] This kernel ranges from 0 (the strings are completely different: they don’t share any substring) to any positive number. In general, higher values indicate higher similarity; however, it’s important to remember that some strings may be longer than others. This is fine if the strings’ length is relevant to the problem you are studying; but if it is not, you can cosine-normalize the Spectrum kernel so the result is bounded between 0 and 1:
\[Spectrum_{cosine}(x_i,x_j) = \frac{Spectrum(x_i,x_j)}{\sqrt{Spectrum(x_i,x_i)Spectrum(x_j,x_j)}}\] A result of 0 means complete dissimilarity, 1 means complete similarity, and the strings’ length is not taken into account.
You need a dataset containing N strings or text data. For instance:
alphabet <- c(letters,"_")
strings <- c("hello_world","hello_word","hola_mon","kaixo_mundua",
"saluton_mondo","ola_mundo", "bonjour_le_monde")
names(strings) <- c("english1","english_typo","catalan","basque",
"esperanto","galician","french")
strings
#> english1 english_typo catalan basque
#> "hello_world" "hello_word" "hola_mon" "kaixo_mundua"
#> esperanto galician french
#> "saluton_mondo" "ola_mundo" "bonjour_le_monde"
alphabet
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z" "_"
Then you can compute the Spectrum kernel. For instance, we could consider substrings of length L = 2:
The feature space can be recovered doing:
The cosine-normalized version can be computed typing:
(if you set feat_space=TRUE
, the feature space returned
by Spectrum()
will be scaled accordingly).
Right now, the Spectrum kernel implemented by kerntools
is very basic and, in large datasets and/or when L is high, the
computation may be excessively slow. In that case, there are other R
packages that you may use. For instance, you may refer to
kernlab::stringdot()
, or check the kebabs
package, which covers complex kernels for strings like the motif kernel,
the mismatch kernel, the gappy pair kernel, etc.