A Gaussian process is a stochastic process that assumes that the outputs for any set of input points follows a multivariate normal distribution. To determine the normal distribution, we must select a mean function that gives a mean for each point and a covariance function that gives the covariance between any set of points.

Thus if we have mean function \(\mu\), covariance function \(\Sigma\), and the \(n \times d\) matrix X with the input vectors \(\mathbf{x_1}, ..., \mathbf{x_n}\) in its rows, then distribution of the output at these points, \(\mathbf{y} = [y_1, \ldots, y_n]^T\) is given by:

\[ \mathbf{y} \sim N(\mu(X),~ \Sigma(X)) \]

Or in full element notation:

\[ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \sim N(\begin{bmatrix} \mu(\mathbf{x_1}) \\ \vdots \\ \mu(\mathbf{x_n}) \end{bmatrix}, \begin{bmatrix} \Sigma(\mathbf{x_1},\mathbf{x_1}) &\cdots &\Sigma(\mathbf{x_1},\mathbf{x_n}) \\ \vdots &\ddots &\vdots \\ \Sigma(\mathbf{x_n},\mathbf{x_1}) &\cdots &\Sigma(\mathbf{x_n},\mathbf{x_n}) \end{bmatrix})\]

The mean function can be any function mapping the input space to the real numbers. The most commonly used mean function is a constant, so \(\mu(\mathbf{x}) = \mu\). This means that over the entire space the predicted mean given no other information a constant. When fitting a GP model to data, \(\mu\) is usually estimated using the data. Another commonly used mean function is zero. This works surprisingly well since the GP will interpolated between your data, meaning that the mean function work have much of an effect when there is enough data. A note on notation: for a vector \(u\), \(\mu(u)\) is sometimes written as \(\mu_u\) for simplicity. For a matrix \(X\), \(\mu(X)=\mu_X\) is the vector obtained from applying \(\mu\) to each row of \(X\).

A more advanced choice of mean function is to use a linear model, so \[\mu(\mathbf{x}) = \beta_0 + \sum_{i=1}^d \beta_i x_i\]. Again these parameters must be estimated. This can be generalized to a linear combination of functions of the input data, \(f_1, \ldots, f_m\) so the mean function is \[\mu(\mathbf{x}) = \beta_0 + \sum_{i=1}^m \beta_i f_i(\mathbf{x})\].

It is generally recommended to just use a constant mean since the data itself should provide enough information to fit the true function. Some researchers say that using a linear model can have negative effects on fitting a good model.

The covariance function determines how strong the correlation is between points. A note on notation: the covariance/correlation functions are heavily overloaded, meaning that their meaning depends on the context. For vectors \(u\) and \(v\), \(\Sigma(u,v)\) is the covariance between the points, which is also sometimes written as \(\Sigma_{u,v}\) or \(\Sigma_{uv}\). \(\Sigma(u)\) or \(\Sigma_u\) is the same thing as the covariance of \(u\) with itself, or \(\Sigma(u,u)\). For a matrix \(X\), \(\Sigma(X,u)\) or \(\Sigma_{Xu}\) is a column vector whose elements are the covariance of the rows of \(X\) with \(u\). For another matrix \(W\), \(\Sigma(X, W)\) is a matrix whose \((i,j)\) element is the covariance of the \(i\) row of \(X\) and the \(j\) row of \(W\). \(\Sigma(X)\) or \(\Sigma_X\) means the same thing as \(\Sigma(X,X)\).

Often a correlation function, \(R\), is used instead of a covariance function. The correlation function should map any pair of points to \([0,1]\). The correlation for any point with itself should be 1, i.e. \(R(\mathbf{x}, \mathbf{x}) = 1\). When a correlation function is used, a variance parameter \(\sigma^2\) must be estimated to scale the correlation matrix into a covariance matrix. Thus the covariance matrix is \(C(X) = \hat{\sigma}^2 R(X)\). \(R\) is overloaded similarly to \(Sigma\).

The most commonly used correlation function is the Gaussian.

\[ R(\mathbf{u}, \mathbf{v}) = \exp \left( -\sum_{i=1}^d \theta_i (u_i - v_i)^2 \right)\]

The parameters \(\mathbf{\theta} = (\theta_1, \ldots, \theta_d)\) are the correlation parameters for each dimensions. Generally they must be estimated from the data when fitting a Gaussian process model to data.

The parameters are often estimated by finding the parameters that maximize the likelihood given a data set.

The likelihood function is the usual multivariate normal pdf shown below, where \(\mathbf{\mu} = \mu(X)\), \(\Sigma = \Sigma(X)\)

\[ f(\mathbf{\theta};~X,~\mathbf{y}) = f(X,~\mathbf{y};~\mathbf{\theta}) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2} } \exp{(-\frac{1}{2} (\mathbf{y} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{y} - \mathbf{\mu}))}\]

As usual, we use negative two times the log-likelihood for simplicity, ignoring the constant terms.

\[ \ell(\theta) = \ln |\Sigma| + (\mathbf{y} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{y} - \mathbf{\mu})\]

This equation is minimized as a function of the correlation parameters to find the parameters that give the greatest likelihood. Since there is a determinant and matrix solve, this is an expensive function to optimize, with each evaluation being \(O(n^3)\)

If the mean is set to be constant, \(\mu(X) = \mu \mathbf{1_n}\), then there is a single parameter \(\mu\) to estimate. Differentiating \(\ell\) with respect to \(\mu\) will then give \[ \frac{d \ell}{d \mu} = \mathbf{1_n}^T \Sigma^{-1}(\mathbf{y} - \mu \mathbf{1_n})\] Setting this equal to zero and solving for \(\mu\) gives the maximum likelihood estimate \(\hat{\mu}\) \[ \hat{\mu} = \frac{\mathbf{1_n}^T \Sigma^{-1}\mathbf{y}}{\mathbf{1_n}^T \Sigma^{-1}\mathbf{1_n}}\]

When using a correlation matrix so that \(\Sigma = \sigma^2 R\), \(\sigma\) must be estimated using maximum likelihood.

\[ \frac{d \ell}{d \sigma^2} = \frac{n}{\sigma^2} - \frac{1}{\sigma^4}(\mathbf{y} - \mathbf{\mu})^T R^{-1} (\mathbf{y} - \mathbf{\mu})\] Setting equal to zero and solving for \(\sigma^2\) gives \[ \hat{\sigma}^2 = \frac{1}{n} (\mathbf{y} - \mathbf{\mu})^T R^{-1} (\mathbf{y} - \mathbf{\mu})\] When estimating \(\mu\) and \(\sigma^2\) simultaneously, these estimates are valid and the estimate can simply be plugged into the equation.

Suppose there are vectors \(\mathbf{y_1}\) and \(\mathbf{y_2}\) that are jointly multivariate normal. The joint distribution is \[ \begin{bmatrix} \mathbf{y_1} \\ \mathbf{y_2} \end{bmatrix} \sim N( \begin{bmatrix} \mathbf{\mu_1} \\ \mathbf{\mu_2} \end{bmatrix}, ~\begin{bmatrix} \Sigma_{11} \Sigma_{12} \\ \Sigma_{21} \Sigma_{22} \end{bmatrix} ) \]

The conditional distribution of \(\mathbf{y_1}\) given \(\mathbf{y_2}\) is

\[\mathbf{y_1} ~|~\mathbf{y_2} \sim N(\mathbf{\mu_1} + \Sigma_{12} \Sigma_{22}^{-1}( \mathbf{y_2} - \mathbf{\mu_2})), ~\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \]

Suppose there are two input matrices, \(X_1\) and \(X_2\), whose rows are the input points, with corresponding output vectors \(\mathbf{y_1}\) and \(\mathbf{y_2}\). Suppose we have the actual values for \(\mathbf{y_2}\), and want to estimate, or predict, \(\mathbf{y_1}\). We can use the conditional distribution above to get a posterior distribution for \(\mathbf{y_1}\).

If we only want to predict for a single point, i.e. we want to predict the output \(y\) at \(\mathbf{x}\), then this equation gives

\[y ~|~\mathbf{y_2} \sim N(\hat{y}, ~\hat{\sigma}^2(y)) \] where \[ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) \] and \[ \hat{\sigma}^2(y) = R(\mathbf{x}) - R(\mathbf{x},~X_2) R(X_2)^{-1} R(X_2,~\mathbf{x}) \]

Notice we get an estimate not only for the value of \(y\), but also the standard error. This can be useful when we need a way to judge the prediction accuracy of the model.