February 16th, 2022

Gaussian Processes (GPs) are a powerful Bayesian machine learning method. A problem, however, is that when inferring the value of a new data point (i.e. making a prediction with the GP), the kernel matrix \(K \in \mathbb{R}^{n \times n}\), where \(n\) is the number of training points, needs to be inverted. The inversion of a \(n \times n\) matrix is a \(\mathcal{O}(n^3)\) operation and therefore infeasible for large data sets.

This inversion can be sped up by the matrix inversion lemma to a \(\mathcal{O}(q^3)\) operation, where \(q\) is the rank of the matrix \(K\) (i.e. \(q \leq n\)). We can use this lemma (besides the straighforward application if \(rank(K) \lt n\)) to speed up the GP inference by approximating \(K\) with a matrix \(\tilde{K}\) that is as similar to \(K\) as possible but with (much) smaller rank (similarity is measured with some matrix norm).

To construct a good approximation (optimal w.r.t. the Frobenius norm) with rank \(q\), we need the \(q\) leading eigenvalues and eigenvectors of \(K\). But computing the eigendecomposition is itself a \(\mathcal{O}(n^3)\) operation. We therefore resort to approximating the eigenvalues and -vectors using the Nyström method. As this can be done in \(\mathcal{O}(q^2 n)\), we can significantly speed up GP inference for a small value of \(q\). (Of course, the smaller we choose \(q\), the worse our approximation will be).

Now, what do we actually need to do to compute the approximate kernel matrix? First, we select a sample of \(q \lt n\) elements from the training set. This is typically done by uniform sampling but more sophisticated methods exist. We then reorder the rows and columns of the the kernel \[ K = \begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{22} \end{bmatrix} \] so that \(K_{11} \in \mathbb{R}^{q \times q}\) is the kernel for the \(q\) selected elements and \(K_{21} \in \mathbb{R}^{(n-q) \times q}\). The approximate kernel is then computed as \[ \tilde{K} = \begin{bmatrix} K_{11} \\ K_{21} \end{bmatrix} K_{11}^{-1} \begin{bmatrix} K_{11} & K_{21}^T \end{bmatrix} \] and can as such be used in the GP predictive equations by replacing the kernel matrix \(K = K(X, X)\) for the training points \(X\) (The kernels involving the test-points \(X_{\star}\) are left untouched).

This post is meant to give you an introduction to the Nyström approximation to help you understand the related literature. For further information or details on the implementation, please refer to the Wikipedia article or to the Gaussian Processes for Machine Learning book. There, you can also find the concrete equations to approximate the eigenvalues and -vectors and how to use these to construct the approximate kernel.