Gaussian Processes for Regression

In this tutorial we're going to discuss the humble Gaussian Process, a popular method for performing regression.

It's worth noting that for the sake of illustration, there is a lot of plotting code in this notebook. I have commented where the plotting code is, feel free to skip over it since it's not important for the actual subject.

Broadly speaking, most tasks in data science are either classification, or regression. In this tutorial I'm only going to discuss regression (though GPs are capable of doing classification, they are most naturally suited to regression).

The Gaussian Distribution

The Gaussian distribution (also commonly referred to as the Normal distribution) is one of the most heavily used distributions in data science and machine learning. This is in large part because they appear so often in the world, (see: central limit theorem), but also because they offer a great deal of mathematical convenience.

A univariate Gaussian distribution is specified by a mean ($\mu$) and a variance ($\sigma^2$). This is often shorted to: $ \mathcal{N}(\mu, \sigma^2) $.

To evaluate the probability that the Gaussian assigns to an input location $y$, we use the formula:

$$ p(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} \exp \bigg\{ -\frac{1}{2\sigma^2}(x-\mu)^2\bigg\} $$

It's easiest to picture this.

First I'm going to import some libraries:

Now we can illustrate:

Notice in particular that the variance parameter, $\sigma$, describes the width of the Gaussian. If the width is small, then the peak (around the mean) is higher. Intuitively, a wider Gaussian means that there is a wider range of possible values, therefore we expect to see each one of them a little less often.

Imagine we are modelling the height (in centimetres) of a population. We could specify that the mean height of the population was 160cm, with a variance of 10cm ($H_{cm} \sim \mathcal{N}(160, 10^2)$).

We can plot a probability density for this using the Gaussian equation. The density at a given height (in cm) tells us how often we can expect to see people at that height.

Sampling from a Gaussian

One of the benefits of defining our population this way is that we can take samples. We could take 100 samples from our Gaussian (each sample representing the height of one person), and we would expect this list of 100 heights to be roughly representative of our real population.

numpy provides a particularly simple way to sample from Gaussians. E.g. to take n samples:

n_samples = np.random.normal(mu, sigma, n)

To picture what this looks like, imagine that each of the crosses on the below plot are individual samples from our model:

Imagine now that we would also like to incorporate weight (in kg) into our model.

We could define a new Gaussian distribution, with a different mean and variance. If we wanted to sample a new person (represented by a pair of a height and a weight $(h,w)$, we could first draw a sample from our model of height, and then another from our model of weight.

Multivariate Gaussians

The problem with this is that we are assuming that height and weight are independent, when in reality they are heavily linked. We could very easily sample a height from above the mean and a weight from below the mean, and we would have nothing to alert us that this was unlikely.

So far, we have been considering univariate Gaussians, and treating each variable as independent. When we have multiple, related variables (e.g. height and weight), a more powerful choice is the multivariate Gaussian.

When we specify a multivariate Gaussian, we swap our scalar mean for a vector $\mathbf{\mu}$, defining one mean per variable.

For example, in our height/weight example:

$$ \mathbf{\mu} = \begin{bmatrix} \mu_{h} \\ \mu_{w} \\ \end{bmatrix} = \begin{bmatrix} 160 \\ 50 \\ \end{bmatrix} $$

To replace the variance ($\sigma$), we need to specify a covariance matrix ($\Sigma$). A covariance matrix needs to specify the variance of each variable, as well as how each pair of variables covaries. A positive covariance says that if one variable goes up in value, so does the other. A negative covariance says that if one goes up, the other goes down. A covariance of zero specifices that there is no relationship between the two.

The exact definition of the covariance of two variables is given by our choice of covariance function, which I will denote as $K$ from now on. We want a function $K$ that, given two input vectors, returns a scalar (a single value) specifying their relationship.

The most common choice for the covariance function is the Radial Basis Function kernel. Given a pair of vectors $(x_i, x_j)$:

$$ K(x_i, x_j) = \exp \big(-\frac{\|x_i - x_j\|^2}{2 \sigma^2} \big) $$

Where $\sigma$ is a parameter we can use to tune the kernel, often called the lengthscale. For clarity, $\exp(a)$ is a simplified way to write $e^a$ (where $e$ is the mathematical constant)) ). The notation $\| x \|$ means that we are taking the absolute value or magnitude of $x$.

Remember that the output of this is a scalar (just a single number): large magnitude implies heavy correlation, low magnitude implies little correlation.

We compile each of the pairs into a matrix, which is just a convenient way to store them in one place:

$$ \mathbf{\Sigma} = \begin{bmatrix} \Sigma(x_1, x_1) & \Sigma(x_1, x_2) \\ \Sigma(x_2, x_1) & \Sigma(x_2, x_2) \\ \end{bmatrix} $$

There are two things about this function to be cogniscent of:

For the sake of clarity, I'm going to manually write out the covariance matrices in the code so that you can play with them and see how changing their values affects the samples we take.

The easiest way to get some intuition for this is visually:

This fits nicely on a 2D plot, so we can visualise individual samples from our model very easily (the red x):

One of the mathematical niceties that we get from modelling everything as a Gaussian is that we can take slices from the 3D Gaussian above, and get 2D Gaussians as outputs.

For example, imagine we are given as input a height of 159cm and we would like to output what we think a likely weight might be. We can "condition" our multivariate Gaussian on a height of 159cm by taking a slice through the 3D shape at the 159cm mark.

Unfortunately, this kind of visualisation becomes completely impossible once we have more than three variables (dimensions) to consider.

A more scalable way to picture taking samples from the distribution is to put each of the variables as individual locations on the x-axis, and to normalise the y-axis:

To see how these samples correspond to samples on the plot above, let's do away with our height/weight example for a moment and imagine that we have two variables ($x_1, x_2$) each in the range 0-1.

To model them, we construct a multivariate Gaussian:

$$ \begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} \sim \mathcal{N}\bigg(\mathbf{\mu} = \begin{bmatrix}0.3 \\ 0.6 \\ \end{bmatrix}, \mathbf{\Sigma} = \begin{bmatrix}1&0.9 \\ 0.9 & 1 \\\end{bmatrix}\bigg)$$

Now we can take a sample, and picture the sample in two ways:

I encourage you to re-run the above cell a few times to get some different samples so that you have a good intuition for how both visualisations work. It's also worth changing the means and covariance matrix and trying to predict what the outputs will look like.

Something you might notice is that if you decrease the magnitude of the covariance, then the distance between $x_1$ and $x_2$ tends to be larger. This is roughly in line with what we would expect since the covariance describes exactly this.

In order to do conditioning in our new visualisation system (where we fix one variable at a certain value), all we have to do is fix one of the points at its current location, and continue sampling the others.

Generalising to more variables

Imagine now that we add a third variable into the mix: $x_3$. In our population model, this could be something like shoe size.

When designing our new covariance matrix, we decide that $x_3$ is quite related to $x2$, and a little less related to $x1$:

$$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \end{bmatrix} \sim \mathcal{N}\Bigg(\mathbf{\mu} = \begin{bmatrix}0.3 \\ 0.6 \\ 0.4 \\ \end{bmatrix}, \mathbf{\Sigma} = \begin{bmatrix}1&0.9&0.8 \\ 0.9 & 1 & 0.9 \\ 0.8 & 0.9 & 1 \\ \end{bmatrix}\Bigg)$$

With three variables, this is a little tricky to visualise sampling the old way, but very natural in our new way:

We can likewise extend this to 4 variables:

Or even more:

We can generalise this to as many dimensions as we want, and since we're scaling up, I'm going to make use of sklearn from now on.

Let's start by picturing a 1000 dimension multivariate Gaussian. The mean vector, $\mathbf{\mu}$, is just a vector with 1000 elements. The covariance matrix can be visualised as:

This is exactly the same as the covariance matrices we had above, just pictured as a heatmap because there are too many numbers to print exactly. The diagonal is all $1$s, and the further away from the diagonal we get, the lower the values become.

Now that we are equipped with a mean vector and a covariance matrix, we can draw some samples:

Remember how we said we could condition the Gaussian on some variables having a fixed value? Visually, this is akin to fixing a few of the points on that graph, and re-sampling the rest of them.

It turns out, this is exactly how Gaussian Processes work. Those samples above are actually functions that aren't conditioned on anything. It's for this reason that Gaussian Processes are called "distributions over functions" - we are trying to model all of the possible functions that could produce our data.

This is quite a lot to take in so I recommend going back up through the examples one more time and making sure that the way we have extended the means/covariance out to many dimensions makes sense.

Sampling tricks

One way to make sampling easier (and the code writing easier) is to make it so we are only taking samples from a Gaussian $\mathcal{N}(0,1)$, and then scaling that sample to fit our range.

First convince yourself that:

$\mathcal{N}(\mu, \sigma^2) = \mu + (\mathcal{N}(0,\sigma^2))$

Picture this as our Gaussian centred at 0 (the mean) being shifted by $\mu$ in the addition.

Conditioning multivariate Gaussians

Please try not to be intimidated; you can take these rules for granted but the actual maths is reasonably straightforward and if you want to see a full derivation then you can look here: [INCLUDE MR LINK]

Imagine we have some data $\mathbf{X}$ and some labels for the data, $\mathbf{y}$.

We also have a covariance function $K(\circ,\circ)$ which takes two inputs and returns a covariance score. We use this to construct our covariance matrix as usual.

We choose to model $\mathbf{X}$ with a model $f \sim \mathcal{N}(\mu, \Sigma)$. We get some new, unlabelled data, $\mathbf{X_{*}}$, which we can incorporate into our model:

$$ \begin{bmatrix}f \\ f_* \\ \end{bmatrix} \sim \mathcal{N}\bigg(\begin{bmatrix}\mu \\ \mu_* \\ \end{bmatrix}, \begin{bmatrix} K & K_* \\ K_* & K_{**} \end{bmatrix} \bigg) $$

where $K_* = K(X, X_*)$ and $K_{**} = K(X_*, X_*)$.

To get the distribution over functions that could have produced our test set, we can then directly use:

$$ f^* \sim \mathcal{N}(\mu_*, \Sigma_*) $$

This is the key insight and benefit to using Gaussians. This is essentially saying: the covariance function has defined a prior belief about the kind of functions that we could see; we would like to condition the samples on the exact training data we saw, and find slices to get the mean and variance at our test points. More concretely, the above equation produces all of the possible functions that could have passed through our training data, where the mean of the functions is our prediction.

We already have $\mu, \Sigma$ and can easily compute $\Sigma_{**}$. However, we need to calculate $\mu_*, \Sigma*$.

The formulae for these are given by the standard rules for Gaussians:

$ \mu_* = K_{*}(K+\sigma^2_y I)^{-1}y$
$ \Sigma_* = K_{**} - K_*^T (K+\sigma^2_y I)^{-1} K_* $

Where $X^{-1}$ means we are inverting the matrix.

Don't worry about deriving these results, just know that whenever you need to fetch $\mu_*$ and $\Sigma_*$ they can be computed exactly with the above formulae.

Toy example

Now that we have covered the key topics, let's do some inference with a Gaussian Process.

I'm going to start by generating a dataset, in this case a noisy sine wave. We'll split the data up into two sets, train (which we are allowed to see the targets for) and test (which we want to guess the targets for).

So we are going to fit the Gaussian process to the blue points and then try to guess for the red points.

Let's first calculate $\mu_*$, using the formula:

$ \mu_* = K_{*}(K+\sigma^2_y I)^{-1}y$

I'm going to assume that the noise parameter $\sigma^2_y$ is known, but in practice we may need to optimise this via gradient-based methods.

And now, $ \Sigma_* = K_{**} - K_*^T (K+\sigma^2_y I)^{-1} K_* $

And now we can make predictions (by inspecting the mean of the distribution) at our test points!

One of the other nice things we can do with Gaussian Processes is inspect the diagonal of the covariance matrix to get uncertainty estimates (i.e. a variance spanning the possible values) at each of the locations.

That is to say, as well as predicting a mean, we also predict a variance: