A visual exploration of Gaussian Processes and Infinite Neural Networks
Published:
Gaussian Processes (GPs) generalize Gaussian distributions to random variables that are functions (stochastic processes). As such, they are a powerful tool for regression and Bayesian inference. In the past years, they have received increased attention due to the interest in machine learning. An intriguing connection between GPs and neural network exists when the width of layers in feed-forward neural networks tends towards infinity. For a specific choice of the prior distributions that governs the network parameters, the resulting network will be a GP.
In this blog post, I intend to explore the connection between GPs and infinite neural networks, following the exposition in this ICLR article and adding some practical examples in the form of code and visuals. Moreover, I believe there is some value in building some intuition about GPs and how they can help us better understand neural networks (NNs). To this end, this article starts with a detailed introduction to Gaussian Processes. Readers familiar with the topic are encouraged to skip to the second part.
Finally, a few words on the code itself. The examples here are intended to be part of the text and I encourage the reader to copy the code and try out the examples. This is ideal for forming an intuitive understanding of the methods presented here. As programming language, I chose Julia, which has been an increasingly popular choice in scientific computing and machine learning. I took this opportunity to introduce interested readers to this exciting new language using practical examples. I hope that the simplicity and elegance of the language will speak for itself.
Recap of Gaussian distributions
Before we get started, we need to introduce GPs and motivate their use. To do so, we start with a short introduction of multivariate Gaussians and conditional Gaussians as we will make heavy use of them. Readers familiar with the topic may want to skip this introduction.
Multivariate Gaussian distributions
We introduce the multivariate Gaussian distribution, which is defined as
\[\mathcal{N}(x \vert \mu, \Sigma) = \frac{1}{\sqrt{ (2 \pi)^d \vert \Sigma\vert }} \exp \left\{ - \frac{1}{2} (x - \mu)^T \, \Sigma^{-1} \, (x - \mu)\right\}.\]The parameters of the distribution are the mean $\mu \in \mathbb{R}^d$ and the covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$. We can restrict our analysis to symmetric positive definite matrices $\Sigma$, as the antisymmetric part of $\Sigma$ does not contribute to the quadric form in the exponent.
Let us load some packages and then define a multivariate Gaussian in 2D. In Julia, we can make use of the Distributions
package to do this.
using LinearAlgebra
using Random, Distributions
using Plots # for plotting
Random.seed!(123) # this sets the random seed to 123, so that our code becomes reproducible
theme(:vibrant) # select the color theme for aesthetic reasons
d = 2
μ = Random.randn(d)
X = Random.randn(d, d)
Σ = X'*X
dist = MvNormal(μ, Σ)
FullNormal(
dim: 2
μ: [1.1902678809862768, 2.04817970778924]
Σ: [1.5167137980972474 -0.7586450344468244; -0.7586450344468244 0.5991970596857172]
)
We can visualize the resulting distribution and sample from it. Sampling from the distribution is as simple as x = rand(dist, n)
, which gives us n
samples from dist
. The syntax (x,y)->Distributions.pdf(dist,[x,y])
is used to extract the PDF of the distribution in its functional form for plotting.
n = 100
x = rand(dist, n)
xmin = min(x[1,:]...)
xmax = max(x[1,:]...)
ymin = min(x[2,:]...)
ymax = max(x[2,:]...)
heatmap(xmin:0.05:xmax, ymin:0.05:ymax, (x,y)->Distributions.pdf(dist,[x,y]))
scatter!(x[1,:], x[2,:], label="sample", xlims = (xmin, xmax), ylims = (ymin, ymax))
Conditional Gaussians
Let us discuss conditional distributions. We stick to the common practice of distinguishing between random variables denoted by capital letters and their actual realizations denoted by lowercase letters. Assume $X \sim \mathcal{N}(\mu, \Sigma)$, where we have observed the first $k$ components $x_1, x_2, \dots, x_k$ of $X$. We are interested in knowing the conditional contribution $p(x_{k+1}, \dots, x_d \vert x_1, \dots, x_k)$ for the remaining, unobserved variables. For this purpose, let us partition the covariance matrix and the mean according to the variables:
\[x = \begin{bmatrix} x_a \\ x_b \end{bmatrix}, \qquad \mu = \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \qquad \Sigma = \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}.\]Here, the subscript $a$ corresponds to the first $k$ entries, which are the observed entries, whereas $b$ corresponds to the remaining $d-k$ entries. One can show that the conditional distribution $p(x_{k+1}, \dots, x_d \vert x_1, \dots, x_k) = p(x_b \vert x_a)$ is a multinomial Gaussian itself, i.e.
\[p(x_b \vert x_a) = \mathcal{N}(x_b \vert \mu_{b\vert a}, \Sigma_{b\vert a}),\]where $\mu_{b\vert a}, \Sigma_{b\vert a}$ are the parameters of the conditional distribution. By inserting the partitioned variables and re-shuffling the terms in the exponent, we find that $\mu_{b\vert a}$ is determined by the conditioned mean
\[\mu_{b\vert a} = \mu_b + \Sigma_{ba} \Sigma_{aa}^{-1} (x_a - \mu_a)\]and conditioned co-variance matrix
\[\Sigma_{b\vert a} = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}.\]The attentive reader may recognise this as the Schur complement, which appears when the blocked covariance matrix $\Sigma$ is factored and the degrees of freedom corresponding to $a$ are factored out.
Marginalization of Gaussians
We have seen that the conditional distribution of a multivariate Gaussian is another multivariate Gaussian as well. Another nice property applies to marginal distributions. We recall that the marginal distribution over $x_b$ is obtained by intergrating out $x_a$. One can show that after inserting the partitioned Gaussian, one finds the satisfying result
\[\begin{align*} p(x_b) &= \int p(x_a, x_b) \, \mathrm{d} x_a = \int \mathcal{N}\left(x_b \bigg\vert \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right) \, \mathrm{d} x_a \\ &= \mathcal{N}(x_b \vert \mu_b, \Sigma_{bb}). \end{align*}\]Gaussian Processes as generalizations of multivariate Gaussians
Now that we have seen how the Gaussian distribution is generalized to $d$ dmensions, we may ask: “What if the random variable was a function?” This is where Gaussian Processes come in.
Gaussian Processes
Definition: For any index set $S$, a Gaussian Process (GP) on $T$ is a collection of random variables ${Y_x; \, x \in S}$, of which any finite subset $[Y_{x_1},Y_{x_2},\dots,Y_{x_k}]^T$ is a multivariate Gaussian.
In other words, for any number $k \in \mathbb{N}$ of samples taken at indices $x = [x_1,x_2,\dots,x_k]^T$, the random variables ${Y_{x_1},Y_{x_2},\dots,Y_{x_k}}$ jointly follow a multivariate Gaussian. This definition seems somewhat unpractical, thankfully however, we can prescribe the Gaussian Process by specifying functions $m(x)$ and $K(x,x’)$, such that
\[Y = [Y_{x_1},Y_{x_2},\dots,Y_{x_k}]^T % \begin{bmatrix} % Y_{x_1} \\ % Y_{x_2} \\ % \vdots \\ % Y_{x_k} % \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma)\]with mean and covariance given by
\[\mu_i = m(x_i), \qquad \Sigma_{ij} = K(x_i, x_j).\]Informally, we can think of infinite dimensiona vectors being functions. For functions $f$ sampled from the GP, we simply write
\[f(x) \sim GP(m(x), K(x,x')).\]We can think of this as the infinite variant of the marginalization property, as any finite subset of points will be described by the corresponding “marginalized” mean $m(x)$ and covariance matrix $K(x,x’)$.
Radial basis kernel
Let us explore some GPs. As we have pointed out, it suffices to specify a positive semi-definite kernel function and a mean. A popular kernel is the radial-basis function (RBF) kernel
\[K(x,x') = \exp \left\{ - \frac{1}{2\sigma^2} || x - x' ||^2\right\},\]where $\sigma$ is some parameter controlling the width of the kernel. Due to Julia’s functional nature, defining the RBF kernel is straight-forward.
sqexp_kernel(xa, xb; σ=1.0) = exp.(- [norm(a).^2 for a in (xa .- xb)] ./ (2.0 * σ^2)) # RBF kernel for matrix valued-inputs
sqexp_kernel (generic function with 1 method)
Let us write a routine for visualizing kernel functions.
function visualize_kernel(kernel; bounds=(-5,5))
plot1 = heatmap(LinRange(bounds[1], bounds[2], 100), LinRange(bounds[1], bounds[2], 100), kernel, yflip = true)
xa = collect(LinRange(bounds[1], bounds[2], 100))
xb = zero(xa)
plot2 = plot(xa, kernel(xa, xb), label="K(x, 0)")
plot(plot1, plot2, layout = @layout([a{1.0h} b]), size=(770,300))
end
visualize_kernel(sqexp_kernel)
The figure on the right illustrates the kernel function $K(x, 0)$ at a fixed location $x’=0$. By varying $x’$, we slide the kernel across the domain of $x$. This implies that the covariance $K(x, x’)$ for any two points $x, x’$ will fall off exponentially with the distance squared.
Let us plot some realizations of the corresponding GP. Observing realizations is as easy as picking $n$ points on the real axis $[x_1, x_2, \dots, x_n]^T$ and observing $f(x_i)$ at these points. To simplify this, we set $m(x) = 0$. By the definition of GPs, we have that $[f(x_1), f(x_2), \dots, f(x_n)]^T \sim \mathcal{N}(0, \Sigma)$. In other words, we need to sample $n$ observations from the multivariate Gaussian $\mathcal{N}(0, \Sigma)$, where the covariance matrix $\Sigma$ is defined by $\Sigma_{ij} = K(x_i, x_j)$. The following code samples $10$ realizations from the Gaussian process defined by $m(x) = 0$ and the radial-basis kernel $K$.
function visualize_gp_samples(mean, kernel, nsamples; bounds=(0,1))
mesh = LinRange(bounds[1], bounds[2], 500)
S = [kernel(xa,xb) for xa in mesh, xb in mesh]
# The following line is a fix to make sure that S is positive definite.
# To do so, we define a threshhold tol and add the identity matrix to make this the new minimum eigenvalue.
if !isposdef(S)
S = S + (1e-12 - min(eigvals(S)...)) * I
end
GP = MvNormal(mean(mesh), S) # this is the workaround
# sampling the distribution gives us divverent instance of the random variable f
samples = rand(GP,nsamples)
plot(mesh,samples)
end
visualize_gp_samples(zero, sqexp_kernel, 10, bounds=(-4,4))
We observe that the ten samples are smoothly varying around the mean $0$. We have effectively created a distribution over some function space. For now, we are nto concerned with the precise properties of this function space. Rather, let us explore some other kernel functions.
Ornstein-Uhlenbeck process
Let us use Lagrange function, which resembles the functional form of the PDF that produces white noise. It is defined as
\[K(x,x') = \exp \left\{ - \frac{||x - x'||}{l}\right\},\]exp_kernel(xa, xb; l=1.) = exp.(- [norm(a) for a in (xa .- xb)] ./ l)
exp_kernel (generic function with 1 method)
visualize_kernel(exp_kernel)
We observe that the kernel function is sharply peaked around $x’=0$. We use the same code as before to sample the Gaussian Process:
visualize_gp_samples(zero, exp_kernel, 10)
We observe that the resulting function are not smooth everywhere and resemble realizations from a stochastic process. In fact, one can indeed show that the realizations that we observe are equivalent to the random walk of a Brownian particle subject to friction. This is also known as the Ornstein-Uhlenbeck process. It is evident that the smoothness properties of the resulting functions are related to the regularity (or lack thereof) of the kernel function.
Gaussian Process regression (Kriging)
We have seen some examples of GPs and are interested in what we can do with them. A powerful application for GPs is regression, which is also known as Kriging in this context. Let us assume that we have some measurements of a function $t(x)$ at points $x_i$ and that we would like to predict $t(x)$ everywhere else. Instead of measuring $t(x)$ directly however, we observe $Y = t(x) + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is a noise variable that adds uncertainty to the measurement. We could model this by prescribing the conditional distribution $p(y\vert x) = \mathcal{N}(y \vert t(x), \sigma^2)$. Alternatively, we can also interpret this as a Gaussian process $GP(t(x), \sigma^2 \delta(x-x’))$. $\delta(x-x’)$ denotes the Dirac delta, which adds uncorrelated noise for $x \neq x’$.
σ=0.1
dirac_kernel(xa, xb; σ=1.0) = σ^2 .* [Float64(norm(a) == 0.) for a in (xa .- xb)]
# visualize_kernel(D)
f(x) = sin.(x)
visualize_gp_samples(f, (xa,xb)->dirac_kernel(xa,xb,σ=σ), 1, bounds=(-π,π))
Let us make this more concrete. We choose $f(x) = \sin (x)$ and $\sigma = 0$. This gives us a deterministic outcome, as there will be no noise on the prior. We sample eight points at random from the interval $[- \pi, \pi]$, which will serve as our training points.
# sample observed points from the GP
σ = 0.0
n_obs = 8
x_obs = rand(Uniform(-π,π), n_obs)
S_obs = [dirac_kernel(xa, xb, σ=σ) for xa in x_obs, xb in x_obs]
if !isposdef(S_obs)
S_obs = S_obs + (1e-12 - min(eigvals(S_obs)...)) * I
end
GP = MvNormal(f(x_obs), S_obs)
f_obs = rand(GP, 1)
# plot the observations
scatter(x_obs, f_obs, label="observations", color=4)
After observing $k$ points from this Gaussian Process, we are interested in constructing a posterior distribution
\[Y \sim GP(m(x), K(x,x'))\]Luckily, we can use conditional Gaussians to make a prediction on the query points. We assume that $x_1, x_2, \dots, x_k$ are the reference points, at which we have observed $y_1, y_2, \dots, y_k$, and $x_{k+1}, x_{k+2}, \dots, x_d$ the query points at which we would like to make a prediction $y_{k+1}, y_{k+2}, \dots, y_d$ using the GP posterior. As before, we will use the subscript $a$ to denote the first $k$ reference points and $b$ the remaining query points. By the definition of GPs, the vector $x$ follows
\[Y = \begin{bmatrix} Y_a \\ Y_b \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right).\]To make a prediction on the query points, all we need to do is compute the conditional Gaussian distribution and compute the required terms in $\Sigma$ and $\mu$. We make the simple choice of assuming a zero prior, such that $\mu_a = 0$. The terms in $\Sigma_{aa}$, $\Sigma_{ba}$ and $\Sigma_{bb}$ are computed by plugging the points $x_i$ into the Kernel. We rcall that the conditional Gaussian distribution is defined by
\[\mu_{b\vert a} = \Sigma_{ba} \Sigma_{aa}^{-1} y_a\]and
\[\Sigma_{b\vert a} = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}.\]The maximum likelihood prediction in this case is given by the mean $\mu_{b\vert a}$ and the variance can be estimated by taking the diagonal entries of $\Sigma_{b\vert a}$. Let us visualize this with some code.
# compute the function of the posterior
function compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
Σaa = post_kernel(x_obs*ones(length(x_obs))', ones(length(x_obs))*x_obs') + prior_kernel(x_obs*ones(length(x_obs))', ones(length(x_obs))*x_obs')
Σba = post_kernel(x_mesh*ones(length(x_obs))', ones(length(x_mesh))*x_obs')
Σbb = post_kernel(x_mesh*ones(length(x_mesh))', ones(length(x_mesh))*x_mesh')
# the regression part
m = Σba * (Σaa \ f_obs) # we assume μ=0 for the prior
S = Σbb - Σba * (Σaa \ Σba')
return m, S
end
x_mesh = LinRange(-2*π,2*π,100)
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> sqexp_kernel(xa, xb, σ=1.0)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
([-9.369695918338859e-5; -0.00015075979983293126; … ; -0.005352845937511571; -0.003520077442188424], [0.9999989337571946 0.9919746613050364 … 1.169090721559824e-6 7.794250769238117e-7; 0.9919746613050364 0.9999972786788515 … 1.8644416057254722e-6 1.243011747806888e-6; … ; 1.1690907215600333e-6 1.864441605725805e-6 … 0.9998702236000084 0.9918910258492498; 7.794250769239503e-7 1.2430117478071082e-6 … 0.9918910258492498 0.9999438745325055])
The resulting mean and covariance matrix determine the distribution on all points of the domain. To visualize the posterior GP, we plot the mean and a confidence interval of $\pm 2 \sigma$. The latter can be extracted by taking the diagonal entries of the posterior covariance matrix.
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="model", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function")
scatter!(x_obs, f_obs, label="observations", color=4)
This is very nice. Not only where we able to fit the data points, but we also got a posterior distribution, that gives us a confidence interval. Because the prior distribution was deterministic, there is no uncertainty around the observed points. However, in between these points we observe that the covariance increases. Moreover, the kernel that we chose encodes some regularity, which results in smooth realizations, if we were to sample this GP.
Let us now re-introduce noise into our prior. To this end, we set $\sigma = 0.1$, and adapt the GP prior accordingly. The modified GP is determined by
\[Y = \begin{bmatrix} Y_a \\ Y_b \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} + \sigma^2 I & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right),\]which means that the prediction is given by
\[\mu_{b\vert a} = \Sigma_{ba} (\Sigma_{aa} + \sigma^2 I)^{-1} y_a.\]# sample new observed points from the GP with non-zero variance
σ = 0.1
S_obs = [dirac_kernel(xa, xb, σ=σ) for xa in x_obs, xb in x_obs]
if !isposdef(S_obs)
S_obs = S_obs + (1e-12 - min(eigvals(S_obs)...)) * I
end
GP = MvNormal(f(x_obs), S_obs)
f_obs = rand(GP, 1)
# take the same mesh, but adapt the prior to the new variance
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> sqexp_kernel(xa, xb, σ=1.0)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="model", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function")
scatter!(x_obs, f_obs, label="observations", color=4)
We observe very similar results, with the difference of there being some uncertainty around the observed points. This is nice, as we can clearly see that this method is robust with respect to the noise in the observed data. Unfortunately, this plot also shows one of the main limitations of this approach. We notice that at the boundaries of the domain there is no way to predict the true GP. As such, it converges to the prior, which has zero mean.
Comparison to Bayesian linear regression
In the setting of Bayesian linear regression, we fit the function with
\[f(x) = wx + b,\]where the parameters $w, b \in \mathbb{R}$ follow the prior distributions $w \sim \mathcal{N}(0, \sigma^2_w)$ and $b \sim \mathcal{N}(0, \sigma^2_b)$, with corresponding hyperparameters $\sigma^2_w$ and $\sigma^2_b$. This also forms a Gaussian Process, which we can see by forming the covariance of $f$ evaluated on two points $x$ and $x’$, given by
\[\begin{align*} \mathop{cov}[f(x), f(x')] &= \mathbb{E}[f(x)f(x')] - \mathbb{E}[f(x)]\mathbb{E}[f(x')]\\ &= \mathbb{E}[(wx + b)(wx' + b)]\\ &= \sigma^2_w x x' + \sigma^2_b = K(x,x'). \end{align*}\]This can also be interpreted as a kernel. We fit our sinusoidal observations with this polynomial kernel GP.
polynomial_kernel(xa, xb; σw=1.0, σb=1.0) = σw^2*(xa.*xb) .+ σb^2
polynomial_kernel (generic function with 1 method)
visualize_kernel(polynomial_kernel)
# take the same mesh, but adapt the prior to the new variance
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> polynomial_kernel(xa, xb)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="model", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function")
scatter!(x_obs, f_obs, label="observations", color=4)
Unsurprisingly, this is not a particularly good fit, as we are fitting the data with a linear function. Moreover the result looks very similar to what we could expect when we perform Bayesian regression. To see how both methods compare, let us take a few steps back and recall how we compute the estimator in a Bayesian regression setting. As we did previously, we assume $w \sim \mathcal{N}(0, \sigma^2_w)$ and $b \sim \mathcal{N}(0, \sigma^2_b)$. After observing the data $x_a, y_a$, the Maximum-Likelihood estimate (MLE) is given by
\[y_b = \Phi_b (\Phi_a^T \Phi_a)^{-1} \Phi_a^T y_a,\]where $\Phi_a = \Phi(x_a)$ and $\Phi_b = \Phi(x_b)$ are defined by
\[\Phi(x) = \begin{bmatrix} \varphi_1(x_1) & \varphi_2(x_1) & \dots & \varphi_p(x_1) \\ \varphi_1(x_2) & \varphi_2(x_2) & & \vdots \\ \vdots & & \ddots & \\ \varphi_1(x_d) & \dots & & \varphi_p(x_d)\\ \end{bmatrix}.\]We have used $\varphi_i$ to denote the $p$ basis functions for some $p \in \mathbb{N}_0$. In the setting of polynomial regression, these could be the monomial basis functions $\varphi_i: x \rightarrow x^i$. The matrix $\Phi(x)$ is also called the design matrix or, in the case of polynomial basis functions, Vandermonde matrix. In our particular (linear) case, this matrix is
\[\Phi(x) = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_d\\ \end{bmatrix}.\]If we instead take the Maximum-a-Posteriori (MAP) estimator, we get
\[y_b = \Phi_b (\Phi_a^T \Phi_a + \Lambda)^{-1} \Phi_a^T y_a = \Phi_b (\Phi_a^T \Phi_a + \lambda I)^{-1} \Phi_a^T y_a.\]$\Lambda$ is defined as
\[\Lambda = \begin{bmatrix} \sigma^2/\sigma_b^2 & 0 \\ 0 & \sigma^2/\sigma_w^2\\ \end{bmatrix},\]and acts as a regularization term, putting the noise variance $\sigma^2$ in relation to the hyperparameters $\sigma_b^2$, $\sigma_w^2$. To simplify matters, we set $\sigma_b^2 = \sigma_w^2 = \sigma_0^2$ and $\lambda = \sigma^2 / \sigma_0^2$.
We compare the MAP estimator to the GP prediction
\[\mu_{b\vert a} = \Sigma_{ba} (\Sigma_{aa} + \sigma^2 I )^{-1} y_a,\]which looks similiar but is not quite the same. We recall that $\Sigma_{aa} = \Phi_a \Phi_a^T$, which for the GP is a $k \times k$ matrix, as opposed to the $p \times p$ matrix $\Phi_a^T \Phi_a$ which appears in the regression scenario. Using $\Sigma_{ba} = \Phi_b \Phi_a^T$ and the Sherman-Morrison-Woodbury formula, we can rewrite it as
\[\begin{align*} \mu_{b\vert a} &= \Phi_b \Phi_a^T (\Phi_a \Phi_a^T + \sigma^2 I )^{-1} y_a \\ &= \frac{1}{\sigma^2} \Phi_b \Phi_a^T [I - \Phi_a (\sigma^2 I + \Phi_a^T \Phi_a )^{-1}\Phi_a^T]y_a \\ &= \frac{1}{\sigma^2} \Phi_b [I - \Phi_a^T \Phi_a (\sigma^2 I + \Phi_a^T \Phi_a )^{-1}] \Phi_a^T y_a \\ &= \frac{1}{\sigma^2} \Phi_b [(\sigma^2 I + \Phi_a^T \Phi_a )(\sigma^2 I + \Phi_a^T \Phi_a )^{-1} - \Phi_a^T \Phi_a (\sigma^2 I + \Phi_a^T \Phi_a )^{-1}] \Phi_a^T y_a \\ &= \Phi_b (\Phi_a^T \Phi_a + \sigma^2 I)^{-1} \Phi_a^T y_a. \end{align*}\]This pleasing result is also called the push-through property. More importantly, it shows that the GP regression with a polynomial kernel is equivalent to Bayesian regression using a prior distribution on the parameters. In fact, this property is already hinted at, when we realized that the process by which we generate the data could be interpreted as a GP. As such, we have demonstrated that GPs offer a high amount of flexibility and can model a vide variety of problems. Moreover, the choice of kernel was the crucial step, which allowed us to model Bayesian regression using GPs. As such it is essential to construct kernels based on the properties we expect our functions to have.
Finally, there is an important practical difference to Bayesian regression. GPs are non-parametric, as there are no learned parameters. We have essentially skipped the step of remembering the parameters, and instead compute a prediction directly. Additionally, this comes at the cost of having to factorize a larger, $k \times k$ matrix.
Constructing Kernels from scratch
The polynomial kernel hinted at how a kernel function can be constructed to implicitly perform regression in some basis. According to Mercer’s theorem, we can represent positive definite covariance functions $K(x,x’)$ as an infinite expansion
\[K(x,x') = \sum_{i=1}^\infty \lambda_i \varphi_i(x) \varphi_i(x'),\]where the basis functions $\varphi_i$ and the constants $\lambda_i$ are determined by the eigenvalue problem
\[\int K(x,x') \varphi_i(x) \, \mathrm{d}x = \lambda_i \varphi_i(x).\]It is evident that this representation of $K$ essentially corresponds to the eigendecomposition of a matrix. We can use this to derive a kernel for periodic basis functions, by inserting the Fourier expansion, which yields
\[K(x,x') = \exp \left\{- \frac{2 \sin^2(d/2)}{l^2} \right\}\]periodic_kernel(xa, xb; l=1.0) = exp.(- [2.0*sin(0.5*norm(a))^2 for a in (xa .- xb)] ./ (l^2)) # RBF kernel for matrix valued-inputs
visualize_kernel(periodic_kernel)
# take the same mesh, but adapt the prior to the new variance
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> periodic_kernel(xa, xb, l=1.0)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="model", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function")
scatter!(x_obs, f_obs, label="observations", color=4)
The additional information regarding the periodicity clearly healps the GP to generalize (extrapolate) better. We see that the uncertainty in the prediction has an interesting structure which follows the same periodicity as the kernel. As we have prescribed the periodicity, the uncertainty lies in the phase and amplitude of the solution.
Connection to Neural Networks
At the beginning of this article I promised to sketch the connection to neural networks. The analysis here follows this and this article and we will largely stick to the notation of the former. The overarching goal of this section is to better understand the parallels between infinite Neural Nets and the GPs that they are equivalent to, such that they may be better analyzed and understood.
We consider a deep, feed-forward neural network with $L$ hidden layers. Let us denote with $x \in \mathbb{R}^{d_\text{in}}$ the input vector to the network. The $i$-th component of the input to layer $l$ is denoted similarly as $x^{l}_i$. This input is computed by applying the non-linear activation $\phi$ to the $i$-th output of the previous layer, $z_i^{l-1}$. In other words, $z_i^l$ denotes the ouput of the $i$-th neuron, after applying the affine transformation and before applying the nonlinearity. This is summarizd as
\[\begin{align*} z_i^l(x) &= b_i^l + \sum_{j=1}^{N_l} W^l_{ij} x_j^l(x), \\ x_i^l(x) &= \phi(z_i^{l-1}(x)) \end{align*}\]where $W^l_{ij}, b_i^l$ are the parameters of layer $l$ and $x_j^{l-1}(x)$ is the output of the previous layer, with $x_j^{0}(x) = x_j$. We have explicitly expressed the dependence of the output in each layer on the input vector $x$. The output of the neural network is the vector $z^L(x)$, which is the output of the last affine transformation.
Shallow Networks and Gaussian Processes
Let us consider the simplified case of a “shallow” network with a single hidden layer, comprising $N^1$ neurons. In this case, we can express the neural network as
\[\begin{align*} z_i^1(x) &= b_i^1 + \sum_{j=1}^{N_1} W^1_{ij} x^1_j(x), \\ x_j^1(x) &= \phi\left(b_j^0 + \sum_{k=1}^{d_\text{in}} W^0_{jk} x_k\right). \end{align*}\]So far, we have not assumed any priors for the weight and bias parameters. In the following, we assume that in each layer $l$, bias parameters $b_i^l$ are drawn i.i.d. from a Gaussian distribution with zero mean and $\sigma_{b,l}^2$ variance. We assume the same for the weight parameters $W_{ij}^l$, which are drawn i.i.d. from a Gaussian distribution with zero mean and $\sigma_{W,l}^2$ variance.
We are interested in computing the expected outcome of $\mathbb{E}[z_i^1(x)]$ for a fixed input $x$. We observe that the output is the weighted sum of the $N_1$ activations and the bias parameter. Due to the weight and bias parameters $b_i^0, W_{ij}^0$ being i.i.d., we observe that $x_j^1$ and $x_{j’}^1$ will be independent for $j \neq j’$. Due to the linearity of the expected value and the independence of $x_j^1$ and $x_{j’}^1$, we can write
\[\mathbb{E}[z_i^1(x)] = \mathbb{E}[b_i^1] + \sum_{j=1}^{N_1} \mathbb{E}[W^1_{ij} x^1_j(x)] = \sum_{j=1}^{N_1} \mathbb{E}[W^1_{ij}]\mathbb{E}[x^1_j(x)] = 0.\]The last step is a consequence of $\mathbb{E}[W_{ij}^1] = 0$ and the independence between $x_j^1(x)$ and the parameters $W_{ij}^1$. A similar arguments holds for $\mathbb{E}[(W_{ij}^1 x_j^1(x))^2]$, which can be rewritten as
\[\mathbb{E}[(W^1_{ij} x^1_j(x))^2] = \mathbb{E}[(W^1_{ij})^2] \mathbb{E}[(x^1_j(x))^2] = \sigma_{W,1}^2 \mathbb{E}[(x^1_j(x))^2].\]This term is bounded, if $x^1_j(x)$ is bounded as well. This is the case if for instance, we set the activation function to $\phi = \tanh$. Consequently, the variance of $z_i^1(x)$ is given by
\[\operatorname{var}[z_i^1(x)] = \sigma_{b,1}^2 + N_1\sigma_{W,1}^2 \mathbb{E}[(x^1_j(x))^2].\]By the multivariate CLT, this implies that in the limit of infinite network width $N_1 \rightarrow \infty$, the prior of $z_i^1(x)$ will converge to a Gausian distribution with mean zero and variance $\sigma_{b,1}^2 + N_1\sigma_{W,1}^2 V^1(x)$. Thus, by adopting $\sigma_{W,1} = \omega_{W,1} N_1^{-1/2}$, where $\omega_{W,1}$ is a new paramter, we obtain a prior with constant variance as the number of hidden units increases. More importantly, for any finite collection of inputs $x^{\alpha=1}, x^{\alpha=2}, \dots, x^{\alpha=k}$, the set ${z_i^1(x^{\alpha=1}), z_i^1(x^{\alpha=2}), \dots, z_i^1(x^{\alpha=k})}$ will be a multivariate Gaussian, which is precisely the definition of a GP.
Let us explore some shallow networks to get a feeling for the matter. We use the populer Julia library Flux
to handle neural networks for us. To simplify manners, we set $\omega_{W,1} = \sigma_{W,0} = \sigma_{b,0} = 1$ and $\sigma_{b,1} = 0$. The following code generates a neural network, which is parametrized by weights and bias parameters drawn from the specified Gaussian priors.
using Flux
function shallow_network(; width=100, activation=tanh)
# initialize the weights according to our prior Distributions
W0 = reshape(rand(Normal(0, 1), width*1), (width, 1))
b0 = reshape(rand(Normal(0, 1), width*1), (width))
W1 = reshape(rand(Normal(0, 1.0/width), 1*width), (1, width)) # this is the layer in which we need to rescale the variance to ensure finite variance of the joint Gaussian
b1 = reshape(rand(Normal(0, 0), 1*1), (1)) # this one is turned off
return Chain(Dense(W0, b0, activation), Dense(W1, b1, identity))
end
shallow_network (generic function with 1 method)
This simple function will generate shallow networks, with the specified width. We can generate functions and access their parameters in the following manner:
nn = shallow_network(width=100)
size(nn[2].W)
(1, 100)
Let us write a function to visualize some samples from this prior, much in the same way as we have visualized samples from GPs.
function visualize_nn_samples(generator, nsamples; bounds=(-5,5))
mesh = LinRange(bounds[1], bounds[2], 500)
nns = [generator() for i in 1:nsamples]
graphs = [nns[i](collect(mesh'))' for i in 1:nsamples]
plot(mesh, graphs)
end
visualize_nn_samples(()->shallow_network(width=100), 10)
We observe that this results in smooth functions with a variation that is almost always concentrated in a region around $x=0$ with a width of roughly $\sigma_{b,0}/\sigma_{W,0}$. Outside of this regions, the functions are almost always constant. Moreover, by varying the amount of hidden units, we observe that the functions become increasingly complex.
We can generate all kinds of GPs depending on the activation function that we choose. The following examples shows the result for step activation functions. We observe that the resulting functions resemble Brownian motion, especially in the limit of many hidden units, which equates to many jumps.
visualize_nn_samples( ()->shallow_network(width=10000, activation=sign), 10 ) # using step functions as activations
To analyze neural networks with the tool of Gaussian processes, we need to determine their kernel function. More specifically, we are interested in finding the covariance for two inputs $x$ and $x’$. We have
\[\begin{align*} K^1(x,x') = \operatorname{cov}[z_i^1(x), \, z_i^1(x')] = \mathbb{E}[z_i^1(x) \, z_i^1(x')] &= \sigma_{b,1}^2 + \sum_{j=1}^{N_1} \sigma_{W,1}^2 \mathbb{E}[x^1_j(x) \, x^1_j(x')] \\ &= \sigma_{b,1}^2 + \omega_{W,1}^2 C(x, x'), \end{align*}\]where we have introduced $C(x, x’) = \mathbb{E}[x^1_j(x) \, x^1_j(x’)]$. This function in term can be expressed as
\[C(x, x') = \frac{1}{2} \left(\mathbb{E}[(x^1_j(x))^2] + \mathbb{E}[(x^1_j(x'))^2] - \mathbb{E}[(x^1_j(x) - x^1_j(x'))^2]\right).\]For nearby values $x \approx x’$, the first terms have roughly the same value, whereas the last term denotes the expected squared difference between the values of a single hidden unit for $x$ and $x’$. For step activation functions, this difference will be either $0$ or $4$, depending on whether the step occurs between $x$ and $x’$. Moreover, the location of this step is approximately uniform in the local vicinity, which would imply that the probability of the step occuring between $x$ and $x’$ grows proportionally with $\vert x - x’ \vert$. As such, we can estimate
\[\mathbb{E}[(x^1_j(x) - x^1_j(x'))^2] \sim |x - x'|\]for nearby points $x$ and $x’$, which is characteristic with our previous observations for Brownian motion GP kernels. A similiar analysis can be carried out for $\tanh$ activation functions, which yields
\[\mathbb{E}[(x^1_j(x) - x^1_j(x'))^2] \sim |x - x'|^2\]for nearby points and aligns with the smoothness properties observed in our experiments. I skip the analysis and point the interested reader to the original technical report by Neil.
Deep Neural Networks and Gaussian Processes
We are interested in deep neural networks and their relation to GPs. We recall that the output of the $l$-th layer is computed as
\[\begin{align*} z_i^l(x) &= b_i^l + \sum_{j=1}^{N_l} W^l_{ij} x_j^l(x), \\ x_i^l(x) &= \phi(z_i^{l-1}(x)). \end{align*}\]Assuming that $z_i^{l-1}(x)$ is a GP with zero mean and covariance determined by $K^{l-1}$, we can show that $z_i^l(x)$ is equally a GP as $N_l$ tends towards infinity. As before, $z_i^l(x)$ is a sum of i.i.d. terms with bounded variance and zero mean, due to the zero mean of the weight priors. In this limit, ${z_i^l(x^{\alpha=1}), z_i^l(x^{\alpha=2}), \dots, z_i^l(x^{\alpha=k})}$ also follows a multivariate Gaussian, and $z_i^l(x)$ is a GP. As for the shallow networks, we can write the Kernel function as
\[\begin{align*} K^l(x,x') &= \mathbb{E}[z_i^1(x) \, z_i^1(x')] \\ &= \sigma_{b}^2 + \omega_{W}^2 \mathbb{E}[x^l_j(x) \, x^l_j(x')] = \sigma_{b}^2 + \omega_{W}^2 \mathbb{E}\left[\phi(z^{l-1}_j(x)) \, \phi(z^{l-1}_j(x'))\right], \end{align*}\]where we have dropped the subscripts indicating the layer of the variances as we take them to be all equal to the variance in the first layer. The expactation $\mathbb{E}\left[ \phi(z_j^{l-1}(x)) \, \phi(z_j^{l-1}(x’)) \right]$ is slightly more complicated than in the shallow case, as we need to integrate over the Gaussian distributions governing the input variables $z_j^{l-1}(x)$ and $z_j^{l-1}(x’)$. As these are characterized by the kernel function of the previous layer $K^{l-1}(x,x’)$, we can summarize this last term as a function $F_\phi$ of $K^{l-1}(x,x’), K^{l-1}(x,x), K^{l-1}(x’,x’)$, which are the three terms that can appear in this computation:
\[K^l(x,x') = \sigma_{b}^2 + \omega_{W}^2 F_\phi (K^{l-1}(x,x'), K^{l-1}(x,x), K^{l-1}(x',x')).\]By identifying $F_\phi$, we can construct the kernel recursively starting from the Kernel function for the first layer, which is given by
\[K^0(x,x') = \mathbb{E}[z_i^0(x) \, z_i^0(x')] = \sigma_{b}^2 + \omega_{W}^2 \frac{x x'}{d_\text{in}}.\]We identify this as the polynomial Kernel from linear regression.
visualize_gp_samples(zero, polynomial_kernel, 10, bounds=(-10,10)) # P is the polynomial kernel which we defined for polynomial regression using GPs
We proceed to construct kernels for other layers recursively. For some nonlinearities, this requires numerical integration and an algorithm is outlined in the paper, which we will not explore here. We will rather focus on ReLU activation functions, which allow the Kernel to be computed analytically. In this case, the Kernel is given by
\[\begin{align*} K^l(x,x') &= \sigma_b^2 + \frac{\sigma_W^2}{2 \pi} \sqrt{K^{l-1}(x,x) \, K^{l-1}(x',x')} \; \left( \sin \theta^{l-1}_{x, x'} + (\pi - \theta^{l-1}_{x, x'}) \cos \theta^{l-1}_{x, x'} \right)\\ \theta^l_{x, x'} &= \cos^{-1} \left( \frac{K^l(x,x')}{\sqrt{K^l(x,x) \, K^l(x',x')}} \right). \end{align*}\]Let us implement a function to compute this kernel recusively.
function ReLU_kernel(xa, xb; σw=1.0, σb=1.0, L=1)
Kl = polynomial_kernel(xa, xb, σw=σw, σb=σb)
Kdiaga = polynomial_kernel(xa, xa, σw=σw, σb=σb)
Kdiagb = polynomial_kernel(xb, xb, σw=σw, σb=σb) # careful with the transpose in the case of matrix valued functions
#@show Kdiag
for l=1:L
# compute the angular values from the previous layer
θ = acos.(Kl ./ sqrt.(Kdiaga .* Kdiagb) )
Kl = σb^2 .+ σw^2 / (2*pi) .* sqrt.(Kdiaga .* Kdiagb) .* (sin.(θ) .+ (pi .- θ) .* cos.(θ))
# if you wonder about all the dots, this is to ensure that the functions are applied point-wise and so the output has the right dimensions
# update diagonal entries for the next computation
Kdiaga = σb^2 .+ σw^2 / (2*pi) .* Kdiaga * pi
Kdiagb = σb^2 .+ σw^2 / (2*pi) .* Kdiagb * pi
end
return Kl
end
ReLU_kernel (generic function with 1 method)
Let us visualize this Kernel for a single layer, i.e. for the shallow case:
visualize_kernel((xa, xb)->ReLU_kernel(xa, xb, L=1))
# visualize_kernel((xa, xb)->ReLU_kernel(xa, xb, L=5))
We observe that this kernel resembles a smoother version of the polynomial kernel. In fact, we can visualize this kernel for an increasing number of levels:
xmesh = LinRange(-5,5,100)
x0 = zero(xmesh)
graphs = [ReLU_kernel(xmesh, x0, σw=sqrt(1.6), σb=sqrt(0.1), L=l) for l in 1:20]
plot(xmesh, graphs)
We observe that the kernel functions flatten out as the number of layers increase. This implies that a higher number of layer amounts to a more uniform correlation among all points. (The same is true if we vary $x’$, albeit the kernel will be asymmetrical).
Let us also sample functions from the GP characterized by this kernel function. We draw samples from the NNGP (the GP characterized by this kernel), and compare it to the NN samples drawn from the priors we previously defined for finite width, shallow ReLU networks.
plot1 = visualize_nn_samples( ()->shallow_network(width=10000, activation=relu), 10 ) # samples drawn from width=10000 shallow network priors
plot2 = visualize_gp_samples(zero, (xa, xb)->ReLU_kernel(xa, xb, L=1), 10, bounds=(-5,5)) # samples drawn from the corresponding GP
plot(plot1, plot2, layout = @layout([a{1.0h} b]), size=(770,300))
As we would expect, samples from the NNGP prior closely resemble the samples of our finite-width neural network samples. The differences in scaling are mainly due to the different parameters that we have chosen for each layer in the finite width case.
Inference using NNGPs
Let us revisit GP regression, this time using the ReLU neural network kernel. As we have already defined the kernel function, it is straight-forward to apply it to the regression problem.
We re-use our previous problem of fitting noisy data from a sinusoidal function. We adapt it to the neural network setting by adding more training data:
# sample observed points from the GP
σ = 0.1
n_obs = 100
x_obs = rand(Uniform(-π,π), n_obs)
S_obs = [dirac_kernel(xa, xb, σ=σ) for xa in x_obs, xb in x_obs]
if !isposdef(S_obs)
S_obs = S_obs + (1e-12 - min(eigvals(S_obs)...)) * I
end
GP = MvNormal(f(x_obs), S_obs)
f_obs = rand(GP, 1);
# take the same mesh, but adapt the prior to the new variance
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> ReLU_kernel(xa, xb, σw=sqrt(1.6), σb=sqrt(1.0), L=1)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
x_mesh = LinRange(-1.25*π,1.25*π,100)
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="GP mean", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function", size=(770,300))
scatter!(x_obs, f_obs, label="observations", color=4, size=(770,300))
What does this tell us? So far, we have understood how sampling infinitely wide neural nets using certain priors can be understood as Gaussian processes. Once we perform GP inference with it, we are essentially computing the MAP estimator assuming the prescribed kernel functsion and mean functions for the prior and posterior. This is especially useful, as it not only gives us a prediction but also estimate the uncertainty of the prediction. Moreover, it represents the optimal choice in the maximum a posteriori setting, and can therefore be regarded as the optimal performance when distributions are assumed over the parameters of the infinite network. This resembles Bayesian neural networks, where bayesian inference is performed on the network weights. This potential connection has also been noted by Lee et al. and might present an interesting direction for exploration.
We can compare this result to the output of a shallow neural network trained on the same training data.
width = 1000
nn = Chain(Dense(1, width, relu), Dense(width, 1, identity), identity) #
parameters = Flux.params(nn)
sqnorm(x) = sum(abs2, x)
loss(x, y) = sum(Flux.Losses.mse(nn(x), y)) + 0.001*sum(sqnorm, parameters)
for epoch in 1:500
Flux.train!(loss, parameters, [(x_obs', f_obs')], Descent(0.1))
end
plot(x_mesh, nn(collect(x_mesh'))', fa=0.3, label="model", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function", size=(770,300))
scatter!(x_obs, f_obs, label="observations", color=4, size=(770,300))
As we can see, the outcome differs slightly compared to the GP prediction, as it has less regularity. This can be partially attributed to the finite width of the network and the fact that training it will render many neurons dead, and therefore getting the model stuck in a local minimum. As mentioned previously, Bayesian neural network might present an interesting test case to compare this to.
What happens if we use more layers? We observe that the performance of the NNGP can be improved by increasing the number of layers. For this particular example, we observe the best performance using $L=9$ layers. We observe that the predictions using deeper finite networks resemple this prediction. This is not entirely surprising as the kernel will determine the functional form of our approximation. Thus, the question remains how much we can use the NNGP posterior to infer properties about trained networks. Another interesting question in this regard is whether NNGPs can be used to inform the choice of neural network architectures.
prior_kernel = (xa, xb) -> dirac_kernel(xa, xb, σ=σ)
post_kernel = (xa, xb) -> ReLU_kernel(xa, xb, σw=sqrt(1.6), σb=sqrt(1.0), L=9)
m, S = compute_gp_posterior(x_obs, f_obs, x_mesh, post_kernel, prior_kernel)
# plot the posterior mean, as well as its 2σ confidence interval
var = sqrt.(diag(S))
plot(x_mesh, m, ribbon=(2*var, 2*var), fc=:green, fa=0.3, label="GP mean", linewidth=2, legend=:bottomright, size=(770,300))
plot!(x_mesh, f(x_mesh), label="target function", size=(770,300))
scatter!(x_obs, f_obs, label="observations", color=4, size=(770,300))
While we have studied inference only with one-dimensional univariate functions, Lee et al. further compare the performance of NNGPs on Image data such as MNIST and CIFAR to networks trained on these datasets. The test accuracy of trained networks on these datasets approach the accuracy of the trained NNGP posterior, but is in most cases outmaatched by the NNGP. While it is possible that NNGPs represent a limit for the accuracy of the associated trained networks, it is unclear what role training plays in this picture.
The reason why this approach is interesting is that the achievable quality of approximations depends on the functional form of the approximating functions, as well as on the training. The former is rather well understood and can be studied using approximation theoretic methods. The training dynamics and how they relate to the initialization however, are a topic of much debate (see Frankle et al. (2019)). Neural Network GPs offer an interesting approach that considers the prior distributions on the weights and then computes the optimal solution in the MAP setting. As such, they may present a possible avenue towards a better understanding of the achievable performance when training neural networks.
Why aren’t NNGPs more popular? Our implementation of the $L$-layer ReLU GP kernel, already revealed a central problem. In order to do inference, we need to evaluate the kernel function for each pair $x^{\alpha}, x^{\beta}$, from both the training and query points. This is due to the non-parametric nature of GPs, which requires a computation that envolves the entire training dataset during inference. This is especially costly, considering that the kernel had to be built up recursively, layer by layer. This situation becomes even worse if multidimensional inputs and outputs are considered.
Conclusion
I hope I could give you an intuitive introduction to Gaussian Processes and shine some light on their connection to Neural Networks. As we have seen, they offer an interesting link between stochastic processes and neural networks in the limit of inifinite layer width. To the best of my knowledge, this connection has only recently been investigated to better understand the achievable accuracy by training neural networks that are initialized using some prior distribution on the parameters. In particular, it might be worth comparing trained Bayesian networks and NNGP posteriors to see whether there is a connection between the two.
While doing my research for this article, I stumbled across many interesting articles on the topic. You can find them listed below.
References
- Lee J., Bahri Y., Novak R., Schoenholz S.S., Pennington J. and Sohl-Dickstein J. (2018). Deep Neural Networks as Gaussian Processes., ICLR 2018. openreview.net/pdf?=id=B1EA-M-0Z
- Nearl R.M. (1994). Priors for Inifinite Networks., Technical Report, University of Toronto. https://www.cs.toronto.edu/~radford/ftp/pin.pdf.
- Rasmussen C.E. (2006). Gaussian Processes for Machine Learning. MIT Press. http://www.gaussianprocess.org/gpml/chapters/RW.pdf.
- Frankle J., Carbin M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. openreview.net/forum?id=rJl-b3RcF7.
- Bishop C.M. (2006). Pattern Recognition and Machine Learning. Springer.
- en.wikipedia.org/wiki/Gaussian_process (as opened 22.12.2021).
- en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process (as opened 22.12.2021).
- en.wikipedia.org/wiki/Central_limit_theorem#Multidimensional_CLT (as opened 22.12.2021).
- Andersen M.S. and Chen T. (2020). Smoothing Splines and Rank Structured Matrices: Revisiting the Spline Kernel. SIAM Journal on Matrix Analysis and Applications. doi.org/10.1137/19M1267349
Further reading
- A note on the Julia language: Julia is compiled just in time, which means that running the code in global scope as we do here tends to be slow. If you want to experience the speed of Julia, I encourage you to write the code into function and executing these more than once.
- There is an interesting connection between Kernel matrices, spline interpolation and so-called rank-structured matrices. These matrices have a nice structure, with low-rank offdiagonal blocks. If you are interested in some futher reading, I strongly suggest reading this article.
- I found this post by Bruno Magalhaes super useful to revise some of the basics regarding MAP estimators.
- A helpful and very well written introduction to GPs can be found here.