Kernel Trick Example

I’d like to demonstrate the kernel trick using a simple example here. The equations are taken from [1]. My motivation is to look from a more practical perspective.

Inputs and Outputs

Of course we have X as input and y as output. But I want to talk about inputs to the prediction equation to solve for data. After the calculations are done for kernel trick(check [1]), we end up with this equation: \(y(x) = k(x)^T (K + \lambda I_N)^{-1}y\)

Only thing we need to do in order to predict new data points is to calculate this equation. We have to calculate these two quantities: \(k(x), K\)

\(K = \Phi \Phi ^ T\)

\(k(x)\) is a vector representing the kernel operation with all input points.

Matrix Sizes

n: number of training samples
d: number of features generated (output size of \(\phi\) for single sample)

\(\Phi \in R^{n \times d}\)
\(K \in R^{n \times n}\)
\(k(x) \in R^{1 \times n}\)
\(y \in R^{n \times 1}\)

Example

The example is function \(x^2 + x + 1\). We will also use features \(\phi(x) = (x^2,x,1)\). The training data contain 3 points 0, 1, 2 (with noise):

Train Points

Here we can calculate \(\Phi\) using our transformation. Each row corresponds to one sample:
\( \Phi =
\begin{bmatrix} 0 & 0 & 1 \\
1 & 1 & 1 \\
4 & 2 & 1 \\
\end{bmatrix}
\)

Here each row corresponds to a single sample. For example the first row is (1,0,0) which corresponds to input x=0 which yields \(\phi(x=0) = (0^2,0,1)\) (the third one is constant from our feature transformation). After that, we can compute K as well using matrix multiplication:
\(K = \Phi \times \Phi ^ T = \begin{bmatrix}
1 & 1 & 1 \\
1 & 3 & 7 \\
1 & 7 & 21
\end{bmatrix}
\)

We only need to calculate \(k(x)\) to find the predictions. Let’s assume we are searching predictions for data points \((4,5,6)\). (I will just calculate for 5)

\(k(x=5) = \begin{bmatrix}
\phi(x=5) \phi(x=x_0=0) \\
\phi(x=5) \phi(x=x_0=1) \\
\phi(x=5) \phi(x=x_0=2) \\
\end{bmatrix}
\)

This is equal to:
\(k(x=5) = \begin{bmatrix}
(25, 5, 1) (0, 0, 1) \\
(25, 5, 1) (1, 1, 1) \\
(25, 5, 1) (4, 2, 1) \\
\end{bmatrix}
\)

Which is(because the operation is dot product):
\(k(x=5) = \begin{bmatrix}
1 \\
31 \\
111 \\
\end{bmatrix}
\)

If we assume \(\lambda=0\), we get the following equation:
\(
y(x=5) = \begin{bmatrix}
1 \\
31 \\
111 \\
\end{bmatrix} ^ T \begin{bmatrix}
1 & 1 & 1 \\
1 & 3 & 7 \\
1 & 7 & 21
\end{bmatrix}^{-1} \begin{bmatrix}
0 \\
1 \\
4 \\
\end{bmatrix}
\approx 46.56 \)
The plot shows the other points:

Prediction Points

References

[1] Bishop – Pattern Recognition And Machine Learning – Springer 2006


Leave a Reply

Your email address will not be published. Required fields are marked *