I’d like to demonstrate the kernel trick using a simple example here. The equations are taken from [1]. My motivation is to look from a more practical perspective.

### Inputs and Outputs

Of course we have X as input and y as output. But I want to talk about inputs to the prediction equation to solve for data. After the calculations are done for kernel trick(check [1]), we end up with this equation: \(y(x) = k(x)^T (K + \lambda I_N)^{-1}y\)

Only thing we need to do in order to predict new data points is to calculate this equation. We have to calculate these two quantities: \(k(x), K\)

\(K = \Phi \Phi ^ T\)

\(k(x)\) is a vector representing the kernel operation with all input points.

### Matrix Sizes

n: number of training samples

d: number of features generated (output size of \(\phi\) for single sample)

\(\Phi \in R^{n \times d}\)

\(K \in R^{n \times n}\)

\(k(x) \in R^{1 \times n}\)

\(y \in R^{n \times 1}\)

### Example

The example is function \(x^2 + x + 1\). We will also use features \(\phi(x) = (x^2,x,1)\). The training data contain 3 points 0, 1, 2 (with noise):

Here we can calculate \(\Phi\) using our transformation. Each row corresponds to one sample:

\( \Phi =

\begin{bmatrix} 0 & 0 & 1 \\

1 & 1 & 1 \\

4 & 2 & 1 \\

\end{bmatrix}

\)

Here each row corresponds to a single sample. For example the first row is (1,0,0) which corresponds to input x=0 which yields \(\phi(x=0) = (0^2,0,1)\) (the third one is constant from our feature transformation). After that, we can compute K as well using matrix multiplication:

\(K = \Phi \times \Phi ^ T = \begin{bmatrix}

1 & 1 & 1 \\

1 & 3 & 7 \\

1 & 7 & 21

\end{bmatrix}

\)

We only need to calculate \(k(x)\) to find the predictions. Let’s assume we are searching predictions for data points \((4,5,6)\). (I will just calculate for 5)

\(k(x=5) = \begin{bmatrix}

\phi(x=5) \phi(x=x_0=0) \\

\phi(x=5) \phi(x=x_0=1) \\

\phi(x=5) \phi(x=x_0=2) \\

\end{bmatrix}

\)

This is equal to:

\(k(x=5) = \begin{bmatrix}

(25, 5, 1) (0, 0, 1) \\

(25, 5, 1) (1, 1, 1) \\

(25, 5, 1) (4, 2, 1) \\

\end{bmatrix}

\)

Which is(because the operation is dot product):

\(k(x=5) = \begin{bmatrix}

1 \\

31 \\

111 \\

\end{bmatrix}

\)

If we assume \(\lambda=0\), we get the following equation:

\(

y(x=5) = \begin{bmatrix}

1 \\

31 \\

111 \\

\end{bmatrix} ^ T \begin{bmatrix}

1 & 1 & 1 \\

1 & 3 & 7 \\

1 & 7 & 21

\end{bmatrix}^{-1} \begin{bmatrix}

0 \\

1 \\

4 \\

\end{bmatrix}

\approx 46.56 \)

The plot shows the other points:

## References

[1] Bishop – Pattern Recognition And Machine Learning – Springer 2006