# Kernel Trick Example

I’d like to demonstrate the kernel trick using a simple example here. The equations are taken from [1]. My motivation is to look from a more practical perspective.

### Inputs and Outputs

Of course we have X as input and y as output. But I want to talk about inputs to the prediction equation to solve for data. After the calculations are done for kernel trick(check [1]), we end up with this equation: $$y(x) = k(x)^T (K + \lambda I_N)^{-1}y$$

Only thing we need to do in order to predict new data points is to calculate this equation. We have to calculate these two quantities: $$k(x), K$$

$$K = \Phi \Phi ^ T$$

$$k(x)$$ is a vector representing the kernel operation with all input points.

### Matrix Sizes

n: number of training samples
d: number of features generated (output size of $$\phi$$ for single sample)

$$\Phi \in R^{n \times d}$$
$$K \in R^{n \times n}$$
$$k(x) \in R^{1 \times n}$$
$$y \in R^{n \times 1}$$

### Example

The example is function $$x^2 + x + 1$$. We will also use features $$\phi(x) = (x^2,x,1)$$. The training data contain 3 points 0, 1, 2 (with noise):

Here we can calculate $$\Phi$$ using our transformation. Each row corresponds to one sample:
$$\Phi = \begin{bmatrix} 0 & 0 & 1 \\ 1 & 1 & 1 \\ 4 & 2 & 1 \\ \end{bmatrix}$$

Here each row corresponds to a single sample. For example the first row is (1,0,0) which corresponds to input x=0 which yields $$\phi(x=0) = (0^2,0,1)$$ (the third one is constant from our feature transformation). After that, we can compute K as well using matrix multiplication:
$$K = \Phi \times \Phi ^ T = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 3 & 7 \\ 1 & 7 & 21 \end{bmatrix}$$

We only need to calculate $$k(x)$$ to find the predictions. Let’s assume we are searching predictions for data points $$(4,5,6)$$. (I will just calculate for 5)

$$k(x=5) = \begin{bmatrix} \phi(x=5) \phi(x=x_0=0) \\ \phi(x=5) \phi(x=x_0=1) \\ \phi(x=5) \phi(x=x_0=2) \\ \end{bmatrix}$$

This is equal to:
$$k(x=5) = \begin{bmatrix} (25, 5, 1) (0, 0, 1) \\ (25, 5, 1) (1, 1, 1) \\ (25, 5, 1) (4, 2, 1) \\ \end{bmatrix}$$

Which is(because the operation is dot product):
$$k(x=5) = \begin{bmatrix} 1 \\ 31 \\ 111 \\ \end{bmatrix}$$

If we assume $$\lambda=0$$, we get the following equation:
$$y(x=5) = \begin{bmatrix} 1 \\ 31 \\ 111 \\ \end{bmatrix} ^ T \begin{bmatrix} 1 & 1 & 1 \\ 1 & 3 & 7 \\ 1 & 7 & 21 \end{bmatrix}^{-1} \begin{bmatrix} 0 \\ 1 \\ 4 \\ \end{bmatrix} \approx 46.56$$
The plot shows the other points:

## References

[1] Bishop – Pattern Recognition And Machine Learning – Springer 2006