We start with a basic formula, expected value. It is the expected value when you follow some transition function \(\pi\) with some start state \(X_0\).

Eq. 1: \(G_t = \sum_{k=0}^{\inf} \gamma^k R_{t+k+1}\)

Note that the initial state is included in the reward function \(R\).

**Bellman Theorem**

A policy is optimal if and only if it is greedy with respect to its induced value function.

### Policy Iteration

If we knew the value function for all possible next states, we would greedily select the policy iteration. Moreover, if we knew the optimal policy, we would calculate the value function using the policy.

Policy iteration aims to iteratively update policy and value by examining them one by one. More simply, the algorithm can be summarized as:`1. Create random policy and zero value function`

`2. Find optimal policy for the given value function`

`3. Find optimal value function for the given policy`

`4. Repeat until convergence`

### Value Iteration

The value iteration is similar to the policy iteration, but we get rid of the policy completely as we can determine it from value function. More simply, we write the value function in terms of itself. This is given as:

Eq. 2:\(V^*(x) = max_a r(x, a) + \gamma \sum_{x’} P(x’ | x, a)V^*(x’)\)

Simply following this equation by iteratively calculating value function converges to the value function which gives the optimal policy.

## Leave a Reply