**Policy gradient methods:** Methods where we directly optimize the policy without the value function. Examples are: REINFORCE, Proximal Policy Optimization.**Actor-Critic:** Actor critic methods consist of one actor and one critic. Actor is the function of the agent and the critic is the value function of the environment. The goal is to have a fast and stable method. Actor methods can explore but they are unstable. So, we use actor-critic to reduce the variance during training.

## Q-Learning

Q-Learning is a subtype of temporal difference learning. The idea is simply to find a Q function that assigns a long-term reward to a specific state-value pairs. Q-learning can be computed with a table using. It is:

– model-free

– off-policy

– temporal-difference learning algorithm

Until convergence, you can do this Bellman equation:|

\(Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) – Q(s, a) \right]\)

### Deep Q-Learning

This one is very simple and straightforward. Instead of storing/calculating all possible state/action pairs, we train a neural network to learn and predict them. Agent policies can also be represented by learnable policies, or pre-defined policies. In either case, they can be stochastic or deterministic.

Practically,

- \(Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’)\right]\) is modelled via a
**target network**. - \(Q(s, a)\) is modelled via a
**policy network**.

Target network is the delayed version of the policy network. Therefore, we train the policy network using gradient based approaches, and update the target network via averaging, e.g. \(\text{target_net}[key] = \text{policy_net}[key] \times \tau + \text{target_net}[key] \times (1 – \tau)\)

So, target network slowly converges to the policy network.