Policy Gradient
Different types of
Assumptions on the Policy Function:
-
Analytical Gradient: We assume that we can calculate the gradient of the policy function analytically for all state-action pairs.
Softmax Policy
Softmax policy weights actions using a linear combination of features $\phi(s,a)^\intercal\theta$. The probability of action is calculated by the normalized value of the exponentiated weight:
$$\pi_\theta(s,a) = \frac{e^{\phi(s,a)^\intercal\theta}}{\sum_{a’} e^{\phi(s,a’)^\intercal\theta}}$$
The score function (gradient of the log probability) becomes:
$$\nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb{E}{\pi\theta}[\phi(s,\cdot)]$$
Gaussian Policy
Gaussian policies model the action distribution as a Gaussian distribution parameterized by mean and variance. They are suitable for continuous action spaces.
For a Gaussian policy with mean $\mu_\theta(s)$ and fixed variance $\sigma^2$:
$$\pi_\theta(a|s) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(a - \mu_\theta(s))^2}{2\sigma^2}\right)$$
The score function is:
$$\nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \nabla_\theta \mu_\theta(s)$$
Neural Network Policy
Neural network policies use deep networks to parameterize the policy function. The network takes the state as input and outputs action probabilities (discrete) or Gaussian parameters (continuous).
Importance Sampling
When learning from data collected by a different policy (off-policy), we use importance sampling:
$$\mathbb{E}\pi[f(x)] = \mathbb{E}\beta\left[\frac{\pi(x)}{\beta(x)} f(x)\right]$$
The ratio $\frac{\pi(x)}{\beta(x)}$ is called the importance weight.
Baseline
A major issue with vanilla policy gradient is high variance. We can reduce variance by subtracting a baseline $b(s)$ from the return:
$$\nabla_\theta J(\theta) = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a|s) (Q^{\pi}(s,a) - b(s))\right]$$
Common choices for baseline:
- Constant baseline: Average return
- State-dependent baseline: $b(s) = V^{\pi}(s)$
When using $V^{\pi}(s)$ as baseline, the term $(Q^{\pi}(s,a) - V^{\pi}(s))$ is the advantage function $A^{\pi}(s,a)$.
Natural Policy Gradient
Natural Policy Gradient (NPG) addresses the sensitivity to parameterization in vanilla policy gradient. It uses the Fisher Information Matrix to define a more appropriate metric:
$$F = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta \nabla_\theta \log \pi_\theta^\intercal\right]$$
The natural gradient update:
$$\theta_{k+1} = \theta_k + \alpha F^{-1} \nabla_\theta J(\theta_k)$$
TRPO & PPO
TRPO (Trust Region Policy Optimization) constrains the policy update:
$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s,a)\right] \quad \text{s.t.} \quad D_{KL}(\pi_{\theta_{old}} || \pi_\theta) \leq \delta$$
PPO (Proximal Policy Optimization) simplifies this with a clipped objective:
$$L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]$$
PPO is widely used due to its simplicity and strong performance.
References
- Sutton & Barto (2018). Reinforcement Learning: An Introduction
- Schulman et al. (2015). Trust Region Policy Optimization
- Schulman et al. (2017). Proximal Policy Optimization Algorithms