Policy Gradient

Different types of

Assumptions on the Policy Function:

Softmax Policy

Softmax policy weights actions using a linear combination of features $\phi(s,a)^\intercal\theta$. The probability of action is calculated by the normalized value of the exponentiated weight:

$$\pi_\theta(s,a) = \frac{e^{\phi(s,a)^\intercal\theta}}{\sum_{a’} e^{\phi(s,a’)^\intercal\theta}}$$

The score function (gradient of the log probability) becomes:

$$\nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb{E}{\pi\theta}[\phi(s,\cdot)]$$

Gaussian Policy

Gaussian policies model the action distribution as a Gaussian distribution parameterized by mean and variance. They are suitable for continuous action spaces.

For a Gaussian policy with mean $\mu_\theta(s)$ and fixed variance $\sigma^2$:

$$\pi_\theta(a|s) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(a - \mu_\theta(s))^2}{2\sigma^2}\right)$$

The score function is:

$$\nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \nabla_\theta \mu_\theta(s)$$

Neural Network Policy

Neural network policies use deep networks to parameterize the policy function. The network takes the state as input and outputs action probabilities (discrete) or Gaussian parameters (continuous).

Importance Sampling

When learning from data collected by a different policy (off-policy), we use importance sampling:

$$\mathbb{E}\pi[f(x)] = \mathbb{E}\beta\left[\frac{\pi(x)}{\beta(x)} f(x)\right]$$

The ratio $\frac{\pi(x)}{\beta(x)}$ is called the importance weight.

Baseline

A major issue with vanilla policy gradient is high variance. We can reduce variance by subtracting a baseline $b(s)$ from the return:

$$\nabla_\theta J(\theta) = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a|s) (Q^{\pi}(s,a) - b(s))\right]$$

Common choices for baseline:

When using $V^{\pi}(s)$ as baseline, the term $(Q^{\pi}(s,a) - V^{\pi}(s))$ is the advantage function $A^{\pi}(s,a)$.

Natural Policy Gradient

Natural Policy Gradient (NPG) addresses the sensitivity to parameterization in vanilla policy gradient. It uses the Fisher Information Matrix to define a more appropriate metric:

$$F = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta \nabla_\theta \log \pi_\theta^\intercal\right]$$

The natural gradient update:

$$\theta_{k+1} = \theta_k + \alpha F^{-1} \nabla_\theta J(\theta_k)$$

TRPO & PPO

TRPO (Trust Region Policy Optimization) constrains the policy update:

$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s,a)\right] \quad \text{s.t.} \quad D_{KL}(\pi_{\theta_{old}} || \pi_\theta) \leq \delta$$

PPO (Proximal Policy Optimization) simplifies this with a clipped objective:

$$L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]$$

PPO is widely used due to its simplicity and strong performance.

References