In the last post, I explored foundational concepts of reinforcement learning, including policy iteration and value iteration methods. These iteration approaches provide tools for finding optimal policies in Markov Decision Processes (MDPs). One essential step during these iteration approaches is policy evaluation. We can estimate the value of the current policy by using Bellman equation.

$$ V^{\pi}(s) = Q^{\pi}(s,\pi(s))$$ $$ Q^{\pi}(s,\pi(s)) = R(s,a) + \gamma \mathbb{E}_{s’ \sim P(s,a)} [V^{\pi}(s’)]$$.

To calculate this, we still need to know In real world, it is hard to get the real Instead, what we can get is trajectories of interaction.

Monte Carlo (MC) Policy Evaluation

Monte Carlo methods estimate value functions by averaging sample returns. The algorithm:

  1. Initialize $N(s) = 0$, $G(s) = 0$ for all states
  2. Loop:
    • Sample episode $i$: $s_{i,0}, a_{i,0}, r_{i,1}, s_{i,1}, …, s_{i,T_i}$
    • For each state $s$ visited in episode:
      • Compute return $G_{i,t} = r_{i,t+1} + \gamma r_{i,t+2} + … + \gamma^{T_i-t-1} r_{i,T_i}$
      • Update: $N(s) \leftarrow N(s) + 1$
      • Update: $G(s) \leftarrow G(s) + G_{i,t}$
    • $V(s) = G(s) / N(s)$

MC methods are unbiased but have high variance and require complete episodes.

Temporal Difference (TD) Learning

TD learning combines ideas from MC and dynamic programming. Instead of waiting for the episode to end, TD updates values after each step using the bootstrapped estimate.

TD(0)

The simplest TD method updates the value function as:

$$V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]$$

The term $\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error.

Key properties:

TD vs MC

Property Monte Carlo TD Learning
Bias Unbiased Biased
Variance High Low
Episodes Requires complete Can use incomplete
Convergence Slow Fast

SARSA (On-Policy TD)

SARSA learns the action-value function $Q(s,a)$ using the update:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]$$

The name comes from the tuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ used in the update.

Q-Learning (Off-Policy TD)

Q-learning directly learns the optimal Q-function:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]$$

Unlike SARSA, Q-learning uses the max over actions, making it off-policy - it learns the optimal policy regardless of the behavior policy.

TD($\lambda$) and Eligibility Traces

TD($\lambda$) unifies MC ($\lambda=1$) and TD(0) ($\lambda=0$) using eligibility traces:

$$G_t^{(\lambda)} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

where $G_t^{(n)}$ is the n-step return.

This provides a smooth interpolation between TD and MC methods.

References