Temporal Difference Learning

In the last post, I explored foundational concepts of reinforcement learning, including policy iteration and value iteration methods. These iteration approaches provide tools for finding optimal policies in Markov Decision Processes (MDPs). One essential step during these iteration approaches is policy evaluation. We can estimate the value of the current policy by using Bellman equation.

$$ V^{\pi}(s) = Q^{\pi}(s,\pi(s))$$ $$ Q^{\pi}(s,\pi(s)) = R(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} [V^{\pi}(s')]$$.

To calculate this, we still need to know $R(s,a)$ and $P(s,a)$. In real world, it is hard to get the real $R(s,a)$ and $P(s,a)$ for all possible actions and states directly. Instead, what we can get is trajectories of interaction.

Monte Carlo (MC) On Policy Evaluation

1. Initialize $N(s) = 0, G(s) = 0 $

2. Loop

sample episode $i = s_i1, a_i1,..., s_iT_i $
Get return $G_{i,t} = r_{i,t } + \gamma r_{i,t+1} + ... + \gamma^{T_i - 1} r_{i,T_i} $ starting from step $t$.
sample episode $i = s_i1, a_i1,..., s_iT_i $

$$\pi_i \rightarrow V^{\pi_i}$$

Policy Improvement : Based on the evaluation of $\pi_i$, improve $\pi_i$ to $\pi_{i+1}$

$$ \pi_{i+i}(s) = \argmax_{a} \big[\sum_{s'} p(s'|s,a)(r + \gamma \cdot V^{\pi_i}(s') )\big] $$

Monte Carlo (MC) On Policy Evaluation#

Monte Carlo (MC) On Policy Evaluation