What is Reinforcement Learning?

Reinforcement Learning is about the “science of decision making.” It is a type of machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment in order to achieve a specific goal. In RL, the agent learns through trial and error, receiving feedback in the form of rewards or penalties based on its actions. The goal of the agent is typically to maximize the cumulative reward it receives over time.

Basic Concepts

Markov Decision Process (MDP)

Policy

Value Function

$$ E_{\pi}[G_t|S_t = s] = E[R_{t+1} + \gamma V_{\pi}(s_{t+1})]$$.

Q Function

Optimal Policy and Value Function

Policy Iteration vs Value Iteration

Policy Iteration

$$\pi_i \rightarrow V^{\pi_i}$$

$$ \pi_{i+1}(s) = \argmax_{a} \big[\sum_{s’} p(s’|s,a)(r + \gamma \cdot V^{\pi_i}(s’) )\big] $$

or

$$ \pi_{i+1}(s) = \argmax_{a} Q^{\pi_i}(s,a) \forall s \in S$$

$$ V^{\pi_i} \rightarrow \pi_{i+1}$$

Value Iteration

Policy iteration computes the infinite horizon value of a policy and then improves that policy. However, in some cases, we do not need to maintain the policies explicitly. Value iteration is an alternative technique for this.

$$ V_Q(s’) = \max_a \big[ \sum_{s’’} p(s’’|s’,a)(r + \gamma \cdot V(s’’)) \big] $$

So it is like estimating $Q(s,a)$ by summing the immediate reward $R(s,a)$ and the maximum cumulative rewards $V(s’)$ based on the current estimate.

Example code for Value iteration and policy iteration : https://colab.research.google.com/drive/1JbYBgZUg74yrfo1VQbJPaWkDREbhuDsW?usp=sharing#scrollTo=gg97TU1j2ED5