What is Reinforcement Learning?

Reinforcement Leraning is about the “science of decision making.” It is a type of machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment in order to ahcieve a specific goal. In RL, the agent learns through trial and error, receiving feedback in the form of rewards or penalties based on its actions. The goal of the agent is typically to maximize the cumulative reward it receives over time.

Basic Concepts

Markov Decision Process (MDP)

A Markov Decision Process is a mathematical framework used to model decision-making problems in which an agent interacts with an environment. It is a fundamental concept in the field of reinforcement learning. A Markov Decision Process consists of a tuple of M=(S,A,P,R)M = (S,A,P,R)

- State space SS: A set of all possible situations state ss that the environment can be in.

- Action space AA: A set of all possible actions aa that the agent can take.

- Transition probability function P(ss,a)P(s'|s,a): The probability of transitioning from state ss to ss' after taking action a while obtaining reward rr.

- Reward function R(s,a)R(s,a) : Expected value of next reward Rt+1R_{t+1} after action ata_t is implemented in state sts_t

Policy

A strategy or mapping which action the agent should take in each state; π(s)\pi (s) = a.

Value Function

The value function V(s)V(s) gives the long-term reward that the agent can obtain from a given state ss. The state value function Vπ(s)V_{\pi}(s) of an MDP is the expected return starting from state ss following the policy π\pi. Mathematically, the value function is defined as the expected sum of discounted future rewards starting from state ss: Vπ(s)=Eπ[Rt+1+γRt+2+....St=s]V_{\pi}(s) = E_{\pi}[ R_{t+1} + \gamma R_{t+2} + ....|S_t = s] . It's expectation because the environment can be stochastic. The value function can be decomposed into two parts: immediate reward Rt+1R_{t+1} and discounted value of successor state γV(st+1)\gamma V(s_{t+1}). Therefore, Vπ(s)V_{\pi}(s)can be expressed as

Eπ[GtSt=s]=E[Rt+1+γVπ(st+1)] E_{\pi}[G_t|S_t = s] = E[R_{t+1} + \gamma V_{\pi}(s_{t+1})].

Q Function

The action- value function, denoted as Q(s,a)Q(s,a) represents the expected cumulative reward that the agent can obtain from being in state ss, taking action aa and following the given policy π\pi. Mathematically, Vπ(s)=Qπ(s,π(s))V_{\pi}(s) = Q_{\pi}(s,\pi(s)) Qπ(s,a)=Eπ[GtSt=s,At=a]=R(s,a)+γEsP(s,a)[Vπ(s)]Q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t|S_t = s, A_t = a] = R(s,a) + \gamma E_{s' \sim P(s,a)}[V^{\pi}(s')] These equations are called Bellman Equation. When π\pi is nondeterministic, the first equation can also expressed as Vπ(s)=aAQπ(s,a)π(as)V_{\pi}(s) = \sum_{a\in A} Q_{\pi}(s,a)\pi(a|s)

Optimal Policy and Value Function

The optimal value function, denoted as V(s)V^{*}(s) represent the maximum expected cumulative reward that an agent can achieve from a given state ss by following the optimal policy π\pi^{*}. Similarly, the optimal Q-function Q(s,a)Q^{*}(s,a) represents the maximum expected cumulative reward that an agent can achieve from taking action aa in the state ss and then following the optimal policy thereafter.

Policy Iteration vs Value Iteration

How can we get the optimal policy π\pi^{*}? One way might be explore all possible policies pipi and choose the one making the best rewards. However, this exhaustive search is computationally expensive and often impractical, especially in environments with large state and action spaces. Iteration methods offer a smarter and more efficient alternative to finding the optimal policy. There are two main iteration approaches: Policy Iteration and Value Iteration.

Policy Iteration

1. set ii = 0

2. Initialize π0(s)\pi_0(s) as the uniformly random policy over all aAa \in A

3. While πiπi1>0|\pi_i - \pi_{i-1}| > 0

  • Policy Evaluation : Evaluate the value of πi\pi_i by using Vπi(s)=sp(ss,π(s))(r+γVπi(s))V^{\pi_i}(s) = \sum_{s'} p(s'|s,\pi(s))(r + \gamma \cdot V^{\pi_i}(s'))

πiVπi\pi_i \rightarrow V^{\pi_i}

  • Policy Improvement : Based on the evaluation of πi\pi_i, improve πi\pi_i to πi+1\pi_{i+1}

πi+i(s)=arg maxa[sp(ss,a)(r+γVπi(s))] \pi_{i+i}(s) = \argmax_{a} \big[\sum_{s'} p(s'|s,a)(r + \gamma \cdot V^{\pi_i}(s') )\big]

or

πi+1(s)=arg maxaQπi(s,a)sS \pi_{i+1}(s) = \argmax_{a} Q^{\pi_i}(s,a) \forall s \in S

Vπiπi+1 V^{\pi_i} \rightarrow \pi_{i+1}

  • i+=1i+=1

Value Iteration

Policy iteration computes the infinite horizon value of a policy and then improves that policy. However, in some cases, we do not have to maintain the policies. Value iteration is another technique idea for this.

1. set ii = 0

2. Initialize \(\Q_0(s,a) = 0\) for all sSs \in S and aAa \in A

3. While QiQi1>0|Q_i - Q_{i-1}| > 0

  • Qi+1=R(s,a)+γEsP(s,a)VQ(s) Q_{i+1} = R(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} V_Q(s') , where

VQ(s)=maxa[sp(ss,a)(r+γV(s)] V_Q(s') = \max_a \big[ \sum_{s''} p(s''|s',a)(r + \gamma*V(s'') \big]

So it is like assuming Q(s,a) Q(s,a) by summing the immediate reward R(s,a)R(s,a) and the maximum cummulative rewards of V(s)V(s') based on current assumption.

Example code for Value iteration and policy iteration : https://colab.research.google.com/drive/1JbYBgZUg74yrfo1VQbJPaWkDREbhuDsW?usp=sharing#scrollTo=gg97TU1j2ED5