What is Reinforcement Learning?
Reinforcement Leraning is about the “science of decision making.” It is a type of machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment in order to ahcieve a specific goal. In RL, the agent learns through trial and error, receiving feedback in the form of rewards or penalties based on its actions. The goal of the agent is typically to maximize the cumulative reward it receives over time.
Basic Concepts
Markov Decision Process (MDP)
A Markov Decision Process is a mathematical framework used to model decision-making problems in which an agent interacts with an environment. It is a fundamental concept in the field of reinforcement learning. A Markov Decision Process consists of a tuple of
- State space : A set of all possible situations state that the environment can be in.
- Action space : A set of all possible actions that the agent can take.
- Transition probability function : The probability of transitioning from state to after taking action a while obtaining reward .
- Reward function : Expected value of next reward after action is implemented in state
Policy
A strategy or mapping which action the agent should take in each state; = a.
Value Function
The value function gives the long-term reward that the agent can obtain from a given state . The state value function of an MDP is the expected return starting from state following the policy . Mathematically, the value function is defined as the expected sum of discounted future rewards starting from state : . It's expectation because the environment can be stochastic. The value function can be decomposed into two parts: immediate reward and discounted value of successor state . Therefore, can be expressed as
.Q Function
The action- value function, denoted as represents the expected cumulative reward that the agent can obtain from being in state , taking action and following the given policy . Mathematically, These equations are called Bellman Equation. When is nondeterministic, the first equation can also expressed as
Optimal Policy and Value Function
The optimal value function, denoted as represent the maximum expected cumulative reward that an agent can achieve from a given state by following the optimal policy . Similarly, the optimal Q-function represents the maximum expected cumulative reward that an agent can achieve from taking action in the state and then following the optimal policy thereafter.
Policy Iteration vs Value Iteration
How can we get the optimal policy ? One way might be explore all possible policies and choose the one making the best rewards. However, this exhaustive search is computationally expensive and often impractical, especially in environments with large state and action spaces. Iteration methods offer a smarter and more efficient alternative to finding the optimal policy. There are two main iteration approaches: Policy Iteration and Value Iteration.
Policy Iteration
1. set = 0
2. Initialize as the uniformly random policy over all
3. While- Policy Evaluation : Evaluate the value of by using
- Policy Improvement : Based on the evaluation of , improve to
or
Value Iteration
Policy iteration computes the infinite horizon value of a policy and then improves that policy. However, in some cases, we do not have to maintain the policies. Value iteration is another technique idea for this.
1. set = 0
2. Initialize \(\Q_0(s,a) = 0\) for all and
3. While- , where
So it is like assuming by summing the immediate reward and the maximum cummulative rewards of based on current assumption.
Example code for Value iteration and policy iteration : https://colab.research.google.com/drive/1JbYBgZUg74yrfo1VQbJPaWkDREbhuDsW?usp=sharing#scrollTo=gg97TU1j2ED5