Description
This process uses one of the most famous reinforcement learning algorithm- the policy gradients algorithm toplay atari games using just the raw pixels.
this algorithm, too was introduced to me by andrej karpathy, where in 90 lines he trained an agent which was able to play pong and even win against the atari game ai after sufficient training. The link to the post can be found here. While researching for this algorithm, i came to a stop because i could understand the rationale behind it and why it worked but failed to find an implementation which worked on something other than pong, just a recreation of the original post. So, i decided to train an agent to play space invaders using policy gradients algorithm in tensorflow, and generalised it so that it could be used out of the box without much changes.
Reinforcement Learning in the broadest sense refers to the learning problem where the aim is to maximize a long-term objective. The system description consists of an agent which interacts with the environment via its actions at discrete time steps and receives a reward. This transitions the agent into a new state.
There are mainly 2 fundamental methods to solve reinforcement learning settings - value based functions and policy based functions. Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy. These methods rely on the theory of MDPs, where optimality is defined in a sense that: A policy is called optimal if it achieves the best expected return from any initial state. In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).
A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.
A stochastic policy is a policy that maps out states to a probability of actions to be taken.The stochastic policy is used when the environment is uncertain. We call this process a Partially Observable Markov Decision Process (POMDP)
Convergence For one, policy-based methods have better convergence properties.
The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.
On the other hand, with policy gradient, we just follow the gradient to find the best parameters. We see a smooth update of our policy at each step.
Effective for high dimensional placesThe second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.
The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self driving car, at each state you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honk…). We’ll need to output a Q-value for each possible action!
On the other hand, in policy-based methods, you just adjust the parameters directly: thanks to that you’ll start to understand what the maximum will be, rather than computing (estimating) the maximum directly at every step.
Policy gradients can learn stochastic policies
A third advantage is that policy gradient can learn a stochastic policy, while value functions can’t. This has two consequences.
One of these is that we don’t need to implement an exploration/exploitation trade off. A stochastic policy allows our agent to explore the state space without always taking the same action.
Policy Gradients Algorithm