I began my Reinforcement Learning studies by reading Karpathy (May 31, 2016).
This article explains policy gradients used for solving games like pong.
In the case of pong, we take a matrix of pixel diffs between two consecutive timesteps and run backprop over a reward of 1 if game ended in win for agent, -1 if it didn’t, over every single timestep + decision of going up or down (sampled from a bernoulli where p is output of a 2 layer fc DNN which takes whole screen as input).
The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.
For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real” objective of winning the game.
My notes on Proximal Policy Optimization[🌿]: The current state of the art in Reinforcement Learning, developed by OpenAI, consisting of policy gradient + regularization that limits how much the policy can vary from epoch to epoch.
My notes on Sutton’s Alberta Research Plan[🌿]: Richard Sutton et al., lay down the foundations for the next 5~10 years of AI (particularly Reinforcement Learning) research.
Mnih et al. (2015)[🌱]: A paper where a Q-Learning algorithm is trained on ATARI 2600 dataset and beats human level performance. They used Q-learning with experience replay, where a “tape” of experiences is kept, storing (St, At, Rt+1, St+1) and those are replayed by sampling uniformly from the tape in a planning manner.