Strikingloo

Reinforcement Learning - Base Note and Bibliography

Karpathy 2016: Pong with Policy Gradients

I began my Reinforcement Learning studies by reading Karpathy (May 31, 2016).

This article explains policy gradients used for solving games like pong.

In the case of pong, we take a matrix of pixel diffs between two consecutive timesteps and run backprop over a reward of 1 if game ended in win for agent, -1 if it didn’t, over every single timestep + decision of going up or down (sampled from a bernoulli where p is output of a 2 layer fc DNN which takes whole screen as input).

This, simple as it is, amazingly works.

“For a more thorough derivation and discussion I recommend John Schulman’s lecture.” [🌱]

The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.

For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real” objective of winning the game.

Additional Reading:

[Share on twitter]

01 Jul 2022 - importance: 7