Compute Policy Gradient for UC Berkeley Deep RL Bootcamp Lab 4 Exercise 3.6
UC Berkeley had organised a great Bootcamp on reinforcement learning back in 2017. And the exercise 3.6 of lab 4 asked candidates to compute gradient of a softmax policy.
And I found that the formulation of the policy in lab 4 is a little bit different from the formulation given in lecture (e.g. David Sliver’s course, Chapter 13 of Sutton & Barto Book).
For the formulation given in lecture, you may find the steps to compute the policy gradient here.
Compute Policy Gradient
Now I am going to give the steps to compute the policy gradient of exercise 3.6.
Notations
Let begin by defining some useful notations.
Formulation
This formulation of policy is based on the code of lab4.
Steps
For the left term of the policy gradient from the formulation above
For the right term
Combining both term give us the policy gradient