Compute Policy Gradient for UC Berkeley Deep RL Bootcamp Lab 4 Exercise 3.6

Louis Kit Lung Law
2 min readFeb 28, 2019

UC Berkeley had organised a great Bootcamp on reinforcement learning back in 2017. And the exercise 3.6 of lab 4 asked candidates to compute gradient of a softmax policy.

Lab 4 Exercise 3.6

And I found that the formulation of the policy in lab 4 is a little bit different from the formulation given in lecture (e.g. David Sliver’s course, Chapter 13 of Sutton & Barto Book).

Typical formulation of softmax policy given in lecture
Formulation of softmax policy from the code of lab4

For the formulation given in lecture, you may find the steps to compute the policy gradient here.

Compute Policy Gradient

Now I am going to give the steps to compute the policy gradient of exercise 3.6.

Notations

Let begin by defining some useful notations.

Notations

Formulation

This formulation of policy is based on the code of lab4.

Formulation

Steps

For the left term of the policy gradient from the formulation above

Left term

For the right term

Right term

Combining both term give us the policy gradient

Policy gradient

--

--