Policy Gradient Methods

Next: Reward Weighted Regression: Cannon Up: Policy Gradient Methods: Swimmer Previous: Matlab package Description

Policy Gradient Methods

At least for the finite difference method, you may normalize the gradient as a unit vector before the weight update, i.e.

$\displaystyle \delta \theta_{h+1} = \theta_{h} + \alpha \frac{\nabla_{\theta} J(\theta)}{\vert\nabla_{\theta} J(\theta)\vert}.$

This usually improves the learning speed. For a more exact description of the methods and the equations see the lecture slides.

Finite Differences [1 P]: This method perturbs the parameter vector itself to estimate the gradient. We can set the stochasticity of the policy to 0 for this method to improve the accuracy of the estimation. Generate (try 24) rollouts by adding small Gaussian Noise to the parameter vector $\Delta \theta_i = \mathcal{N}(0, \sigma_{FD}^2 I)$ , where $\sigma_{FD}$ equals small values like . Try several learning rates $\alpha$ (Hint: Setting $\alpha$ to is a good starting point).
Likelihood ratio + Policy Gradient Theorem [2 P]: Here the noise directly acts on the action, therefore we use a stochastic policy (use $\sigma = 0.5$ which can be set in the $\textrm{E.J}$ function). For both methods you need to be able to calculate the gradient of the log-likelihood of the policy for each time step, e.g. $\nabla_{\theta} \log \pi(\dot{\mathbf{y}}_t\vert \mathbf{z}_t, x_t; \mathbf{b})$ . As we can see in the lecture slides the only two quantities needed for this operation are the used noise $\epsilon_t$ and the feature representation $\Phi(x_t)$ . Both quantities are provided by the function ⁹. For the policy gradient theorem the reward for each time step is needed instead of the summed reward signal. This vector is given by the -th output value of the function. rollouts and a learning rate of $\alpha=10$ are proper choices.
Episodic Natural Actor-Critic [1* P]: The implementation of this method is optional (1* point). Still, only $\epsilon$ and $\Phi(x_t)$ are needed for this algorithm. Use rollouts and a learning rate of at least $\alpha=2000$ .

All the policy gradient methods should be compared with respect to the learned speed. Therefore, create a performance curve (x-axis : number of episodes seen by the algorithm, y-axis: summed reward of current parameter value) for each algorithm. In order to get a reliable estimate, use the average over at least 10 trials for each curve.

Next: Reward Weighted Regression: Cannon Up: Policy Gradient Methods: Swimmer Previous: Matlab package Description

Haeusler Stefan 2011-01-25