Next: Matlab package Description Up: MLB_Exercises_2010 Previous: RL application III: Self-play

Policy Gradient Methods: Swimmer [3+1* P]

**Figure 3:** 3 Link Swimmer task: The simulated snake-like robot should swim fast and energy efficient.

In this task you have to learn optimal policies for the swimmer (see Figure 3) using different policy gradient methods.

You have to compare three algorithms to compute the gradient $\nabla_{\theta} J(\theta_h)$ , namely Finite Differences, Likelihood Ratio and the Natural Policy Gradient. The robot is a 3-link (2-joints) snake-like robot swimming in the water. He has two actuators, you have to learn how to use these actuators to swim as fast as possible in a given direction. The model, the policy and the reward function are already given in the provided matlab package swimmer.zip⁷. The used policy is a stochastic Gaussian policy implemented by a Dynamic Movement Primitive (DMP). The DMP uses 6 centers per joint. As we deal with a periodic movement the phase variable of the DMP is also periodic. The policy itself is a stochastic policy which adds noise to the velocity variable of the DMP, i.e.

$\displaystyle \pi(\dot{\mathbf{y}}\vert \mathbf{z}, x; \mathbf{b}) = \mathcal{N}(\dot{\mathbf{y}}\vert \Phi(x) \mathbf{b} + \mathbf{z}, \sigma^2 \mathbf{I}),$

where $\mathbf{b}$ is the parameter vector (also denoted as $\theta$ in the further description) of the policy which we have to learn. The DMP itself is already implemented so you do not have to deal with that, the matlab package provides you with all information you need to calculate the gradients. The reward function (already implemented) is given by $r_t = 10^{-2} v_x - 10^{-6} \mathbf{u}^2$ , where

is the velocity in x-direction and $\mathbf{u}$ is the used torque.

Subsections

Next: Matlab package Description Up: MLB_Exercises_2010 Previous: RL application III: Self-play

Haeusler Stefan 2011-01-25