Next: RL application II: Function Up: MLB_Exercises_2010 Previous: RL theory II [3

RL application I: On- and off-policy learning [3 P]

Download the Reinforcement Learning (RL) MATLAB Toolbox and the example files⁴. Adapt the mountain car demo example and apply RL to the following learning task. Consider the gridworld shown in Figure 4. Implement this environment with the RL Toolbox as an undiscounted ( $\gamma = 1$ ), episodic task with a start state at and a goal state at . The actions move the agent up, down, left and right, unless he bumps into a wall, in which case the position is not changed. The reward is on all normal transitions, for bumping into a wall, and 0 at the bonus state marked with .

**Figure 2:** Gridworld with bonus state.

Use Q-Learning and SARSA without eligibility traces to learn policies for this task. Use $\epsilon$ -greedy action selection with a constant $\epsilon = 0.1$ . Measure and plot the online performance of both learning algorithms (i.e. average reward per episode), and also sketch the policies that the algorithms find. Explain any differences in the performance of the algorithms. Are the learned policies optimal? Submit your MATLAB code.

a): Repeat the exercise again with $\epsilon \in \{0.01, 0.001, 0.0001, 0.00001\}$ , respectively. Explain and interpret you results for both , i.e. Q-Learning and SARSA.
b): Verify if the online performance and the final policies change if you use eligibility traces for SARSA? Try different values for $\lambda$ .

Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless.

Next: RL application II: Function Up: MLB_Exercises_2010 Previous: RL theory II [3

Haeusler Stefan 2011-01-25