next up previous
Next: Function approximation [3* P] Up: MLB_Exercises_2008 Previous: RL game [3* P]

On- and off-policy learning [5 P]

Download the Reinforcement Learning (RL) Toolbox8 and the example files9. See ToolboxTutorial.pdf for a RL toolbox tutorial. A similar example can be found in the folder cliffworld to help you getting started. Consider the gridworld shown in Figure 2. Implement this environment with the RL Toolbox as an undiscounted ( $ \gamma = 1$ ), episodic task with a start state at $ S$ and a goal state at $ G$ . The actions move the agent up, down, left and right, unless he bumps into a wall, in which case the position is not changed. The reward is $ -1$ on all normal transitions, $ -10$ for bumping into a wall, and 0 at the bonus state marked with $ B$ .

Figure 2: Gridworld with bonus state.
\includegraphics[width=0.3\linewidth]{bonusworld.eps}

Use Q-Learning and SARSA without eligibility traces to learn policies for this task. Use $ \epsilon$ -greedy action selection with a constant $ \epsilon = 0.1$ . Measure and plot the online performance of both learning algorithms (i.e. average reward per episode), and also sketch the policies that the algorithms find. Explain any differences in the performance of the algorithms. Are the learned policies optimal? Try this exercise again with $ \epsilon$ being gradually reduced after every episode and explain what you find. Submit your code and the gridworld configuration file.

a)
(1 point) For fixed $ \epsilon = 0.1$ , how large do you have to set the bonus reward at $ B$ , such that both algorithms converge to the same policy. Plot the online performance for both methods.
b)
(1 point) How does the online performance and the final policy change if you use eligibility traces for both algorithms? Try different values for $ \lambda$ .



next up previous
Next: Function approximation [3* P] Up: MLB_Exercises_2008 Previous: RL game [3* P]
Haeusler Stefan 2009-01-19