next up previous
Next: RL application: On- and Up: MLB_Exercises_2012 Previous: RL theory I [3

RL theory II [3 P]

Assume that for a given continuing MDP with discount factor $ \gamma < 1$ we modify the reward signal by either

a)
adding a constant $ d$ to all rewards
b)
multiplying every reward with a constant $ k > 0$
c)
linearly transforming the reward signal to $ k \cdot r + d$ , $ k > 0$

Can this change the optimal policy of the MDP? Express for all three cases the new state values in terms of $ V(s), \gamma$ and the constants (where $ V(s)$ is the optimal value of state $ s$ under the original reward function).

Now consider the following modifications for deterministic MDPs:

d)
Let $ (s_{max}, a_{max})$ be the state-action pair that leads to the highest possible immediate reward $ r_{max} = max_{s,a} r(s,a)$ in the MDP. Set $ r(s_{max},a_{max}) \leftarrow r_{max} + d,  d > 0$
e)
Let $ (s_{min}, a_{min})$ be the state-action pair that leads to the lowest possible immediate reward $ r_{min} = min_{s,a} r(s,a)$ in the MDP. Set $ r(s_{min},a_{min}) \leftarrow r_{min} - d,  d > 0$
For simplicity, you can assume in both cases that the minimum/maximum is unique, i.e. it is taken on exactly at one state-action pair. Can you guarantee for arbitrary MDPs that the optimal policy stays the same? If not, show a counterexample.


next up previous
Next: RL application: On- and Up: MLB_Exercises_2012 Previous: RL theory I [3
Haeusler Stefan 2013-01-16