CTDLearner Class Reference
Class
for temporal Difference Learning. More...
#include <ctdlearner.h>
Inheritance diagram for CTDLearner:
List of all members.
|
Public Member Functions
|
|
|
CTDLearner (CRewardFunction
*rewardFunction, CAbstractQFunction
*qfunction,
CAbstractQETraces
*etraces,
CAgentController
*estimationPolicy)
|
| |
Creates a TD Learner with the given abstract
Q-Function and Q-ETraces.
|
|
|
CTDLearner (CRewardFunction
*rewardFunction, CAbstractQFunction
*qfunction,
CAgentController
*estimationPolicy)
|
| |
Creates a TD Learner with the given composed
Q-Function and a new composed Q-Etraces object.
|
|
virtual
|
~CTDLearner ()
|
|
virtual void
|
loadValues (char
*filename)
|
|
virtual void
|
saveValues (char
*filename)
|
|
virtual void
|
loadValues (FILE
*stream)
|
|
virtual void
|
saveValues (FILE
*stream)
|
|
virtual void
|
nextStep (CStateCollection *oldState,
CAction *action, double
reward, CStateCollection
*nextState)
|
| |
Calls the update function learnStep.
|
|
virtual void
|
intermediateStep (CStateCollection *oldState,
CAction *action, double
reward, CStateCollection
*nextState)
|
| |
Updates the Q-Function for a intermediate
step.
|
|
virtual void
|
newEpisode ()
|
| |
Resets the Etraces.
|
|
void
|
setAlpha (double
alpha)
|
| |
Sets the learning rate.
|
|
void
|
setLambda (double
lambda)
|
| |
Sets the lambda parameter of the etraces.
|
|
CAgentController *
|
getEstimationPolicy
()
|
|
void
|
setEstimationPolicy
(CAgentController
*estimationPolicy)
|
|
CAbstractQFunction
*
|
getQFunction ()
|
|
CAbstractQETraces
*
|
getETraces ()
|
Protected Member Functions
|
|
virtual void
|
learnStep (CStateCollection *oldState,
CAction *action, double
reward, CStateCollection
*nextState)
|
| |
Updates the Q-Function and manages the
Etraces.
|
|
virtual double
|
getTemporalDifference
(CStateCollection *oldState,
CAction *action, double
reward, CStateCollection
*nextState)
|
| |
calculates the temporal difference
|
|
virtual double
|
getResidual (double oldQ, double
reward, int duration, double newQ)
|
| |
returns the temporal difference error
residual
|
|
virtual void
|
addETraces (CStateCollection *oldState,
CStateCollection *newState,
CAction
*action)
|
| |
adds the current state to the etraces
|
Protected Attributes
|
|
bool
|
externETraces |
| |
use extern eTraces
|
|
CAgentController *
|
estimationPolicy |
| |
estimation Policy - policy which is learned
|
|
CAction *
|
lastEstimatedAction |
| |
The last action estimated by the policy.
|
|
CAbstractQFunction
*
|
qfunction |
|
CAbstractQETraces
*
|
etraces |
|
CActionDataSet
*
|
actionDataSet |
Detailed Description
Class for temporal Difference Learning.
Temporal Difference (TD) Q-Value Learner are the common
model-free reinforcement learning algorithms. They make their
update according to the difference of the current Q-Value to the
calculated Q Value Q(s_t, a_t)=R(s_t, a_t, s_{t+1})+gamma *
Q(s_{t+1}, a_{t+1}) for each step sample. So the TD update for the
Q-Values is
Q_new(s_t,a_t)=(1-alpha)*Q_old(s_t,a_t)+alpha*(R(s_t,a_t,s_{t+1})+gamma*Q(s_{t+1},a_{t+1}))
is further Q_old(s_t,a_t) + alpha*(R(s_t,a_t,
s_{t+1})+gamma*Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) where R(s_t, a_t,
s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) is the temporal
Difference. Respectively for the semi Markov case, the temporal
difference is R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})-
Q(s_t,a_t)). This temporal difference update is usually done for
states from the past too, using ETraces. This Method is called
TD-Lambda.
In the RIL toolbox TD-Learner are represented by the class
CTDLearner and provides an implementation of the TD-Lambda
algorithm. The class maintains a Q-Function, an ETraces Object, a
Reward Function and a Policy as estimation policy needed for the
calculation of a_{t+1} The Q-Function, the Reward Function and the
Policy have to be passed from the user. The ETraces object is
usually initialized with the standard etraces object for the
Q-Function, but can also be specified..
The learnStep Function updates the Q-Function according the step
sample. The function is called by the nextStep event. First of all
the last estimated action ($a_{t+1}$) is compared to the action
doublely executed. If these two actions are not equal, the ETraces
have to be reset, because the agent didn't follow the policy to
learn. If you don't want to reset the etraces you can set the
parameter "ResetETracesOnWrongEstimate" to false (0.0). If the 2
actions are equal the Etraces gets multiplied by lambda*gamma.
After that, the Etrace of the current state-action pair is added to
the ETraces object, then the next estimated action is calculated by
the given policy and stored. Now the temporal difference error can
be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1},
a_{t+1})- Q(s_t,a_t)) or R(s_t, a_t, s_{t+1}) + gamma^N *
Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the
temporal difference error all the states in the ETraces are updated
by the updateQFunction method from the Q-Etraces object. Before the
update, the temporal difference error gets multiplied with the
learning rate (Parameter: "QLearningRate").
The getTemporalDifference function calculates the old Q-Value
and the new Q-Value and then calls the getResidual function, which
does the actual temporal difference error computation.
For hierarchic MDP's Intermediate steps get a special treatment
in the TD-Algorithm. Since the intermediate steps aren't doublely
member of the episode they need special treatment for etraces. The
state of the intermediate step is normally added to the ETraces
object, but the multiplication of all other ETraces is canceled and
the Q-Function isn’t updated with the whole ETraces object,
only the Q-Value of the intermediate state is updated. This is done
because the intermediate step isn't directly reachable for the past
states and update all intermediate steps via etraces would falsify
the Q-Values since the same step gets updates several times.
CTDLearner has following Parameters:
- inherits all Parameters from the Q-Function
- inherits all Parameters from the ETraces
- "QLearningRate", 0.2 : learning rate of the algorithm
- "DiscountFactor", 0.95 : discount factor of the learning
problem
- "ResetETracesOnWrongEstimate", 1.0 : reset etraces when the
estimated action wasn't the double executed.
- See also:
- CQLearner
CSarsaLearner
Constructor & Destructor Documentation
| |
Creates a TD Learner with the given abstract Q-Function and
Q-ETraces.
|
| |
Creates a TD Learner with the given composed Q-Function and a
new composed Q-Etraces object.
The etraces get initialised by the standard V-Etraces of the
Q-Functions V-Functions. If you want to access the VEtraces you
have to cast the result from getQETraces() from (CAbstractQETraces *) to
(CQETraces
*).
|
| virtual
CTDLearner::~CTDLearner
|
( |
|
) |
[virtual] |
|
Member Function Documentation
| virtual double
CTDLearner::getResidual
|
( |
double |
oldQ,
|
|
|
double |
reward,
|
|
|
int |
duration,
|
|
|
double |
newQ |
|
) |
[protected,
virtual] |
|
| |
Updates the Q-Function for a intermediate step.
Since the intermediate steps aren't doublely member of the
episode they need special treatment for etraces. The state of the
intermediate step is normally added to the ETraces object, but the
multiplication of all other ETraces is canceled and the Q-Function
isn’t updated with the whole ETraces object, only the Q-Value
of the intermediate state is updated. This is done because the
intermediate step isn't directly reachable for the past states and
update all intermediate steps via etraces would falsify the
Q-Values since the same step gets updates several times.
Reimplemented from CSemiMDPRewardListener.
|
| |
Updates the Q-Function and manages the Etraces.
The learnStep Function updates the Q-Function according the step
sample. The function is called by the nextStep event. First of all
the last estimated action (a_{t+1}) is compared to the action
doublely executed. If these two actions are not equal, the ETraces
have to be reset, because the agent didn't follow the policy to
learn, using the etraces of older states would falsify the
Q-Values. If the 2 actions are equal the Etraces gets multiplied by
lambda*gamma. After that, the Etrace of the current state-action
pair is added to the ETraces object, then the next estimated action
is calculated by the given policy. Now the temporal difference can
be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1},
a_{t+1})- Q(s_t,a_t)) or R(s_t, a_t, s_{t+1}) + gamma^N *
Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the
temporal difference all the states in the ETraces are updated by
the updateQFunction method from the Q-Etraces object.
Reimplemented in CAdvantageUpdating, and
CTDResidualLearner.
|
| virtual void
CTDLearner::loadValues
|
( |
FILE * |
stream |
) |
[virtual] |
|
| virtual void
CTDLearner::loadValues
|
( |
char * |
filename |
) |
[virtual] |
|
| virtual void
CTDLearner::newEpisode
|
( |
|
) |
[virtual] |
|
| virtual void
CTDLearner::saveValues
|
( |
FILE * |
stream |
) |
[virtual] |
|
| virtual void
CTDLearner::saveValues
|
( |
char * |
filename |
) |
[virtual] |
|
| void
CTDLearner::setAlpha
|
( |
double |
alpha |
) |
|
|
| void
CTDLearner::setLambda
|
( |
double |
lambda |
) |
|
|
| |
Sets the lambda parameter of the etraces.
|
Member Data Documentation
| |
estimation Policy - policy which is
learned
|
| |
The last action estimated by the policy.
|
The documentation for this class was generated from the following
file:
|