Reinforcement Learning Toolbox 2.0
last updated:
General
Documentation
Manual
Tutorial
Class Reference
Master Thesis
Examples
Related Papers
Downloads
Links
News
mailto:webmaster
Main Page     Class Hierarchy   Compound List   File List   Compound Members   File Members

CTDLearner Class Reference

Class for temporal Difference Learning. More...

#include <ctdlearner.h>

Inheritance diagram for CTDLearner:

CSemiMDPRewardListener CErrorSender CSemiMDPListener CParameterObject CParameters CAdvantageUpdating CQLearner CSarsaLearner CTDGradientLearner CTDResidualLearner CAdvantageLearner List of all members.


Public Member Functions

  CTDLearner (CRewardFunction *rewardFunction, CAbstractQFunction *qfunction, CAbstractQETraces *etraces, CAgentController *estimationPolicy)
  Creates a TD Learner with the given abstract Q-Function and Q-ETraces.

  CTDLearner (CRewardFunction *rewardFunction, CAbstractQFunction *qfunction, CAgentController *estimationPolicy)
  Creates a TD Learner with the given composed Q-Function and a new composed Q-Etraces object.

virtual  ~CTDLearner ()
virtual void  loadValues (char *filename)
virtual void  saveValues (char *filename)
virtual void  loadValues (FILE *stream)
virtual void  saveValues (FILE *stream)
virtual void  nextStep (CStateCollection *oldState, CAction *action, double reward, CStateCollection *nextState)
  Calls the update function learnStep.

virtual void  intermediateStep (CStateCollection *oldState, CAction *action, double reward, CStateCollection *nextState)
  Updates the Q-Function for a intermediate step.

virtual void  newEpisode ()
  Resets the Etraces.

void  setAlpha (double alpha)
  Sets the learning rate.

void  setLambda (double lambda)
  Sets the lambda parameter of the etraces.

CAgentController getEstimationPolicy ()
void  setEstimationPolicy (CAgentController *estimationPolicy)
CAbstractQFunction getQFunction ()
CAbstractQETraces getETraces ()


Protected Member Functions

virtual void  learnStep (CStateCollection *oldState, CAction *action, double reward, CStateCollection *nextState)
  Updates the Q-Function and manages the Etraces.

virtual double  getTemporalDifference (CStateCollection *oldState, CAction *action, double reward, CStateCollection *nextState)
  calculates the temporal difference

virtual double  getResidual (double oldQ, double reward, int duration, double newQ)
  returns the temporal difference error residual

virtual void  addETraces (CStateCollection *oldState, CStateCollection *newState, CAction *action)
  adds the current state to the etraces



Protected Attributes

bool  externETraces
  use extern eTraces

CAgentController estimationPolicy
  estimation Policy - policy which is learned

CAction lastEstimatedAction
  The last action estimated by the policy.

CAbstractQFunction qfunction
CAbstractQETraces etraces
CActionDataSet actionDataSet

Detailed Description

Class for temporal Difference Learning.

Temporal Difference (TD) Q-Value Learner are the common model-free reinforcement learning algorithms. They make their update according to the difference of the current Q-Value to the calculated Q Value Q(s_t, a_t)=R(s_t, a_t, s_{t+1})+gamma * Q(s_{t+1}, a_{t+1}) for each step sample. So the TD update for the Q-Values is Q_new(s_t,a_t)=(1-alpha)*Q_old(s_t,a_t)+alpha*(R(s_t,a_t,s_{t+1})+gamma*Q(s_{t+1},a_{t+1})) is further Q_old(s_t,a_t) + alpha*(R(s_t,a_t, s_{t+1})+gamma*Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) where R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) is the temporal Difference. Respectively for the semi Markov case, the temporal difference is R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)). This temporal difference update is usually done for states from the past too, using ETraces. This Method is called TD-Lambda.

In the RIL toolbox TD-Learner are represented by the class CTDLearner and provides an implementation of the TD-Lambda algorithm. The class maintains a Q-Function, an ETraces Object, a Reward Function and a Policy as estimation policy needed for the calculation of a_{t+1} The Q-Function, the Reward Function and the Policy have to be passed from the user. The ETraces object is usually initialized with the standard etraces object for the Q-Function, but can also be specified..

The learnStep Function updates the Q-Function according the step sample. The function is called by the nextStep event. First of all the last estimated action ($a_{t+1}$) is compared to the action doublely executed. If these two actions are not equal, the ETraces have to be reset, because the agent didn't follow the policy to learn. If you don't want to reset the etraces you can set the parameter "ResetETracesOnWrongEstimate" to false (0.0). If the 2 actions are equal the Etraces gets multiplied by lambda*gamma. After that, the Etrace of the current state-action pair is added to the ETraces object, then the next estimated action is calculated by the given policy and stored. Now the temporal difference error can be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) or R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the temporal difference error all the states in the ETraces are updated by the updateQFunction method from the Q-Etraces object. Before the update, the temporal difference error gets multiplied with the learning rate (Parameter: "QLearningRate").

The getTemporalDifference function calculates the old Q-Value and the new Q-Value and then calls the getResidual function, which does the actual temporal difference error computation.

For hierarchic MDP's Intermediate steps get a special treatment in the TD-Algorithm. Since the intermediate steps aren't doublely member of the episode they need special treatment for etraces. The state of the intermediate step is normally added to the ETraces object, but the multiplication of all other ETraces is canceled and the Q-Function isn’t updated with the whole ETraces object, only the Q-Value of the intermediate state is updated. This is done because the intermediate step isn't directly reachable for the past states and update all intermediate steps via etraces would falsify the Q-Values since the same step gets updates several times.

CTDLearner has following Parameters:

  • inherits all Parameters from the Q-Function
  • inherits all Parameters from the ETraces
  • "QLearningRate", 0.2 : learning rate of the algorithm
  • "DiscountFactor", 0.95 : discount factor of the learning problem
  • "ResetETracesOnWrongEstimate", 1.0 : reset etraces when the estimated action wasn't the double executed.
See also:
CQLearner

CSarsaLearner


Constructor & Destructor Documentation

CTDLearner::CTDLearner CRewardFunction rewardFunction,
CAbstractQFunction qfunction,
CAbstractQETraces etraces,
CAgentController estimationPolicy
 

Creates a TD Learner with the given abstract Q-Function and Q-ETraces.

CTDLearner::CTDLearner CRewardFunction rewardFunction,
CAbstractQFunction qfunction,
CAgentController estimationPolicy
 

Creates a TD Learner with the given composed Q-Function and a new composed Q-Etraces object.

The etraces get initialised by the standard V-Etraces of the Q-Functions V-Functions. If you want to access the VEtraces you have to cast the result from getQETraces() from (CAbstractQETraces *) to (CQETraces *).

virtual CTDLearner::~CTDLearner  )  [virtual]
 

Member Function Documentation

virtual void CTDLearner::addETraces CStateCollection oldState,
CStateCollection newState,
CAction action
[protected, virtual]
 

adds the current state to the etraces

Reimplemented in CAdvantageUpdating, and CTDGradientLearner.

CAgentController* CTDLearner::getEstimationPolicy  ) 
 
CAbstractQETraces* CTDLearner::getETraces  ) 
 
CAbstractQFunction* CTDLearner::getQFunction  ) 
 
virtual double CTDLearner::getResidual double  oldQ,
double  reward,
int  duration,
double  newQ
[protected, virtual]
 

returns the temporal difference error residual

Reimplemented in CTDGradientLearner.

virtual double CTDLearner::getTemporalDifference CStateCollection oldState,
CAction action,
double  reward,
CStateCollection nextState
[protected, virtual]
 

calculates the temporal difference

Reimplemented in CAdvantageUpdating, and CAdvantageLearner.

virtual void CTDLearner::intermediateStep CStateCollection oldState,
CAction action,
double  reward,
CStateCollection nextState
[virtual]
 

Updates the Q-Function for a intermediate step.

Since the intermediate steps aren't doublely member of the episode they need special treatment for etraces. The state of the intermediate step is normally added to the ETraces object, but the multiplication of all other ETraces is canceled and the Q-Function isn’t updated with the whole ETraces object, only the Q-Value of the intermediate state is updated. This is done because the intermediate step isn't directly reachable for the past states and update all intermediate steps via etraces would falsify the Q-Values since the same step gets updates several times.

Reimplemented from CSemiMDPRewardListener.

virtual void CTDLearner::learnStep CStateCollection oldState,
CAction action,
double  reward,
CStateCollection nextState
[protected, virtual]
 

Updates the Q-Function and manages the Etraces.

The learnStep Function updates the Q-Function according the step sample. The function is called by the nextStep event. First of all the last estimated action (a_{t+1}) is compared to the action doublely executed. If these two actions are not equal, the ETraces have to be reset, because the agent didn't follow the policy to learn, using the etraces of older states would falsify the Q-Values. If the 2 actions are equal the Etraces gets multiplied by lambda*gamma. After that, the Etrace of the current state-action pair is added to the ETraces object, then the next estimated action is calculated by the given policy. Now the temporal difference can be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) or R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the temporal difference all the states in the ETraces are updated by the updateQFunction method from the Q-Etraces object.

Reimplemented in CAdvantageUpdating, and CTDResidualLearner.

virtual void CTDLearner::loadValues FILE *  stream  )  [virtual]
 
virtual void CTDLearner::loadValues char *  filename  )  [virtual]
 
virtual void CTDLearner::newEpisode  )  [virtual]
 

Resets the Etraces.

Reimplemented from CSemiMDPListener.

Reimplemented in CTDResidualLearner.

virtual void CTDLearner::nextStep CStateCollection oldState,
CAction action,
double  reward,
CStateCollection nextState
[virtual]
 

Calls the update function learnStep.

Reimplemented from CSemiMDPRewardListener.

virtual void CTDLearner::saveValues FILE *  stream  )  [virtual]
 
virtual void CTDLearner::saveValues char *  filename  )  [virtual]
 
void CTDLearner::setAlpha double  alpha  ) 
 

Sets the learning rate.

void CTDLearner::setEstimationPolicy CAgentController estimationPolicy  ) 
 
void CTDLearner::setLambda double  lambda  ) 
 

Sets the lambda parameter of the etraces.


Member Data Documentation

CActionDataSet* CTDLearner::actionDataSet [protected]
 
CAgentController* CTDLearner::estimationPolicy [protected]
 

estimation Policy - policy which is learned

CAbstractQETraces* CTDLearner::etraces [protected]
 
bool CTDLearner::externETraces [protected]
 

use extern eTraces

CAction* CTDLearner::lastEstimatedAction [protected]
 

The last action estimated by the policy.

CAbstractQFunction* CTDLearner::qfunction [protected]
 

The documentation for this class was generated from the following file: