Reinforcement Learning Toolbox 2.0
last updated:
General
Documentation
Manual
Tutorial
Class Reference
Master Thesis
Examples
Related Papers
Downloads
Links
News
mailto:webmaster
Main Page     Class Hierarchy   Compound List   File List   Compound Members   File Members

CValueIteration Class Reference

The Value Iteration Algorithm. More...

#include <cdynamicprogramming.h>

Inheritance diagram for CValueIteration:

CParameterObject CParameters CPrioritizedSweeping List of all members.


Public Member Functions

  CValueIteration (CFeatureQFunction *qFunction, CAbstractFeatureStochasticModel *model, CFeatureRewardFunction *rewardModel)
  Creates the Value Iteration algorithm with Q-Function learning and a greedy policy.

  CValueIteration (CFeatureQFunction *qFunction, CAbstractFeatureStochasticModel *model, CFeatureRewardFunction *rewardModel, CStochasticPolicy *stochPolicy)
  Creates the Value Iteration algorithm with Q-Function learning and given policy for policy evaluation.

  CValueIteration (CFeatureVFunction *vFunction, CAbstractFeatureStochasticModel *model, CFeatureRewardFunction *rewardModel)
  Creates the Value Iteration algorithm with Q-Function learning and a greedy policy.

  CValueIteration (CFeatureVFunction *vFunction, CAbstractFeatureStochasticModel *model, CFeatureRewardFunction *rewardModel, CStochasticPolicy *stochPolicy)
  Creates the Value Iteration algorithm with Q-Function learning and given policy for policy evaluation.

virtual  ~CValueIteration ()
virtual void  updateFeature (int feature)
  Updates the given feature.

void  updateFirstFeature ()
  Updates the first feature from the list.

void  addPriority (int feature, double priority)
  Adds the given priority to the given feature.

void  addPriorities (CFeatureList *featList)
  Add all Priorities of the featuers in the feature list.

CAbstractFeatureStochasticModel getTheoreticalModel ()
CAbstractVFunction getVFunction ()
CFeatureQFunction getQFunction ()
CStochasticPolicy getStochasticPolicy ()
int  getMaxListSize ()
void  setMaxListSize (int maxListSize)
void  doUpdateSteps (int k)
  updates the frist k states in the priority list

void  doUpdateStepsUntilEmptyList (int k)
  Updates the states from the priority list until it is empty.

void  doUpdateBackwardStates (int state)
  Updates all backward states of the given state.



Protected Member Functions

virtual double  getPriority (CTransition *trans, double bellE)
  returns the priority of a specific Transition given the bellman error

void  init (CAbstractFeatureStochasticModel *model, CFeatureRewardFunction *rewardModel)


Protected Attributes

CAbstractVFunction vFunction
  The used V-Function.

CAbstractVFunction vFunctionFromQFunction
  V-Function used for the new Value calculation when using V-Function Learning.

CFeatureQFunction qFunction
  The used Q-Function.

CQFunctionFromStochasticModel qFunctionFromVFunction
  Q-Function for the Action Value calculation when using V-Learning.

CAbstractFeatureStochasticModel model
  the model

CFeatureRewardFunction rewardModel
  reward function of the learning Problem

CActionSet actions
  The actions used by the value iteration.

bool  learnVFunction
  use V or Q Function?

CState discState
  Temporary state object.

CFeatureList priorityList
  Sorted list of the priorities.

CStochasticPolicy stochPolicy
  The stochastic Policy which is used.


Detailed Description

The Value Iteration Algorithm.

Value Iteration calculates the Value Function of a arbitrary policy for a given learning problem, it expects a given stochastic model of the learning problem, so if you need to learn the model as well, use the prioritized sweeping algorithm. The Value iteration classes of the toolbox provides both, V-Function learning and Q-Function learning. Value iteration uses the update rule V_{k+1}=sum_a pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')) (where pi is a stochastic policy) for value function learning and Q(s,a)= sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), where $V_k(s') = sum_a Q(s',a)*pi(s',a)$ for the Q-Value learning case. If you repeat that step arbitrary often, the update rule converges to the value function of the policy. Usually a greedy policy is used for learning, since you want the optimal value function, but you can also choose to evaluate the value function of some other, maybe self-coded policy (as long as it implements the interface CStochasticPolicy). Dynamic Programming approaches are usually a safe tool to gather the optimal value function, but it is also a very CPU-intensive task, so it is very important which state is updated because in the most states the update is very small or even zero. So the class CValueIteration also maintains a priority list of the states, indicating which state has to be updated first. If a state is updated according to the given rules, the error of the former value is calculated and than every state in the backward list of the updated state from the stochastic model (so every state which leads to the updated state), gets his priority added by the value error * prop, where prop is the probability of that (backward) transition. This concept comes from prioritized sweeping. Due to this concept the states which are likely to change their Values considerably gets updated first. The class provides functions for updating the states in the priority list k times (if the list is empty a random state is chosen), update the states until the list is empty, or update a single given state. To give the algorithm a little hint where to start you can also update all features in the backward transitions of a specific state. For the priority List the algorithm uses a sorted feature list.

You can choose if you want to learn a Value-Function or directly a Q-Function by providing a Q-Function or a Value Function to the constructor. Learning a QFunction can have the advantage that this Q-Function can be used by other learning algorithms too. If you use a V-Function you have to get a QFunction for the policies from the VFunction, this is done by CQFunctionFromStochasticModel, which takes the stochastic model and a VFunction and calculates the Q-Values if they are requested. The update process works as follows:
  • Learning with the V-Function: The new Value of the state is calculated by V_{k+1}=sum_a pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), then the error is calculated and used for priority updates.
  • Learning with the Q-Function: The Value of the state is calculated by V_k(s') = sum_a Q(s',a)*pi(s',a), this is done by the class CVFunctionFromQFunction. Then each action-value is updated by Q(s,a)= sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), after update the new V-Value is calculated to get the error needed for the priorities.
If you use Q-Function learning, a CVFunctionFromQFunction is used for the calculation of the Value, if you use V-Function learning a CFeatureQFunctionFromVFunction is used for the calculation of the action Values.
You can also specify a policy which Value or QFunction you want to learn, so you can do policy evaluation if the policy is fixed. The standard policy is the greedy policy, so you calculate the optimal Value function.

Constructor & Destructor Documentation

CValueIteration::CValueIteration CFeatureQFunction qFunction,
CAbstractFeatureStochasticModel model,
CFeatureRewardFunction rewardModel
 

Creates the Value Iteration algorithm with Q-Function learning and a greedy policy.

CValueIteration::CValueIteration CFeatureQFunction qFunction,
CAbstractFeatureStochasticModel model,
CFeatureRewardFunction rewardModel,
CStochasticPolicy stochPolicy
 

Creates the Value Iteration algorithm with Q-Function learning and given policy for policy evaluation.

CValueIteration::CValueIteration CFeatureVFunction vFunction,
CAbstractFeatureStochasticModel model,
CFeatureRewardFunction rewardModel
 

Creates the Value Iteration algorithm with Q-Function learning and a greedy policy.

CValueIteration::CValueIteration CFeatureVFunction vFunction,
CAbstractFeatureStochasticModel model,
CFeatureRewardFunction rewardModel,
CStochasticPolicy stochPolicy
 

Creates the Value Iteration algorithm with Q-Function learning and given policy for policy evaluation.

virtual CValueIteration::~CValueIteration  )  [virtual]
 

Member Function Documentation

void CValueIteration::addPriorities CFeatureList featList  ) 
 

Add all Priorities of the featuers in the feature list.

void CValueIteration::addPriority int  feature,
double  priority
 

Adds the given priority to the given feature.

void CValueIteration::doUpdateBackwardStates int  state  ) 
 

Updates all backward states of the given state.

Used to give the algorithm a hint where to start, since due to the updates, all backward states of the backward states gets added to the prioritylist (as long as they made a Bellman error.

void CValueIteration::doUpdateSteps int  k  ) 
 

updates the frist k states in the priority list

If the list is empty, a random state is chosen

void CValueIteration::doUpdateStepsUntilEmptyList int  k  ) 
 

Updates the states from the priority list until it is empty.

int CValueIteration::getMaxListSize  ) 
 
virtual double CValueIteration::getPriority CTransition trans,
double  bellE
[protected, virtual]
 

returns the priority of a specific Transition given the bellman error

The standard priority calculation is trans->getProbapility() * bellE, but this can be changed by possible subclasses

CFeatureQFunction* CValueIteration::getQFunction  ) 
 
CStochasticPolicy* CValueIteration::getStochasticPolicy  ) 
 
CAbstractFeatureStochasticModel* CValueIteration::getTheoreticalModel  ) 
 
CAbstractVFunction* CValueIteration::getVFunction  ) 
 
void CValueIteration::init CAbstractFeatureStochasticModel model,
CFeatureRewardFunction rewardModel
[protected]
 
void CValueIteration::setMaxListSize int  maxListSize  ) 
 
virtual void CValueIteration::updateFeature int  feature  )  [virtual]
 

Updates the given feature.

Clears the feature from the prioritylist and then makes either a Q-Function or a V-Function update. The update process works as follows:

  • Learning with the V-Function: The new Value of the state is calculated by V_{k+1}=sum_a pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), then the error is calculated and used for priority updates.
  • Learning with the Q-Function: The Value of the state is calculated by V_k(s') = sum_a Q(s',a)*pi(s',a), this is done by the class CVFunctionFromQFunction. Then each action-value is updated by Q(s,a)= sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), after update the new V-Value is calculated to get the error needed for the priorities.
After that alls backwards states are fetched from the model and added to the priority list with the priority getPriority(transition, bellError), which is in standard transition->getPropability() * bellError.
void CValueIteration::updateFirstFeature  ) 
 

Updates the first feature from the list.


Member Data Documentation

CActionSet* CValueIteration::actions [protected]
 

The actions used by the value iteration.

CState* CValueIteration::discState [protected]
 

Temporary state object.

bool CValueIteration::learnVFunction [protected]
 

use V or Q Function?

CAbstractFeatureStochasticModel* CValueIteration::model [protected]
 

the model

CFeatureList* CValueIteration::priorityList [protected]
 

Sorted list of the priorities.

CFeatureQFunction* CValueIteration::qFunction [protected]
 

The used Q-Function.

CQFunctionFromStochasticModel* CValueIteration::qFunctionFromVFunction [protected]
 

Q-Function for the Action Value calculation when using V-Learning.

CFeatureRewardFunction* CValueIteration::rewardModel [protected]
 

reward function of the learning Problem

CStochasticPolicy* CValueIteration::stochPolicy [protected]
 

The stochastic Policy which is used.

CAbstractVFunction* CValueIteration::vFunction [protected]
 

The used V-Function.

CAbstractVFunction* CValueIteration::vFunctionFromQFunction [protected]
 

V-Function used for the new Value calculation when using V-Function Learning.


The documentation for this class was generated from the following file: