CValueIteration Class Reference
The Value Iteration
Algorithm. More...
#include <cdynamicprogramming.h>
Inheritance diagram for CValueIteration:
List of all
members.
|
Public Member Functions
|
|
|
CValueIteration
(CFeatureQFunction
*qFunction,
CAbstractFeatureStochasticModel
*model,
CFeatureRewardFunction
*rewardModel)
|
| |
Creates the Value Iteration algorithm with
Q-Function learning and a greedy policy.
|
|
|
CValueIteration
(CFeatureQFunction
*qFunction,
CAbstractFeatureStochasticModel
*model,
CFeatureRewardFunction
*rewardModel,
CStochasticPolicy
*stochPolicy)
|
| |
Creates the Value Iteration algorithm with
Q-Function learning and given policy for policy evaluation.
|
|
|
CValueIteration
(CFeatureVFunction
*vFunction,
CAbstractFeatureStochasticModel
*model,
CFeatureRewardFunction
*rewardModel)
|
| |
Creates the Value Iteration algorithm with
Q-Function learning and a greedy policy.
|
|
|
CValueIteration
(CFeatureVFunction
*vFunction,
CAbstractFeatureStochasticModel
*model,
CFeatureRewardFunction
*rewardModel,
CStochasticPolicy
*stochPolicy)
|
| |
Creates the Value Iteration algorithm with
Q-Function learning and given policy for policy evaluation.
|
|
virtual
|
~CValueIteration
()
|
|
virtual void
|
updateFeature (int
feature)
|
| |
Updates the given feature.
|
|
void
|
updateFirstFeature
()
|
| |
Updates the first feature from the list.
|
|
void
|
addPriority (int feature,
double priority)
|
| |
Adds the given priority to the given
feature.
|
|
void
|
addPriorities
(CFeatureList
*featList)
|
| |
Add all Priorities of the featuers in the
feature list.
|
|
CAbstractFeatureStochasticModel
*
|
getTheoreticalModel
()
|
|
CAbstractVFunction
*
|
getVFunction ()
|
|
CFeatureQFunction
*
|
getQFunction ()
|
|
CStochasticPolicy
*
|
getStochasticPolicy
()
|
|
int
|
getMaxListSize
()
|
|
void
|
setMaxListSize (int
maxListSize)
|
|
void
|
doUpdateSteps (int
k)
|
| |
updates the frist k states in the priority
list
|
|
void
|
doUpdateStepsUntilEmptyList
(int k)
|
| |
Updates the states from the priority list
until it is empty.
|
|
void
|
doUpdateBackwardStates
(int state)
|
| |
Updates all backward states of the given
state.
|
Protected Member Functions
|
|
virtual double
|
getPriority (CTransition *trans, double
bellE)
|
| |
returns the priority of a specific
Transition given the bellman error
|
|
void
|
init (CAbstractFeatureStochasticModel
*model,
CFeatureRewardFunction
*rewardModel)
|
Protected Attributes
|
|
CAbstractVFunction
*
|
vFunction |
| |
The used V-Function.
|
|
CAbstractVFunction
*
|
vFunctionFromQFunction |
| |
V-Function used for the new Value
calculation when using V-Function Learning.
|
|
CFeatureQFunction
*
|
qFunction |
| |
The used Q-Function.
|
|
CQFunctionFromStochasticModel
*
|
qFunctionFromVFunction |
| |
Q-Function for the Action Value calculation
when using V-Learning.
|
|
CAbstractFeatureStochasticModel
*
|
model |
| |
the model
|
|
CFeatureRewardFunction
*
|
rewardModel |
| |
reward function of the learning Problem
|
|
CActionSet
*
|
actions |
| |
The actions used by the value iteration.
|
|
bool
|
learnVFunction |
| |
use V or Q Function?
|
|
CState *
|
discState |
| |
Temporary state object.
|
|
CFeatureList
*
|
priorityList |
| |
Sorted list of the priorities.
|
|
CStochasticPolicy
*
|
stochPolicy |
| |
The stochastic Policy which is used.
|
Detailed Description
The Value Iteration Algorithm.
Value Iteration calculates the Value Function of a arbitrary
policy for a given learning problem, it expects a given stochastic
model of the learning problem, so if you need to learn the model as
well, use the prioritized sweeping algorithm. The Value iteration
classes of the toolbox provides both, V-Function learning and
Q-Function learning. Value iteration uses the update rule
V_{k+1}=sum_a pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s'))
(where pi is a stochastic policy) for value function learning and
Q(s,a)= sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), where $V_k(s')
= sum_a Q(s',a)*pi(s',a)$ for the Q-Value learning case. If you
repeat that step arbitrary often, the update rule converges to the
value function of the policy. Usually a greedy policy is used for
learning, since you want the optimal value function, but you can
also choose to evaluate the value function of some other, maybe
self-coded policy (as long as it implements the interface
CStochasticPolicy). Dynamic
Programming approaches are usually a safe tool to gather the
optimal value function, but it is also a very CPU-intensive task,
so it is very important which state is updated because in the most
states the update is very small or even zero. So the class
CValueIteration also maintains a priority list of the states,
indicating which state has to be updated first. If a state is
updated according to the given rules, the error of the former value
is calculated and than every state in the backward list of the
updated state from the stochastic model (so every state which leads
to the updated state), gets his priority added by the value error *
prop, where prop is the probability of that (backward) transition.
This concept comes from prioritized sweeping. Due to this concept
the states which are likely to change their Values considerably
gets updated first. The class provides functions for updating the
states in the priority list k times (if the list is empty a random
state is chosen), update the states until the list is empty, or
update a single given state. To give the algorithm a little hint
where to start you can also update all features in the backward
transitions of a specific state. For the priority List the
algorithm uses a sorted feature list.
- You can choose if you want to learn a Value-Function or
directly a Q-Function by providing a Q-Function or a Value Function
to the constructor. Learning a QFunction can have the advantage
that this Q-Function can be used by other learning algorithms too.
If you use a V-Function you have to get a QFunction for the
policies from the VFunction, this is done by CQFunctionFromStochasticModel,
which takes the stochastic model and a VFunction and calculates the
Q-Values if they are requested. The update process works as
follows:
- Learning with the V-Function: The new Value of the state is
calculated by V_{k+1}=sum_a
pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), then the error
is calculated and used for priority updates.
- Learning with the Q-Function: The Value of the state is
calculated by V_k(s') = sum_a Q(s',a)*pi(s',a), this is done by the
class CVFunctionFromQFunction.
Then each action-value is updated by Q(s,a)=
sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), after update the new
V-Value is calculated to get the error needed for the
priorities.
If you use Q-Function learning, a CVFunctionFromQFunction
is used for the calculation of the Value, if you use V-Function
learning a CFeatureQFunctionFromVFunction is used for the
calculation of the action Values.
You can also specify a policy which Value or QFunction you want to
learn, so you can do policy evaluation if the policy is fixed. The
standard policy is the greedy policy, so you calculate the optimal
Value function.
Constructor & Destructor Documentation
| |
Creates the Value Iteration algorithm with Q-Function learning
and a greedy policy.
|
| |
Creates the Value Iteration algorithm with Q-Function learning
and given policy for policy evaluation.
|
| |
Creates the Value Iteration algorithm with Q-Function learning
and a greedy policy.
|
| |
Creates the Value Iteration algorithm with Q-Function learning
and given policy for policy evaluation.
|
| virtual
CValueIteration::~CValueIteration
|
( |
|
) |
[virtual] |
|
Member Function Documentation
| void
CValueIteration::addPriorities
|
( |
CFeatureList *
|
featList |
) |
|
|
| |
Add all Priorities of the featuers in the feature
list.
|
| void
CValueIteration::addPriority
|
( |
int |
feature,
|
|
|
double |
priority |
|
) |
|
|
| |
Adds the given priority to the given
feature.
|
| void
CValueIteration::doUpdateBackwardStates
|
( |
int |
state |
) |
|
|
| |
Updates all backward states of the given state.
Used to give the algorithm a hint where to start, since due to
the updates, all backward states of the backward states gets added
to the prioritylist (as long as they made a Bellman
error.
|
| void
CValueIteration::doUpdateSteps
|
( |
int |
k |
) |
|
|
| |
updates the frist k states in the priority list
If the list is empty, a random state is
chosen
|
| void
CValueIteration::doUpdateStepsUntilEmptyList
|
( |
int |
k |
) |
|
|
| |
Updates the states from the priority list until it is
empty.
|
| int
CValueIteration::getMaxListSize
|
( |
|
) |
|
|
| virtual double
CValueIteration::getPriority
|
( |
CTransition *
|
trans,
|
|
|
double |
bellE |
|
) |
[protected,
virtual] |
|
| |
returns the priority of a specific Transition given the bellman
error
The standard priority calculation is trans->getProbapility()
* bellE, but this can be changed by possible
subclasses
|
| void
CValueIteration::setMaxListSize
|
( |
int |
maxListSize |
) |
|
|
| virtual void
CValueIteration::updateFeature
|
( |
int |
feature |
) |
[virtual] |
|
| |
Updates the given feature.
Clears the feature from the prioritylist and then makes either a
Q-Function or a V-Function update. The update process works as
follows:
- Learning with the V-Function: The new Value of the state is
calculated by V_{k+1}=sum_a
pi(s,a)*sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), then the error
is calculated and used for priority updates.
- Learning with the Q-Function: The Value of the state is
calculated by V_k(s') = sum_a Q(s',a)*pi(s',a), this is done by the
class CVFunctionFromQFunction.
Then each action-value is updated by Q(s,a)=
sum_{s'}P(s'|s,a)*(R(s,a,s')+gamma* V_(s')), after update the new
V-Value is calculated to get the error needed for the
priorities.
After that alls backwards states are fetched from the model and
added to the priority list with the priority
getPriority(transition, bellError), which is in standard
transition->getPropability() * bellError.
|
| void
CValueIteration::updateFirstFeature
|
( |
|
) |
|
|
| |
Updates the first feature from the list.
|
Member Data Documentation
| |
The actions used by the value iteration.
|
| |
Sorted list of the priorities.
|
| |
Q-Function for the Action Value calculation when using
V-Learning.
|
| |
reward function of the learning Problem
|
| |
The stochastic Policy which is used.
|
| |
V-Function used for the new Value calculation when using
V-Function Learning.
|
The documentation for this class was generated from the following
file:
|