Simulation-based optimization of Markov decision processes
Author(s)Marbach, Peter, 1966-
John N. Tsitsiklis.
MetadataShow full item record
Markov decision processes have been a popular paradigm for sequential decision making under uncertainty. Dynamic programming provides a framework for studying such problems, as well as for devising algorithms to compute an optimal control policy. Dynamic programming methods rely on a suitably defined value function that has to be computed for every state in the state space. However, many interesting problems involve very large state spaces ( "curse of dimensionality"), which prohibits the application of dynamic programming. In addition, dynamic programming assumes the availability of an exact model, in the form of transition probabilities ( "curse of modeling"). In many situations, such a model is not available and one must resort to simulation or experimentation with an actual system. For all of these reasons, dynamic programming in its pure form may be inapplicable. In this thesis we study an approach for overcoming these difficulties where we use (a) compact (parametric) representations of the control policy, thus avoiding the curse of dimensionality, and (b) simulation to estimate quantities of interest, thus avoiding model-based computations and the curse of modeling. ,Furthermore, .our approach is not limited to Markov decision processes, but applies to general Markov reward processes for which the transition probabilities and the one-stage rewards depend on a tunable parameter vector 0. We propose gradient-type algorithms for updating 0 based on the simulation of a single sample path, so as to improve a given performance measure. As possible performance measures, we consider the weighted reward-to-go and the average reward. The corresponding algorithms(a) can be implemented online and update the parameter vector either at visits to a certain state; or at every time step . . . ,(b) have the property that the gradient ( with respect to 0) of the performance 'measure converges to O with probability 1. This is the strongest possible result · for gradient:-related stochastic approximation algorithms.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 127-129).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Electrical Engineering and Computer Science