Simulation-based optimization of Markov decision processes

Marbach, Peter, 1966-

Author(s)

Marbach, Peter, 1966-

DownloadFull printable version (11.94Mb)

Advisor

John N. Tsitsiklis.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Markov decision processes have been a popular paradigm for sequential decision making under uncertainty. Dynamic programming provides a framework for studying such problems, as well as for devising algorithms to compute an optimal control policy. Dynamic programming methods rely on a suitably defined value function that has to be computed for every state in the state space. However, many interesting problems involve very large state spaces ( "curse of dimensionality"), which prohibits the application of dynamic programming. In addition, dynamic programming assumes the availability of an exact model, in the form of transition probabilities ( "curse of modeling"). In many situations, such a model is not available and one must resort to simulation or experimentation with an actual system. For all of these reasons, dynamic programming in its pure form may be inapplicable. In this thesis we study an approach for overcoming these difficulties where we use (a) compact (parametric) representations of the control policy, thus avoiding the curse of dimensionality, and (b) simulation to estimate quantities of interest, thus avoiding model-based computations and the curse of modeling. ,Furthermore, .our approach is not limited to Markov decision processes, but applies to general Markov reward processes for which the transition probabilities and the one-stage rewards depend on a tunable parameter vector 0. We propose gradient-type algorithms for updating 0 based on the simulation of a single sample path, so as to improve a given performance measure. As possible performance measures, we consider the weighted reward-to-go and the average reward. The corresponding algorithms(a) can be implemented online and update the parameter vector either at visits to a certain state; or at every time step . . . ,(b) have the property that the gradient ( with respect to 0) of the performance 'measure converges to O with probability 1. This is the strongest possible result · for gradient:-related stochastic approximation algorithms.

Description

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.

Includes bibliographical references (p. 127-129).

Date issued

1998

URI

http://hdl.handle.net/1721.1/9660

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science

Collections

Doctoral Theses