Regulating exploration in multi-armed bandit problems with time patterns and dying arms

Tracà, Stefano

dc.contributor.advisor	Roy E. Welsch.	en_US
dc.contributor.author	Tracà, Stefano	en_US
dc.contributor.other	Massachusetts Institute of Technology. Operations Research Center.	en_US
dc.date.accessioned	2018-11-28T15:44:31Z
dc.date.available	2018-11-28T15:44:31Z
dc.date.copyright	2018	en_US
dc.date.issued	2018	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/119353
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2018.	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 65-70).	en_US
dc.description.abstract	In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in the number of visitors, or increases in customers just before major holidays. The standard paradigm of multi-armed bandit analysis does not take these known patterns into account. This means that for applications in retail, where prices are fixed for periods of time, current bandit algorithms will not suffice. This work provides a framework and methods that take the time-dependent patterns into account. In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the corrected methods, a set of bounds on the expected regret are presented and insights into why we would want to exploit during periods of high reward are discussed. When the set of available options changes over time, mortal bandits algorithms have proven to be extremely useful in a number of settings, for example, for providing news article recommendations, or running automated online advertising campaigns. Previous work on this problem showed how to regulate exploration of new arms when they have recently appeared, but they do not adapt when the arms are about to disappear. Since in most applications we can determine either exactly or approximately when arms will disappear, we can leverage this information to improve performance: we should not be exploring arms that are about to disappear. Also for this framework, adaptations of algorithms and regret bounds are provided. The proposed methods perform well in experiments, and were inspired by a high-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo! Front Page. That entry heavily used time-series methods to regulate greed over time, which was substantially more effective than other contextual bandit methods.	en_US
dc.description.statementofresponsibility	by Stefano Tracà.	en_US
dc.format.extent	173 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Operations Research Center.	en_US
dc.title	Regulating exploration in multi-armed bandit problems with time patterns and dying arms	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Operations Research Center
dc.contributor.department	Sloan School of Management
dc.identifier.oclc	1065541758	en_US

Files in this item

Name:: 1065541758-MIT.pdf
Size:: 12.99Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record