Counterfactual off-policy evaluation with gumbel-max structural causal models

Oberst, Michael; Sontag, David Alexander

Author(s)

Oberst, Michael; Sontag, David Alexander

DownloadPublished version (1.417Mb)

Publisher Policy

Terms of use

Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.

Metadata

Show full item record

Abstract

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

Date issued

2019-06

URI

https://hdl.handle.net/1721.1/130437

Department

Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory

Journal

Proceedings of the 36th International Conference on Machine Learning

Publisher

MLResearch Press

Citation

Oberst, Michael and David Sontag. "Counterfactual off-policy evaluation with gumbel-max structural causal models." Proceedings of the 36th International Conference on Machine Learning, June 2019, Long Beach, California, MLResearch Press, 2019.

Version: Final published version

Collections

MIT Open Access Articles