Ramesh Johari, Stanford University
Suppose an online platform wants to compare a treatment and control policy, e.g., two different
matching algorithms in a ridesharing system, or two different inventory management algorithms in an
online retail site. Standard randomized controlled trials are typically not feasible, since the goal
is to estimate policy performance on the entire system. Instead, the typical current practice
involves dynamically alternating between the two policies for fixed lengths of time, and comparing
the average performance of each over the intervals in which they were run as an estimate of the
treatment effect, called a “switchback” design.
In this talk, we cast switchback design optimization as a problem of estimating the difference in
steady state reward between two Markov chains: a “control” chain and a “treatment” chain. We
discuss the rich structure exhibited by this formulation of the problem, including reduction of the
variance optimal design to a convex optimization problem, in the case where parametric or
nonparametric maximum likelihood is used for estimation. We’ll close with a discussion of open
issues, novel directions of application, and discussion of alternative experimental designs of interest.
Joint work with Jose Blanchet, Peter Glynn, Mohammad Rasouli, and Linjia Wu.