Abstract
Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
Abstract (translated)
受到关于在每次决策中都存在不必要的风险的担忧的启发,本文提出了可能在任何时间步上都安全的随机组合半衰期问题。在这个问题上,Agent被赋予了选择一组不超过$K$个子集的选择权利,每个子集都有特定的平均收益和方差,表示其风险。为了减轻Agent面临的风险,我们需要有至少$1-delta$的概率,在整个时间$T$的生命周期内,每个 Agent 做出的选择都应该包含子集的方差之和不超过某个方差预算。我们称之为可能在任何时间步上都安全的限定。在这个限定下,我们设计和分析了一个算法{sc PASCombUCB},它最小化$T$时间 horizon内 regret。通过开发伴随信息论的下界,我们证明了在问题依存和问题无关的范式下,{sc PASCombUCB}几乎总是最优的。我们的算法架构、提出的{sc PASCombUCB}算法以及新的分析适用于领域如推荐系统和交通,其中Agent可以在一个时间步上选择多个子集,并希望控制整个时间 horizon 的风险。
URL
https://arxiv.org/abs/2301.13393