Speaker
Description
The multi-armed bandit problem is a simple model of decision-making with uncertainty that lies in the
class of classical reinforcement learning problems. Given a set of arms, a learner interacts sequentially
with these arms sampling a reward at each round and the objective of the learner is to identify the arm
with largest expected reward while maximizing the total cumulative reward. The problem that the learner
faces is a trade-off between exploration and exploitation: one wants to explore all the arms to identify the
best one but also wants to exploit those arms that give the best rewards.
In [1] we initiate the study of tradeoffs between exploration and exploitation in online learning of
properties of quantum states. Given sequential oracle access to an unknown quantum state, in each round,
we are tasked to choose an observable from a set of actions aiming to maximize its expectation value on
the state (the reward). Information gained about the unknown state from previous rounds can be used to
gradually improve the choice of action, thus reducing the gap between the reward and the maximal
reward attainable with the given action set (the regret). We provide various information-theoretic lower
bounds on the cumulative regret that an optimal learner must incur, and show that it scales at least as the
square root of the number of rounds played. We also investigate the dependence of the cumulative regret
on the number of available actions and the dimension of the underlying space. Moreover, we exhibit
strategies that are optimal for bandits with a finite number of arms and general mixed states.
In [2] we study a recommender system for quantum data using the linear contextual bandit framework. In
each round, a learner receives an observable (the context) and has to recommend from a finite set of
unknown quantum states (the actions) which one to measure. The learner has the goal of maximizing the
reward in each round, that is the outcome of the measurement on the unknown state. Using this model we
formulate the low energy quantum state recommendation problem where the context is a Hamiltonian and
the goal is to recommend the state with the lowest energy. For this task, we study two families of
contexts: the Ising model and a generalized cluster model. We observe that if we interpret the actions as
different phases of the models then the recommendation is done by classifying the correct phase of the
given Hamiltonian and the strategy can be interpreted as an online quantum phase classifier.
References
[1] J. Lumbreras, E. Haapasalo, and M. Tomamichel. Multi-armed quantum bandits: Exploration versus exploita-
tion when learning properties of quantum states. Quantum, 6:749, 2022
[2] S. Brahmachari, J.Lumbreras and M. Tomamichel. Quantum contextual bandits and
recommender systems for quantum data, arXiv:2301.1352 4