Centre Inria de l'Université de Lille
Team project Scool -- Spring 2025
Keywords: Multi-armed bandits, Markov Decision Procceses, Regret minimization.
Investigator: The project is proposed and advised by Odalric-Ambrym Maillard from Inria team-project Scool.
Place: This project will be primarily held at the research center Inria Lille -- Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.
Since the early development of mathematical statistics, the collaboration between mathematicians and experimental scientists has played a crucial role in shaping rigorous methodologies for data analysis and decision-making. Fundamental principles such as likelihood estimation, deviation analysis, and experimental design have provided the statistical foundation for modern inference and decision sciences. These ideas are now being revisited and expanded within the domains of reinforcement learning (RL) and stochastic optimization, where finite-time analysis and non-parametric techniques offer new insights into classical and emerging challenges. In particular, statistical methods have influenced the design of information-based state-of-the-art RL algorithms, which build upon Chernoff's pioneering work on optimal testing from 1959 but adapt these ideas to RL settings with a focus on finite-time performance and uncertainty quantification. Despite these advances, the interplay between statistical theory, robust RL algorithm design, and experimental validation in diverse environments remains insufficiently explored. This proposal seeks to deepen this interaction by leveraging insights from both statistics and algorithmic learning theory.
Despite significant advances in regret minimization for bandits and reinforcement learning (RL), efficiently handling structured bandits and general structured Markov Decision Processes (MDPs) remains an open challenge, even in discrete state-action spaces Boone & Maillard, 2025. One promising approach is leveraging the concept of the most-confusing instance, initially developed for performance lower bounds, and transforming it into an algorithmic principle rooted in information theory.
The "most-confusing instance" methodology departs from traditional optimistic and Bayesian approaches, aligning instead with hypothesis-testing principles. Recent studies demonstrate its efficacy across multiple structured settings, leading to both theoretical and computational improvements Saber & Maillard 2024, Pesquerel & Maillard, 2022. Key challenges include:
The Linear Quadratic Regulator (LQR): In particular, one significant extension would involve adapting these principles to the Linear Quadratic Regulator (LQR), a fundamental control problem with applications in robotics, autonomous systems, and large-scale industrial processes. LQR provides an ideal benchmark for structured control due to its tractability and practical importance. The goal is to extend regret-minimization techniques to continuous-state MDPs where function approximation becomes necessary.
Regret minimization in MDPs can be approached through multi-armed bandit theory. A modern information-theoretic approach consists in tracking, for each sub-optimal action \(a\), the confusion cost:
\[ \inf \{ L(\hat{\mathbf{M}}_t, \mathbf{M}^\dagger) : \mathbf{M}^\dagger \in \mathcal{C}(a, \hat{\mathbf{M}}_t) \} \]
where \( L(\hat{\mathbf{M}}_t, \mathbf{M}^\dagger) \) represents the log-likelihood ratio between the current maximum likelihood estimate \( \hat{\mathbf{M}}_t \) and a model \( \mathbf{M}^\dagger \), and where \(\mathcal{C}(a, \hat{\mathbf{M}}_t)\) denotes the models where \(a\) is optimal that are undistinguishable from \( \hat{\mathbf{M}}_t\) when playing optimally in \( \hat{\mathbf{M}}_t \).
Tracking this cost can be performed deterministically using IMED or stochastically via Maillard sampling from Bian & Jun, 2022. Applying this principle to MDPs led e.g. to the development of IMED-RL, a provably optimal regret minimization algorithm for ergodic MDPs.
This research program formalizes statistical decision-making as a three-player game between:
This game-theoretic perspective enables a transition from passive lower-bound derivations to active learning processes, informing algorithm design for regret-efficient decision-making.
Future research will explore reinforcement learning as a dynamic game against stochastic environments, extend structured bandit techniques to large-scale reinforcement learning problems, and analyze the policy neighborhood structure in high-dimensional MDPs. For example, in discrete ergodic MDPs, every sub-optimal policy \(\pi\) has a neighbor differing in only one state, yet achieving a strictly larger gain. This property enables efficient policy iteration via local updates, significantly reducing computational complexity.
The project will be hosted at Centre Inria de l'Université de Lille, in the Scool team. Scool (Sequential COntinual and Online Learning) focuses on the study of sequential decision-making under uncertainty.