Research Internship Proposal

Context

Since the early development of mathematical statistics, the collaboration between mathematicians and experimental scientists has played a crucial role in shaping rigorous methodologies for data analysis and decision-making. Fundamental principles such as likelihood estimation, deviation analysis, and experimental design have provided the statistical foundation for modern inference and decision sciences. These ideas are now being revisited and expanded within the domains of reinforcement learning (RL) and stochastic optimization, where finite-time analysis and non-parametric techniques offer new insights into classical and emerging challenges. In particular, statistical methods have influenced the design of information-based state-of-the-art RL algorithms, which build upon Chernoff's pioneering work on optimal testing from 1959 but adapt these ideas to RL settings with a focus on finite-time performance and uncertainty quantification. Despite these advances, the interplay between statistical theory, robust RL algorithm design, and experimental validation in diverse environments remains insufficiently explored. This proposal seeks to deepen this interaction by leveraging insights from both statistics and algorithmic learning theory.

Despite significant advances in regret minimization for bandits and reinforcement learning (RL), efficiently handling structured bandits and general structured Markov Decision Processes (MDPs) remains an open challenge, even in discrete state-action spaces Boone & Maillard, 2025. One promising approach is leveraging the concept of the most-confusing instance, initially developed for performance lower bounds, and transforming it into an algorithmic principle rooted in information theory.

The "most-confusing instance" methodology departs from traditional optimistic and Bayesian approaches, aligning instead with hypothesis-testing principles. Recent studies demonstrate its efficacy across multiple structured settings, leading to both theoretical and computational improvements Saber & Maillard 2024, Pesquerel & Maillard, 2022. Key challenges include:

Generalizing these methods to a broader class of structured decision problems.
Applying this paradigm to linear systems, which are central to control theory.
Developing scalable function approximation techniques based on this framework.

The Linear Quadratic Regulator (LQR): In particular, one significant extension would involve adapting these principles to the Linear Quadratic Regulator (LQR), a fundamental control problem with applications in robotics, autonomous systems, and large-scale industrial processes. LQR provides an ideal benchmark for structured control due to its tractability and practical importance. The goal is to extend regret-minimization techniques to continuous-state MDPs where function approximation becomes necessary.

Light Formalization

Regret minimization in MDPs can be approached through multi-armed bandit theory. A modern information-theoretic approach consists in tracking, for each sub-optimal action \(a\), the confusion cost:

\[ \inf \{ L(\hat{\mathbf{M}}_t, \mathbf{M}^\dagger) : \mathbf{M}^\dagger \in \mathcal{C}(a, \hat{\mathbf{M}}_t) \} \]

where \( L(\hat{\mathbf{M}}_t, \mathbf{M}^\dagger) \) represents the log-likelihood ratio between the current maximum likelihood estimate \( \hat{\mathbf{M}}_t \) and a model \( \mathbf{M}^\dagger \), and where \(\mathcal{C}(a, \hat{\mathbf{M}}_t)\) denotes the models where \(a\) is optimal that are undistinguishable from \( \hat{\mathbf{M}}_t\) when playing optimally in \( \hat{\mathbf{M}}_t \).

Tracking this cost can be performed deterministically using IMED or stochastically via Maillard sampling from Bian & Jun, 2022. Applying this principle to MDPs led e.g. to the development of IMED-RL, a provably optimal regret minimization algorithm for ergodic MDPs.

This research program formalizes statistical decision-making as a three-player game between:

Nature: Defines the environment \(\mathbf{M}\) (e.g., an MDP).
The Critic: Constructs a statistically indistinguishable but strategically challenging alternative \(\mathbf{M}^\dagger\).
The Learner: Adapts its strategy to optimize regret.

This game-theoretic perspective enables a transition from passive lower-bound derivations to active learning processes, informing algorithm design for regret-efficient decision-making.

Future Research Directions

Future research will explore reinforcement learning as a dynamic game against stochastic environments, extend structured bandit techniques to large-scale reinforcement learning problems, and analyze the policy neighborhood structure in high-dimensional MDPs. For example, in discrete ergodic MDPs, every sub-optimal policy \(\pi\) has a neighbor differing in only one state, yet achieving a strictly larger gain. This property enables efficient policy iteration via local updates, significantly reducing computational complexity.

Bibliography

Lattimore & Szepesvári, 2020 Lattimore, Tor, and Csaba Szepesvári. Bandit algorithms. Cambridge University Press (2020).
Bian & Jun, 2022 Bian, Jie, and Kwang-Sung Jun. "Maillard sampling: Boltzmann exploration done optimally." In International Conference on Artificial Intelligence and Statistics, pp. 54-72. PMLR, 2022.
Pesquerel & Maillard 2022 Pesquerel, F. and Maillard, O.A., 2022. IMED-RL: Regret optimal learning of ergodic Markov decision processes. Advances in Neural Information Processing Systems, 35, pp.26363-26374.
Saber & Maillard 2024 Saber, H. and Maillard, O.A., 2024, August. Bandits with Multimodal Structure. In Reinforcement Learning Conference (Vol. 1, No. 5, p. 39).
Boone & Maillard 2025 Boone, V. and Maillard, O.A., 2025. The regret lower bound for communicating Markov Decision Processes. arXiv preprint arXiv:2501.13013.

Research Topic Proposal

Context

Light Formalization

Future Research Directions

Bibliography

Host Institution and Supervision