INRIA Logo CRIStAL Logo Université de Lille Logo CNRS Logo

Research Topic Proposal

Centre Inria de l'Université de Lille
Team project Scool -- Spring 2025

“Sim-to-Real Adaptation in Continual Reinforcement Learning”

Keywords: Reinforcement Learning, Transfer learning, Simulator, Aggregation of experts.

Investigator: The project is proposed and advised by Odalric-Ambrym Maillard from Inria team-project Scool.

Place: This project will be primarily held at the research center Inria Lille -- Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.


Context

Simulators provide valuable yet often imperfect approximations of real-world systems. They typically fail to capture complex interactions, leading to discrepancies between simulated and real dynamics. A striking example arises in mixed-cropping systems, where multiple plant species grow simultaneously. While AI models can learn the growth behavior of each species in isolation, no current simulator accurately describes their joint dynamics. Consequently, reinforcement learning (RL) agents trained in simulators must not only learn effective policies but also adapt these policies to real-world data, where interactions between species introduce new, unmodeled effects. This adaptation falls under transfer reinforcement learning, where an agent refines its decision-making process as it transitions from simulation to reality.

A fundamental challenge in sim-to-real adaptation is that the optimal policy in simulation is not necessarily optimal in reality. Some adaptation is required, but the extent of this adaptation varies: in some cases, a simple fine-tuning of parameters may suffice, while in others, entirely new strategies must be learned from scratch. Our goal is to understand when and how existing knowledge can be efficiently leveraged to minimize learning costs while maintaining strong performance in the real environment.

Formalization

Let us consider two environments: Environment A representing a simplified, approximate version of the real-world system, and Environment B, representing the real-world environment, where the agent ultimately needs to perform well. The goal is to transfer a policy \( \pi \), learned in \( A \), to perform well in \( B \), and later extend this adaptation process to a sequence of environments representing variations of \( B \).

Representing Policy Spaces To fix ideas, consider a policy is parameterized by a (potentially high-dimensional) vector \( \theta \in \mathbb{R}^d \), where typically \( \log_{10}(d) > 3 \) (i.e., thousands to millions of parameters). We denote the set of \( \varepsilon \)-near-optimal policies in environment \( A \) as \( \Theta(A, \varepsilon) \), and likewise introduce \( \Theta(B, \varepsilon) \).

For computational efficiency, one may further approximate this set using a convex polytope built on a small set of representative policies. Formally:

\[ \Theta^+(\boldsymbol{\theta}_A, \varepsilon) = \left\{ \sum_{\ell=1}^{L} \alpha_\ell \theta_\ell : \alpha \in \mathcal{P}([0,1]^L) \right\} \]

where \( L \) is significantly smaller than \( d \) (typically \( L = O(\log(d)) \)), and \( \boldsymbol{\theta}_A = \{\theta_\ell\}_{\ell=1}^{L} \) represents a small number of diverse, near-optimal policies—akin to an "ensemble of experts". Searching in this space significantly reduces optimization time when \(L\) is significant smaller than \( d \).

Adaptation scenarios The interplay of the near-optimal policies in \(A\) and \(B\) yields four illustrative scenarios:

Balancing Learning Costs and Adaptation In practice, we do not know in advance which scenario applies. Instead, we must balance the cost of attempting adaptation against the cost of full retraining.

Denoting \( f_1, f_2, f_3, f_4 \) as the costs of different approaches, and considering a progressive procedure prioritizing the cheapest approaches and swithching to more costly ones only when they fail, transfer learning is only beneficial if:

\[ \sum_{m < 4} f_m< f_4 \]

Note that \(f_1\) and \(f_2\) typically depend on \(L\) rather than \(d\), while \(f_4\) depends on \(d\). The key challenge is designing stopping criteria to recognize when adaptation is failing and to switch to a more costly but necessary approach.

Continual Learning Across Multiple Environments Next, consider adapting not just to \( B \), but to a sequence of environments \( B_1, B_2, ..., B_p \). The goal is now to maintain a small, efficient set of policies that generalize well across multiple tasks.

Define a core set of policies \( \boldsymbol{\theta} = \{\theta_1, ..., \theta_{L_p}\} \) such that for all \( i \):

\[ \Theta(B_i, \varepsilon) \cap \Theta^+(\boldsymbol{\theta}, \varepsilon) \neq \emptyset \]

Ideally, we want \( \frac{L_p}{p} \to 0 \), ensuring that the number of required policies does not grow linearly with the number of tasks. Indeed if that case the system becomes unmanageable, as we would need to store and retrieve an excessive number of policies. Likewise, we want to avoid that \(L_p\) becomes too large compared to \(\log(d)\). Instead, we seek sparse additions of novel policies to avoid computational and memory blow-up, while maintaining near-optimal performance in all environments. An interesting question is to characterize simple situations in which core set policies exist such that \(L_p/p\to0\) happens.

Key Research Questions

By leveraging aggregations of expert policies, we hope to efficiently transfer knowledge while minimizing unnecessary retraining. The ultimate goal is to strike a balance between computational efficiency, adaptability, and long-term performance across diverse environments. Some research questions that need to be investigated include:

Bibliography

References are available upon request or via the Scool project page.

Host Institution and Supervision

The proejct will be hosted at Centre Inria de l'Université de Lille, in the Scool team. Scool (Sequential COntinual and Online Learning) focuses on the study of sequential decision-making under uncertainty.