Research Internship Proposal

Keywords: Reinforcement Learning, Transfer learning, Simulator, Aggregation of experts.

Investigator: The project is proposed and advised by Odalric-Ambrym Maillard from Inria team-project Scool.

Place: This project will be primarily held at the research center Inria Lille -- Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.

Context

Simulators provide valuable yet often imperfect approximations of real-world systems. They typically fail to capture complex interactions, leading to discrepancies between simulated and real dynamics. A striking example arises in mixed-cropping systems, where multiple plant species grow simultaneously. While AI models can learn the growth behavior of each species in isolation, no current simulator accurately describes their joint dynamics. Consequently, reinforcement learning (RL) agents trained in simulators must not only learn effective policies but also adapt these policies to real-world data, where interactions between species introduce new, unmodeled effects. This adaptation falls under transfer reinforcement learning, where an agent refines its decision-making process as it transitions from simulation to reality.

A fundamental challenge in sim-to-real adaptation is that the optimal policy in simulation is not necessarily optimal in reality. Some adaptation is required, but the extent of this adaptation varies: in some cases, a simple fine-tuning of parameters may suffice, while in others, entirely new strategies must be learned from scratch. Our goal is to understand when and how existing knowledge can be efficiently leveraged to minimize learning costs while maintaining strong performance in the real environment.

Formalization

Let us consider two environments: Environment A representing a simplified, approximate version of the real-world system, and Environment B, representing the real-world environment, where the agent ultimately needs to perform well. The goal is to transfer a policy $ \pi $, learned in $ A $, to perform well in $ B $, and later extend this adaptation process to a sequence of environments representing variations of $ B $.

Representing Policy Spaces To fix ideas, consider a policy is parameterized by a (potentially high-dimensional) vector $ \theta \in \mathbb{R}^d $, where typically $ \log_{10}(d) > 3 $ (i.e., thousands to millions of parameters). We denote the set of $ \varepsilon $-near-optimal policies in environment $ A $ as $ \Theta(A, \varepsilon) $, and likewise introduce $ \Theta(B, \varepsilon) $.

For computational efficiency, one may further approximate this set using a convex polytope built on a small set of representative policies. Formally:

\[ \Theta^+(\boldsymbol{\theta}_A, \varepsilon) = \left\{ \sum_{\ell=1}^{L} \alpha_\ell \theta_\ell : \alpha \in \mathcal{P}([0,1]^L) \right\} \]

where $ L $ is significantly smaller than $ d $ (typically $ L = O(\log(d)) $), and $ \boldsymbol{\theta}_A = \{\theta_\ell\}_{\ell=1}^{L} $ represents a small number of diverse, near-optimal policies—akin to an "ensemble of experts". Searching in this space significantly reduces optimization time when $L$ is significant smaller than $ d $.

Adaptation scenarios The interplay of the near-optimal policies in $A$ and $B$ yields four illustrative scenarios:

Direct Transfer Success: If $ \Theta^+(A, \varepsilon) \cap \Theta(B, \varepsilon) \neq \emptyset $, adaptation reduces to selecting an appropriate convex combination of expert policies. Indeed, there exists at least one policy that is near-optimal in both environments. Hence, adaptation reduces to selecting an appropriate convex combination of expert policies in $A$, requiring optimization only over the small parameter space of $\alpha$ rather than the full-dimensional space $\theta$.
Mild Adaptation: If no $ \varepsilon $-optimal policy in $ A $ remains $ \varepsilon $-optimal in $ B $, but at least one policy is $ 2\varepsilon $-optimal in $ B $, that is \[ \Theta^+(\boldsymbol{\theta}_A, \varepsilon) \cap \Theta(B, 2\varepsilon) \neq \emptyset, \] then a pre-trained strategy can still perform reasonably well in $B$, albeit with some degradation. Hence adaptation remains feasible by searching within the convex polytope, adjusting the mixture of expert policies.
Significant Adaptation Required: If no near-optimal policy in $ A $ performs even $ 2\varepsilon $-optimally in $ B $, but a slightly worse policy in $ A $ can be adapted, local modifications may suffice. Formally, consider \[ \Theta^+(\boldsymbol{\theta}_A, 2\varepsilon) \cap \Theta(B, \varepsilon) \neq \emptyset\] In this case, one need to perturbate the anchor points in $\boldsymbol{\theta}_A$, hence to modify the expert policies themselves. But crucially, it may still be possible to achieve adaptation efficiently if local modifications suffice—scaling with $L$ rather than the full parameter dimension $d$.
Severe Domain Shift: If $ \Theta^+(\boldsymbol{\theta}_A, 2\varepsilon) \cap \Theta(B, 2\varepsilon) = \emptyset $, the original policies are entirely inadequate for $ B $ and new learning is required. This case is costly, as requires optimizing in dimension $d $.

Balancing Learning Costs and Adaptation In practice, we do not know in advance which scenario applies. Instead, we must balance the cost of attempting adaptation against the cost of full retraining.

Denoting $ f_1, f_2, f_3, f_4 $ as the costs of different approaches, and considering a progressive procedure prioritizing the cheapest approaches and swithching to more costly ones only when they fail, transfer learning is only beneficial if:

\[ \sum_{m < 4} f_m< f_4 \]

Note that $f_1$ and $f_2$ typically depend on $L$ rather than $d$, while $f_4$ depends on $d$. The key challenge is designing stopping criteria to recognize when adaptation is failing and to switch to a more costly but necessary approach.

Continual Learning Across Multiple Environments Next, consider adapting not just to $ B $, but to a sequence of environments $ B_1, B_2, ..., B_p $. The goal is now to maintain a small, efficient set of policies that generalize well across multiple tasks.

Define a core set of policies $ \boldsymbol{\theta} = \{\theta_1, ..., \theta_{L_p}\} $ such that for all $ i $:

\[ \Theta(B_i, \varepsilon) \cap \Theta^+(\boldsymbol{\theta}, \varepsilon) \neq \emptyset \]

Ideally, we want $ \frac{L_p}{p} \to 0 $, ensuring that the number of required policies does not grow linearly with the number of tasks. Indeed if that case the system becomes unmanageable, as we would need to store and retrieve an excessive number of policies. Likewise, we want to avoid that $L_p$ becomes too large compared to $\log(d)$. Instead, we seek sparse additions of novel policies to avoid computational and memory blow-up, while maintaining near-optimal performance in all environments. An interesting question is to characterize simple situations in which core set policies exist such that $L_p/p\to0$ happens.

Key Research Questions

By leveraging aggregations of expert policies, we hope to efficiently transfer knowledge while minimizing unnecessary retraining. The ultimate goal is to strike a balance between computational efficiency, adaptability, and long-term performance across diverse environments. Some research questions that need to be investigated include:

How can we efficiently determine or test when a pre-trained policy is inadequate and should be modified or replaced?
What are the computational trade-offs between modifying existing policies versus learning new ones?
Under what conditions does the number of required policies remain sublinear in the number of environments?

Host Institution and Supervision

The proejct will be hosted at Centre Inria de l'Université de Lille, in the Scool team. Scool (Sequential COntinual and Online Learning) focuses on the study of sequential decision-making under uncertainty.

Research Topic Proposal

Context

Formalization

Key Research Questions

Bibliography

Host Institution and Supervision