Research Internship Proposal

Keywords: Multi-aremd bandits, Hierarchical, Wavelet theory.

Investigator: The project is proposed and advised by Odalric-Ambrym Maillard from Inria team-project Scool.

Place: This project will be primarily held at the research center Inria Lille -- Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.

Introduction and Background Motivation

Multi-armed bandit (MAB) problems are a fundamental framework in decision-making under uncertainty, with applications in domains such as online advertising, clinical trials, and recommendation systems. However, real-world problems often involve hierarchical structures where decisions must be refined progressively across multiple levels. To address this, we propose a novel extension of MAB to Hierarchical Bandits, positioning this work within the broader Continual Reinforcement Learning (CRL) paradigm.

CRL focuses on learning and adapting sequentially in a sequence of environments. Hierarchical Bandits extend MAB by introducing a tree-partition structure, where decisions evolve from broad categories to specific actions, supporting both adaptation and knowledge transfer across levels. For example, in agricultural treatments, a coarse-level decision might specify a general strategy (e.g., applying nitrogen), while finer levels refine this strategy (e.g., specifying dosage). While one may initially only consider the bandit problem at the coarse-level decision, we may later consider its refined version. Hence this hierarchy aligns with CRL's goals of progressive skill refinement and efficient adaptation over environments by considering a specific structure linking these environments.

The primary motivation is to develop a formal framework that captures the hierarchical nature of such problems, enabling coherent decision-making across different levels of granularity. Current bandit literature often neglects these hierarchical relationships, leading to suboptimal strategies in structured decision spaces. By integrating wavelet-like decomposition and hierarchical structures, we aim to provide both theoretical insights and practical algorithms for addressing these challenges in a way that complements and enhances CRL methodologies.

Modeling and Formalization

The hierarchical bandit problem can be formalized as a tuple $ ( \mathcal{A},\mathcal{T}, \mathbf{r}) $, where:

$ \mathcal{A} $: The set of available actions, organized hierarchically.
$ \mathcal{T} $: A tree-partition structure of the set $ \mathcal{A} $.
$ \mathbf{r} $: The reward function defined over $ \mathcal{A}$ whose mean is reconstructed on $ \mathcal{T} $.

The arm set $ \mathcal{A} $ is structured as a tree-partition $ \mathcal{T} $, where each vertex $ \nu \in \mathcal{T} $ corresponds to a subset $ \mathcal{A}_\nu \subset \mathcal{A} $, and its children vertices $ \mathcal{C}_\nu $ form a partition of $ \mathcal{A}_\nu $. The root node $ \nu_0 $ represents the full set $ \mathcal{A}_{\nu_0} = \mathcal{A} $, and leaf nodes are singletons corresponding to individual arms.

A slice of $ \mathcal{T} $ is defined as a partition of $ \mathcal{A} $ where each element corresponds to a subset $ \mathcal{A}_\nu $ for $ \nu \in \mathcal{T} $. Slices can be refined progressively, modeling scenarios where treatments or actions become more detailed over time. For example, consider an agricultural context with three broad treatments that are later refined into specific dosages or techniques. This hierarchical modeling allows for coherent exploration and exploitation across levels.

To enforce coherence, we adopt ideas from wavelet theory. The mean-reward function $ \mathbf{m}: \mathcal{A} \to \mathbb{R} $ is treated as a signal, decomposed into levels corresponding to $ \mathcal{T} $. For each node $ \nu \in \mathcal{T} $, we define transformations $ T_\nu[\mathbf{m}] $ and partial reconstructions $ \mathbf{m}[\nu] $, enabling a meta-bandit framework at varying levels of granularity.

More formally, let $\phi:\mathbb{R}\to\mathbb{C}$ and assume that $\phi(x)=0$ for all $x\notin[0,1]$, $\int_{0}^1 \phi(x)dx=0$ and $\int_{0}^1 |\phi(x)|^2dx=1$. For any subset $\mathcal{A}^\dagger=\{a_1,a_1+1,\dots,a_2\}\subset \mathbb{N}$ of consecutive integers, we define the $\mathcal{A}^\dagger$-rescaled version of $\phi$ as $$\phi_{\mathcal{A}^\dagger}(a)=\frac{1}{\sqrt{|\mathcal{A}^\dagger|}}\int_{a}^{a+1}\phi(\frac{x-a_1}{|\mathcal{A}^\dagger|})dx.$$ We further define for a signal $x:\mathcal{A}\to\mathbb{R}^d$, and $\nu\in\mathcal{T}$, the transform $$T_\nu[x]=\sum_{a\in\mathcal{A}}x(a)\phi_{\mathcal{A}_\nu}(a),$$ which enables to reconstruct a signal via $x(a) = \sum_{\nu'\in \mathcal{T}} T_{\nu'}[x] \phi_{\mathcal{A}_{\nu'}}(a)$.

For each node $ \nu \in \mathcal{T} $, denoting $ p_\nu=(\nu_0,\dots,\nu_{j-1},\nu_j) $ the ancestors node of $\nu=\nu_j$, such that $\mathcal{A}_\nu\subset\mathcal{A}_{\nu_{j-1}}\subset\dots\subset\mathcal{A}_{\nu_0}=\mathcal{A}$, we can now use the $\mathcal{T}$-Wavelet coefficients at these nodes to reconstruct a partial version of the signal $x[\nu]$ adapted to precision of node $\nu$.

Illustrative Example

Consider a bandit problem in agriculture involving three initial treatments:

Treatment 1: Adding nitrogen, which can be refined into two specific doses, $ \{1, 2\} $.
Treatment 2: Applying tillage, which can vary by intensity, $ \{3, 4\} $.
Treatment 3: Removing diseased plant parts, represented as a single option, $ \{5\} $.

This structure corresponds to the following hierarchical tree $ \mathcal{T} $:

        Root (All Actions): {1, 2, 3, 4, 5}
        ├── Node 1: {1, 2} (Nitrogen Application)
        │   ├── Leaf 1: {1} (Low Dose)
        │   └── Leaf 2: {2} (High Dose)
        ├── Node 2: {3, 4} (Tillage)
        │   ├── Leaf 3: {3} (Low Intensity)
        │   └── Leaf 4: {4} (High Intensity)
        └── Node 3: {5} (Plant Part Removal)
             └── Leaf 5: {5} (Single Option)

Initially, decisions are made at a coarse level, exploring the broader categories $ \{\{1, 2\}, \{3, 4\}, \{5\}\} $, referred to as a "slice" of $ \mathcal{T} $. As more data is gathered, refinements such as $ \{\{1\}, \{2\}, \{3, 4\}, \{5\}\} $ allow for more detailed exploration of nitrogen doses while keeping other treatments aggregated.

Using wavelet-like decomposition, the signal (mean reward $ \mathbf{m}: \mathcal{A} \to \mathbb{R} $) is analyzed hierarchically. Transformations $ T_\nu[\mathbf{m}] $ capture reward variations at each node, and partial reconstructions $ \mathbf{m}[\nu] $ summarize the aggregated performance across levels. For example:

$ T_{\text{Node 1}}[\mathbf{m}] $: Captures the average reward for nitrogen applications (coarse level).
$ T_{\text{Leaf 1}}[\mathbf{m}] $: Details the reward for low-dose nitrogen applications (fine level).
$ \mathbf{m}[\text{Node 2}] $: Provides a summary of tillage rewards across intensities.

This hierarchical structure ensures efficient exploration by leveraging information from broader categories to guide finer-level decisions, optimizing both exploration and exploitation across the decision space.

Proposed Methodology

The primary challenge lies in balancing exploration and exploitation across levels while maintaining coherence. This requires designing algorithms that:

Efficiently explore the tree to identify promising actions.
Ensure consistent decisions across levels using wavelet-based coherence.

The research will focus on:

Formalizing hierarchical bandits with tree-partition structures.
Developing wavelet-inspired decomposition methods for coherent decision-making across levels.
Designing algorithms that leverage hierarchical relationships to optimize exploration and exploitation.

Theoretical analyses will establish regret bounds and convergence properties, while empirical studies will validate the framework in domains such as agriculture and adaptive testing.

Impact and Relevance

Hierarchical bandits address a critical gap in decision-making under uncertainty by incorporating structured relationships among actions. This research has significant implications for fields requiring adaptive and granular decision-making, such as personalized medicine, education, and sustainable agriculture. By aligning theoretical advancements with practical needs, the proposed work aims to bridge the gap between abstract modeling and real-world applications.

Host Institution and Supervision

The project will be hosted at Centre Inria de l'Université de Lille, in the Scool team. Scool (Sequential COntinual and Online Learning) focuses on the study of sequential decision-making under uncertainty.