Centre Inria de l'Université de Lille
Team project Scool -- Spring 2025
Keywords: Multi-armed bandits, Group-sequential, Hypothesis testing.
Investigator: The project is proposed and advised by Odalric-Ambrym Maillard from Inria team-project Scool.
Place: This project will be primarily held at the research center Inria Lille -- Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.
Experimental research in long-cycle domains such as agroecology and clinical trials faces unique challenges that arise from the interplay between the length of individual experiments and the need for timely, actionable insights. Unlike short-duration studies, where results can be quickly synthesized, these domains require extended observation periods—spanning entire growing seasons for agricultural practices or several months to years for clinical outcomes. This temporal constraint makes traditional experimental designs insufficiently agile to address the dynamic needs of stakeholders, such as farmers seeking practical guidance or clinicians adapting to emerging therapeutic options.
In agroecology, the study of sustainable farming practices involves complex, multivariate systems that demand meticulous observation over extended periods to capture outcomes like crop yield, soil health, and biodiversity. While the underlying agroecosystem remains relatively stationary over these timescales, the set of available practices may evolve incrementally. New innovations, such as biofertilizers or intercropping techniques, emerge and must be integrated into ongoing studies. Simultaneously, older practices may become obsolete as they are outperformed by more sustainable or cost-effective alternatives. The evolving action space reflects the practical reality of agricultural systems, where farmers need timely recommendations that can be implemented across diverse contexts without waiting for the full conclusion of multi-year trials.
Similarly, clinical trials for chronic conditions operate within a stable disease framework but may face an evolving therapeutic landscape. The introduction of novel drugs, combination therapies, or personalized treatment approaches alters the available set of interventions, while older treatments may be deprioritized due to inferior efficacy or emerging safety concerns. Given the high stakes of patient outcomes, ethical imperatives demand early identification of effective therapies to benefit current patients, even as the experiment is still underway. Long observation periods, required to assess efficacy and safety, further complicate the design of traditional trials, necessitating methodologies that can accommodate gradual changes in the set of available treatments without compromising statistical rigor or ethical oversight.
The confluence of these factors—long experiment cycles, the evolving set of practices or treatments, and the need for real-time decision-making—positions the research challenge at the intersection of adaptive experimentation and group-sequential methodologies. Unlike classical designs that assume fixed action spaces, the evolving nature of agroecological practices and clinical therapies calls for a dynamic approach where new options can be integrated into the experimental framework as they emerge. Group-sequential methods provide a natural structure for addressing these challenges by enabling interim analyses that respect the long observational cycles while ensuring early actionable insights. This framework allows researchers to dynamically manage evolving action sets, balance ethical considerations, and maximize the practical utility of their findings, all while maintaining the robustness and reliability demanded in high-stakes experimental settings.
Building on the challenges of long-cycle experimentation and evolving action sets, a further complexity arises when incorporating contextual information into decision-making. Unlike standard hypothesis testing or multi-armed bandit frameworks that operate under stationary and context-independent assumptions, many real-world applications require models that adapt to variations in contextual features. For example, in agroecology, the optimal practice for a specific field may depend on soil type, weather patterns, or crop variety. Similarly, in clinical trials, the efficacy of a treatment may vary across patient subgroups, such as age, comorbidities, or genetic markers. Addressing this contextual dependency requires extending traditional methods to contextual hypothesis testing and, correspondingly, to contextual bandits with a response model that depends on observable covariates, such as a linear response function.
This contextual extension introduces a new paradigm for experimentation that we term "hypothesis charting" or "bandit charting" (akin to contextual best-arm identification). The goal is no longer simply to identify the best overall action but rather to map the contextual space to determine which action is optimal for each region of the context. In essence, this process seeks to chart a "map" of contexts where the superiority of each action can be established with confidence. The challenge becomes efficiently identifying these regions while balancing exploration and exploitation, particularly under resource constraints or ethical considerations. This paradigm intersects with both the hypothesis testing and bandit axes, as well as the fully sequential and group-sequential axes. The most challenging formulation emerges in group-sequential bandit charting, where the need for interim decisions, evolving contexts, and adaptive experimentation converges.
We consider a sequential decision-making framework where the learner interacts with an environment over multiple rounds \( t = 1, 2, \ldots \). At each round, the learner is presented with a **context** \( x_t \in \mathbb{R}^d \), drawn from a distribution \( \mathcal{D} \). Based on this context, the learner selects an **action** \( a_t \in \mathcal{A}_t \), where \( \mathcal{A}_t \subseteq \mathcal{A} \) denotes the set of available actions at time \( t \). Importantly, we allow \( \mathcal{A}_t \) to evolve over time, reflecting practical scenarios where actions (e.g., treatments, interventions) can appear, disappear, or be deemed obsolete.
Upon selecting an action \( a_t \), the learner observes a **reward** \( r_t \) drawn from a stochastic reward function \( r_t \sim \mathcal{R}(x_t, a_t) \), where the expected reward is modeled as:
$$ \mathbb{E}[r_t | x_t, a_t] = \mu(x_t, a_t) = x_t^\top \theta_{a_t}. $$
Here, \( \theta_{a_t} \in \mathbb{R}^d \) is the parameter vector associated with action \( a_t \), and the linear reward structure reflects the **contextual nature** of the problem.
The overarching goal depends on the specific problem setting:
For each action \( a \), a key object of study is the set of contexts \( \mathcal{C}_a \) for which \( a \) is the optimal action. Formally:
$$ \mathcal{C}_a = \{ x \in \mathbb{R}^d : \mu(x, a) > \mu(x, a'), \forall a' \neq a \}. $$
In other words, \( \mathcal{C}_a \) is the region of the context space where the reward for action \( a \) is strictly greater than the reward for all other actions. Identifying these regions is crucial for understanding the decision boundaries between actions and plays a key role in defining efficient sampling strategies.
In the group-sequential setting, decisions are not made round-by-round but instead in batches or groups of rounds. At the end of each group \( g = 1, 2, \ldots, G \), the learner evaluates accumulated evidence to decide whether to continue or stop. Formally, the timeline is divided into \( G \) groups, each comprising \( n_g \) rounds, such that \( T = \sum_{g=1}^G n_g \). The stopping criteria are defined in terms of statistical thresholds, balancing exploration, exploitation, and error control.
For example, in a group-sequential hypothesis testing setup, the null hypothesis \( H_0: \mu(x, a_1) \leq \mu(x, a_2) \) might be rejected in favor of the alternative \( H_1: \mu(x, a_1) > \mu(x, a_2) \) if:
$$ \text{Cumulative evidence: } \sum_{t \in \text{Group } g} r_t(a_1) - r_t(a_2) > \text{Threshold}. $$
To handle evolving action sets, we redefine the decision process dynamically. At each round \( t \), the learner is presented with a subset \( \mathcal{A}_t \subseteq \mathcal{A} \), where \( \mathcal{A}_t \) may change over time. This reflects real-world constraints, such as:
The key challenge lies in maintaining **optimality** despite the dynamically evolving action set. The reward function remains contextual, but the learner must adaptively re-evaluate the expected rewards \( \mu(x_t, a) \) for actions \( a \in \mathcal{A}_t \) as the set evolves.
For a given setting, the learner optimizes one of the following objectives:
$$ H_0: \mu(x, a_1) \leq \mu(x, a_2), \quad \text{vs.} \quad H_1: \mu(x, a_1) > \mu(x, a_2), $$
with **group-sequential error control**.$$ a^*(x) = \arg\max_{a \in \mathcal{A}_t} \mu(x, a). $$
$$ \text{Regret}(T) = \sum_{t=1}^T \big(\mu(x_t, a^*(x_t)) - \mu(x_t, a_t)\big). $$
The most challenging and novel setting combines both dimensions:
The decision-making strategy must account for:
1. Agroecological Example: Consider an agricultural field divided into multiple regions, each characterized by a unique context \( x \), including soil type, sunlight exposure, and moisture levels. The actions \( \mathcal{A} \) represent different cropping strategies, such as intercropping, monoculture, and organic fertilizers. Initially, only a subset of actions \( \mathcal{A}_0 \subset \mathcal{A} \) is available due to resource constraints or prior knowledge. Over time, new actions are introduced as additional techniques or resources become accessible, dynamically expanding the action set \( \mathcal{A}_t \).
In the group-sequential setting, experiments are conducted in predefined phases, with each phase corresponding to a growing season. After each phase, the cumulative data are analyzed to update the reward models \( r_a(x) \) for each action \( a \) and context \( x \). For example, the analysis may reveal that intercropping is superior in sandy soils but less effective in clay soils with low rainfall. The goal is to iteratively refine the "chart" of optimal actions across the context space while ensuring efficient use of experimental resources.
As the action set evolves, the challenge becomes balancing exploration of newly introduced cropping strategies with exploitation of previously identified optimal strategies, all while respecting the sequential nature of the growing seasons.
2. Clinical Trial Example: In a clinical trial setting, the context \( x \) represents patient characteristics, such as age, comorbidities, and genetic markers. The actions \( \mathcal{A} \) correspond to different treatment options. Initially, only a few treatments are available, and new treatments are introduced over time as they pass preliminary safety checks.
In the group-sequential framework, the trial is conducted in stages, with interim analyses performed after each batch of patients is enrolled. For example, after the first stage, it may be observed that Treatment A is more effective for older patients, whereas Treatment B is optimal for younger patients with a specific genetic marker. These findings inform the allocation of patients in subsequent stages, prioritizing exploration of less-understood subgroups while exploiting the known optimal treatments for certain contexts.
The evolving action set adds complexity to the problem, as newly introduced treatments must be evaluated against existing treatments while maintaining ethical constraints and ensuring patient safety. The ultimate goal is to map the context space to determine the optimal treatment for each patient subgroup as efficiently and rigorously as possible.
We organize this research program around two key axes of methodological advancement: the integration of group-sequential methodologies with contextual structure and the incorporation of evolving action sets into sequential decision-making frameworks. While these axes are largely independent, their combination within the three fundamental problems of hypothesis testing, best arm identification (BAI), and regret minimization offers unique challenges and opportunities for advancing the field. Our approach is to methodically explore each axis, starting from the foundational problems, and progressively addressing their most challenging extensions.
The first axis focuses on the intersection of group-sequential methods and contextual decision-making. Standard hypothesis testing and BAI approaches typically assume continuous or fully sequential data collection, while group-sequential methods partition observations into pre-determined stages, allowing for interim analyses. Extending these methods to incorporate contextual information, such as environmental conditions in agriculture or patient demographics in healthcare, introduces additional complexity. For hypothesis testing, this involves designing group-sequential tests that remain robust under varying contextual conditions. For BAI, the goal is to efficiently identify the best action in each region of the contextual space while balancing exploration and exploitation across stages. In regret minimization, the challenge is to adaptively optimize decisions while respecting group-sequential constraints on information flow and error control.
The second axis addresses the dynamic nature of action sets in many real-world applications. Traditional bandit problems assume a fixed set of actions, but practical scenarios often feature new actions emerging over time and others becoming obsolete. This evolution poses unique challenges for hypothesis testing, BAI, and regret minimization. For instance, in hypothesis testing, new hypotheses may need to be tested as they appear, requiring flexible designs that account for a growing number of comparisons. In BAI, the identification of optimal actions must be revisited as new actions are introduced, demanding algorithms that can efficiently explore and evaluate these additions. Similarly, in regret minimization, the objective is to maintain near-optimal performance despite the changing landscape of available actions.
The most ambitious objective combines these two axes: group-sequential decision-making in the presence of both contextual structure and evolving action sets. For example, this could involve mapping the optimal agricultural practice across varying soil types and weather conditions while accounting for the introduction of new practices over time. Alternatively, in clinical trials, this might mean identifying the best treatment for patient subgroups as new drugs enter the pipeline. By unifying group-sequential methods, contextual models, and adaptive mechanisms for evolving action sets, we aim to develop a comprehensive framework capable of addressing these complex, real-world problems.
This research program is designed to contribute not only to theoretical advancements but also to academic growth and interdisciplinary collaboration. Each step in the program offers opportunities for students and researchers to engage with cutting-edge challenges, fostering a deeper understanding of statistical and machine learning principles. Moreover, the program’s focus on applications in agronomy and healthcare underscores its societal relevance, providing a foundation for future work in sequential experimentation and decision-making under uncertainty.
The project will be hosted at Centre Inria de l'Université de Lille, in the Scool team. Scool (Sequential COntinual and Online Learning) focuses on the study of sequential decision-making under uncertainty.