Epistemic Exploration Toward Artificial General Intelligence

Responder Reasoner Agent Prospector Ecosystem

Exploration as the Transition Mechanism

Five Levels Toward AGI

The Five-Level Trajectory Toward AGI. Exploration serves as the transition mechanism across five levels of increasing agent sophistication: Responder → Reasoner (reasoning space), Reasoner → Agent (interaction space), Agent → Prospector (imagination space), and Prospector → Ecosystem (coordination space).

🔥 This is a curated paper list for the survey "Epistemic Exploration Toward Artificial General Intelligence", covering exploration mechanisms across reasoning, embodied AI, world models, and multi-agent systems.

🔥 Stay tuned for our full paper release, incorporating the latest developments.

[Always] We welcome all related papers! If you find any missed or new work, please open a Pull Request or contact us. We will keep this list updated frequently!

1. Overview

1.1 What is Epistemic Exploration?

Epistemic exploration is the agent's capacity to actively acquire information that reduces its uncertainty about the world, convert that reduction into durable policy improvement, and keep future acquisition possible.

Unlike undirected exploration (e.g., ε-greedy), epistemic exploration is intentional, belief-driven, and multi-scale: the agent reasons about which actions are most informative and plans multi-step information-gathering strategies across reasoning trajectories, tool-use policies, embodied sensorimotor loops, world-model rollouts, and multi-agent coordination protocols.

1.2 Three Criteria

Three Criteria

Foundation of Epistemic Exploration — Why, What, and How.

We ground epistemic exploration in three jointly necessary criteria, each addressing a distinct failure mode of static optimisation:

C1

Information Gain

Actively reduces epistemic uncertainty via belief-updating observations

  • Failure Mode: Belief Stagnation — frozen internal model under distribution shift
  • Explores: ...where it knows least
C2

Value Improvement

Converts new information into durable policy improvement

  • Failure Mode: Value Stagnation — local optima lock-in, surrogate misalignment
  • Explores: ...what it cannot yet do well
C3

Epistemic Reachability

Preserves positive visitation over belief-consistent regions

  • Failure Mode: Reachability Collapse — irreversible contraction of behavioural diversity
  • Explores: ...where it might otherwise never go

These form a closed loop: gain information → convert to value → keep the capacity to gain information alive → ...

1.3 Unified Epistemic Exploration Objective

The three criteria combine into a single constrained objective:

$$ \pi_{\mathfrak{A},t}^{*} \;=\; \underset{\underbrace{\pi_{\mathfrak{A}} \,\in\, \Pi_{\mathrm{reach}}(b_t)}_{\text{Reachability (C3)}}}{\arg\max}\; \underbrace{\mathbb{E}_{\theta \sim b_t} \Big[V^{\pi_{\mathfrak{A}}}_\theta(s_t, h_t)\Big]}_{\text{Value Improvement (C2)}} \;+\;\beta\;\cdot\; \underbrace{\mathbb{E}^{\pi_{\mathfrak{A}}}_{b_t} \left[\sum_{t'=t}^{\infty} \gamma^{\,t'-t}\,\mathcal{U}(s_{t'}, a_{t'};\, b_{t'})\right]}_{\text{Information Gain (C1)}} $$

where $\mathcal{U}(s, a;\, b) = I(\theta;\, s', r \mid s, a, b)$ is epistemic uncertainty, and $\Pi_{\mathrm{reach}}(b_t)$ is the reachability-feasible policy set.

Information-Gain Term (C1)

Expected cumulative epistemic uncertainty the agent anticipates resolving along its trajectory.

Value Improvement (C2)

Expected cumulative reward under current beliefs; what a pure exploiter would maximise.

Reachability (C3)

Visitation must remain over every region plausibly relevant under beliefs, preventing short-term gains from foreclosing future learning.

1.4 Five-Level Trajectory Toward AGI

We propose exploration as the transition mechanism between five levels of increasing agent sophistication. Each level introduces a qualitatively new exploration space that the previous level cannot access:

Transition Exploration Space What Becomes Explorable
L1 → L2: Responder → Reasoner Reasoning space Hypotheses, alternative reasoning trajectories, latent thought representations; self-verification and revision
L2 → L3: Reasoner → Agent Interaction space Embodied perception, tool invocation, memory management, closed-loop action under partial observability
L3 → L4: Agent → Prospector Imagination space Counterfactual futures in learned world models; the dual exploration problem across real and imagined environments
L4 → L5: Prospector → Ecosystem Coordination space Communication topologies, co-evolving role specialisations, shared representations, collaborative strategies

1.5 3×5 Taxonomy

Our survey is organized as a 3×5 taxonomy crossing three signal-driven methodologies with the five levels:

L1 Responder L2 Reasoner L3 Agent L4 Prospector L5 Ecosystem
Uncertainty-Driven (single forward pass; no internal search) Token / step entropy, entropy-guided branching Active SLAM, prediction variance, pose uncertainty Ensemble disagreement in latent world models Inter-agent disagreement, joint-belief uncertainty
Competence-Driven (no learning loop at inference) Difficulty-adaptive curricula, self-verification, self-play Skill bootstrapping, goal-conditioned self-play Imagination-based skill discovery, learning-progress curricula Emergent multi-agent self-play, co-evolving curricula
Reachability-Driven (fixed output manifold) Beam diversity, anti-repetition, KL-to-reference trust regions Go-Explore, coverage-maximising curricula Latent-space diversity bonuses, action-entropy regularisation Role-diversity bonuses, anti-convergence on coordination topologies

2. Levels 1–2: Responder → Reasoner — Reasoning-Space Exploration

The transition from Responder to Reasoner requires exploration in reasoning space: branching over token sequences, reasoning trajectories, and latent thought representations. The agent must search for informative hypotheses rather than simply produce reactive outputs.

Reasoning-Space Exploration

Levels 1–2 Reasoning-Space Exploration — Why (entropy escalation & reward stagnation), Where (tokens → turns → latent trajectories), and How (uncertainty / competence / reachability-driven).

2.1 Uncertainty-Driven Exploration

Methods that prioritise exploration at high-uncertainty branching points in the reasoning process.

Date Method Key Idea Links
2025-08 CURE Expands the training-state distribution at critical decision points to sustain exploration arXiv
GitHub
2026-03 SPINE Preserves exploration by selectively updating high-entropy branch tokens arXiv
2025-06 TreeRL Explores reasoning via on-policy tree search from uncertain intermediate states arXiv
GitHub
2025-09 CE-GPPO Collapses epistemic uncertainty into adaptive exploration bonuses for reasoning RL arXiv
2025-10 STEER Preserves exploration by stabilizing token-level entropy change through adaptive reweighting arXiv
2025-10 AEPO Adaptive entropy-guided policy optimization for exploration in reasoning models arXiv
GitHub
2025-11 ICPO Promotes exploration by combining verifiable rewards with confidence-based preference advantages arXiv
2026-02 REAL Stabilizes exploration via balanced gradient allocation arXiv

2.2 Competence-Driven Exploration

Methods that match problem difficulty to the model's evolving competence frontier.

Date Method Key Idea Links
2025-06 E2H Guides exploration through an easy-to-hard curriculum arXiv
GitHub
2025-10 RLAAR Steers exploration through curriculum learning and rewarded abstention arXiv
2025-05 CDAS Explores the competence frontier by sampling problems matched to the model's current ability arXiv
GitHub
2026-01 HA-DW Reduces exploration imbalance by debiasing group-relative advantages across prompt difficulty arXiv
2025-08 SvS Sustains exploration by self-synthesizing diverse but answer-equivalent problems during RLVR arXiv
GitHub

2.3 Reachability-Driven Exploration

Methods that prevent irreversible contraction of reasoning trajectory distributions.

Date Method Key Idea Links
2025-10 TROLL Stabilizes exploration with principled trust-region updates instead of PPO-style clipping arXiv
2024-04 ROPO Preserves useful exploration by downweighting noisy preference signals instead of overfitting to them arXiv
2025-05 KTAE Explores better by assigning credit to key reasoning tokens rather than whole rollouts arXiv
GitHub
2025-08 VRPRM Guides exploration with visual step-level rewards that encourage deeper reasoning paths arXiv
2026-01 RLVRR Turns sparse end rewards into a verifiable reward chain that supports broader open-ended exploration arXiv
GitHub

3. Level 3: Reasoner → Agent — Perception- & Action-Space Exploration

At Level 3, the agent crosses from internal reasoning into situated interaction with external environments. Exploration unfolds in perception and action space, where every step incurs real cost. The transition splits into Digital Agents (software-mediated) and Embodied Agents (physical interaction).

3.1 Digital Agents

Agents operating in software-mediated environments (web, APIs, code interpreters):

3.1.1 Uncertainty-Driven Exploration

Methods that acquire information under partial observability by prioritising uncertain states, tool calls, or capability boundaries.

Date Method Key Idea Links
2026-01 JitRL Uses count-based uncertainty bonuses to explore unseen state-action pairs arXiv
GitHub
2023-05 RAP Explores alternative reasoning paths with MCTS and UCB guidance Link
GitHub
2024-08 Agent Q Expands high-value action trajectories via MCTS-guided exploration arXiv
2023-10 LAST Explores reasoning-action branches through language-agent tree search arXiv
GitHub
2025-04 KnowSelf Explores capability boundaries by detecting uncertain self-knowledge arXiv
GitHub
2025-01 Search-o1 Explores external evidence when reasoning exposes knowledge uncertainty Link
GitHub
2025-04 TTRL Test-time RL via majority-voted pseudo-rewards turns inference disagreement into exploration arXiv
3.1.2 Competence-Driven Exploration

Methods that tame combinatorial tool-use spaces through curricula, process-level credit assignment, and self-generated training tasks.

Date Method Key Idea Links
2025-08 PilotRL Stages curricula to expand agent exploration from planning to tool use arXiv
2025-09 ReSum-GRPO Sustains long-horizon search exploration through context summarization arXiv
2024-03 ETO Optimizes exploratory trial-and-error trajectories for agent learning arXiv
GitHub
2024-11 WebRL Self-evolving online curriculum from failure trajectories for web agents arXiv
2025-09 Planner-R1 Uses dense process rewards to steer exploration toward feasible plans arXiv
2025-08 RLTR Rewards complete tool-use processes to improve exploratory planning arXiv
2025-04 ReTool RL rewards strategic tool-invocation patterns, penalises redundant calls arXiv
2025-05 GiGPO Assigns state-level credit across grouped rollouts for exploration arXiv
GitHub
2025-11 Agent0-VL Evolves tool-integrated exploration through repeated reasoning cycles arXiv
GitHub
2025-05 Absolute Zero Uses proposer-solver self-play to explore new reasoning tasks arXiv
GitHub
3.1.3 Reachability-Driven Exploration

Methods that preserve behavioural flexibility by regulating entropy or injecting useful off-policy experience.

Date Method Key Idea Links
2025-08 EGPO Adds entropy bonuses to encourage exploration in function-call reasoning arXiv
GitHub
2025-09 EPO Regularizes entropy to sustain exploration in multi-turn agent RL arXiv
GitHub
2025-09 ENTROPO Uses entropy-enhanced preferences to diversify coding-agent exploration arXiv
2026-03 RAPO Expands policy exploration with retrieval-augmented experience arXiv
2026-04 E³-TIR Branches from high-entropy prefixes to exploit exploratory experience arXiv
GitHub

3.2 Embodied Agents

Embodied Agent Exploration

Level 3 Embodied Agent Exploration — Uncertainty-driven active perception, competence-driven navigation & RL & test-time compute, and reachability-driven reward engineering & constrained safety.

Embodied agents operate in continuous, high-dimensional action spaces where every physical interaction consumes time, energy, and mechanical wear. The three exploration paradigms adapt as follows:

3.2.1 Uncertainty-Driven Exploration

Geometric & high-fidelity reconstruction: Viewpoint selection for active mapping, information-theoretic coverage, and ensemble-disagreement-based exploration of dynamics.

Date Method Key Idea Links
2018-10 MAX Ensemble-disagreement drives active exploration of dynamics arXiv
GitHub
2020-04 Active Neural SLAM Coverage-maximising hierarchical policies explore unknown occupancy maps arXiv
GitHub
2021-03 APT Non-parametric entropy maximisation for unsupervised active pre-training arXiv
GitHub
2023-12 Model-Free Active Exploration Information-theoretic lower-bound approximation for ensemble-based exploration Link
2024-10 ActiveSplat Gaussian-splat viewpoint exploration maximises reconstruction fidelity under a time budget arXiv
GitHub
2023-11 Conan Active interactive exploration as Bayesian query to disambiguate latent scene state arXiv
GitHub
2024-04 ActiveRIR Cross-modal audio-visual exploration for acoustic scene mapping (room impulse responses) arXiv
2025-10 Active Semantic Perception Entropy-driven exploration over LLM-sampled scene graph hypotheses arXiv
GitHub

Semantic & multi-modal active inference: Probing the environment to disambiguate alternative scene-graph completions or to gather cross-modal (audio/language) evidence.

3.2.2 Competence-Driven Exploration

Competence-driven exploration spans navigation to task-relevant states and manipulation to achieve objectives. Both push beyond pre-trained priors at the frontier of current capability.

Objective-driven navigation
Date Method Key Idea Links
2022-04 SayCan Affordance value-function reweights LLM-proposed action exploration arXiv
GitHub
2022-07 Inner Monologue Closed-loop replanning via inner-speech feedback re-explores failed plans arXiv
2022-07 LM-Nav Goal-directed exploration over LLM-annotated topological graphs arXiv
GitHub
2022-10 VLMaps Open-vocabulary visual-language maps guide language-conditioned spatial exploration arXiv
GitHub
2023-10 LFG LLM semantic-priors prune frontier exploration toward goal-relevant regions arXiv
GitHub
2024-10 Fisher-Info Planning MLLM-guided exploration balancing information gain vs. localisation risk (Fisher information) arXiv
GitHub
RL for VLA policy exploration
Date Method Key Idea Links
2023-03 Cal-QL Calibrated offline value exploration enabling safe online fine-tuning arXiv
2023-09 Q-Transformer Scales autoregressive value-based exploration to static multi-task trajectories arXiv
2024-09 DPPO Formulates denoising trajectories as auxiliary MDP for stable PPO on diffusion policies arXiv
GitHub
2024-09 FLaRe Large-scale online RL fine-tuning exploration on pretrained VLAs arXiv
GitHub
2024-10 HIL-SERL Sample-efficient on-robot RL with human-in-the-loop interventions for dexterous tasks arXiv
GitHub
2024-11 GRAPE Preference-aligned exploration generalises VLA policies to novel scenarios arXiv
2025-02 ConRFT Consistency-regularised offline-to-online exploration (HIL-SERL + consistency) for diffusion VLA arXiv
GitHub
2025-05 ReinboT RL amplifies VLA manipulation exploration via reward-guided offline alignment arXiv
GitHub
2025-05 VLA-RL Scalable PPO-based online action-space exploration for VLA policies arXiv
GitHub
2025-09 SimpleVLA-RL GRPO group-relative exploration scales VLA skill acquisition arXiv
GitHub
2025-09 Dual-Actor FT Dual-actor decoupling of exploration vs. exploitation for stable offline-to-online RL arXiv
2025-10 π_RL First online PPO/GRPO RL fine-tuning for flow-matching VLA arXiv
2025-11 SRPO Self-refined exploration bridging static data and online rollouts arXiv
GitHub
2025-11 π*₀.₆ Flow-matching VLA that learns from online experience via offline RL arXiv
2025-11 WMPO Pure world-model PPO enables safe online action-space exploration for VLA arXiv
2026-01 SOP Scalable online post-training infrastructure for fleet-scale VLA exploration arXiv
2026-02 GigaBrain-0.5M Foundation VLA learned directly from world-model-based RL at fleet scale arXiv
2026-04 π₀.₇ Steerable flow VLA trained with diverse multimodal context for out-of-the-box generalist skills arXiv
Test-time compute & cognitive search
Date Method Key Idea Links
2024-10 V-GPS Offline value guidance steers generalist AR / diffusion VLA decoding at test time arXiv
2025-05 Hume System-2 deliberative exploration via continuous flow value guidance arXiv
GitHub
2025-08 MB-Search VLA Model-based MCTS over AR / Diffusion VLA imagines trajectories before acting arXiv
2025-09 VLA-Reasoner MCTS imagination-time exploration over autoregressive action trajectories arXiv
2025-11 DeepThinkVLA Slow-thinking test-time exploration through deliberate chain-of-action reasoning arXiv
GitHub
2025-12 TACO Anti-exploration test-time steering via continuous normalising flows arXiv
GitHub
2026-01 TT-VLA Value-free on-the-fly test-time RL adapts VLA policies per-episode arXiv
2026-02 Recurrent-Depth VLA Implicit test-time compute scaling via latent iterative reasoning (no explicit tokens) arXiv
3.2.3 Reachability-Driven Exploration

Automated reward engineering & curiosity: LLM-driven reward synthesis and curiosity / curriculum mechanisms sustain broad exploration incentives.

Date Method Key Idea Links
2018-10 RND Curiosity-driven exploration bonus via random network distillation arXiv
GitHub
2020-02 Never Give Up Episodic + lifelong novelty bonuses sustain directed exploration across long horizons arXiv
2023-06 Language-to-Rewards LLM synthesises dense language-conditioned reward for skill exploration arXiv
GitHub
2023-10 Eureka LLM-synthesised executable reward code evolves the explorable task manifold arXiv
GitHub
2024-09 CurricuLLM LLM-designed curricula for progressive exploration of hard manipulation skills arXiv
GitHub
2025-05 TeViR Text-to-video diffusion rewards enable efficient sparse-task exploration arXiv
2020-10 Recovery RL Learned recovery zones bound exploration without collapsing reachable set arXiv
GitHub
2024-04 RECOVER Neuro-symbolic failure detection bounds exploratory trajectories in manipulation arXiv
2025-03 SafeVLA Constrained policy exploration under hard safety guarantees for VLA arXiv
GitHub

4. Level 4: Agent → Prospector — Imagination-Space Exploration

The Prospector internalises a world model and faces a dual exploration problem: simultaneously gathering real data to refine world model fidelity AND searching imagined trajectories to extract policies.

Imagination-Space Exploration

Level 4 Imagination-Space Exploration — Why (the dual exploration problem), Where (simulated rollouts, hazard zones, latent value landscapes), and How (MBRL, video generation, autonomous driving, social dynamics).

4.1 Why: The Dual Exploration Problem

4.1.1 Compounding Errors and Reality Drift

World models act as recursive self-simulators where infinitesimal single-step errors compound exponentially over long imagined horizons. Agents must proactively probe epistemic boundaries.

Date Method Key Idea Links
2023-01 DreamerV3 Explores long imagined rollouts with entropy-regularised actor to prevent premature convergence in sparse-reward settings arXiv
GitHub
2019-06 MBPO Limits imagined rollout length to prevent compounding error accumulation; iterative policy–model alternation drives exploration of true dynamics arXiv
GitHub
2018-05 PETS Ensemble of probabilistic networks quantifies epistemic uncertainty; explores regions where ensemble predictions most disagree to anchor model to reality arXiv
GitHub
2018-07 SLBO Constructs return lower bound jointly optimised over policy and model; optimism under uncertainty encourages exploration of under-covered state–action regions arXiv
GitHub
2018-07 STEVE Stochastic ensemble value expansion explicitly propagates epistemic uncertainty across multi-step imagined rollouts… arXiv
2025-12 Long-Horizon MBRL Identifies compounding error as the core bottleneck in offline long-horizon model-based RL… arXiv
GitHub
2025-12 Surprise-Robust WM Trains world models to explicitly handle out-of-distribution "surprise" inputs; surprise-resilient training reduces catastrophic reality drift when imagination enters unexplored regions of state space arXiv
GitHub
4.1.2 The Noise-Hijacking Trap

Curiosity-driven agents waste budgets on irreducible stochasticity (noisy-TV problem). Disentangling aleatoric from epistemic uncertainty via reachability metrics and learning-progress monitoring is essential.

Date Method Key Idea Links
2020-05 Plan2Explore Maximises future ensemble disagreement in latent space for task-agnostic exploration; disagreement targets epistemic uncertainty, not irreducible noise arXiv
GitHub
2020-06 RIDES Reward-weighted state-reachability intrinsic motivation separates reachable novel states from high-entropy irreducible noise arXiv
GitHub
2018-08 RND Random network distillation as epistemic novelty signal; highlights persistent failure to distinguish aleatoric from epistemic uncertainty in stochastic envs arXiv
GitHub
2017-05 ICM Curiosity via self-supervised inverse/forward dynamics; forward-model prediction error as intrinsic reward to explore informative state transitions arXiv
GitHub
2025-09 Beyond Noisy-TVs Systematically categorises sources of stochastic noise in exploration environments… arXiv
GitHub
2017-11 Bayesian Uncertainties Canonical treatment of aleatoric vs. epistemic uncertainty in deep networks… arXiv
2019-10 Model-Based Active Exploration Explicitly optimises for epistemic information gain rather than prediction novelty… arXiv
GitHub
4.1.3 Fatal Detail Loss in Latent Space

Aggressive compression of high-dimensional sensory streams loses safety-critical details. Physical stress-testing and structured latent representations force world models to encode functionally critical geometric realities.

Date Method Key Idea Links
2019-02 PlaNet RSSM with deterministic + stochastic latent components; stochastic branch explores multiple plausible states rather than collapsing to a single deterministic prediction arXiv
GitHub
2018-03 World Models V–M–C architecture compresses pixels to latent then explores futures via RNN-based mental simulation; highlights information loss from pure deterministic latents arXiv
GitHub
2023-04 I-JEPA Image Joint-Embedding Predictive Architecture… arXiv
GitHub

4.2 Where: Exploration Across Different Spaces

4.2.1 Simulated Future Rollouts

Agents generate imagined trajectories to discover effective behaviours before physical execution, exploiting computational parallelism — thousands of hypothetical scenarios per second.

Date Method Key Idea Links
2025-09 DreamerV4 Shared world-model/policy backbone with phased training; first agent to obtain Minecraft diamonds from offline data via exhaustive imagined rollouts arXiv
2020-11 MuZero Learns latent dynamics supporting MCTS planning without pixel reconstruction; achieves superhuman performance by planning over imagined state sequences arXiv
GitHub
2019-12 DreamerV1 RSSM-based latent world model; explores via action noise during environment interaction to broaden state-space coverage for model training arXiv
GitHub
2020-10 DreamerV2 Discrete categorical latents + entropy-regularised actor; entropy bonus is explicit exploration regulariser preventing premature behavioural convergence arXiv
GitHub
2019-03 Atari 100k (SimPLe) First video-prediction world model competitive with model-free RL at 100k environment steps; imagined rollouts from pixel-based world model enable sample-efficient exploration of Atari games arXiv
GitHub
2026-01 Ctrl-World Controllable world model with structured latent decomposition… OpenReview
GitHub
4.2.2 Counterfactual Hazard Zones

Safety-critical exploration probes operational failure boundaries before physical deployment; world models evaluate "what-if" counterfactuals without incurring real-world risk.

Date Method Key Idea Links
2023-01 DayDreamer Transfers Dreamer latent-imagination to physical robots; explores counterfactual hardware interactions in latent space before committing to unsafe real actions arXiv
GitHub
2018-05 PETS Probabilistic ensemble models epistemic uncertainty for planning; agents probe high-uncertainty regions to discover failure modes before physical execution arXiv
GitHub
2024-10 ActSafe Active safe exploration via worst-case trajectory imagination; uses constrained world model rollouts to identify unsafe counterfactual outcomes before committing to any real action arXiv
GitHub
2025-04 BUMEx Boundary-uncertainty model exploration: identifies safety-critical boundary regions in the world model's state space and actively probes them with imagined counterfactual rollouts to discover latent… arXiv
GitHub
4.2.3 Latent Value Landscapes

In sparse-reward settings, agents construct internal value landscapes through intrinsic motivation, turning world-model predictive errors into exploration bonuses.

Date Method Key Idea Links
2020-05 Plan2Explore World model ensemble disagreement as intrinsic reward; constructs a latent value landscape rewarding states where model uncertainty is highest arXiv
GitHub
2020-06 RIDES Reachability-weighted novelty bonus sculpts intrinsic value landscape to emphasise informative and accessible states arXiv
2018-10 RND Forward model prediction error on random network as novelty signal; constructs a pseudo-value landscape for count-free exploration in high-dimensional spaces arXiv
GitHub
2017-05 ICM Self-supervised curiosity: forward-model error in latent feature space as exploration bonus, ignoring unpredictable environmental noise arXiv
GitHub
2025-03 Curiosity-Driven Imagination Curiosity bonus directly inside the latent imagination loop: world model generates diverse hypothetical futures and rewards the agent for imagining states with high latent novelty, sculpting an… arXiv
GitHub
2025-10 General Exploratory Bonus Unified framework for count-free exploration bonuses in latent space… arXiv
GitHub
2026-01 SuS (Surprise-based Successor) Surprise-modulated successor representations for exploration… arXiv
GitHub
4.2.4 Action-Grounded Latent Manifolds

Latent spaces must be action-grounded "Embodied-Native" manifolds — every imagined future tightly coupled with executable motor commands.

Date Method Key Idea Links
2025-06 V-JEPA 2 Video-pretrained JEPA model enabling zero-shot robotic grasping; explores action-grounded latent manifold via abstract representation rather than pixel reconstruction arXiv
GitHub
2025-06 WorldVLA Unified autoregressive framework for text, image, and action generation; joint latent manifold treats physical actions and visual evolution as first-class citizens arXiv
GitHub
2024-04 V-JEPA First pure-video self-supervised JEPA; latent prediction of masked spatiotemporal blocks produces rich action-predictive representations without pixel reconstruction arXiv
GitHub
2019-02 PlaNet RSSM latent manifold for planning; stochastic + deterministic components allow exploration over multiple plausible physical futures simultaneously arXiv
GitHub
2025-01 AD-L-JEPA Autonomous driving latent-space JEPA: predicts action-conditioned future representations of driving scenes… arXiv
GitHub

4.3 How: Exploration Across World-Model Domains

4.3.1 Model-Based Reinforcement Learning (MBRL)
Deterministic Dynamics and Iterative Exploration
Date Method Key Idea Links
2019-06 MBPO Short imagined rollouts prevent compounding error; iterative model–policy alternation guides exploration toward regions of true dynamics not yet covered arXiv
GitHub
2018-07 SLBO Jointly maximises return lower bound over policy and model; optimism in model optimisation encourages exploration of state–action regions not yet well covered arXiv
GitHub
2019-03 SimPLe Sequential policy optimisation in latent model: pixel-based video prediction model supports model-free policy gradient exploration… arXiv
GitHub
Uncertainty-Aware Exploration
Date Method Key Idea Links
2018-05 PETS Probabilistic ensemble explicitly disentangles aleatoric vs. epistemic uncertainty via KL minimisation between ensemble members; explores where epistemic uncertainty is highest arXiv
GitHub
2018-10 ME-TRPO Model-ensemble trust-region policy optimisation: uses N independently trained models and limits policy updates to regions where all models agree, preventing exploitation of epistemic uncertainty in… arXiv
GitHub
2019-10 Model-Based Active Exploration (MAX) Treats exploration as active learning: at each step selects actions that maximally reduce epistemic uncertainty under the ensemble… arXiv
GitHub
From Pixels to Latent Planning: Representation Learning for World Models
Date Method Key Idea Links
2020-05 Plan2Explore Novelty = future ensemble disagreement in RSSM latent space; task-agnostic exploration pre-trains a world model before any reward signal is available arXiv
GitHub
2019-02 PlaNet Pioneers RSSM: deterministic hidden state + stochastic Gaussian latent; stochastic branch forces imagination to explore multiple plausible environmental outcomes arXiv
GitHub
2018-03 World Models V–M–C architecture: VAE compresses pixels, RNN explores temporal structure in latent space, controller acts within learned representation arXiv
GitHub
Imagination-Based Exploration: The Dreamer Family
Date Method Key Idea Links
2025-09 DreamerV4 Phased training (WM pre-train → policy post-train) solves dual exploration problem; policy leverages WM priors for efficient exploration of long-horizon imagined trajectories arXiv
2023-01 DreamerV3 Percentile return normalisation stabilises exploration intensity across sparse and dense reward scales; adapts entropy regularisation automatically arXiv
GitHub
2020-10 DreamerV2 Discrete categorical latents + actor entropy bonus as explicit exploration regulariser; prevents premature policy collapse in imagined rollouts arXiv
GitHub
2019-12 DreamerV1 RSSM world model with action noise for environment exploration; broader state-space coverage improves quality of imagined training data arXiv
GitHub
2023-01 DayDreamer Transfers DreamerV2 to physical robots; latent imagination enables efficient hardware exploration without prohibitive real-world sample requirements arXiv
GitHub
Predictive Architectures: JEPA
Date Method Key Idea Links
2025-06 V-JEPA 2 Video-pretrained world model enabling zero-shot robotic deployment; latent-space exploration closes the loop between imagination and physical execution arXiv
GitHub
2024-04 V-JEPA First pure-video JEPA: predicts masked spatiotemporal block representations; abstract latent prediction explores rich spatiotemporal structure without pixel-level noise arXiv
GitHub
2023-06 I-JEPA Image JEPA: learns representations by predicting abstract features of masked image regions from context… arXiv
GitHub
4.3.2 Video Generation as World Simulation

Model as Environment: Video generation models serve as physics engines, enabling RL agents to explore within generated "dreams" at zero physical cost.

Model as Environment
Date Method Key Idea Links
2024-05 Genie Learns action-conditioned state transitions from unlabelled video; creates controllable virtual sandboxes for agent exploration without real-world interaction arXiv
2024-03 UniSim Universal simulator of sensorimotor interactions; trains RL agents entirely in simulated video environments for safe long-tail exploration arXiv
2024-01 DriveDreamer Driving-domain video world model conditioned on structured HD-map and traffic annotations… arXiv
GitHub
2024-11 Genie 2 Scalable interactive environment generator: produces persistent 3D-consistent game worlds from a single image prompt… arXiv
2024-04 Video Language Planning Combines video generation with language-conditioned planning: generates goal-directed video plans as imagined futures, then executes them via a learned policy… arXiv
GitHub
Planning and Policy Learning in Video World Models
Date Method Key Idea Links
2025-01 Cosmos Policy Injects latent frames for video–action co-diffusion; RL reward guides generation to explore physically consistent action-conditioned futures arXiv
2025-01 VideoDPO DPO applied to video generation; preference data forces generative exploration toward spatiotemporally consistent physical trajectories arXiv
GitHub
2026-01 TAGRPO Token-level advantage-guided reward policy optimisation for video generation… arXiv
GitHub
2026-02 DreamZero Zero-shot world model policy: directly uses a pre-trained video world model as a policy by selecting action sequences that steer imagined futures toward high-reward outcomes… arXiv
GitHub
Language as World Model: LLM-Based Simulation
Date Method Key Idea Links
2025-01 Video-T1 Test-time scaling for video generation: generates multiple candidate trajectories, uses verifiers to select the most physically consistent — inference as tree search arXiv
GitHub
2026-01 WMReward (Inference-Time) Uses world model value estimates as verifier rewards at inference time… arXiv
GitHub
2026-02 DreamZero (Inference) Leverages a frozen video world model at inference time to score action proposals via forward imagination… arXiv
GitHub
Autonomous Driving: Vectorised and Occupancy Exploration
Date Method Key Idea Links
2026-03 FastWAM Decouples video generation from policy inference; skips test-time future imagination entirely to achieve 190ms latency for real-time closed-loop action exploration arXiv
GitHub
2025-06 WorldVLA Unified autoregressive framework generating text, images, and actions; explores action-grounded latent manifold as joint first-class representation arXiv
GitHub
2025-01 Cosmos Policy Latent frame injection synchronises video and action co-diffusion; closed-loop counterfactual simulation with RL-guided physically consistent exploration arXiv
2026-02 DreamZero (WAM) Zero-shot WAM that couples a frozen video world model with a learned action decoder; closed-loop action exploration via iterative world-model querying without any task-specific fine-tuning arXiv
GitHub
Autonomous Driving: Vectorised and Occupancy Exploration
Date Method Key Idea Links
2024-01 DriveDreamer Driving world model synthesising future scenes conditioned on actions; explores diverse driving futures in latent space to validate plans before physical execution arXiv
GitHub
2023-09 GAIA-1 Generative world model for autonomous driving; produces diverse imagined driving scenarios as a data engine for exploring rare safety-critical events arXiv
2022-10 MILE Model-based imitation learning in compact latent space; imagined rollouts in latent representation for sample-efficient exploration of driving behaviours arXiv
GitHub
2024-12 DrivingWorld Spatiotemporal autoregressive world model for autonomous driving… arXiv
GitHub
2025-06 GenAD Generalised autonomous driving world model: trains a single generative model across diverse driving datasets… arXiv
GitHub
2024-03 Think2Drive Converts a pre-trained world model into an online planner… arXiv
GitHub
2023-06 UniAD Unified autonomous driving framework integrating perception, prediction, and planning in a shared representation… arXiv
GitHub
Date Method Key Idea Links
2024-01 Drive-WM Action-conditioned multi-view video generation for driving; visualises consequences of hypothetical manoeuvres to explore safe counterfactual futures arXiv
GitHub
2024-03 RealGen Adversarial retrieval-augmented generation targeting safety-critical scenario boundaries; probes failure-mode hazard zones via adversarial imagination exploration arXiv
GitHub
2025-01 AD-L-JEPA Autonomous driving latent-space JEPA: predicts action-conditioned future driving representations… arXiv
GitHub
2024-06 Delphi Dense latent point cloud world model for driving; probabilistic forecasting of future scene states enables uncertainty-aware exploration of counterfactual traffic evolutions arXiv
GitHub
2025-01 UncAD Uncertainty-aware autonomous driving: estimates both epistemic and aleatoric uncertainty in world model predictions… arXiv
GitHub
Date Method Key Idea Links
2024-08 OccSora Diffusion-based 4D occupancy generation; synthesises long-horizon occupancy sequences enabling exploration of temporally consistent physical futures arXiv
GitHub
2024-05 OccWorld Vision-centric 3D occupancy world model for driving; forecasts occupancy evolution providing collision-risk cost volumes for safe spatial exploration arXiv
GitHub
2025-01 Drive-OccWorld Drives entirely in a 4D occupancy world: unified model for scene generation and ego-planning… arXiv
GitHub
2024-04 Copilot4D Discretises 3D point cloud scenes into tokens and applies discrete diffusion for 4D world modelling… arXiv
2025-01 DynamicCity Dynamic 4D city generation via HexPlane-based occupancy world model; generates temporally consistent large-scale urban occupancy sequences enabling exploration of rare urban environment configurations arXiv
GitHub
2025-04 Gaussian World Model 4D Gaussian splatting world model for autonomous driving… arXiv
GitHub
Date Method Key Idea Links
2024-10 DriveArena Creates reactive 4D worlds where the policy actively interacts with a neural simulator; enables closed-loop exploration of diverse reactive traffic scenarios arXiv
GitHub
2024-09 SimGen Cascaded diffusion for high-fidelity, controllable scenario augmentation; addresses long-tail exploration by generating rare safety-critical scenarios on demand arXiv
GitHub
2024-03 DrivingDiffusion Multi-view video diffusion for closed-loop driving simulation… arXiv
GitHub
2025-05 Raw2Drive End-to-end closed-loop driving directly from raw sensor data… arXiv
GitHub
2023-06 TrafficBots Multi-agent traffic simulation via conditional behaviour generation… arXiv
GitHub
2025-04 DrivingSphere Spherical-projection world model for full 360° closed-loop driving simulation; complete spatial coverage removes blind spots so policies train against surrounding agents in all directions arXiv
GitHub
Social Dynamics: Exploration in Strategic and Normative Environments
Date Method Key Idea Links
2024-01 VBench Comprehensive benchmark evaluating world model quality including social regularity capture; provides metrics to assess whether exploration produces socially plausible behaviours arXiv
GitHub
2020-08 Social-STGCNN Spatio-temporal graph CNN models pedestrian trajectory social forces; world model for social dynamics enabling exploration of crowd interaction counterfactuals arXiv
GitHub
2025-10 LCTGen Language-conditioned traffic generation: natural-language scene specs, structured map retrieval, and multi-agent rollouts for counterfactual social traffic exploration arXiv
GitHub

5. Level 5: Prospector → Ecosystem — Coordination-Space Exploration

Coordination-Space Exploration

Level 5 Coordination-Space Exploration — Why (single-agent limitations), Where (communication, collaboration, role, deployment), and How (orchestration, ensemble, MARL, self-evolving agents).

Key Challenges at Level 5

  • Scalable Coordination Exploration: The search space is combinatorial, hierarchical, and dynamic—which agents to activate, what communication to establish, how information flows
  • Ecosystem-Level Credit Assignment: Disentangling behavioural contribution from structural contribution under sparse, delayed feedback
  • Diversity vs. Convergence Tension: Balancing ecological diversity against system-level coherence
  • Role–Communication Co-evolution: Jointly evolving functional specialisation and information exchange protocols

5.1 Multi-Agent Orchestration

5.1.1 Rule-based Orchestration (Reachability-Driven)

Methods that coordinate collaboration through pre-defined routing rules, role protocols, or structured workflows to guarantee deterministic reachability.

Date Method Key Idea Links
2026-02 ORCH Explores many parallel analysis trajectories and merges them via a deterministic EMA-guided router to ensure reachable consensus in discrete-choice reasoning arXiv
2025-11 MA-IR Deterministic multi-agent orchestration for high-quality incident response decision support arXiv
2025-10 MOSAIC Task-intelligent orchestration routes specialised agents to explore scientific coding workflows within a rule-governed collaboration space arXiv
2025-06 AgentOrchestra Defines the reachable coordination space via a Tool-Environment-Agent (TEA) protocol, scaffolding scalable multi-agent exploration arXiv
2025 Croto Cross-team communication rules carve out a reachable space of inter-team collaboration for multi-agent exploration -
2024-06 MACNET Predefined topological links bound the agent interaction graph that large-scale multi-agent collaboration can traverse arXiv
2023-08 MetaGPT Encodes SOPs as meta-programming so role-based workflows follow reliably reachable collaboration trajectories arXiv
2023 AgentVerse Rule-driven collaboration scaffolds multi-agent exploration of emergent group behaviours within a reachable role space OpenReview
5.1.2 Learnable Orchestration (Competence-Driven)

Methods that train orchestrators, meta-agents, or agent graphs to route tasks toward the most competent executors and expand multi-agent capability boundaries.

Date Method Key Idea Links
2026-01 MAS-Orchestra Expands multi-agent reasoning competence through holistic orchestration, with controlled benchmarks probing the explored system space arXiv
2025-05 Puppeteer-Puppet Evolves a puppeteer that explores dynamic orchestration strategies over puppet agents to extend collaborative competence arXiv
2025-04 W4S Trains a weak meta-agent to explore task decompositions and harness strong executor agents beyond its own competence arXiv
2024-04 CMAT Collaboration tuning expands small-model agent competence by exploring multi-agent interaction signals arXiv
2024-02 GPTSwarm Treats language agents as optimizable computation graphs, enabling competence-driven search over prompts and inter-agent edges arXiv
5.1.3 Reflection & Information-Theoretic Orchestration (Uncertainty-Driven)

Methods that adapt orchestration policies through reflection feedback or information-theoretic uncertainty signals across long-horizon multi-agent tasks.

Date Method Key Idea Links
2025-09 Orchestrator Uses active inference to drive multi-agent exploration under epistemic uncertainty across long-horizon tasks arXiv
2025-04 W4S Weak meta-agent explores orchestration policies over strong executors, guided by reflection on uncertain outcomes arXiv
2025-03 MAS-GPT Trains LLMs to synthesise multi-agent systems per query, exploring the system-design space conditioned on task uncertainty arXiv
2024-04 CMAT Reflective multi-agent tuning explores feedback-driven collaboration to calibrate small-agent competence under uncertainty arXiv
5.1.4 Memory & Knowledge Substrate Exploration

Methods that build self-evolving multi-agent systems on shared memory or knowledge substrates to support long-horizon specialisation and exploration.

Date Method Key Idea Links
2025-05 PiFlow Principle-aware orchestration grows a scientific knowledge substrate that guides multi-agent discovery exploration arXiv
GitHub
2025-03 MedAgentSim Self-evolving clinical multi-agent simulation explores new cases by accumulating case-level memory as a shared substrate arXiv
GitHub
2025 SEMC Self-evolving consultation grows a shared diagnostic knowledge base, expanding the reachable medical case space over time -
2025-02 MobileSteward Orchestrates app-oriented agents with self-evolving memory to explore cross-app instruction compositions arXiv
GitHub
2025-01 Mobile-Agent-E Self-evolving mobile assistant explores complex tasks by accumulating reusable tips and shortcuts as a growing experience substrate arXiv
GitHub
2023-04 Generative Agents Interactive simulacra use memory-retrieval substrates to explore and surface emergent social behaviour over long horizons arXiv
GitHub

5.2 Agentic Ensemble Papers

5.2.1 Ensemble-During-Inference Papers

Methods that explore how to combine or choose token candidates from multiple LLMs at each decoding step.

Token-Level Ensemble
Date Method Key Idea Links
2025-10 SAFE Only ensembles at a few well-chosen token steps to keep decoding stable and fast arXiv
2025-10 CoRe Uses token and model agreement to downweight unreliable signals arXiv
2025-05 Transformer Copilot A Copilot learns from past token mistakes and fixes the Pilot’s logits arXiv
GitHub
2025-02 ABE Makes different-vocabulary models agree on the same surface token before choosing it arXiv
GitHub
2025-02 CITER Sends easy tokens to a small model and hard ones to a large model arXiv
GitHub
2024-10 UniTe Ensembles only the union of top-*k* tokens instead of the full vocabulary arXiv
2024-06 GaC Treats next-token generation like classification and averages token probabilities arXiv
GitHub
2024-04 DeePEn Maps different vocabularies into a shared space before merging token distributions arXiv
GitHub
2024-04 PackLLM Gives more weight to models that fit the prompt better arXiv
GitHub
2024-04 EVA Learns vocabulary mappings so different models can ensemble token by token arXiv
GitHub
2024-02 - Uses a benign small model to pull token probabilities away from harmful outputs arXiv
Span-Level Ensemble
Date Method Key Idea Links
2025-06 RLAE Adjusts model weights on the fly as generation goes on arXiv
2024-12 SpecFuse Lets models draft short spans, then picks the best one for the next step arXiv
2025-02 Speculative Ensemble Lets one model draft a span and others verify it for faster decoding arXiv
GitHub
2024-09 SweetSpan Lets each model write a short span, then uses mutual scoring to choose one arXiv
2024-07 Cool-Fusion Waits for a shared word boundary, then selects the best whole span arXiv
Reasoning-Step Ensemble
Date Method Key Idea Links
2025-11 CBS Explores many next reasoning steps, then keeps the ones backed by collective consensus Link
2024-12 LE-MCTS Searches over next reasoning steps and keeps the path with the best process reward arXiv
5.2.2 Ensemble-After-Inference Papers

Methods that explore how to compare multiple complete responses after generation, either by selecting the single best answer or by choosing a strong subset for regeneration.

Date Method Key Idea Links
2025-12 LLM-PeerReview Lets LLM judges score candidate answers and picks the best-reviewed one arXiv
GitHub
2025-10 LLMartini Aligns answer parts so users can compare and compose a final response arXiv
2025-10 Beyond Consensus Uses minority veto to stop overly agreeable judges from accepting bad answers arXiv
GitHub
2025-10 OW/ISP Uses who agrees with whom, not just vote counts arXiv
2025-09 FLAME Aggregates line-level annotations from several LLMs to rank bug locations arXiv
GitHub
2025-09 CARGO Uses confidence-aware scoring to decide which model to trust more arXiv
2025-07 LENS Learns how much to trust each answer from internal states arXiv
2025-05 EL4NER Merges small-LLM NER outputs and self-checks the final spans arXiv
2025-03 Symbolic-MoE Selects skill-matched experts and then combines their finished reasonings arXiv
GitHub
2025-01 DFPE Keeps diverse strong models, filters weak ones, and reweights the rest arXiv
GitHub
2025-01 DMoA Balances diversity and consistency before mixing answers OpenReview
2024-12 Smoothie Picks the answer most supported by the others, without labels arXiv
GitHub
2024-10 LLM-Forest Uses weighted voting across graph-guided prompt variants arXiv
GitHub
2024-10 LLM-TOPLA Selects a diverse top-*k* set before regeneration arXiv
GitHub
2024-10 MLKF Fuses complementary reasoning from multiple LLMs into one answer Link
GitHub
2024-08 URG Learns ranking and regeneration together Link
2024-02 Agent-Forest Samples many answers and keeps the one most supported by voting arXiv
GitHub
2023-06 LLM-Blender Ranks answers first, then fuses the best few into one arXiv
GitHub
2023-05 MoRE Uses agreement among reasoning experts to choose an answer or abstain arXiv
GitHub
5.2.3 Ensemble-Before-Inference Papers

Methods that route each query by predicting discrete model utility—such as whether a model is likely to be good enough, or which model is better under the query.

Date Method Key Idea Links
2025-10 DiSRouter Lets models help decide which peer should answer arXiv
2025-06 TagRouter Matches queries to model tags instead of training a heavy router arXiv
2025-06 Router-R1 Learns multi-round routing to choose and combine models better arXiv
GitHub
2025-06 RadialRouter Builds a structured query view for more robust routing arXiv
2025-05 RTR Chooses both the model and the reasoning style arXiv
GitHub
2024-12 Bench-CoE Routes queries using benchmark-based model strengths arXiv
GitHub
2024-10 GraphRouter Uses graph structure to choose the best model arXiv
GitHub
2024-09 Eagle Compares candidate models without extra router training arXiv
2024-08 SelectLLM Scores each query and picks an efficient model arXiv
2024-06 RouteLLM Learns when a cheaper model can replace a stronger one arXiv
GitHub
2024-05 LLM Routing Lessons Shows which prompt cues help choose the right model arXiv
GitHub
2024-04 Hybrid-LLM Balances answer quality and cost before choosing a model arXiv
GitHub
2024-03 ETR Routes expert tokens to the most suitable specialist model arXiv
GitHub
2024-01 Routoo Learns to send each query to the model most likely to work arXiv
2024 RouterDC Learns query embeddings that make routing easier Link
GitHub
2023-11 ZOOTER Uses reward signals to pick the right expert model arXiv
2023-08 FORC Chooses the cheapest model that is still good enough arXiv
GitHub
2023 Benchmark Routing Builds routing rules from benchmark-level model performance OpenReview
Continuous Model Utility Routing
Date Method Key Idea Links
2025-10 WebRouter Compresses web-agent prompts and routes with cost in mind arXiv
2025-10 LLMRank Uses rich query features to rank which model should answer arXiv
2025-05 Avengers Combines small models by routing queries to their strengths arXiv
GitHub
2025-05 InferenceDynamics Profiles model skills and knowledge before routing arXiv
2025-05 kNN Router Uses nearest past queries instead of a complex learned router arXiv
2025 RELM Learns recommendation and evaluation together for model selection OpenReview
2025-02 LLM Bandit Explores cheap options first and learns cost-aware routing online arXiv
2024-12 PickLLM Uses RL to pick the best model from context and budget cues arXiv
2024-08 TO-Router Predicts utility under latency and cost constraints arXiv
2024-07 MetaLLM Wraps several models and picks one using predicted utility arXiv
GitHub
2024-06 HomoRouter Routes queries among similar tools with fine-grained scoring arXiv
2024-01 Blending Blends model strengths as a cheaper alternative to one giant model arXiv
5.2.4 Cascaded-Based Papers

Methods that cascade models in sequence—each model handles a subset of the query or task, passing intermediate outputs to the next model in the pipeline.

Date Method Key Idea Links
2025-12 RoBoN Routes best-of-n samples across multiple LLMs, exploring which model adds the highest next-response gain via reward and agreement signals arXiv
GitHub
2025-09 Semantic Agreement Uses meaning-level agreement between model outputs to explore whether a query can stop at a smaller model or should defer upward arXiv
2025-04 EMAFusion Combines taxonomy routing, learned routing, and confidence-triggered escalation to explore the cheapest reliable model path for each query arXiv
2025-04 ModelSwitch Uses sample consistency to explore when repeated sampling should stay with the current model or switch to a complementary one arXiv
GitHub
2024-12 DER Treats expert selection as sequential route exploration, choosing the next LLM to add complementary knowledge with minimal compute arXiv
2024-10 Cascade Routing Unifies routing and cascading to explore the model chain only when quality estimates suggest extra capacity will pay off arXiv
GitHub
2024-04 LM Cascades Uses token-level uncertainty to explore whether a generative response is reliable enough to stop or should be deferred to a larger model arXiv
2023-10 AutoMix Uses self-verification and POMDP routing to explore whether a weaker model is sufficient or a larger model is needed arXiv
GitHub
2023-10 Neural Caching Uses active selection to explore which queries a continuously distilled student can absorb and which should still go to the teacher arXiv
GitHub
2023-10 MoT Cascade Uses weak-model answer consistency, enriched with CoT/PoT thought mixtures, to explore whether escalation is necessary arXiv
GitHub
2023-10 EcoAssistant Explores a hierarchy of assistants, refining with execution feedback and backing off to stronger models only when needed arXiv
GitHub
2023-05 FrugalGPT Explores budget-aware combinations of LLMs, adaptively choosing a query-specific cascade for cost-efficient accuracy arXiv
2023-01 Confidence Deferral Clarifies when confidence-only deferral can explore the cascade effectively and when downstream-aware signals are required Link
2022-10 Model Cascading Explores early exit across models of increasing capacity, reserving large-model compute for harder inputs arXiv

5.3 Multi-Agent Reinforcement Learning (MARL)

Date Method Key Idea Links
2025-12 SDAX Treats unsupervised skill discovery as high-level exploration and bi-level tunes diversity-vs-task rewards to learn agile locomotion behaviors arXiv
2025-09 CERMIC Calibrates curiosity with inferred peer-intention context to filter noisy novelty and reward high information-gain transitions in sparse-reward MARL arXiv
GitHub
2025-02 TEE Maximizes cross-agent trajectory entropy in a contrastive latent space via particle-based estimation, yielding intrinsic rewards for diverse coordinated exploration arXiv
2025-02 Consensus-Diversity Tradeoff Shows implicit consensus with partial disagreement can preserve exploration diversity and improve robustness in dynamic multi-agent settings arXiv
GitHub
2023-02 EMAX Uses per-agent value ensembles for UCB-guided exploration, low-variance ensemble targets, and majority-vote action selection to reduce miscoordination arXiv
2022-08 MACE Combines collaborative voxel mapping, global goal assignment, and time-aware safe-corridor planning for collision-safe multi-robot exploration of unknown spaces arXiv
2021-12 MASAC Proposes a CTDE multi-agent soft actor-critic front-end for collaborative waypoint search, coupled with minimal-snap trajectory optimization for executable robot motion arXiv
2021-11 EMC Uses prediction errors of induced individual Q-values as coordinated intrinsic rewards, plus episodic memory to reinforce informative experiences arXiv
GitHub
2021-07 CMAE Selects shared exploration goals from entropy-scored projected state spaces and trains agents to reach them in a coordinated way arXiv
GitHub
2019-10 MAVEN Introduces latent-variable hierarchical control to induce temporally committed exploration modes while retaining value-decomposition scalability arXiv
GitHub
2019-10 EITI/EDTI Encourages coordinated exploration by maximizing inter-agent influence, using mutual information (EITI) and value-of-interaction rewards (EDTI) arXiv
GitHub
2017-05 ICM Defines curiosity as forward-model prediction error in inverse-dynamics features, promoting exploration without relying on extrinsic rewards arXiv
GitHub

5.4 Self-Evolving Agent Systems

Methods that enable agents to evolve their own capabilities, roles, or knowledge representations through exploration and self-modification.

Date Method Key Idea Links
2025-11 AgentEvolver Self-evolves an LLM agent through self-questioning, self-navigating, and self-attributing modules so it generates its own tasks, trajectories, and credit signals for autonomous capability growth arXiv
GitHub
2025-05 SPA-RL Decomposes a long-horizon agent's final reward into per-step progress contributions, providing dense intermediate rewards that stabilise RL on sparse-reward tasks arXiv
GitHub
2025-05 GiGPO Adds an inner step-level group baseline on top of trajectory-level GRPO, enabling fine-grained credit assignment for multi-turn LLM agent training arXiv
GitHub
2025-05 SRSI LLM acts as its own judge to score self-generated solutions and uses these self-rewards as RL signal, bootstrapping improvement without external labels arXiv
2025-03 DAPO Open-source large-scale LLM RL recipe introducing clip-higher, dynamic sampling, token-level policy-gradient loss, and overlong-reward shaping for stable scaling arXiv
GitHub
2025-02 SiriuS Builds an experience library of successful multi-agent reasoning trajectories—augmented by re-trying and rewriting failed ones—and fine-tunes the agents on it for self-improvement arXiv
GitHub
2024-11 WebRL Self-evolving online curriculum turns failed web tasks into new training instructions, paired with an outcome-supervised reward model and KL-constrained policy updates arXiv
GitHub
2024-06 TextGrad Treats LLM-generated natural-language critiques as textual "gradients" and back-propagates them through compound AI systems to optimise prompts, code, and components end-to-end arXiv
GitHub
2024-06 DigiRL Two-stage offline-then-online autonomous RL with a VLM-based evaluator and an advantage-filtered curriculum to train robust Android device-control agents in the wild arXiv
GitHub
2024-03 Quiet-STaR Generalises STaR by sampling internal rationales at every token position and reinforcing the ones that improve next-token prediction, teaching LMs to "think" silently before speaking arXiv
GitHub
2024-02 GRPO Removes PPO's value network by using group-sampled rollouts to compute the baseline, enabling memory-efficient on-policy RL for LLMs (introduced in DeepSeekMath) arXiv
GitHub
2024-01 SRLM LLM acts as its own judge via LLM-as-a-Judge prompting, generates preference pairs on its own outputs, and iteratively DPO-trains on them—removing the human reward bottleneck arXiv
2023-10 SELF Iterative self-evolution loop where the model self-critiques and self-refines its outputs in natural-language feedback, then fine-tunes on the improved data arXiv
2023-05 DPO Re-parameterises RLHF so the LM itself implicitly defines the reward, reducing alignment to a simple classification loss on preference pairs—no separate reward model or RL loop needed arXiv
GitHub
2023-04 RRHF Aligns LLMs to human preferences by sampling multiple candidate responses and applying a ranking loss based on their reward order, sidestepping the complexity of PPO arXiv
GitHub
2023-03 Self-Refine Same LLM iteratively produces feedback on its own output and refines it across rounds, improving quality at inference time without any extra training or external supervision arXiv
GitHub
2023-03 Reflexion Agent verbalises why a trial failed, stores the reflection in episodic memory, and uses it to guide subsequent trials—"verbal" reinforcement learning without weight updates arXiv
GitHub
2022-03 STaR Generates rationales for problems, rationalises wrong answers given the correct one, and iteratively fine-tunes on rationales that yield correct answers, bootstrapping reasoning ability arXiv
GitHub

6. Exploration Evaluation

Exploration is not a primitive observable but an inferential claim about how an agent handles uncertainty, learning, and future choice. An adequate evaluation must separate exploratory behaviour from mere activity, benefit from base competence, and open-ended search from premature convergence.

6.1 Three Evaluation Principles

C1

Information Gain

The agent actively reduces uncertainty through directed information-seeking.

  • Signals: uncertainty reduction, calibration improvement, informative state acquisition
C2

Value Improvement

The agent converts acquired information into improved competence.

  • Signals: performance gains, sample-efficiency, boundary expansion on harder tasks
C3

Epistemic Reachability

The agent preserves access to plausible future states, actions, and hypotheses.

  • Signals: coverage, behavioural diversity, anti-collapse

6.2 Evaluation Benchmarks

Date Benchmark Key Idea Links
Reasoning-Space Evaluation
2024-11 FrontierMath Exposes the pass@1-to-pass@k gap at capability frontiers, measuring whether reasoning search reaches answers beyond one-shot competence arXiv
2024-02 OlympiadBench Probes whether extended reasoning chains resolve genuine uncertainty at olympiad-level difficulty GitHub
2024-03 LiveCodeBench Adds executable feedback to reasoning evaluation, making self-correction gain directly measurable GitHub
Interaction-Space Evaluation
2024-03 WorkArena Reveals whether the controller reopens exploration when web-based interfaces change across knowledge-work scenarios GitHub
2024-04 OSWorld Tests whether multimodal agents reopen search and adapt action policies when open-ended computer tasks shift tool demands GitHub
Imagination-Space Evaluation
2022-06 MineDojo Evaluates whether embodied agents leverage internal simulation to reduce real interaction cost in open-ended environments GitHub
2022-06 PlanBench Isolates planning-layer failures by testing whether agents reason coherently about action and change GitHub
Coordination-Space Evaluation
2019-02 SMAC Tests whether decentralised agents reduce joint uncertainty through tactical coordination GitHub
2023-10 Sotopia Assesses whether social interaction reduces joint uncertainty rather than producing superficial verbal agreement GitHub

6.3 Open Challenges

Open-Domain Scalable Exploration

Scalable exploration for reasoning beyond verifiable tasks remains an open problem.

Safe Exploration with World Action Models

Ensuring exploration safety within predictive world action models.

Causal & Counterfactual Reasoning

Causal and counterfactual reasoning in imagination space.

Scalable Coordination

Coordination-space exploration without combinatorial explosion.