Epistemic Exploration Toward Artificial General Intelligence

The Five-Level Trajectory Toward AGI. Exploration serves as the transition mechanism across five levels of increasing agent sophistication: Responder → Reasoner (reasoning space), Reasoner → Agent (interaction space), Agent → Prospector (imagination space), and Prospector → Ecosystem (coordination space).

🔥 This is a curated paper list for the survey "Epistemic Exploration Toward Artificial General Intelligence", covering exploration mechanisms across reasoning, embodied AI, world models, and multi-agent systems.

🔥 Stay tuned for our full paper release, incorporating the latest developments.

[Always] We welcome all related papers! If you find any missed or new work, please open a Pull Request or contact us. We will keep this list updated frequently!

1. Overview

1.1 What is Epistemic Exploration?

Epistemic exploration is the agent's capacity to actively acquire information that reduces its uncertainty about the world, convert that reduction into durable policy improvement, and keep future acquisition possible.

Unlike undirected exploration (e.g., ε-greedy), epistemic exploration is intentional, belief-driven, and multi-scale: the agent reasons about which actions are most informative and plans multi-step information-gathering strategies across reasoning trajectories, tool-use policies, embodied sensorimotor loops, world-model rollouts, and multi-agent coordination protocols.

1.2 Three Criteria

Foundation of Epistemic Exploration — Why, What, and How.

We ground epistemic exploration in three jointly necessary criteria, each addressing a distinct failure mode of static optimisation:

C1

Information Gain

Actively reduces epistemic uncertainty via belief-updating observations

Failure Mode: Belief Stagnation — frozen internal model under distribution shift
Explores: ...where it knows least

C2

Value Improvement

Converts new information into durable policy improvement

Failure Mode: Value Stagnation — local optima lock-in, surrogate misalignment
Explores: ...what it cannot yet do well

C3

Epistemic Reachability

Preserves positive visitation over belief-consistent regions

Failure Mode: Reachability Collapse — irreversible contraction of behavioural diversity
Explores: ...where it might otherwise never go

These form a closed loop: gain information → convert to value → keep the capacity to gain information alive → ...

1.3 Unified Epistemic Exploration Objective

The three criteria combine into a single constrained objective:

$$ \pi_{\mathfrak{A},t}^{*} \;=\; \underset{\underbrace{\pi_{\mathfrak{A}} \,\in\, \Pi_{\mathrm{reach}}(b_t)}_{\text{Reachability (C3)}}}{\arg\max}\; \underbrace{\mathbb{E}_{\theta \sim b_t} \Big[V^{\pi_{\mathfrak{A}}}_\theta(s_t, h_t)\Big]}_{\text{Value Improvement (C2)}} \;+\;\beta\;\cdot\; \underbrace{\mathbb{E}^{\pi_{\mathfrak{A}}}_{b_t} \left[\sum_{t'=t}^{\infty} \gamma^{\,t'-t}\,\mathcal{U}(s_{t'}, a_{t'};\, b_{t'})\right]}_{\text{Information Gain (C1)}} $$

where $\mathcal{U}(s, a;\, b) = I(\theta;\, s', r \mid s, a, b)$ is epistemic uncertainty, and $\Pi_{\mathrm{reach}}(b_t)$ is the reachability-feasible policy set.

Information-Gain Term (C1)

Expected cumulative epistemic uncertainty the agent anticipates resolving along its trajectory.

Value Improvement (C2)

Expected cumulative reward under current beliefs; what a pure exploiter would maximise.

Reachability (C3)

Visitation must remain over every region plausibly relevant under beliefs, preventing short-term gains from foreclosing future learning.

1.4 Five-Level Trajectory Toward AGI

We propose exploration as the transition mechanism between five levels of increasing agent sophistication. Each level introduces a qualitatively new exploration space that the previous level cannot access:

Transition	Exploration Space	What Becomes Explorable
L1 → L2: Responder → Reasoner	Reasoning space	Hypotheses, alternative reasoning trajectories, latent thought representations; self-verification and revision
L2 → L3: Reasoner → Agent	Interaction space	Embodied perception, tool invocation, memory management, closed-loop action under partial observability
L3 → L4: Agent → Prospector	Imagination space	Counterfactual futures in learned world models; the dual exploration problem across real and imagined environments
L4 → L5: Prospector → Ecosystem	Coordination space	Communication topologies, co-evolving role specialisations, shared representations, collaborative strategies

1.5 3×5 Taxonomy

Our survey is organized as a 3×5 taxonomy crossing three signal-driven methodologies with the five levels:

	L1 Responder	L2 Reasoner	L3 Agent	L4 Prospector	L5 Ecosystem
Uncertainty-Driven	—(single forward pass; no internal search)	Token / step entropy, entropy-guided branching	Active SLAM, prediction variance, pose uncertainty	Ensemble disagreement in latent world models	Inter-agent disagreement, joint-belief uncertainty
Competence-Driven	—(no learning loop at inference)	Difficulty-adaptive curricula, self-verification, self-play	Skill bootstrapping, goal-conditioned self-play	Imagination-based skill discovery, learning-progress curricula	Emergent multi-agent self-play, co-evolving curricula
Reachability-Driven	—(fixed output manifold)	Beam diversity, anti-repetition, KL-to-reference trust regions	Go-Explore, coverage-maximising curricula	Latent-space diversity bonuses, action-entropy regularisation	Role-diversity bonuses, anti-convergence on coordination topologies

2. Levels 1–2: Responder → Reasoner — Reasoning-Space Exploration

The transition from Responder to Reasoner requires exploration in reasoning space: branching over token sequences, reasoning trajectories, and latent thought representations. The agent must search for informative hypotheses rather than simply produce reactive outputs.

Levels 1–2 Reasoning-Space Exploration — Why (entropy escalation & reward stagnation), Where (tokens → turns → latent trajectories), and How (uncertainty / competence / reachability-driven).

2.1 Uncertainty-Driven Exploration

Methods that prioritise exploration at high-uncertainty branching points in the reasoning process.

Date	Method	Key Idea	Links
2025-08	CURE	Expands the training-state distribution at critical decision points to sustain exploration	arXiv GitHub
2026-03	SPINE	Preserves exploration by selectively updating high-entropy branch tokens	arXiv
2025-06	TreeRL	Explores reasoning via on-policy tree search from uncertain intermediate states	arXiv GitHub
2025-09	CE-GPPO	Collapses epistemic uncertainty into adaptive exploration bonuses for reasoning RL	arXiv
2025-10	STEER	Preserves exploration by stabilizing token-level entropy change through adaptive reweighting	arXiv
2025-10	AEPO	Adaptive entropy-guided policy optimization for exploration in reasoning models	arXiv GitHub
2025-11	ICPO	Promotes exploration by combining verifiable rewards with confidence-based preference advantages	arXiv
2026-02	REAL	Stabilizes exploration via balanced gradient allocation	arXiv

2.2 Competence-Driven Exploration

Methods that match problem difficulty to the model's evolving competence frontier.

Date	Method	Key Idea	Links
2025-06	E2H	Guides exploration through an easy-to-hard curriculum	arXiv GitHub
2025-10	RLAAR	Steers exploration through curriculum learning and rewarded abstention	arXiv
2025-05	CDAS	Explores the competence frontier by sampling problems matched to the model's current ability	arXiv GitHub
2026-01	HA-DW	Reduces exploration imbalance by debiasing group-relative advantages across prompt difficulty	arXiv
2025-08	SvS	Sustains exploration by self-synthesizing diverse but answer-equivalent problems during RLVR	arXiv GitHub

2.3 Reachability-Driven Exploration

Methods that prevent irreversible contraction of reasoning trajectory distributions.

Date	Method	Key Idea	Links
2025-10	TROLL	Stabilizes exploration with principled trust-region updates instead of PPO-style clipping	arXiv
2024-04	ROPO	Preserves useful exploration by downweighting noisy preference signals instead of overfitting to them	arXiv
2025-05	KTAE	Explores better by assigning credit to key reasoning tokens rather than whole rollouts	arXiv GitHub
2025-08	VRPRM	Guides exploration with visual step-level rewards that encourage deeper reasoning paths	arXiv
2026-01	RLVRR	Turns sparse end rewards into a verifiable reward chain that supports broader open-ended exploration	arXiv GitHub

3. Level 3: Reasoner → Agent — Perception- & Action-Space Exploration

At Level 3, the agent crosses from internal reasoning into situated interaction with external environments. Exploration unfolds in perception and action space, where every step incurs real cost. The transition splits into Digital Agents (software-mediated) and Embodied Agents (physical interaction).

3.1 Digital Agents

Agents operating in software-mediated environments (web, APIs, code interpreters):

3.1.1 Uncertainty-Driven Exploration

Methods that acquire information under partial observability by prioritising uncertain states, tool calls, or capability boundaries.

Date	Method	Key Idea	Links
2026-01	JitRL	Uses count-based uncertainty bonuses to explore unseen state-action pairs	arXiv GitHub
2023-05	RAP	Explores alternative reasoning paths with MCTS and UCB guidance	Link GitHub
2024-08	Agent Q	Expands high-value action trajectories via MCTS-guided exploration	arXiv
2023-10	LAST	Explores reasoning-action branches through language-agent tree search	arXiv GitHub
2025-04	KnowSelf	Explores capability boundaries by detecting uncertain self-knowledge	arXiv GitHub
2025-01	Search-o1	Explores external evidence when reasoning exposes knowledge uncertainty	Link GitHub
2025-04	TTRL	Test-time RL via majority-voted pseudo-rewards turns inference disagreement into exploration	arXiv

3.1.2 Competence-Driven Exploration

Methods that tame combinatorial tool-use spaces through curricula, process-level credit assignment, and self-generated training tasks.

Date	Method	Key Idea	Links
2025-08	PilotRL	Stages curricula to expand agent exploration from planning to tool use	arXiv
2025-09	ReSum-GRPO	Sustains long-horizon search exploration through context summarization	arXiv
2024-03	ETO	Optimizes exploratory trial-and-error trajectories for agent learning	arXiv GitHub
2024-11	WebRL	Self-evolving online curriculum from failure trajectories for web agents	arXiv
2025-09	Planner-R1	Uses dense process rewards to steer exploration toward feasible plans	arXiv
2025-08	RLTR	Rewards complete tool-use processes to improve exploratory planning	arXiv
2025-04	ReTool	RL rewards strategic tool-invocation patterns, penalises redundant calls	arXiv
2025-05	GiGPO	Assigns state-level credit across grouped rollouts for exploration	arXiv GitHub
2025-11	Agent0-VL	Evolves tool-integrated exploration through repeated reasoning cycles	arXiv GitHub
2025-05	Absolute Zero	Uses proposer-solver self-play to explore new reasoning tasks	arXiv GitHub

3.1.3 Reachability-Driven Exploration

Methods that preserve behavioural flexibility by regulating entropy or injecting useful off-policy experience.

Date	Method	Key Idea	Links
2025-08	EGPO	Adds entropy bonuses to encourage exploration in function-call reasoning	arXiv GitHub
2025-09	EPO	Regularizes entropy to sustain exploration in multi-turn agent RL	arXiv GitHub
2025-09	ENTROPO	Uses entropy-enhanced preferences to diversify coding-agent exploration	arXiv
2026-03	RAPO	Expands policy exploration with retrieval-augmented experience	arXiv
2026-04	E³-TIR	Branches from high-entropy prefixes to exploit exploratory experience	arXiv GitHub

3.2 Embodied Agents

Level 3 Embodied Agent Exploration — Uncertainty-driven active perception, competence-driven navigation & RL & test-time compute, and reachability-driven reward engineering & constrained safety.

Embodied agents operate in continuous, high-dimensional action spaces where every physical interaction consumes time, energy, and mechanical wear. The three exploration paradigms adapt as follows:

3.2.1 Uncertainty-Driven Exploration

Geometric & high-fidelity reconstruction: Viewpoint selection for active mapping, information-theoretic coverage, and ensemble-disagreement-based exploration of dynamics.

Date	Method	Key Idea	Links
2018-10	MAX	Ensemble-disagreement drives active exploration of dynamics	arXiv GitHub
2020-04	Active Neural SLAM	Coverage-maximising hierarchical policies explore unknown occupancy maps	arXiv GitHub
2021-03	APT	Non-parametric entropy maximisation for unsupervised active pre-training	arXiv GitHub
2023-12	Model-Free Active Exploration	Information-theoretic lower-bound approximation for ensemble-based exploration	Link
2024-10	ActiveSplat	Gaussian-splat viewpoint exploration maximises reconstruction fidelity under a time budget	arXiv GitHub
2023-11	Conan	Active interactive exploration as Bayesian query to disambiguate latent scene state	arXiv GitHub
2024-04	ActiveRIR	Cross-modal audio-visual exploration for acoustic scene mapping (room impulse responses)	arXiv
2025-10	Active Semantic Perception	Entropy-driven exploration over LLM-sampled scene graph hypotheses	arXiv GitHub

Semantic & multi-modal active inference: Probing the environment to disambiguate alternative scene-graph completions or to gather cross-modal (audio/language) evidence.

3.2.2 Competence-Driven Exploration

Competence-driven exploration spans navigation to task-relevant states and manipulation to achieve objectives. Both push beyond pre-trained priors at the frontier of current capability.

Objective-driven navigation

Date	Method	Key Idea	Links
2022-04	SayCan	Affordance value-function reweights LLM-proposed action exploration	arXiv GitHub
2022-07	Inner Monologue	Closed-loop replanning via inner-speech feedback re-explores failed plans	arXiv
2022-07	LM-Nav	Goal-directed exploration over LLM-annotated topological graphs	arXiv GitHub
2022-10	VLMaps	Open-vocabulary visual-language maps guide language-conditioned spatial exploration	arXiv GitHub
2023-10	LFG	LLM semantic-priors prune frontier exploration toward goal-relevant regions	arXiv GitHub
2024-10	Fisher-Info Planning	MLLM-guided exploration balancing information gain vs. localisation risk (Fisher information)	arXiv GitHub

RL for VLA policy exploration

Date	Method	Key Idea	Links
2023-03	Cal-QL	Calibrated offline value exploration enabling safe online fine-tuning	arXiv
2023-09	Q-Transformer	Scales autoregressive value-based exploration to static multi-task trajectories	arXiv
2024-09	DPPO	Formulates denoising trajectories as auxiliary MDP for stable PPO on diffusion policies	arXiv GitHub
2024-09	FLaRe	Large-scale online RL fine-tuning exploration on pretrained VLAs	arXiv GitHub
2024-10	HIL-SERL	Sample-efficient on-robot RL with human-in-the-loop interventions for dexterous tasks	arXiv GitHub
2024-11	GRAPE	Preference-aligned exploration generalises VLA policies to novel scenarios	arXiv
2025-02	ConRFT	Consistency-regularised offline-to-online exploration (HIL-SERL + consistency) for diffusion VLA	arXiv GitHub
2025-05	ReinboT	RL amplifies VLA manipulation exploration via reward-guided offline alignment	arXiv GitHub
2025-05	VLA-RL	Scalable PPO-based online action-space exploration for VLA policies	arXiv GitHub
2025-09	SimpleVLA-RL	GRPO group-relative exploration scales VLA skill acquisition	arXiv GitHub
2025-09	Dual-Actor FT	Dual-actor decoupling of exploration vs. exploitation for stable offline-to-online RL	arXiv
2025-10	π_RL	First online PPO/GRPO RL fine-tuning for flow-matching VLA	arXiv
2025-11	SRPO	Self-refined exploration bridging static data and online rollouts	arXiv GitHub
2025-11	*π₀.₆**	Flow-matching VLA that learns from online experience via offline RL	arXiv
2025-11	WMPO	Pure world-model PPO enables safe online action-space exploration for VLA	arXiv
2026-01	SOP	Scalable online post-training infrastructure for fleet-scale VLA exploration	arXiv
2026-02	GigaBrain-0.5M	Foundation VLA learned directly from world-model-based RL at fleet scale	arXiv
2026-04	π₀.₇	Steerable flow VLA trained with diverse multimodal context for out-of-the-box generalist skills	arXiv

Test-time compute & cognitive search

Date	Method	Key Idea	Links
2024-10	V-GPS	Offline value guidance steers generalist AR / diffusion VLA decoding at test time	arXiv
2025-05	Hume	System-2 deliberative exploration via continuous flow value guidance	arXiv GitHub
2025-08	MB-Search VLA	Model-based MCTS over AR / Diffusion VLA imagines trajectories before acting	arXiv
2025-09	VLA-Reasoner	MCTS imagination-time exploration over autoregressive action trajectories	arXiv
2025-11	DeepThinkVLA	Slow-thinking test-time exploration through deliberate chain-of-action reasoning	arXiv GitHub
2025-12	TACO	Anti-exploration test-time steering via continuous normalising flows	arXiv GitHub
2026-01	TT-VLA	Value-free on-the-fly test-time RL adapts VLA policies per-episode	arXiv
2026-02	Recurrent-Depth VLA	Implicit test-time compute scaling via latent iterative reasoning (no explicit tokens)	arXiv

3.2.3 Reachability-Driven Exploration

Automated reward engineering & curiosity: LLM-driven reward synthesis and curiosity / curriculum mechanisms sustain broad exploration incentives.

Date	Method	Key Idea	Links
2018-10	RND	Curiosity-driven exploration bonus via random network distillation	arXiv GitHub
2020-02	Never Give Up	Episodic + lifelong novelty bonuses sustain directed exploration across long horizons	arXiv
2023-06	Language-to-Rewards	LLM synthesises dense language-conditioned reward for skill exploration	arXiv GitHub
2023-10	Eureka	LLM-synthesised executable reward code evolves the explorable task manifold	arXiv GitHub
2024-09	CurricuLLM	LLM-designed curricula for progressive exploration of hard manipulation skills	arXiv GitHub
2025-05	TeViR	Text-to-video diffusion rewards enable efficient sparse-task exploration	arXiv
2020-10	Recovery RL	Learned recovery zones bound exploration without collapsing reachable set	arXiv GitHub
2024-04	RECOVER	Neuro-symbolic failure detection bounds exploratory trajectories in manipulation	arXiv
2025-03	SafeVLA	Constrained policy exploration under hard safety guarantees for VLA	arXiv GitHub

4. Level 4: Agent → Prospector — Imagination-Space Exploration

The Prospector internalises a world model and faces a dual exploration problem: simultaneously gathering real data to refine world model fidelity AND searching imagined trajectories to extract policies.

Level 4 Imagination-Space Exploration — Why (the dual exploration problem), Where (simulated rollouts, hazard zones, latent value landscapes), and How (MBRL, video generation, autonomous driving, social dynamics).

4.1 Why: The Dual Exploration Problem

4.1.1 Compounding Errors and Reality Drift

World models act as recursive self-simulators where infinitesimal single-step errors compound exponentially over long imagined horizons. Agents must proactively probe epistemic boundaries.

Date	Method	Key Idea	Links
2023-01	DreamerV3	Explores long imagined rollouts with entropy-regularised actor to prevent premature convergence in sparse-reward settings	arXiv GitHub
2019-06	MBPO	Limits imagined rollout length to prevent compounding error accumulation; iterative policy–model alternation drives exploration of true dynamics	arXiv GitHub
2018-05	PETS	Ensemble of probabilistic networks quantifies epistemic uncertainty; explores regions where ensemble predictions most disagree to anchor model to reality	arXiv GitHub
2018-07	SLBO	Constructs return lower bound jointly optimised over policy and model; optimism under uncertainty encourages exploration of under-covered state–action regions	arXiv GitHub
2018-07	STEVE	Stochastic ensemble value expansion explicitly propagates epistemic uncertainty across multi-step imagined rollouts…	arXiv
2025-12	Long-Horizon MBRL	Identifies compounding error as the core bottleneck in offline long-horizon model-based RL…	arXiv GitHub
2025-12	Surprise-Robust WM	Trains world models to explicitly handle out-of-distribution "surprise" inputs; surprise-resilient training reduces catastrophic reality drift when imagination enters unexplored regions of state space	arXiv GitHub

4.1.2 The Noise-Hijacking Trap

Curiosity-driven agents waste budgets on irreducible stochasticity (noisy-TV problem). Disentangling aleatoric from epistemic uncertainty via reachability metrics and learning-progress monitoring is essential.

Date	Method	Key Idea	Links
2020-05	Plan2Explore	Maximises future ensemble disagreement in latent space for task-agnostic exploration; disagreement targets epistemic uncertainty, not irreducible noise	arXiv GitHub
2020-06	RIDES	Reward-weighted state-reachability intrinsic motivation separates reachable novel states from high-entropy irreducible noise	arXiv GitHub
2018-08	RND	Random network distillation as epistemic novelty signal; highlights persistent failure to distinguish aleatoric from epistemic uncertainty in stochastic envs	arXiv GitHub
2017-05	ICM	Curiosity via self-supervised inverse/forward dynamics; forward-model prediction error as intrinsic reward to explore informative state transitions	arXiv GitHub
2025-09	Beyond Noisy-TVs	Systematically categorises sources of stochastic noise in exploration environments…	arXiv GitHub
2017-11	Bayesian Uncertainties	Canonical treatment of aleatoric vs. epistemic uncertainty in deep networks…	arXiv
2019-10	Model-Based Active Exploration	Explicitly optimises for epistemic information gain rather than prediction novelty…	arXiv GitHub

4.1.3 Fatal Detail Loss in Latent Space

Aggressive compression of high-dimensional sensory streams loses safety-critical details. Physical stress-testing and structured latent representations force world models to encode functionally critical geometric realities.

Date	Method	Key Idea	Links
2019-02	PlaNet	RSSM with deterministic + stochastic latent components; stochastic branch explores multiple plausible states rather than collapsing to a single deterministic prediction	arXiv GitHub
2018-03	World Models	V–M–C architecture compresses pixels to latent then explores futures via RNN-based mental simulation; highlights information loss from pure deterministic latents	arXiv GitHub
2023-04	I-JEPA	Image Joint-Embedding Predictive Architecture…	arXiv GitHub

4.2 Where: Exploration Across Different Spaces

4.2.1 Simulated Future Rollouts

Agents generate imagined trajectories to discover effective behaviours before physical execution, exploiting computational parallelism — thousands of hypothetical scenarios per second.

Date	Method	Key Idea	Links
2025-09	DreamerV4	Shared world-model/policy backbone with phased training; first agent to obtain Minecraft diamonds from offline data via exhaustive imagined rollouts	arXiv
2020-11	MuZero	Learns latent dynamics supporting MCTS planning without pixel reconstruction; achieves superhuman performance by planning over imagined state sequences	arXiv GitHub
2019-12	DreamerV1	RSSM-based latent world model; explores via action noise during environment interaction to broaden state-space coverage for model training	arXiv GitHub
2020-10	DreamerV2	Discrete categorical latents + entropy-regularised actor; entropy bonus is explicit exploration regulariser preventing premature behavioural convergence	arXiv GitHub
2019-03	Atari 100k (SimPLe)	First video-prediction world model competitive with model-free RL at 100k environment steps; imagined rollouts from pixel-based world model enable sample-efficient exploration of Atari games	arXiv GitHub
2026-01	Ctrl-World	Controllable world model with structured latent decomposition…	OpenReview GitHub

4.2.2 Counterfactual Hazard Zones

Safety-critical exploration probes operational failure boundaries before physical deployment; world models evaluate "what-if" counterfactuals without incurring real-world risk.

Date	Method	Key Idea	Links
2023-01	DayDreamer	Transfers Dreamer latent-imagination to physical robots; explores counterfactual hardware interactions in latent space before committing to unsafe real actions	arXiv GitHub
2018-05	PETS	Probabilistic ensemble models epistemic uncertainty for planning; agents probe high-uncertainty regions to discover failure modes before physical execution	arXiv GitHub
2024-10	ActSafe	Active safe exploration via worst-case trajectory imagination; uses constrained world model rollouts to identify unsafe counterfactual outcomes before committing to any real action	arXiv GitHub
2025-04	BUMEx	Boundary-uncertainty model exploration: identifies safety-critical boundary regions in the world model's state space and actively probes them with imagined counterfactual rollouts to discover latent…	arXiv GitHub

4.2.3 Latent Value Landscapes

In sparse-reward settings, agents construct internal value landscapes through intrinsic motivation, turning world-model predictive errors into exploration bonuses.

Date	Method	Key Idea	Links
2020-05	Plan2Explore	World model ensemble disagreement as intrinsic reward; constructs a latent value landscape rewarding states where model uncertainty is highest	arXiv GitHub
2020-06	RIDES	Reachability-weighted novelty bonus sculpts intrinsic value landscape to emphasise informative and accessible states	arXiv
2018-10	RND	Forward model prediction error on random network as novelty signal; constructs a pseudo-value landscape for count-free exploration in high-dimensional spaces	arXiv GitHub
2017-05	ICM	Self-supervised curiosity: forward-model error in latent feature space as exploration bonus, ignoring unpredictable environmental noise	arXiv GitHub
2025-03	Curiosity-Driven Imagination	Curiosity bonus directly inside the latent imagination loop: world model generates diverse hypothetical futures and rewards the agent for imagining states with high latent novelty, sculpting an…	arXiv GitHub
2025-10	General Exploratory Bonus	Unified framework for count-free exploration bonuses in latent space…	arXiv GitHub
2026-01	SuS (Surprise-based Successor)	Surprise-modulated successor representations for exploration…	arXiv GitHub

4.2.4 Action-Grounded Latent Manifolds

Latent spaces must be action-grounded "Embodied-Native" manifolds — every imagined future tightly coupled with executable motor commands.

Date	Method	Key Idea	Links
2025-06	V-JEPA 2	Video-pretrained JEPA model enabling zero-shot robotic grasping; explores action-grounded latent manifold via abstract representation rather than pixel reconstruction	arXiv GitHub
2025-06	WorldVLA	Unified autoregressive framework for text, image, and action generation; joint latent manifold treats physical actions and visual evolution as first-class citizens	arXiv GitHub
2024-04	V-JEPA	First pure-video self-supervised JEPA; latent prediction of masked spatiotemporal blocks produces rich action-predictive representations without pixel reconstruction	arXiv GitHub
2019-02	PlaNet	RSSM latent manifold for planning; stochastic + deterministic components allow exploration over multiple plausible physical futures simultaneously	arXiv GitHub
2025-01	AD-L-JEPA	Autonomous driving latent-space JEPA: predicts action-conditioned future representations of driving scenes…	arXiv GitHub

4.3 How: Exploration Across World-Model Domains

4.3.1 Model-Based Reinforcement Learning (MBRL)

Deterministic Dynamics and Iterative Exploration

Date	Method	Key Idea	Links
2019-06	MBPO	Short imagined rollouts prevent compounding error; iterative model–policy alternation guides exploration toward regions of true dynamics not yet covered	arXiv GitHub
2018-07	SLBO	Jointly maximises return lower bound over policy and model; optimism in model optimisation encourages exploration of state–action regions not yet well covered	arXiv GitHub
2019-03	SimPLe	Sequential policy optimisation in latent model: pixel-based video prediction model supports model-free policy gradient exploration…	arXiv GitHub

Uncertainty-Aware Exploration

Date	Method	Key Idea	Links
2018-05	PETS	Probabilistic ensemble explicitly disentangles aleatoric vs. epistemic uncertainty via KL minimisation between ensemble members; explores where epistemic uncertainty is highest	arXiv GitHub
2018-10	ME-TRPO	Model-ensemble trust-region policy optimisation: uses N independently trained models and limits policy updates to regions where all models agree, preventing exploitation of epistemic uncertainty in…	arXiv GitHub
2019-10	Model-Based Active Exploration (MAX)	Treats exploration as active learning: at each step selects actions that maximally reduce epistemic uncertainty under the ensemble…	arXiv GitHub

From Pixels to Latent Planning: Representation Learning for World Models

Date	Method	Key Idea	Links
2020-05	Plan2Explore	Novelty = future ensemble disagreement in RSSM latent space; task-agnostic exploration pre-trains a world model before any reward signal is available	arXiv GitHub
2019-02	PlaNet	Pioneers RSSM: deterministic hidden state + stochastic Gaussian latent; stochastic branch forces imagination to explore multiple plausible environmental outcomes	arXiv GitHub
2018-03	World Models	V–M–C architecture: VAE compresses pixels, RNN explores temporal structure in latent space, controller acts within learned representation	arXiv GitHub

Imagination-Based Exploration: The Dreamer Family

Date	Method	Key Idea	Links
2025-09	DreamerV4	Phased training (WM pre-train → policy post-train) solves dual exploration problem; policy leverages WM priors for efficient exploration of long-horizon imagined trajectories	arXiv
2023-01	DreamerV3	Percentile return normalisation stabilises exploration intensity across sparse and dense reward scales; adapts entropy regularisation automatically	arXiv GitHub
2020-10	DreamerV2	Discrete categorical latents + actor entropy bonus as explicit exploration regulariser; prevents premature policy collapse in imagined rollouts	arXiv GitHub
2019-12	DreamerV1	RSSM world model with action noise for environment exploration; broader state-space coverage improves quality of imagined training data	arXiv GitHub
2023-01	DayDreamer	Transfers DreamerV2 to physical robots; latent imagination enables efficient hardware exploration without prohibitive real-world sample requirements	arXiv GitHub

Predictive Architectures: JEPA

Date	Method	Key Idea	Links
2025-06	V-JEPA 2	Video-pretrained world model enabling zero-shot robotic deployment; latent-space exploration closes the loop between imagination and physical execution	arXiv GitHub
2024-04	V-JEPA	First pure-video JEPA: predicts masked spatiotemporal block representations; abstract latent prediction explores rich spatiotemporal structure without pixel-level noise	arXiv GitHub
2023-06	I-JEPA	Image JEPA: learns representations by predicting abstract features of masked image regions from context…	arXiv GitHub

4.3.2 Video Generation as World Simulation

Model as Environment: Video generation models serve as physics engines, enabling RL agents to explore within generated "dreams" at zero physical cost.

Model as Environment

Date	Method	Key Idea	Links
2024-05	Genie	Learns action-conditioned state transitions from unlabelled video; creates controllable virtual sandboxes for agent exploration without real-world interaction	arXiv
2024-03	UniSim	Universal simulator of sensorimotor interactions; trains RL agents entirely in simulated video environments for safe long-tail exploration	arXiv
2024-01	DriveDreamer	Driving-domain video world model conditioned on structured HD-map and traffic annotations…	arXiv GitHub
2024-11	Genie 2	Scalable interactive environment generator: produces persistent 3D-consistent game worlds from a single image prompt…	arXiv
2024-04	Video Language Planning	Combines video generation with language-conditioned planning: generates goal-directed video plans as imagined futures, then executes them via a learned policy…	arXiv GitHub

Planning and Policy Learning in Video World Models

Date	Method	Key Idea	Links
2025-01	Cosmos Policy	Injects latent frames for video–action co-diffusion; RL reward guides generation to explore physically consistent action-conditioned futures	arXiv
2025-01	VideoDPO	DPO applied to video generation; preference data forces generative exploration toward spatiotemporally consistent physical trajectories	arXiv GitHub
2026-01	TAGRPO	Token-level advantage-guided reward policy optimisation for video generation…	arXiv GitHub
2026-02	DreamZero	Zero-shot world model policy: directly uses a pre-trained video world model as a policy by selecting action sequences that steer imagined futures toward high-reward outcomes…	arXiv GitHub

Language as World Model: LLM-Based Simulation

Date	Method	Key Idea	Links
2025-01	Video-T1	Test-time scaling for video generation: generates multiple candidate trajectories, uses verifiers to select the most physically consistent — inference as tree search	arXiv GitHub
2026-01	WMReward (Inference-Time)	Uses world model value estimates as verifier rewards at inference time…	arXiv GitHub
2026-02	DreamZero (Inference)	Leverages a frozen video world model at inference time to score action proposals via forward imagination…	arXiv GitHub

Autonomous Driving: Vectorised and Occupancy Exploration

Date	Method	Key Idea	Links
2026-03	FastWAM	Decouples video generation from policy inference; skips test-time future imagination entirely to achieve 190ms latency for real-time closed-loop action exploration	arXiv GitHub
2025-06	WorldVLA	Unified autoregressive framework generating text, images, and actions; explores action-grounded latent manifold as joint first-class representation	arXiv GitHub
2025-01	Cosmos Policy	Latent frame injection synchronises video and action co-diffusion; closed-loop counterfactual simulation with RL-guided physically consistent exploration	arXiv
2026-02	DreamZero (WAM)	Zero-shot WAM that couples a frozen video world model with a learned action decoder; closed-loop action exploration via iterative world-model querying without any task-specific fine-tuning	arXiv GitHub

Autonomous Driving: Vectorised and Occupancy Exploration

Date	Method	Key Idea	Links
2024-01	DriveDreamer	Driving world model synthesising future scenes conditioned on actions; explores diverse driving futures in latent space to validate plans before physical execution	arXiv GitHub
2023-09	GAIA-1	Generative world model for autonomous driving; produces diverse imagined driving scenarios as a data engine for exploring rare safety-critical events	arXiv
2022-10	MILE	Model-based imitation learning in compact latent space; imagined rollouts in latent representation for sample-efficient exploration of driving behaviours	arXiv GitHub
2024-12	DrivingWorld	Spatiotemporal autoregressive world model for autonomous driving…	arXiv GitHub
2025-06	GenAD	Generalised autonomous driving world model: trains a single generative model across diverse driving datasets…	arXiv GitHub
2024-03	Think2Drive	Converts a pre-trained world model into an online planner…	arXiv GitHub
2023-06	UniAD	Unified autonomous driving framework integrating perception, prediction, and planning in a shared representation…	arXiv GitHub

Date	Method	Key Idea	Links
2024-01	Drive-WM	Action-conditioned multi-view video generation for driving; visualises consequences of hypothetical manoeuvres to explore safe counterfactual futures	arXiv GitHub
2024-03	RealGen	Adversarial retrieval-augmented generation targeting safety-critical scenario boundaries; probes failure-mode hazard zones via adversarial imagination exploration	arXiv GitHub
2025-01	AD-L-JEPA	Autonomous driving latent-space JEPA: predicts action-conditioned future driving representations…	arXiv GitHub
2024-06	Delphi	Dense latent point cloud world model for driving; probabilistic forecasting of future scene states enables uncertainty-aware exploration of counterfactual traffic evolutions	arXiv GitHub
2025-01	UncAD	Uncertainty-aware autonomous driving: estimates both epistemic and aleatoric uncertainty in world model predictions…	arXiv GitHub

Date	Method	Key Idea	Links
2024-08	OccSora	Diffusion-based 4D occupancy generation; synthesises long-horizon occupancy sequences enabling exploration of temporally consistent physical futures	arXiv GitHub
2024-05	OccWorld	Vision-centric 3D occupancy world model for driving; forecasts occupancy evolution providing collision-risk cost volumes for safe spatial exploration	arXiv GitHub
2025-01	Drive-OccWorld	Drives entirely in a 4D occupancy world: unified model for scene generation and ego-planning…	arXiv GitHub
2024-04	Copilot4D	Discretises 3D point cloud scenes into tokens and applies discrete diffusion for 4D world modelling…	arXiv
2025-01	DynamicCity	Dynamic 4D city generation via HexPlane-based occupancy world model; generates temporally consistent large-scale urban occupancy sequences enabling exploration of rare urban environment configurations	arXiv GitHub
2025-04	Gaussian World Model	4D Gaussian splatting world model for autonomous driving…	arXiv GitHub

Date	Method	Key Idea	Links
2024-10	DriveArena	Creates reactive 4D worlds where the policy actively interacts with a neural simulator; enables closed-loop exploration of diverse reactive traffic scenarios	arXiv GitHub
2024-09	SimGen	Cascaded diffusion for high-fidelity, controllable scenario augmentation; addresses long-tail exploration by generating rare safety-critical scenarios on demand	arXiv GitHub
2024-03	DrivingDiffusion	Multi-view video diffusion for closed-loop driving simulation…	arXiv GitHub
2025-05	Raw2Drive	End-to-end closed-loop driving directly from raw sensor data…	arXiv GitHub
2023-06	TrafficBots	Multi-agent traffic simulation via conditional behaviour generation…	arXiv GitHub
2025-04	DrivingSphere	Spherical-projection world model for full 360° closed-loop driving simulation; complete spatial coverage removes blind spots so policies train against surrounding agents in all directions	arXiv GitHub

Social Dynamics: Exploration in Strategic and Normative Environments

Date	Method	Key Idea	Links
2024-01	VBench	Comprehensive benchmark evaluating world model quality including social regularity capture; provides metrics to assess whether exploration produces socially plausible behaviours	arXiv GitHub
2020-08	Social-STGCNN	Spatio-temporal graph CNN models pedestrian trajectory social forces; world model for social dynamics enabling exploration of crowd interaction counterfactuals	arXiv GitHub
2025-10	LCTGen	Language-conditioned traffic generation: natural-language scene specs, structured map retrieval, and multi-agent rollouts for counterfactual social traffic exploration	arXiv GitHub

5. Level 5: Prospector → Ecosystem — Coordination-Space Exploration

Level 5 Coordination-Space Exploration — Why (single-agent limitations), Where (communication, collaboration, role, deployment), and How (orchestration, ensemble, MARL, self-evolving agents).

Key Challenges at Level 5

Scalable Coordination Exploration: The search space is combinatorial, hierarchical, and dynamic—which agents to activate, what communication to establish, how information flows
Ecosystem-Level Credit Assignment: Disentangling behavioural contribution from structural contribution under sparse, delayed feedback
Diversity vs. Convergence Tension: Balancing ecological diversity against system-level coherence
Role–Communication Co-evolution: Jointly evolving functional specialisation and information exchange protocols

5.1 Multi-Agent Orchestration

5.1.1 Rule-based Orchestration (Reachability-Driven)

Methods that coordinate collaboration through pre-defined routing rules, role protocols, or structured workflows to guarantee deterministic reachability.

Date	Method	Key Idea	Links
2026-02	ORCH	Explores many parallel analysis trajectories and merges them via a deterministic EMA-guided router to ensure reachable consensus in discrete-choice reasoning	arXiv
2025-11	MA-IR	Deterministic multi-agent orchestration for high-quality incident response decision support	arXiv
2025-10	MOSAIC	Task-intelligent orchestration routes specialised agents to explore scientific coding workflows within a rule-governed collaboration space	arXiv
2025-06	AgentOrchestra	Defines the reachable coordination space via a Tool-Environment-Agent (TEA) protocol, scaffolding scalable multi-agent exploration	arXiv
2025	Croto	Cross-team communication rules carve out a reachable space of inter-team collaboration for multi-agent exploration	-
2024-06	MACNET	Predefined topological links bound the agent interaction graph that large-scale multi-agent collaboration can traverse	arXiv
2023-08	MetaGPT	Encodes SOPs as meta-programming so role-based workflows follow reliably reachable collaboration trajectories	arXiv
2023	AgentVerse	Rule-driven collaboration scaffolds multi-agent exploration of emergent group behaviours within a reachable role space	OpenReview

5.1.2 Learnable Orchestration (Competence-Driven)

Methods that train orchestrators, meta-agents, or agent graphs to route tasks toward the most competent executors and expand multi-agent capability boundaries.

Date	Method	Key Idea	Links
2026-01	MAS-Orchestra	Expands multi-agent reasoning competence through holistic orchestration, with controlled benchmarks probing the explored system space	arXiv
2025-05	Puppeteer-Puppet	Evolves a puppeteer that explores dynamic orchestration strategies over puppet agents to extend collaborative competence	arXiv
2025-04	W4S	Trains a weak meta-agent to explore task decompositions and harness strong executor agents beyond its own competence	arXiv
2024-04	CMAT	Collaboration tuning expands small-model agent competence by exploring multi-agent interaction signals	arXiv
2024-02	GPTSwarm	Treats language agents as optimizable computation graphs, enabling competence-driven search over prompts and inter-agent edges	arXiv

5.1.3 Reflection & Information-Theoretic Orchestration (Uncertainty-Driven)

Methods that adapt orchestration policies through reflection feedback or information-theoretic uncertainty signals across long-horizon multi-agent tasks.

Date	Method	Key Idea	Links
2025-09	Orchestrator	Uses active inference to drive multi-agent exploration under epistemic uncertainty across long-horizon tasks	arXiv
2025-04	W4S	Weak meta-agent explores orchestration policies over strong executors, guided by reflection on uncertain outcomes	arXiv
2025-03	MAS-GPT	Trains LLMs to synthesise multi-agent systems per query, exploring the system-design space conditioned on task uncertainty	arXiv
2024-04	CMAT	Reflective multi-agent tuning explores feedback-driven collaboration to calibrate small-agent competence under uncertainty	arXiv

5.1.4 Memory & Knowledge Substrate Exploration

Methods that build self-evolving multi-agent systems on shared memory or knowledge substrates to support long-horizon specialisation and exploration.

Date	Method	Key Idea	Links
2025-05	PiFlow	Principle-aware orchestration grows a scientific knowledge substrate that guides multi-agent discovery exploration	arXiv GitHub
2025-03	MedAgentSim	Self-evolving clinical multi-agent simulation explores new cases by accumulating case-level memory as a shared substrate	arXiv GitHub
2025	SEMC	Self-evolving consultation grows a shared diagnostic knowledge base, expanding the reachable medical case space over time	-
2025-02	MobileSteward	Orchestrates app-oriented agents with self-evolving memory to explore cross-app instruction compositions	arXiv GitHub
2025-01	Mobile-Agent-E	Self-evolving mobile assistant explores complex tasks by accumulating reusable tips and shortcuts as a growing experience substrate	arXiv GitHub
2023-04	Generative Agents	Interactive simulacra use memory-retrieval substrates to explore and surface emergent social behaviour over long horizons	arXiv GitHub

5.2 Agentic Ensemble Papers

5.2.1 Ensemble-During-Inference Papers

Methods that explore how to combine or choose token candidates from multiple LLMs at each decoding step.

Token-Level Ensemble

Date	Method	Key Idea	Links
2025-10	SAFE	Only ensembles at a few well-chosen token steps to keep decoding stable and fast	arXiv
2025-10	CoRe	Uses token and model agreement to downweight unreliable signals	arXiv
2025-05	Transformer Copilot	A Copilot learns from past token mistakes and fixes the Pilot’s logits	arXiv GitHub
2025-02	ABE	Makes different-vocabulary models agree on the same surface token before choosing it	arXiv GitHub
2025-02	CITER	Sends easy tokens to a small model and hard ones to a large model	arXiv GitHub
2024-10	UniTe	Ensembles only the union of top-k tokens instead of the full vocabulary	arXiv
2024-06	GaC	Treats next-token generation like classification and averages token probabilities	arXiv GitHub
2024-04	DeePEn	Maps different vocabularies into a shared space before merging token distributions	arXiv GitHub
2024-04	PackLLM	Gives more weight to models that fit the prompt better	arXiv GitHub
2024-04	EVA	Learns vocabulary mappings so different models can ensemble token by token	arXiv GitHub
2024-02	-	Uses a benign small model to pull token probabilities away from harmful outputs	arXiv

Span-Level Ensemble

Date	Method	Key Idea	Links
2025-06	RLAE	Adjusts model weights on the fly as generation goes on	arXiv
2024-12	SpecFuse	Lets models draft short spans, then picks the best one for the next step	arXiv
2025-02	Speculative Ensemble	Lets one model draft a span and others verify it for faster decoding	arXiv GitHub
2024-09	SweetSpan	Lets each model write a short span, then uses mutual scoring to choose one	arXiv
2024-07	Cool-Fusion	Waits for a shared word boundary, then selects the best whole span	arXiv

Reasoning-Step Ensemble

Date	Method	Key Idea	Links
2025-11	CBS	Explores many next reasoning steps, then keeps the ones backed by collective consensus	Link
2024-12	LE-MCTS	Searches over next reasoning steps and keeps the path with the best process reward	arXiv

5.2.2 Ensemble-After-Inference Papers

Methods that explore how to compare multiple complete responses after generation, either by selecting the single best answer or by choosing a strong subset for regeneration.

Date	Method	Key Idea	Links
2025-12	LLM-PeerReview	Lets LLM judges score candidate answers and picks the best-reviewed one	arXiv GitHub
2025-10	LLMartini	Aligns answer parts so users can compare and compose a final response	arXiv
2025-10	Beyond Consensus	Uses minority veto to stop overly agreeable judges from accepting bad answers	arXiv GitHub
2025-10	OW/ISP	Uses who agrees with whom, not just vote counts	arXiv
2025-09	FLAME	Aggregates line-level annotations from several LLMs to rank bug locations	arXiv GitHub
2025-09	CARGO	Uses confidence-aware scoring to decide which model to trust more	arXiv
2025-07	LENS	Learns how much to trust each answer from internal states	arXiv
2025-05	EL4NER	Merges small-LLM NER outputs and self-checks the final spans	arXiv
2025-03	Symbolic-MoE	Selects skill-matched experts and then combines their finished reasonings	arXiv GitHub
2025-01	DFPE	Keeps diverse strong models, filters weak ones, and reweights the rest	arXiv GitHub
2025-01	DMoA	Balances diversity and consistency before mixing answers	OpenReview
2024-12	Smoothie	Picks the answer most supported by the others, without labels	arXiv GitHub
2024-10	LLM-Forest	Uses weighted voting across graph-guided prompt variants	arXiv GitHub
2024-10	LLM-TOPLA	Selects a diverse top-k set before regeneration	arXiv GitHub
2024-10	MLKF	Fuses complementary reasoning from multiple LLMs into one answer	Link GitHub
2024-08	URG	Learns ranking and regeneration together	Link
2024-02	Agent-Forest	Samples many answers and keeps the one most supported by voting	arXiv GitHub
2023-06	LLM-Blender	Ranks answers first, then fuses the best few into one	arXiv GitHub
2023-05	MoRE	Uses agreement among reasoning experts to choose an answer or abstain	arXiv GitHub

5.2.3 Ensemble-Before-Inference Papers

Methods that route each query by predicting discrete model utility—such as whether a model is likely to be good enough, or which model is better under the query.

Date	Method	Key Idea	Links
2025-10	DiSRouter	Lets models help decide which peer should answer	arXiv
2025-06	TagRouter	Matches queries to model tags instead of training a heavy router	arXiv
2025-06	Router-R1	Learns multi-round routing to choose and combine models better	arXiv GitHub
2025-06	RadialRouter	Builds a structured query view for more robust routing	arXiv
2025-05	RTR	Chooses both the model and the reasoning style	arXiv GitHub
2024-12	Bench-CoE	Routes queries using benchmark-based model strengths	arXiv GitHub
2024-10	GraphRouter	Uses graph structure to choose the best model	arXiv GitHub
2024-09	Eagle	Compares candidate models without extra router training	arXiv
2024-08	SelectLLM	Scores each query and picks an efficient model	arXiv
2024-06	RouteLLM	Learns when a cheaper model can replace a stronger one	arXiv GitHub
2024-05	LLM Routing Lessons	Shows which prompt cues help choose the right model	arXiv GitHub
2024-04	Hybrid-LLM	Balances answer quality and cost before choosing a model	arXiv GitHub
2024-03	ETR	Routes expert tokens to the most suitable specialist model	arXiv GitHub
2024-01	Routoo	Learns to send each query to the model most likely to work	arXiv
2024	RouterDC	Learns query embeddings that make routing easier	Link GitHub
2023-11	ZOOTER	Uses reward signals to pick the right expert model	arXiv
2023-08	FORC	Chooses the cheapest model that is still good enough	arXiv GitHub
2023	Benchmark Routing	Builds routing rules from benchmark-level model performance	OpenReview

Continuous Model Utility Routing

Date	Method	Key Idea	Links
2025-10	WebRouter	Compresses web-agent prompts and routes with cost in mind	arXiv
2025-10	LLMRank	Uses rich query features to rank which model should answer	arXiv
2025-05	Avengers	Combines small models by routing queries to their strengths	arXiv GitHub
2025-05	InferenceDynamics	Profiles model skills and knowledge before routing	arXiv
2025-05	kNN Router	Uses nearest past queries instead of a complex learned router	arXiv
2025	RELM	Learns recommendation and evaluation together for model selection	OpenReview
2025-02	LLM Bandit	Explores cheap options first and learns cost-aware routing online	arXiv
2024-12	PickLLM	Uses RL to pick the best model from context and budget cues	arXiv
2024-08	TO-Router	Predicts utility under latency and cost constraints	arXiv
2024-07	MetaLLM	Wraps several models and picks one using predicted utility	arXiv GitHub
2024-06	HomoRouter	Routes queries among similar tools with fine-grained scoring	arXiv
2024-01	Blending	Blends model strengths as a cheaper alternative to one giant model	arXiv

5.2.4 Cascaded-Based Papers

Methods that cascade models in sequence—each model handles a subset of the query or task, passing intermediate outputs to the next model in the pipeline.

Date	Method	Key Idea	Links
2025-12	RoBoN	Routes best-of-n samples across multiple LLMs, exploring which model adds the highest next-response gain via reward and agreement signals	arXiv GitHub
2025-09	Semantic Agreement	Uses meaning-level agreement between model outputs to explore whether a query can stop at a smaller model or should defer upward	arXiv
2025-04	EMAFusion	Combines taxonomy routing, learned routing, and confidence-triggered escalation to explore the cheapest reliable model path for each query	arXiv
2025-04	ModelSwitch	Uses sample consistency to explore when repeated sampling should stay with the current model or switch to a complementary one	arXiv GitHub
2024-12	DER	Treats expert selection as sequential route exploration, choosing the next LLM to add complementary knowledge with minimal compute	arXiv
2024-10	Cascade Routing	Unifies routing and cascading to explore the model chain only when quality estimates suggest extra capacity will pay off	arXiv GitHub
2024-04	LM Cascades	Uses token-level uncertainty to explore whether a generative response is reliable enough to stop or should be deferred to a larger model	arXiv
2023-10	AutoMix	Uses self-verification and POMDP routing to explore whether a weaker model is sufficient or a larger model is needed	arXiv GitHub
2023-10	Neural Caching	Uses active selection to explore which queries a continuously distilled student can absorb and which should still go to the teacher	arXiv GitHub
2023-10	MoT Cascade	Uses weak-model answer consistency, enriched with CoT/PoT thought mixtures, to explore whether escalation is necessary	arXiv GitHub
2023-10	EcoAssistant	Explores a hierarchy of assistants, refining with execution feedback and backing off to stronger models only when needed	arXiv GitHub
2023-05	FrugalGPT	Explores budget-aware combinations of LLMs, adaptively choosing a query-specific cascade for cost-efficient accuracy	arXiv
2023-01	Confidence Deferral	Clarifies when confidence-only deferral can explore the cascade effectively and when downstream-aware signals are required	Link
2022-10	Model Cascading	Explores early exit across models of increasing capacity, reserving large-model compute for harder inputs	arXiv

5.3 Multi-Agent Reinforcement Learning (MARL)

Date	Method	Key Idea	Links
2025-12	SDAX	Treats unsupervised skill discovery as high-level exploration and bi-level tunes diversity-vs-task rewards to learn agile locomotion behaviors	arXiv
2025-09	CERMIC	Calibrates curiosity with inferred peer-intention context to filter noisy novelty and reward high information-gain transitions in sparse-reward MARL	arXiv GitHub
2025-02	TEE	Maximizes cross-agent trajectory entropy in a contrastive latent space via particle-based estimation, yielding intrinsic rewards for diverse coordinated exploration	arXiv
2025-02	Consensus-Diversity Tradeoff	Shows implicit consensus with partial disagreement can preserve exploration diversity and improve robustness in dynamic multi-agent settings	arXiv GitHub
2023-02	EMAX	Uses per-agent value ensembles for UCB-guided exploration, low-variance ensemble targets, and majority-vote action selection to reduce miscoordination	arXiv
2022-08	MACE	Combines collaborative voxel mapping, global goal assignment, and time-aware safe-corridor planning for collision-safe multi-robot exploration of unknown spaces	arXiv
2021-12	MASAC	Proposes a CTDE multi-agent soft actor-critic front-end for collaborative waypoint search, coupled with minimal-snap trajectory optimization for executable robot motion	arXiv
2021-11	EMC	Uses prediction errors of induced individual Q-values as coordinated intrinsic rewards, plus episodic memory to reinforce informative experiences	arXiv GitHub
2021-07	CMAE	Selects shared exploration goals from entropy-scored projected state spaces and trains agents to reach them in a coordinated way	arXiv GitHub
2019-10	MAVEN	Introduces latent-variable hierarchical control to induce temporally committed exploration modes while retaining value-decomposition scalability	arXiv GitHub
2019-10	EITI/EDTI	Encourages coordinated exploration by maximizing inter-agent influence, using mutual information (EITI) and value-of-interaction rewards (EDTI)	arXiv GitHub
2017-05	ICM	Defines curiosity as forward-model prediction error in inverse-dynamics features, promoting exploration without relying on extrinsic rewards	arXiv GitHub

5.4 Self-Evolving Agent Systems

Methods that enable agents to evolve their own capabilities, roles, or knowledge representations through exploration and self-modification.

Date	Method	Key Idea	Links
2025-11	AgentEvolver	Self-evolves an LLM agent through self-questioning, self-navigating, and self-attributing modules so it generates its own tasks, trajectories, and credit signals for autonomous capability growth	arXiv GitHub
2025-05	SPA-RL	Decomposes a long-horizon agent's final reward into per-step progress contributions, providing dense intermediate rewards that stabilise RL on sparse-reward tasks	arXiv GitHub
2025-05	GiGPO	Adds an inner step-level group baseline on top of trajectory-level GRPO, enabling fine-grained credit assignment for multi-turn LLM agent training	arXiv GitHub
2025-05	SRSI	LLM acts as its own judge to score self-generated solutions and uses these self-rewards as RL signal, bootstrapping improvement without external labels	arXiv
2025-03	DAPO	Open-source large-scale LLM RL recipe introducing clip-higher, dynamic sampling, token-level policy-gradient loss, and overlong-reward shaping for stable scaling	arXiv GitHub
2025-02	SiriuS	Builds an experience library of successful multi-agent reasoning trajectories—augmented by re-trying and rewriting failed ones—and fine-tunes the agents on it for self-improvement	arXiv GitHub
2024-11	WebRL	Self-evolving online curriculum turns failed web tasks into new training instructions, paired with an outcome-supervised reward model and KL-constrained policy updates	arXiv GitHub
2024-06	TextGrad	Treats LLM-generated natural-language critiques as textual "gradients" and back-propagates them through compound AI systems to optimise prompts, code, and components end-to-end	arXiv GitHub
2024-06	DigiRL	Two-stage offline-then-online autonomous RL with a VLM-based evaluator and an advantage-filtered curriculum to train robust Android device-control agents in the wild	arXiv GitHub
2024-03	Quiet-STaR	Generalises STaR by sampling internal rationales at every token position and reinforcing the ones that improve next-token prediction, teaching LMs to "think" silently before speaking	arXiv GitHub
2024-02	GRPO	Removes PPO's value network by using group-sampled rollouts to compute the baseline, enabling memory-efficient on-policy RL for LLMs (introduced in DeepSeekMath)	arXiv GitHub
2024-01	SRLM	LLM acts as its own judge via LLM-as-a-Judge prompting, generates preference pairs on its own outputs, and iteratively DPO-trains on them—removing the human reward bottleneck	arXiv
2023-10	SELF	Iterative self-evolution loop where the model self-critiques and self-refines its outputs in natural-language feedback, then fine-tunes on the improved data	arXiv
2023-05	DPO	Re-parameterises RLHF so the LM itself implicitly defines the reward, reducing alignment to a simple classification loss on preference pairs—no separate reward model or RL loop needed	arXiv GitHub
2023-04	RRHF	Aligns LLMs to human preferences by sampling multiple candidate responses and applying a ranking loss based on their reward order, sidestepping the complexity of PPO	arXiv GitHub
2023-03	Self-Refine	Same LLM iteratively produces feedback on its own output and refines it across rounds, improving quality at inference time without any extra training or external supervision	arXiv GitHub
2023-03	Reflexion	Agent verbalises why a trial failed, stores the reflection in episodic memory, and uses it to guide subsequent trials—"verbal" reinforcement learning without weight updates	arXiv GitHub
2022-03	STaR	Generates rationales for problems, rationalises wrong answers given the correct one, and iteratively fine-tunes on rationales that yield correct answers, bootstrapping reasoning ability	arXiv GitHub

6. Exploration Evaluation

Exploration is not a primitive observable but an inferential claim about how an agent handles uncertainty, learning, and future choice. An adequate evaluation must separate exploratory behaviour from mere activity, benefit from base competence, and open-ended search from premature convergence.

6.1 Three Evaluation Principles

C1

Information Gain

The agent actively reduces uncertainty through directed information-seeking.

Signals: uncertainty reduction, calibration improvement, informative state acquisition

C2

Value Improvement

The agent converts acquired information into improved competence.

Signals: performance gains, sample-efficiency, boundary expansion on harder tasks

C3

Epistemic Reachability

The agent preserves access to plausible future states, actions, and hypotheses.

Signals: coverage, behavioural diversity, anti-collapse

6.2 Evaluation Benchmarks

Date	Benchmark	Key Idea	Links
Reasoning-Space Evaluation
2024-11	FrontierMath	Exposes the pass@1-to-pass@k gap at capability frontiers, measuring whether reasoning search reaches answers beyond one-shot competence	arXiv
2024-02	OlympiadBench	Probes whether extended reasoning chains resolve genuine uncertainty at olympiad-level difficulty	GitHub
2024-03	LiveCodeBench	Adds executable feedback to reasoning evaluation, making self-correction gain directly measurable	GitHub
Interaction-Space Evaluation
2024-03	WorkArena	Reveals whether the controller reopens exploration when web-based interfaces change across knowledge-work scenarios	GitHub
2024-04	OSWorld	Tests whether multimodal agents reopen search and adapt action policies when open-ended computer tasks shift tool demands	GitHub
Imagination-Space Evaluation
2022-06	MineDojo	Evaluates whether embodied agents leverage internal simulation to reduce real interaction cost in open-ended environments	GitHub
2022-06	PlanBench	Isolates planning-layer failures by testing whether agents reason coherently about action and change	GitHub
Coordination-Space Evaluation
2019-02	SMAC	Tests whether decentralised agents reduce joint uncertainty through tactical coordination	GitHub
2023-10	Sotopia	Assesses whether social interaction reduces joint uncertainty rather than producing superficial verbal agreement	GitHub

6.3 Open Challenges

Open-Domain Scalable Exploration

Scalable exploration for reasoning beyond verifiable tasks remains an open problem.

Safe Exploration with World Action Models

Ensuring exploration safety within predictive world action models.

Causal & Counterfactual Reasoning

Causal and counterfactual reasoning in imagination space.

Scalable Coordination

Coordination-space exploration without combinatorial explosion.