Skip to the content.
v0.15.0 Cross-environment validation ships: four-arm CPG matrix replayed on DMC Cartpole-swingup at two TD-MPC2 capacities. MODEL BOTTLENECK reproduces in all four cells at model_size = 5; at model_size = 1, the CEM×TD-MPC2 cell flips to INCONCLUSIVE — the first moderate-$n$ verdict the metric's gate refuses to commit on. First non-zero learned-arm successes in the paper. Release notes →

Evaluating world models like they will ship

Static AI benchmarks measure how well a model predicts. They miss what an applied team actually needs to know: success rate, latency budget, compute cost, robustness under perturbation. This is a small, opinionated evaluation layer that closes that gap.

The next bottleneck for world models is not only model quality. It is proof of usefulness.
A 7x7 maze with an animated agent walking the optimal path from start to goal.
An agent walks the 7x7 maze. Optimal path = 14 actions. The world-model planner finds it in ~33 steps with replanning.

tests python license

Step 01

The problem

Action-conditioned world models are routinely evaluated on prediction quality: reconstruction loss, FID, held-out next-frame accuracy. None of these answer the question an applied team must answer before integrating a model into a control loop.

The question is decision quality, not prediction quality. Does the model, when used by a planner, produce decisions that succeed within the latency and compute budget the deployment will accept? Does it recover from perturbations? Does it generalise across related tasks?

A low validation MSE is necessary but not sufficient. The framework’s headline example: a Markovian MLP world model with val_mse $= 0.026$ on DMC Acrobot. Predicting accurately. Planning a $0\%$ success rate.

See Thesis and Evaluation gap for the long form.

Step 02

The evaluation contract

Every adapter exposes four hooks. The benchmark runner does the rest: rollouts, perturbations, latency measurement, scorecard.

architecture

The contract is intentionally minimal: encode (observation → latent), rollout (latent + action sequence → latent sequence), score (latent + goal → reward), plan (observation + goal + horizon → action sequence). Anything that implements these four methods plugs into the same runner and gets compared on the same scorecard structure.

Concrete subclasses live in src/wmel/adapters/: a stdlib tabular planner, a PyTorch MLP, and the DMC Acrobot oracle. None of them know about the runner, the metrics, or each other.

Step 03

How it behaves on a toy

A 7x7 maze with a vertical wall and one doorway. Same env, same 30 episodes, same seed; three different planners. The framework's first non-trivial demonstration that the contract holds and the metrics discriminate.

Three policies, side by side

Random

Samples actions uniformly at random.

0%

success rate over 30 episodes

latency / call
0.03 ms
compute / decision
n/a
verdict
Wanders near the start. Goal stays out of reach.

Greedy (no waypoint)

Always step toward the goal in Manhattan distance.

0%

success rate over 30 episodes

latency / call
0.001 ms
compute / decision
n/a
verdict
Walks into the wall. Plan diverges from env, stuck.

Tabular world model

Random-shooting MPC over a learned-style dynamics function.

100%

success rate over 30 episodes

latency / call
3.12 ms
compute / decision
~256 rollout-units
verdict
Finds the corridor. Goal in ~34 steps (optimal is 14).
Three side-by-side mini-mazes. The random agent wanders near the start; the greedy agent walks into the wall and stays stuck; the world-model agent finds the corridor and walks the optimal path to the goal.
Three agents, three outcomes, one shared evaluation contract.

The same contract holds for a learned model

The thesis is only credible if a real learned model can plug into the same evaluation layer. The smallest demonstration: a tiny PyTorch MLP trained on 64 maze transitions, passed in as the dynamics= callable.

Oracle dynamics (stdlib)

The reference run from the section above.

100%

success rate over 30 episodes

latency / call
3.12 ms
compute / decision
~256 rollout-units
verdict
reaches goal in ~34 steps.

Learned MLP dynamics (PyTorch)

Same MPC planner, dynamics is now a tiny MLP trained on 64 transitions.

100%

success rate over 30 episodes

latency / call
236.93 ms
compute / decision
~256 rollout-units
verdict
contract holds. Latency is 76x higher.

Same success, same steps to success, same nominal compute – 76 times the per-call latency at horizon 20. Without measuring latency per call, you would conclude “it works just as well!” while the actual deployment cost is two orders of magnitude higher.

Effective planning horizon, made visible

Sweep the planning horizon of the tabular world-model planner and watch where it pays off. Hover any horizon to see all four metrics together. Success saturates at $h = 15$; per-call latency keeps climbing past the plateau without buying any extra success.

Planning-horizon sweep (maze toy, tabular world model) 0.00 0.25 0.50 0.75 1.00 0.00 1.33 2.66 3.99 5.31 5 10 15 20 30 success rate latency per call (ms) plan_horizon plan_horizon = 5 | success = 0.000 | latency = 0.887 ms/call | compute/decision = 368.3 plan_horizon = 10 | success = 0.900 | latency = 1.664 ms/call | compute/decision = 350.6 plan_horizon = 15 | success = 1.000 | latency = 2.504 ms/call | compute/decision = 278.7 plan_horizon = 20 | success = 1.000 | latency = 3.183 ms/call | compute/decision = 256.4 plan_horizon = 30 | success = 1.000 | latency = 4.593 ms/call | compute/decision = 277.5 success rate (95% CI band) planning latency per call (95% CI band)

Step 04

How it scales to real control

The framework's flagship metric. Run the same random-shooting MPC planner twice on DeepMind Control Suite Acrobot-swingup — once against oracle dynamics (real MuJoCo physics), once against a Markovian MLP world model trained on 2 000 random transitions. The only thing that changes is the dynamics= callable. The success-rate difference is the Counterfactual Planning Gap.

Oracle dynamics

Random-shooting MPC against real MuJoCo physics.

30%

success rate over 10 episodes

latency / call
77.3 ms
compute / decision
407.1 rollout-units
avg steps to success
180.7

Learned MLP dynamics

Same MPC, same scoring, MLP trained on 2 000 random transitions.

0%

success rate over 10 episodes

latency / call
65.3 ms
compute / decision
157.3 rollout-units
val MSE
0.026 (low) - yet success collapses

Counterfactual Planning Gap

Decoupling model error from planner capacity.

+0.30

raw difference of success rates

AC 95% CI
[-0.06, +0.56]
n / arm
10 episodes
verdict
INCONCLUSIVE

A low validation MSE on prediction quality does not translate into closed-loop success. CPG quantifies the planning-side gap with an Agresti–Caffo $95\%$ confidence interval that does not collapse at the boundary proportions $p \in {0, 1}$ where the standard Wald approximation degenerates. The verdict is gated on the CI lower bound, not the raw point estimate – at $n = 10$ the framework reports INCONCLUSIVE rather than over-claiming a model bottleneck.

Multi-seed extension: capacity vs.\ coverage

At $n = 10$ the framework refused to commit. We then pooled three seeds at $n = 50$ episodes per arm per seed and swept the MLP’s training-set size by a factor of $100$. The verdict hardens to MODEL BOTTLENECK with a tight, identical confidence interval in every cell.

Train size Val MSE Oracle Learned Raw CPG AC 95% CI Verdict
$200$ $0.0651$ $40/150$ $0/150$ $+0.267$ $[+0.191, +0.335]$ MODEL BOTTLENECK
$2{,}000$ $0.0233$ $40/150$ $0/150$ $+0.267$ $[+0.191, +0.335]$ MODEL BOTTLENECK
$20{,}000$ $0.0004$ $40/150$ $0/150$ $+0.267$ $[+0.191, +0.335]$ MODEL BOTTLENECK

Held-out validation MSE drops by ~150 times across the three cells. Planning success stays at exactly zero. The gap does not close. A prediction-quality metric alone would have declared the largest-data cell solved; CPG points to a data-coverage bottleneck (random rollouts in Acrobot never visit the upright-balancing regime) as the most parsimonious read, with planner-side and score-function residuals as plausible second-order contributors. The recommended remediation is to change the data-collection policy, not to grow the model.

Read the full page on CPG →  ·  Read the paper →

Step 05

Reproduce in 25 seconds

No GPU, no heavy ML dependency at runtime. Core install plus an installed CLI; optional extras for PyTorch and DMC.

git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"

Then run a single benchmark or sweep the planning horizon via the installed wmel console script:

# One scorecard, one JSON report
wmel run --env maze_toy --policy tabular-world-model --episodes 30 --output run.json

# Horizon sweep, comma-separated horizons, one combined JSON
wmel sweep --env maze_toy --plan-horizons 5,10,15,20,30 --output sweep.json

The DMC Acrobot CPG worked example needs the [control] and [learned] extras:

pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg
# -> results/dmc_acrobot/cpg.json

Every JSON report carries a versioned envelope (schema_version, wmel_version, generated_at).

Releases

v0.15.0 2026-05-23 · current

Cross-environment: DMC Cartpole-swingup, two capacities

Four-arm CPG matrix on a second env at TD-MPC2 model_size = 5 AND model_size = 1, $n = 30$ pooled each. All four cells at size = 5 reproduce MODEL BOTTLENECK; the CEM×TD-MPC2 cell at size = 1 flips to INCONCLUSIVE (learned $0.533$ vs oracle $0.500$, CPG $-0.033$, CI $[-0.28, +0.21]$) — first moderate-$n$ INCONCLUSIVE in the paper. Paper §5.10 + Figures 3 and 4. GPU queue added at experiments/GPU_ROADMAP.md.

v0.14.1 2026-05-23

First two paper figures (CPG vs data, coverage histogram)

PGF/TikZ Figure 1 (val MSE plummets $\sim 150\times$ while CPG stays flat at $+0.267$, asymmetric Agresti-Caffo CI) and Figure 2 (uprightness coverage: $0/2000$ random states reach upright vs $20.2\%$ for oracle). Two adversarial-review fixes addressed: sig-fig parity and asymmetric error bars.

v0.14.0 2026-05-23

Robustness sweep: published model, stronger planner, perturbation

Three new axes test the v0.11 MODEL BOTTLENECK verdict: TD-MPC2 (2M env steps) as dynamics, CEM as planner, DropNextActions(k) as in-episode perturbation. Verdict survives all three; pooled-150 under CEM tightens CI half-width to 0.054. Paper Sections 5.8 + 5.9.

v0.11.0 2026-05-16

Multi-seed CPG sweep: capacity vs.\ coverage

Verdict hardens from INCONCLUSIVE (n = 10) to MODEL BOTTLENECK (n = 150 pooled across three seeds); training-set sweep across {200, 2 000, 20 000} transitions leaves verdict and CI identical while validation MSE drops 150×. Paper Section 5.5 + 5.6 updated.

v0.10.0 2026-05-16

Short paper: Counterfactual Planning Gap

~7-page LaTeX paper under paper/, 23 BibTeX entries, reproducibility script, three adversarial-review findings addressed before tag.

v0.9.0 2026-05

CPG metric with Agresti-Caffo CI and gated verdict

Five-branch verdict gated on the CI lower bound; honest INCONCLUSIVE at n=10 instead of over-claiming a Wald-CI-driven significance.

v0.8.0 2026-05

DMC Acrobot-swingup wired in

First non-toy environment via wmel.envs.dmc_acrobot, with a Markovian MLP learned dynamics and an oracle dynamics factory. dm-control is an optional extra.

v0.7.0 2026-04

CLI, versioned JSON schema, perturbation-aware sweep

wmel run / wmel sweep console scripts; JSON envelope with schema_version, wmel_version, generated_at. Second CI job locks the no-torch runtime promise.

v0.6.0 2026-04

Proof of contract for learned PyTorch dynamics

PyTorch MLP fits the maze's transition table and plugs in as a drop-in dynamics= callable. Identical success, 76x higher per-call latency -- the trade-off the framework is built to expose.

v0.5.0 2026-04

Pluggable perturbation library

Perturbation, EnvPerturbation, DropNextActions, CompositePerturbation. Runner inner loop switched to deque for O(1) action-queue pops.

Disclaimer

This is an independent study of evaluation methodology for action-conditioned world models. It is not an official artifact of AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author. References to JEPA-style or LeWorldModel concepts are conceptual, not affiliational.