Random
Samples actions uniformly at random.
success rate over 30 episodes
- latency / call
- 0.03 ms
- compute / decision
- n/a
- verdict
- Wanders near the start. Goal stays out of reach.
Static AI benchmarks measure how well a model predicts. They miss what an applied team actually needs to know: success rate, latency budget, compute cost, robustness under perturbation. This is a small, opinionated evaluation layer that closes that gap.
The next bottleneck for world models is not only model quality. It is proof of usefulness.
Step 01
Action-conditioned world models are routinely evaluated on prediction quality: reconstruction loss, FID, held-out next-frame accuracy. None of these answer the question an applied team must answer before integrating a model into a control loop.
The question is decision quality, not prediction quality. Does the model, when used by a planner, produce decisions that succeed within the latency and compute budget the deployment will accept? Does it recover from perturbations? Does it generalise across related tasks?
A low validation MSE is necessary but not sufficient. The framework’s headline example: a Markovian MLP world model with val_mse $= 0.026$ on DMC Acrobot. Predicting accurately. Planning a $0\%$ success rate.
See Thesis and Evaluation gap for the long form.
Step 02
Every adapter exposes four hooks. The benchmark runner does the rest: rollouts, perturbations, latency measurement, scorecard.
The contract is intentionally minimal: encode (observation → latent), rollout (latent + action sequence → latent sequence), score (latent + goal → reward), plan (observation + goal + horizon → action sequence). Anything that implements these four methods plugs into the same runner and gets compared on the same scorecard structure.
Concrete subclasses live in src/wmel/adapters/: a stdlib tabular planner, a PyTorch MLP, and the DMC Acrobot oracle. None of them know about the runner, the metrics, or each other.
Step 03
A 7x7 maze with a vertical wall and one doorway. Same env, same 30 episodes, same seed; three different planners. The framework's first non-trivial demonstration that the contract holds and the metrics discriminate.
Samples actions uniformly at random.
success rate over 30 episodes
Always step toward the goal in Manhattan distance.
success rate over 30 episodes
Random-shooting MPC over a learned-style dynamics function.
success rate over 30 episodes
The thesis is only credible if a real learned model can plug into the same evaluation layer. The smallest demonstration: a tiny PyTorch MLP trained on 64 maze transitions, passed in as the dynamics= callable.
The reference run from the section above.
success rate over 30 episodes
Same MPC planner, dynamics is now a tiny MLP trained on 64 transitions.
success rate over 30 episodes
Same success, same steps to success, same nominal compute – 76 times the per-call latency at horizon 20. Without measuring latency per call, you would conclude “it works just as well!” while the actual deployment cost is two orders of magnitude higher.
Sweep the planning horizon of the tabular world-model planner and watch where it pays off. Hover any horizon to see all four metrics together. Success saturates at $h = 15$; per-call latency keeps climbing past the plateau without buying any extra success.
Step 04
The framework's flagship metric. Run the same random-shooting MPC planner twice on DeepMind Control Suite Acrobot-swingup — once against oracle dynamics (real MuJoCo physics), once against a Markovian MLP world model trained on 2 000 random transitions. The only thing that changes is the dynamics= callable. The success-rate difference is the Counterfactual Planning Gap.
Random-shooting MPC against real MuJoCo physics.
success rate over 10 episodes
Same MPC, same scoring, MLP trained on 2 000 random transitions.
success rate over 10 episodes
Decoupling model error from planner capacity.
raw difference of success rates
A low validation MSE on prediction quality does not translate into closed-loop success. CPG quantifies the planning-side gap with an Agresti–Caffo $95\%$ confidence interval that does not collapse at the boundary proportions $p \in {0, 1}$ where the standard Wald approximation degenerates. The verdict is gated on the CI lower bound, not the raw point estimate – at $n = 10$ the framework reports INCONCLUSIVE rather than over-claiming a model bottleneck.
At $n = 10$ the framework refused to commit. We then pooled three seeds at $n = 50$ episodes per arm per seed and swept the MLP’s training-set size by a factor of $100$. The verdict hardens to MODEL BOTTLENECK with a tight, identical confidence interval in every cell.
| Train size | Val MSE | Oracle | Learned | Raw CPG | AC 95% CI | Verdict |
|---|---|---|---|---|---|---|
| $200$ | $0.0651$ | $40/150$ | $0/150$ | $+0.267$ | $[+0.191, +0.335]$ | MODEL BOTTLENECK |
| $2{,}000$ | $0.0233$ | $40/150$ | $0/150$ | $+0.267$ | $[+0.191, +0.335]$ | MODEL BOTTLENECK |
| $20{,}000$ | $0.0004$ | $40/150$ | $0/150$ | $+0.267$ | $[+0.191, +0.335]$ | MODEL BOTTLENECK |
Held-out validation MSE drops by ~150 times across the three cells. Planning success stays at exactly zero. The gap does not close. A prediction-quality metric alone would have declared the largest-data cell solved; CPG points to a data-coverage bottleneck (random rollouts in Acrobot never visit the upright-balancing regime) as the most parsimonious read, with planner-side and score-function residuals as plausible second-order contributors. The recommended remediation is to change the data-collection policy, not to grow the model.
Step 05
No GPU, no heavy ML dependency at runtime. Core install plus an installed CLI; optional extras for PyTorch and DMC.
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"
Then run a single benchmark or sweep the planning horizon via the installed wmel console script:
# One scorecard, one JSON report
wmel run --env maze_toy --policy tabular-world-model --episodes 30 --output run.json
# Horizon sweep, comma-separated horizons, one combined JSON
wmel sweep --env maze_toy --plan-horizons 5,10,15,20,30 --output sweep.json
The DMC Acrobot CPG worked example needs the [control] and [learned] extras:
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg
# -> results/dmc_acrobot/cpg.json
Every JSON report carries a versioned envelope (schema_version, wmel_version, generated_at).
For the researcher
The metric taxonomy, the four-method evaluation contract, and the CPG definition with its Agresti-Caffo CI and gated verdict.
For the practitioner
A walkthrough of one scorecard, the benchmark cards each environment maps to, and the industrial use-cases the framework is built around.
For the reader
The short paper accompanying the framework, the LaTeX source, the reproducibility script, and the citation entry.
Cross-environment: DMC Cartpole-swingup, two capacities
Four-arm CPG matrix on a second env at TD-MPC2 model_size = 5 AND model_size = 1, $n = 30$ pooled each. All four cells at size = 5 reproduce MODEL BOTTLENECK; the CEM×TD-MPC2 cell at size = 1 flips to INCONCLUSIVE (learned $0.533$ vs oracle $0.500$, CPG $-0.033$, CI $[-0.28, +0.21]$) — first moderate-$n$ INCONCLUSIVE in the paper. Paper §5.10 + Figures 3 and 4. GPU queue added at experiments/GPU_ROADMAP.md.
First two paper figures (CPG vs data, coverage histogram)
PGF/TikZ Figure 1 (val MSE plummets $\sim 150\times$ while CPG stays flat at $+0.267$, asymmetric Agresti-Caffo CI) and Figure 2 (uprightness coverage: $0/2000$ random states reach upright vs $20.2\%$ for oracle). Two adversarial-review fixes addressed: sig-fig parity and asymmetric error bars.
Robustness sweep: published model, stronger planner, perturbation
Three new axes test the v0.11 MODEL BOTTLENECK verdict: TD-MPC2 (2M env steps) as dynamics, CEM as planner, DropNextActions(k) as in-episode perturbation. Verdict survives all three; pooled-150 under CEM tightens CI half-width to 0.054. Paper Sections 5.8 + 5.9.
Multi-seed CPG sweep: capacity vs.\ coverage
Verdict hardens from INCONCLUSIVE (n = 10) to MODEL BOTTLENECK (n = 150 pooled across three seeds); training-set sweep across {200, 2 000, 20 000} transitions leaves verdict and CI identical while validation MSE drops 150×. Paper Section 5.5 + 5.6 updated.
Short paper: Counterfactual Planning Gap
~7-page LaTeX paper under paper/, 23 BibTeX entries, reproducibility script, three adversarial-review findings addressed before tag.
CPG metric with Agresti-Caffo CI and gated verdict
Five-branch verdict gated on the CI lower bound; honest INCONCLUSIVE at n=10 instead of over-claiming a Wald-CI-driven significance.
DMC Acrobot-swingup wired in
First non-toy environment via wmel.envs.dmc_acrobot, with a Markovian MLP learned dynamics and an oracle dynamics factory. dm-control is an optional extra.
CLI, versioned JSON schema, perturbation-aware sweep
wmel run / wmel sweep console scripts; JSON envelope with schema_version, wmel_version, generated_at. Second CI job locks the no-torch runtime promise.
Proof of contract for learned PyTorch dynamics
PyTorch MLP fits the maze's transition table and plugs in as a drop-in dynamics= callable. Identical success, 76x higher per-call latency -- the trade-off the framework is built to expose.
Pluggable perturbation library
Perturbation, EnvPerturbation, DropNextActions, CompositePerturbation. Runner inner loop switched to deque for O(1) action-queue pops.
This is an independent study of evaluation methodology for action-conditioned world models. It is not an official artifact of AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author. References to JEPA-style or LeWorldModel concepts are conceptual, not affiliational.