03 - Benchmark Cards

Each card maps a known research task to the applied question it actually represents. The intent is to make it possible to skim a scorecard and understand what the numbers mean for an industrial application.

This repository ships the Two-Room and Maze environments in code. Push-T, Reacher, and OGBench Cube cards are included as targets for v0.3+.

Push-T

Task type: 2D rigid-body manipulation; push a T-shaped block to a target pose with a circular pusher.
Applied interpretation: closed-loop, contact-rich, low-DoF manipulation under partial observability.
Relevant industries: light assembly, packaging lines, kitting robots, lab automation.
World model value hypothesis: a learned dynamics model can short-cut the cost of physical simulation and enable plan-then-act under tight latency budgets.
Candidate metrics: Action Success Rate, Planning Latency, Compute per Decision, Perturbation Recovery (block nudged mid-rollout).
Applied question: Can a learned world model push a part into spec faster and more reliably than a hand-tuned controller on a 50 ms decision loop?

Reacher

Task type: 2-link arm reaching a target position in 2D.
Applied interpretation: low-DoF kinematic control with a goal in workspace coordinates.
Relevant industries: cobots, lab manipulation, simple pick-and-place, prosthetics research.
World model value hypothesis: a latent dynamics model should generalize across target positions without retraining a controller per goal.
Candidate metrics: Action Success Rate, Average Steps to Success, Planning Horizon, Sample Efficiency.
Applied question: How many demonstrations are needed before a world-model-based planner matches an analytical inverse-kinematics controller on success rate?

Two-Room

Task type: discrete 2D grid navigation; two rooms separated by a wall with a single doorway.
Applied interpretation: minimal example of partially observable planning with a topological bottleneck.
Relevant industries: warehouse routing, indoor robot navigation, building automation, evacuation planning.
World model value hypothesis: a model that learns the doorway as a latent structure should plan through it without explicit graph search.
Candidate metrics: Action Success Rate, Average Steps to Success, Perturbation Recovery, Planning Latency.
Applied question: Can a learned model discover and exploit topological structure (the doorway) without being told it exists?

This is the environment shipped in examples/two_room_toy/.

Maze

Task type: discrete 2D grid navigation through a small maze with walls and dead-ends.
Applied interpretation: minimal example where a non-trivial planner is required - naive greedy fails and only a model that simulates candidate futures succeeds.
Relevant industries: warehouse routing under partial maps, indoor robotics, building automation, last-mile delivery.
World model value hypothesis: an action-conditioned predictor combined with random-shooting MPC can solve tasks that defeat reactive heuristics, at the cost of higher per-decision latency.
Candidate metrics: Action Success Rate, Average Steps to Success, Planning Latency, Compute per Decision, Perturbation Recovery.
Applied question: At what planning latency does a world-model-based planner stop being competitive with a reactive heuristic on routing tasks with topological bottlenecks?

This is the environment shipped in examples/maze_toy/. It is the smallest setup where the full LeWMAdapterStub contract is exercised end-to-end via the TabularWorldModelPlanner subclass.

OGBench Cube

Task type: multi-stage block-stacking from the OGBench suite; pick, transport, place cubes to form a target configuration.
Applied interpretation: long-horizon manipulation with composable subgoals.
Relevant industries: assembly automation, logistics palletizing, kitting, construction robotics.
World model value hypothesis: hierarchical world models with subgoal latents should outperform flat planners on tasks that need multi-step reasoning.
Candidate metrics: Action Success Rate, Planning Horizon, Sample Efficiency, Latent Interpretability (do subgoal latents emerge?).
Applied question: Does a world model trained on diverse manipulation transfer to a new stacking goal without retraining, and at what success rate?

How to add a card

Open a pull request that:

States the task type in one sentence.
Names the applied interpretation in plain English.
Lists at least two industries where the question matters.
Proposes a hypothesis the benchmark is testing.
Lists candidate metrics from 02_metric_taxonomy.md.
Ends with a single, falsifiable applied question.