Skip to the content.

03 - Benchmark Cards

Each card maps a known research task to the applied question it actually represents. The intent is to make it possible to skim a scorecard and understand what the numbers mean for an industrial application.

This repository ships the Two-Room and Maze environments in code. Push-T, Reacher, and OGBench Cube cards are included as targets for v0.3+.


Push-T


Reacher


Two-Room

This is the environment shipped in examples/two_room_toy/.


Maze

This is the environment shipped in examples/maze_toy/. It is the smallest setup where the full LeWMAdapterStub contract is exercised end-to-end via the TabularWorldModelPlanner subclass.


OGBench Cube


How to add a card

Open a pull request that:

  1. States the task type in one sentence.
  2. Names the applied interpretation in plain English.
  3. Lists at least two industries where the question matters.
  4. Proposes a hypothesis the benchmark is testing.
  5. Lists candidate metrics from 02_metric_taxonomy.md.
  6. Ends with a single, falsifiable applied question.