05 - 30-Day Prototype Plan
A focused four-week plan that takes the repository from “credible scaffolding” to “shareable applied narrative”.
The plan assumes one contributor, part-time. Every week ends with a runnable artifact, not just a doc.
Week 1 - Repository and toy benchmark
Goal: a public-facing scaffold a stranger can clone, install, and run on a laptop.
- Land the repository structure, README, AGENTS.md, MIT license, and 5 doc pages.
- Implement the
BenchmarkEnvironmentandPlannerPolicyinterfaces. - Implement the Two-Room toy environment with optional perturbation.
- Implement random and greedy baselines.
- Ship
examples/two_room_toy/run_baseline.pyand a sample JSON report. - Tests cover metrics on synthetic results and the toy environment.
Exit criterion: pytest is green and python -m examples.two_room_toy.run_baseline prints a scorecard from a clean checkout.
Week 2 - Adapter interface and scorecards
Goal: prove the evaluation layer is model-agnostic.
- Tighten
PlannerPolicyandBenchmarkEnvironmentto be the only contract a model needs to implement. - Add the
LeWMAdapterStubas a documented contract example (does not import or reimplement any specific model). - Add a second toy environment that exercises planning horizon (a small maze).
- Introduce the
Scorecarddataclass and a human-readable text formatter. - Add the
to_json_reportreporter and snapshot one example for the docs.
Exit criterion: a developer can implement a new PlannerPolicy in under 50 lines and produce a scorecard without touching the rest of the codebase.
Status: shipped in v0.2 - see examples/maze_toy/ and src/wmel/adapters/tabular_world_model.py. BenchmarkEnvironment now exposes action_space. TabularWorldModelPlanner is a concrete LeWMAdapterStub subclass that fills in encode, rollout, score, and plan end-to-end with no third-party dependency.
Week 3 - Baseline comparison and reporting
Goal: make the comparison story sharp.
- Add Compute per Decision and Planning Horizon to the scorecard.
- Add deterministic seeds and basic confidence-interval reporting on success metrics.
- Add a perturbation library: at minimum displacement, blocked-cell, and delayed-action perturbations.
- Add a Markdown report exporter so a scorecard can land in a doc page directly.
- Author one benchmark card in code, not just in prose - the toy maze with full scorecard output.
Exit criterion: a single command produces a Markdown report comparing two policies on two environments with confidence intervals.
Status (v0.4):
- Planning-horizon sweep with Wilson and normal confidence intervals - see
wmel.experiments.horizon_sweep,examples/maze_toy/run_horizon_sweep.py, and the worked example indocs/02_metric_taxonomy.md. - Markdown report exporters:
wmel.report.to_markdown_scorecard,to_markdown_report, andwmel.experiments.to_markdown_horizon_sweep. The output is paste-ready in a PR body or doc page. - Compute-per-decision wired:
PlannerPolicy.compute_per_plan_callis a class attribute that subclasses set.TabularWorldModelPlannerdeclares it asnum_candidates * plan_horizon, andcompute_scorecardderives an average over the run. The maze baseline reports ~256 rollout-units per decision for the world-model planner versus n/a for random / greedy.
What remains for v0.5: a pluggable perturbation library (displacement, blocked-cell, delayed-action) and a small CLI front-end.
Status (v0.5): perturbation library shipped. wmel.perturbations defines a Perturbation ABC with two override hooks (apply_to_env, transform_actions) and three concrete subclasses: EnvPerturbation (delegates to env.perturb(), the runner’s default), DropNextActions(k) (action-level drop), and CompositePerturbation(*parts) (composable). The runner’s inner loop was refactored to a deque-based action queue so action-level perturbations are O(1). Scorecard.perturbation_name records the strategy. The CLI front-end remains for a later release.
Week 4 - Public demo and applied narrative
Goal: make the artifact persuasive to a non-researcher.
- Write a short blog-style page (
docs/06_demo.md) walking through one scorecard and what it implies for an applied decision. - Record a 90-second screen capture of the toy benchmark running and the report being read. (Optional, do not block on it.)
- Tighten the README into a 60-second pitch.
- Add a
CONTRIBUTING.mddescribing how to add a benchmark card, a metric, and an adapter. - Tag
v0.1.0and write release notes that explicitly state the non-affiliation disclaimer.
Exit criterion: a non-researcher can read the README, run the demo, and articulate the thesis without help.
Status: docs/06_demo.md is shipped (a row-by-row product walkthrough of the maze horizon sweep). CONTRIBUTING.md is shipped. Tagged releases are at v0.3.1 and v0.4.0 with explicit non-affiliation disclaimers. CI runs the suite plus a smoke test of the three example scripts on Python 3.11/3.12/3.13. Screen capture remains optional and is not done.
What is explicitly out of scope for the first 30 days
- Training any model.
- Downloading any dataset or checkpoint.
- Adding any GPU dependency.
- Implementing Push-T, Reacher, or OGBench Cube fully - they remain benchmark cards until v0.2.
- Hosting a public scoreboard.
These are deliberate omissions. The whole point of this study is that the evaluation layer is what is missing, not yet another model.
Recipe for executing this plan with an LLM coding agent
The full recipe (setup, per-week loop, pre-tag adversarial review pattern, anti-patterns) is now its own page so it does not crowd the technical study plan: see process/llm_agent_recipe.html. The track record on this repo: zero releases shipped with metric-correctness bugs after the review pattern was adopted; 4 review passes caught 1 critical, 8 majors, and ~15 minors before they reached a tag.