NASA Harness Bench

A small, transparent benchmark that gives a coding harness + model combination a single 1-shot prompt and a fixed NASA dataset, then displays exactly what it produced.

Loading…

Methodology

  1. Every run starts from the identical prompt in bench/PLAN.md and a read-only copy of bench/data/.
  2. A fresh bench instance is created outside this repo so we can watch whether the harness wanders the filesystem.
  3. The harness gets one shot — no follow-up clarification.
  4. Its output is required to build with pnpm build into a static dist/.
  5. We collect the run verbatim — the complete source tree it wrote (under output/) and its compiled site (output/dist/) — link the data back to the one canonical copy, and grade a handful of fields by hand.

Both the source and the compiled output are kept, so you can read what the harness actually produced, not just the rendered result.

What's measured

Each result records (some automatic, some hand-graded):

harness · interface · model · effort · createdAt
broken · cheated · buildSucceeded
tokenUsage · estimatedCostUsd · timeTakenSeconds
tags · summary · free-form notes · raw run log

How to read a result

Pick a run on the left. The header shows its metadata; below it, the harness's compiled site is embedded live in an <iframe>. The links open the raw notes, run log, metadata, and (once the repo is public) the full source the harness wrote, on GitHub.

⚠ Placeholder phase — the concrete NASA task and dataset are not yet designed. The pipeline and site are scaffolded so the shape can be reviewed first.

Notes
Grade