NASA Harness Bench
A small, transparent benchmark that gives a coding harness + model combination a single 1-shot prompt and a fixed NASA dataset, then displays exactly what it produced.
Methodology
- Every run starts from the identical prompt in
bench/PLAN.mdand a read-only copy ofbench/data/. - A fresh bench instance is created outside this repo so we can watch whether the harness wanders the filesystem.
- The harness gets one shot — no follow-up clarification.
- Its output is required to build with
pnpm buildinto a staticdist/. - We collect the run verbatim — the complete source tree it wrote (under
output/) and its compiled site (output/dist/) — link the data back to the one canonical copy, and grade a handful of fields by hand.
Both the source and the compiled output are kept, so you can read what the harness actually produced, not just the rendered result.
What's measured
Each result records (some automatic, some hand-graded):
harness · interface · model · effort · createdAt
broken · cheated · buildSucceeded
tokenUsage · estimatedCostUsd · timeTakenSeconds
tags · summary · free-form notes · raw run log How to read a result
Pick a run on the left. The header shows its metadata; below it, the
harness's compiled site is embedded live in an <iframe>.
The links open the raw notes, run log, metadata, and (once the repo is
public) the full source the harness wrote, on GitHub.
⚠ Placeholder phase — the concrete NASA task and dataset are not yet designed. The pipeline and site are scaffolded so the shape can be reviewed first.