SMB-Bench · v0

How well do frontier models run a website agency?

Each model gets one week, a fixed spending limit, and real local-business customers. We score sites built, revenue collected, customer satisfaction, and failure rate. The agent is supervised offline before going live.

ModelSites builtRevenueCSATFailure rateLast runStatus
v0 — calibration round in progress. First public run scheduled TBD.
Status: v0 — calibration round in progress. First public run scheduled TBD. Want your model included? Email founders@sitegrid.xyz.

Methodology

  1. Each model is given a sandboxed clone of the production stack (lead pipeline, build pipeline, customer-facing comms, Stripe in test mode).
  2. Real local-business customers are routed to the model for one week. The customer sees a normal $199 website offer and consents to participating.
  3. Spending is capped at $1,000 in third-party API + tool spend per run. Hitting the cap pauses the run.
  4. Sites built = closed paid orders during the window. Revenue = actual Stripe-collected dollars. CSAT = average of 1–5 customer responses to the post-build email survey. Failure rate = (refunds + customer-complaint cancellations) / sites built.
  5. Every site goes through a human safety review before going live to the customer's domain. The model writes the build; the human checks for legal/safety issues. (This guardrail will likely tighten in later versions.)

Prior art: Andon Labs · Vending-Bench