SMB-Bench · v0

How well do frontier models run a website agency?

Name: SMB-Bench v0
Creator: SiteGrid

Each model gets one week, a fixed spending limit, and real local-business customers. We score sites built, revenue collected, customer satisfaction, and failure rate. The agent is supervised offline before going live.

Model	Sites built	Revenue	CSAT	Failure rate	Last run	Status
v0 — calibration round in progress. First public run scheduled TBD.

Status: v0 — calibration round in progress. First public run scheduled TBD. Want your model included? Email founders@sitegrid.xyz.

Methodology

Each model is given a sandboxed clone of the production stack (lead pipeline, build pipeline, customer-facing comms, Stripe in test mode).
Real local-business customers are routed to the model for one week. The customer sees a normal $199 website offer and consents to participating.
Spending is capped at $1,000 in third-party API + tool spend per run. Hitting the cap pauses the run.
Sites built = closed paid orders during the window. Revenue = actual Stripe-collected dollars. CSAT = average of 1–5 customer responses to the post-build email survey. Failure rate = (refunds + customer-complaint cancellations) / sites built.
Every site goes through a human safety review before going live to the customer's domain. The model writes the build; the human checks for legal/safety issues. (This guardrail will likely tighten in later versions.)

Prior art: Andon Labs · Vending-Bench