
Context: I'm a solo founder (Rapid Claw), my brother Brandon handles most of the infra, and we run about 5 agents in production on any given day. Small crew, small blast radius, and honestly that's the only reason we can get away with what I'm about to describe.
Last week there was a Hacker News post (and a real paper) showing researchers getting near-perfect scores on prominent AI agent benchmarks without solving a single task. That hit a nerve. We'd been quietly drifting away from benchmarks for months and this gave us the excuse to finally write down why.
Quick honesty check on numbers before I go further. We are at low-4-figure MRR, five agents live, and fewer than two dozen paying customers. I am not about to tell you what works at scale. I'm telling you what works at our scale, this month.
Here's the arc.
Phase 1: benchmarks made us feel smart
When we started, we cared a lot about how our default agent templates scored on public benchmarks. Pass@1 on SWE-Bench Lite, tool-use accuracy, browser nav success, that whole menu. It felt rigorous. We'd swap a model, rerun a suite, and if the number went up, we'd ship it.
Problem: our customers never once complained about benchmark deltas. They complained about things like "the agent burned through my budget on a loop," "the agent silently stopped picking up jobs," and "the agent said it finished but my queue still had the task." None of those show up as a benchmark score.
Phase 2: we replaced the benchmark suite with four boring production numbers
These are the only four we look at now, per agent, per day:
That's it. Four numbers. Per agent. Every day.
Phase 3: traces are the thing
The numbers point at the agent. The traces tell you why. We log every tool call, every model call, every retry, with inputs, outputs, and cost, pinned to a run ID. When a number moves we don't guess, we open the worst trace of the day and read it end to end.
I wrote up our stack for this over here: AI agent observability. It's the boring load-bearing part of running agents unattended. If I could go back, I would have built this before I built the second agent.
What actually moved since we switched
The honest caveats
If you're running agents in production and you're still staring at benchmark scores to decide what to ship, I'd gently suggest switching to whatever four numbers your customers would actually pay to improve. Different for everyone. Mine are above.
Curious what broke first for folks here and what signal replaced it. If you're weighing hosting choices for this kind of setup, our take is at managed AI agents.
Tijo
Moving from leaderboard scores to "Time-to-first-useful-output" is the ultimate transition from "AI researcher" to "SaaS founder," Tijo. Benchmarks are fun for Twitter, but loop rates and silent stalls are what actually determine whether a customer churns or stays. You've essentially traded "vibe metrics" for an AutoOps heartbeat.
I’m currently running a project in Tokyo (Tokyo Lore) that highlights high-utility logic and builders who prioritize production reliability over flashy leaderboards. Since you're running a "5-agent shop" with a focused observability stack, entering Rapid Claw could be the perfect way to turn your "boring production numbers" into a winning case study while your odds are at their peak.
This is a really solid breakdown — especially the shift from benchmarks to real production signals.
That “looks good on paper vs actually works” gap is way bigger than most people think.
I’ve been seeing something similar but more on the trust/safety side — where systems appear accurate until real-world inputs hit (scam links, phishing messages, etc.). That’s where things start slipping through.
Curious — have you found any good way to simulate those messy real-world edge cases, or is it all coming from live data now?
The silent stall rate is the one that bites you hardest the first time — you're right that it used to eat entire nights before heartbeats got wired up.
One signal I'd add to your four: cross-agent context contamination. When two agents share a common state store — same Supabase row, same Redis key — Agent B can partially overwrite Agent A's working context mid-run. Doesn't trigger a loop, doesn't stall, doesn't crash. Outputs look plausible, pass a shallow acceptance check. You catch it weeks later when the research agent's results start drifting on exactly the days the cleanup agent runs.
The fix that worked: a context fingerprint written at job start, checked at job close. If the fingerprint changed and Agent A didn't change it, that's a contamination event. Cheap to add to the trace, genuinely invisible without it.
On "time to first useful output" — it behaves differently for long-horizon tasks. For anything over roughly 15 minutes, you actually need two numbers: first useful checkpoint AND a "last checkpointable progress" rate (how often the agent produces something saveable before timeout). Without the second, you know the agent started well but you're blind to what it's doing in the back half.
Five agents is the exact size where these problems are survivable. At fifteen they start compounding across runs.
What's your current blast radius when the cleanup agent writes to a key the research agent is reading?
Really inspiring
Thanks For Sharing Looks very Interesting.
Benchmarks optimize for bragging rights, production metrics optimize for reality.
The moment you charge money, leaderboard scores matter less than whether the job gets done reliably.