I stopped running benchmarks on my agents. Here's what replaced them in a 5-agent shop.

by Tijo Bear

Field notes week of April 21

Context: I'm a solo founder (Rapid Claw), my brother Brandon handles most of the infra, and we run about 5 agents in production on any given day. Small crew, small blast radius, and honestly that's the only reason we can get away with what I'm about to describe.

Last week there was a Hacker News post (and a real paper) showing researchers getting near-perfect scores on prominent AI agent benchmarks without solving a single task. That hit a nerve. We'd been quietly drifting away from benchmarks for months and this gave us the excuse to finally write down why.

Quick honesty check on numbers before I go further. We are at low-4-figure MRR, five agents live, and fewer than two dozen paying customers. I am not about to tell you what works at scale. I'm telling you what works at our scale, this month.

Here's the arc.

Phase 1: benchmarks made us feel smart

When we started, we cared a lot about how our default agent templates scored on public benchmarks. Pass@1 on SWE-Bench Lite, tool-use accuracy, browser nav success, that whole menu. It felt rigorous. We'd swap a model, rerun a suite, and if the number went up, we'd ship it.

Problem: our customers never once complained about benchmark deltas. They complained about things like "the agent burned through my budget on a loop," "the agent silently stopped picking up jobs," and "the agent said it finished but my queue still had the task." None of those show up as a benchmark score.

Phase 2: we replaced the benchmark suite with four boring production numbers

These are the only four we look at now, per agent, per day:

Time-to-first-useful-output. From task accepted to the first artifact a human would consider useful. Not "first token." Not "first tool call." First useful thing.
Cost per completed task. Dollars spent divided by tasks that actually closed with a passing acceptance check. Open-ended tasks inflate this fast, which is exactly what we want to see.
Loop rate. Percent of runs that hit our "rethink, rewrite, rethink" circuit breaker before completion. We treat anything above 4% as a design problem, not a prompt problem.
Silent stall rate. Percent of runs where the worker stopped making progress but didn't crash. This is the one that used to eat entire nights before we had heartbeats.

That's it. Four numbers. Per agent. Every day.

Phase 3: traces are the thing

The numbers point at the agent. The traces tell you why. We log every tool call, every model call, every retry, with inputs, outputs, and cost, pinned to a run ID. When a number moves we don't guess, we open the worst trace of the day and read it end to end.

I wrote up our stack for this over here: AI agent observability. It's the boring load-bearing part of running agents unattended. If I could go back, I would have built this before I built the second agent.

What actually moved since we switched

Loop rate on our research agent dropped from about 9% to 3% once we forced every long task to declare a concrete "done looks like X" check. Same model, same prompts, just a tighter acceptance contract.
Cost per completed task on the cleanup agent dropped roughly 40% after we capped tool-call depth and killed an unbounded "reflect and try again" step.
Silent stalls went from "several per week" to "maybe one a month" once Brandon wired up a heartbeat plus a "no progress in N minutes" alert. I wrote more about that in building AI agents in production.
Benchmark scores on the replaced configs are actually slightly worse on paper. Customers are happier anyway. Make of that what you will.

The honest caveats

This only works because we have five agents, not fifty. At fifty, I'd probably need a whole separate layer for aggregating trace anomalies.
We still run benchmarks quarterly as a sanity check. They are fine as a sniff test. They are a bad daily signal.
None of this is novel. Observability people have been shouting this at us for years. We just weren't listening while the leaderboard felt fun.

If you're running agents in production and you're still staring at benchmark scores to decide what to ship, I'd gently suggest switching to whatever four numbers your customers would actually pay to improve. Different for everyone. Mine are above.

Curious what broke first for folks here and what signal replaced it. If you're weighing hosting choices for this kind of setup, our take is at managed AI agents.

Tijo

Tijo Bear

posted to

AI Tools

on April 21, 2026

Say something nice to rapidclaw…

Post Comment

1

Moving from leaderboard scores to "Time-to-first-useful-output" is the ultimate transition from "AI researcher" to "SaaS founder," Tijo. Benchmarks are fun for Twitter, but loop rates and silent stalls are what actually determine whether a customer churns or stays. You've essentially traded "vibe metrics" for an AutoOps heartbeat.
I’m currently running a project in Tokyo (Tokyo Lore) that highlights high-utility logic and builders who prioritize production reliability over flashy leaderboards. Since you're running a "5-agent shop" with a focused observability stack, entering Rapid Claw could be the perfect way to turn your "boring production numbers" into a winning case study while your odds are at their peak.

Tokyolore

·
5 hours ago
·
Reply
1

This is a really solid breakdown — especially the shift from benchmarks to real production signals.
That “looks good on paper vs actually works” gap is way bigger than most people think.
I’ve been seeing something similar but more on the trust/safety side — where systems appear accurate until real-world inputs hit (scam links, phishing messages, etc.). That’s where things start slipping through.
Curious — have you found any good way to simulate those messy real-world edge cases, or is it all coming from live data now?

ScamRadar

·
15 hours ago
·
Reply
1

The silent stall rate is the one that bites you hardest the first time — you're right that it used to eat entire nights before heartbeats got wired up.

One signal I'd add to your four: cross-agent context contamination. When two agents share a common state store — same Supabase row, same Redis key — Agent B can partially overwrite Agent A's working context mid-run. Doesn't trigger a loop, doesn't stall, doesn't crash. Outputs look plausible, pass a shallow acceptance check. You catch it weeks later when the research agent's results start drifting on exactly the days the cleanup agent runs.

The fix that worked: a context fingerprint written at job start, checked at job close. If the fingerprint changed and Agent A didn't change it, that's a contamination event. Cheap to add to the trace, genuinely invisible without it.

On "time to first useful output" — it behaves differently for long-horizon tasks. For anything over roughly 15 minutes, you actually need two numbers: first useful checkpoint AND a "last checkpointable progress" rate (how often the agent produces something saveable before timeout). Without the second, you know the agent started well but you're blind to what it's doing in the back half.

Five agents is the exact size where these problems are survivable. At fifteen they start compounding across runs.

What's your current blast radius when the cleanup agent writes to a key the research agent is reading?

dennis19814

·
a day ago
·
Reply
1

Really inspiring

Mobility01

·
2 days ago
·
Reply
1

Thanks For Sharing Looks very Interesting.

James670

·
2 days ago
·
Reply
1

Benchmarks optimize for bragging rights, production metrics optimize for reality.
The moment you charge money, leaderboard scores matter less than whether the job gets done reliably.

clawback

·
3 days ago
·
Reply