Why I built the first browser benchmark that gives the true power of your device

by KirkGC

Hey IH!

Most benchmarks are single-tasking relics. In 2025, we are running local AI models (LLM, Recognition, etc.) and complex data processing at the same time in the browser, demanding a new standard for performance measurement.

I've built SpeedPower.run to solve this critical need for modern, comprehensive benchmarking. Instead of a single, isolated task, our system runs a rigorous set of seven concurrent benchmarks, including core JavaScript operations, high-frequency data exchange simulations, and the execution of five different AI models. This process is specifically designed to force concurrent execution across every available CPU and GPU core in your device, simulating a real-world, multi-tasking environment.

Our benchmark is constructed using the most popular and cutting-edge web AI stack: TensorFlow.js and Transformers.js, ensuring relevance and fidelity to applications being built today.

The Challenge: Traditional scores fail to capture this complexity. Is our overall geometric mean score accurately and transparently reflecting the true concurrent processing power of your browser? We believe our holistic approach provides the most accurate answer.

The test is pure and simple: No network interference, no installation or external dependencies—just a raw measurement of your device's compute capabilities as seen by the browser. See your comprehensive score and performance breakdown here: https://speedpower.run/?ref=indiehacker-1

I'll be here all day to discuss the specifics of our multi-tasking scoring logic, the selection of the seven benchmarks, and how we derived the geometric mean to best represent concurrent power.

KirkGC

posted to

Product Launch

on January 29, 2026

Say something nice to kirkbreton…

Post Comment

5

Another benchmark? How is this different from JetStream 2 or Speedometer? I feel like we’ve solved browser speed.

DevOpsDan

·
2 months ago
·
Reply
1. 2
  
  JetStream and Speedometer test classic JS speed.
  This measures modern browser power: WebGPU + concurrent AI workloads.
  
  anwaarulhaque
  
  ·
  2 months ago
  ·
  Reply
2. 2
  
  I just checked the 'About' page. They are using Transformers.js v3 for the LLM and Speech tests. That uses WebGPU compute shaders for parallel inference. If you're comparing this to old-school JS benchmarks, you're missing the point. We're talking about asynchronous command queues in the browser. I'd be curious to see how the 'Score Stability' handles thermal throttling over multiple runs.
  
  _Sara_TheReactLead
  
  ·
  2 months ago
  ·
  Reply
  1. 2
    
    Spot on! Thermal throttling is the 'invisible variable' in mobile benchmarking.
    
    We don't normalize for it because we want to measure peak real-world capacity. However, that’s exactly why we implemented the 'Warm-up Execution.' We prime the JIT and compile the shaders first, so we aren't measuring 'startup coldness.'
    
    If you run the benchmark three times in a row on a fanless MacBook Air, you will see the score dip. To us, that’s a feature, not a bug, it reveals the device's true sustained compute limit for modern, heavy AI workloads.
    
    kirkbreton
    
    ·
    2 months ago
    ·
    Reply
  2. 1
    
    Good catch. They mention a 'Warm-up Execution' to prime the caches and JIT, but they also say to run it several times for the maximum score. It seems they are measuring 'Peak Capacity' rather than average sustained performance, which makes sense for bursty AI tasks in a web app.
    
    DevOpsDan
    
    ·
    2 months ago
    ·
    Reply
3. 1
  
  Speedometer tests how fast a page feels; this tests if your browser can actually handle local LLMs 😉. Most benchmarks are single-threaded relics.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
  1. 2
    
    But if I'm running an LLM, isn't that almost entirely a GPU bound task? Why does the main thread communication even matter that much once the model is loaded into VRAM?
    
    DevOpsDan
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      👆 Excellent question right here
      
      frugalfrog
      
      ·
      2 months ago
      ·
      Reply
    2. 1
      
      That’s a common misconception we’re trying to highlight! You’re right that the matrix multiplication happens on the GPU, but an LLM in a browser isn't a 'set it and forget it' process.
      
      With Transformers.js v3, orchestration, tokenization, KV cache management, and autoregressive decoding still require constant 'handshakes' between the worker and the main thread. If your 'Exchange' performance is poor, the GPU sits idle waiting for the next instruction. We specifically included the SmolLM2-135M test to show that even a 'small' model can be bottlenecked by how efficiently the browser moves data between threads.
      
      kirkbreton
      
      ·
      2 months ago
      ·
      Reply
    3. 1
      
      This comment was deleted 2 months ago.
      
      frugalfrog
      
      ·
      2 months ago
3

My phone's browser got a better score than my 5-year-old desktop. That feels totally unbelievable, haha. I thought the desktop would crush it with more cores. What gives with the mobile anomaly?

UXNinja77

·
2 months ago
·
Reply
1. 1
  
  That's what we call the "Parallel Paradox," and it's what we find fascinating! We've seen some modern mobile ARM chips show better task switching efficiency than older x86 desktops due to more aggressive, thermal aware scheduling in the mobile browser engines. Raw clock speed isn't the whole story anymore. Our scoring uses a weighted geometric mean, where JavaScript and Exchange efficiency are key factors.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

kudos

alanafterquery

·
2 months ago
·
Reply
1

This is exactly the kind of benchmark we need. I've been building browser-based dev tools and the "Exchange" metric you mentioned is something I run into constantly — the overhead of moving data between threads kills perceived performance even when the actual computation is fast.

Question: have you considered adding a "cold start" vs "warm" comparison? For tools that users open occasionally (vs apps that stay open), that initial JIT compilation and shader loading can dominate the experience. Would be interesting to see how different browsers handle that first-run penalty.

trinhcuong_ast

·
2 months ago
·
Reply
1. 1
  
  You’ve hit on the 'silent killer' of web performance. Even with a world class WebGPU backend, a slow Exchange score means your GPU is effectively 'starving' while waiting for the main thread to hand off the next buffer. It’s the difference between a fast engine and a fast transmission.
  
  Regarding your Cold vs. Warm question: It’s a fantastic suggestion. Currently, we prioritize the 'Warm' run because we want to measure the hardware's peak throughput once the JIT and shaders are optimized.
  
  However, for dev tools and occasional-use apps, you're 100% right, the 'Cold Start' is the user's first impression. We’ve actually discussed adding a 'Time to First Inference' metric to capture that initial JIT/shader compilation penalty. Different browser engines handle this very differently, and exposing that 'first-run penalty' would be a huge win for developers optimizing for quick-hit interactions.
  
  Definitely adding this to our roadmap. Thanks for the insight!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

This is a really insightful benchmark that finally bridges the gap between classic single-task tests

robloxmodapk1

·
2 months ago
·
Reply
1. 1
  
  Thank you 🙏🙏
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

Wow, really impressive work! 👏
Running seven concurrent benchmarks with AI models must have been challenging.
Curious—did you build all of this yourself or use some tools to help manage it?

karan2807

·
2 months ago
·
Reply
1. 1
  
  Thank you! It was definitely a challenge to orchestrate. We didn’t want to reinvent the wheel where established standards already existed, so we strategically integrated top-tier open-source tools to handle the heavy lifting.
  
  We used TensorFlow.js for our established AI pipelines (like BlazeFace and MobileNetV3) and Transformers.js v3 for our next-gen WebGPU workloads, including the SmolLM2 and Moonshine-Tiny models. For the core JS processing, we adapted specific tests from Apple’s JetStream 2 suite, such as Regexp DNA and Access Binary Trees.
  
  The 'secret sauce' we built ourselves was the Parallel Execution Engine. We had to write the logic that forces all these different technologies: WASM, WebGPU, and WebGL to fight for resources across multiple Web Workers simultaneously without crashing the browser. It’s that 'Task Saturation' orchestration that makes the benchmark unique!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

This is really impressive! I love that this actually measures real multitasking performance instead of those old single-task benchmarks. Curious—how does it handle CPU vs GPU scheduling, and have you noticed any cases where the geometric mean score might hide specific performance quirks?

GiniGigs

·
2 months ago
·
Reply
1. 1
  
  Exactly, t’s about moving from 'lab results' to 'street performance.'
  
  On Scheduling: We don't interfere with the OS scheduler, we just saturate it. By spawning a swarm of Web Workers (CPU) while simultaneously filling the WebGPU command queue, we force the browser to play traffic cop. The 'magic' happens when the browser has to decide between processing your 50MB JSON and rendering the next frame of an AI-generated token.
  
  On the Geometric Mean: It’s a double-edged sword. It’s excellent at preventing cheating. You can't just have a god-tier GPU and a potato CPU and get a good score. However, it can hide specific quirks because it’s a measure of balance. It won't tell you why your device hit a wall, only that it did. It's the ultimate weakest link detector.
  
  If you see a score that feels off, it usually means one of those benchmarks hit a near-zero bottleneck that dragged the whole system down. That's the messy reality we're trying to capture!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

This is interesting because it finally reflects how browsers are actually used today, not how benchmarks assumed they were used 10 years ago. Running concurrent AI + compute tasks feels much closer to real-world workloads.

Curious — how did you decide on the weighting for the geometric mean across the seven benchmarks? And did you see big variance between devices that usually score similarly on traditional benchmarks?

Shemithmohanan

·
2 months ago
·
Reply
1. 1
  
  Exactly! it’s about measuring the 'traffic jam' of modern apps.
  
  We weighted Javascript and Exchange highest (6) because they are the 'plumbing.' If your data pre-processing or thread communication is slow, your AI inference will starve regardless of your GPU. We want to reward systems that handle that handshake efficiently.
  
  Regarding variance: Absolutely. We see devices that look identical on single-thread tests diverge by 20-30% here. It usually comes down to how the OS and browser manage Task Saturation, some handle the simultaneous heat and scheduling much better than others 😉
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

This is a really interesting angle. Most browser benchmarks I’ve tried only test one thing at a time, which never matches real usage. Running AI models + data tasks together feels much closer to how people actually use their devices now.

I like the “no install, pure browser” approach too — makes it easy to trust and compare. Curious to see how results differ between laptops and phones.

bhavin_allinonetools

·
2 months ago
·
Reply
1. 1
  
  Spot on. We’re moving into a ‘Compute Web’ era where the browser is effectively a multi-tasking OS, so testing one script at a time just doesn't cut it anymore.
  
  Regarding your question on Laptops vs. Phones: the results are often surprising. High-end phones with modern ARM architectures sometimes outperform mid-range laptops in our AI Transformers tests because their NPUs are specifically optimized for those 4-bit quantized models.
  
  However, phones usually hit a wall on the Exchange score. Their thermal management is much more aggressive; once we saturate the cores with simultaneous JSON and AI tasks, they throttle much faster than a cooled laptop. It’s the difference between 'peak speed' and 'sustained capacity.'
  
  Glad the 'no-install' approach gives you that extra layer of trust, it’s 100% bare-metal compute once those models are loaded. If you see a weird gap between your devices, check the Javascript vs. Exchange weights; that’s usually where the hardware bottleneck is hiding!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

This makes sense to me. ~

Most benchmarks feel like trivia now — interesting on their own, but not really how people use machines in the real world.

What really clicked for me was the idea of forcing contention. As soon as you’re running models, JS, and data movement at the same time, the weak spots show up fast. And that’s usually where real apps fall apart.

I’ve watched teams optimize around clean, single-thread scores, then get blindsided when everything slows to a crawl in production. So testing “messy reality” instead of idealized tasks feels like the right direction.

The mental model I keep coming back to is:

Synthetic benchmarks show potential

Concurrent benchmarks show limits

Limits are what users actually feel

I’m curious how you thought about weighting. If one task degrades badly, does it drag the whole score down, or get smoothed out by the rest? In real apps, one slow lane often ruins the entire experience.

I’m also interested in how you expect people to use the results over time — comparing devices and browsers, or mostly tracking regressions on their own setup?

This feels less like a leaderboard and more like a diagnostic tool, which is probably a good thing.

MORPHOICES

·
2 months ago
·
Reply
1. 1
  
  You nailed the philosophy behind this. We actually use that exact phrase internally: 'Synthetic benchmarks show potential; concurrent benchmarks show limits.'
  
  Regarding your question on weighting and 'drag': that’s precisely why we chose the Geometric Mean. In a standard average, a high score can 'smooth out' a failure. But with a Geometric Mean, if one task like your Exchange score, degrades badly because of thread contention, it pulls the entire overall score down significantly. It treats the browser like a chain, it’s only as strong as its weakest link.
  
  As for how to use this: you're right, it’s a diagnostic tool. We expect developers to use it to see where the 'ceiling' is for their specific stack. If you're building an AI-heavy app and your target users are on hardware that tanks under our Transformers.js + JSON mix, you know you have to optimize your data handoff, not just your model.
  
  It’s less about who has the 'biggest number' and more about identifying which browsers/devices actually stay fluid when the reality gets messy. Great to have you thinking along the same lines!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

Pretty smooth experience, only took 30 seconds. My main question is: how are you distinguishing between a true browser-engine efficiency lead (like Brave vs. Chrome) and just a thermal throttling difference on the OS/driver side?

mark902

·
2 months ago
·
Reply
1. 1
  
  That's the million-dollar question for any benchmark! We run a quick pre-test to check for baseline thermal status, and we perform a Warm-up Execution to ensure we're measuring peak throughput. We engineered the total runtime to be a short, maximum-load burst to isolate the browser's scheduler efficiency (the Exchange and JavaScript scores) before OS/hardware thermal throttling becomes the dominant factor.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

I'm building a complex dashboard app right now. My biggest bottleneck is garbage collection when I'm running multiple WebWorkers. Are you guys tracking GC pauses in your methodology? That's the one metric I'd love to see.

chessyjaz

·
2 months ago
·
Reply
1. 1
  
  That's an excellent feature request. We are focused on CPU/GPU throughput saturation right now, with a key focus on Web Worker communication in our Exchange benchmark. A metric on GC pause time/frequency during heavy concurrent load would be a perfect addition for our next phase. Mind sharing what framework/library you are using? We'd love to hear more about your real-world use case.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

очень интересно, я бы хотела читать такого рода статьи чаще

MariaLit

·
2 months ago
·
Reply
1. 1
  
  Thank you so much! I’m really glad you found the article interesting.
  
  We definitely plan to keep sharing these kinds of deep dives. At ScaleDynamics, we’re obsessed with the technical details of the 'Compute Web'—especially how browsers handle the collision of local AI and heavy data processing.
  
  We’ll be posting more about our findings from the SpeedPower.run beta data soon, focusing on how different architectures handle task saturation. If there’s a specific part of browser performance or AI integration you're most curious about, let me know—I’d love to cover it in a future post!
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

Love the focus on true concurrent workloads. Most benchmarks don’t reflect how browsers are actually used today AI + heavy JS at the same time. The no-network, no-install approach makes the results feel trustworthy. Curious how the geometric mean avoids hiding CPU vs GPU bottlenecks. Great work 👏

anwaarulhaque

·
2 months ago
·
Reply
1. 1
  
  Thanks for the kind words! You’ve touched on exactly why we went with the Geometric Mean for the final score.
  
  In traditional benchmarks using the Arithmetic Mean, a massive score in a single category (like raw JS speed) can 'pull up' a terrible score in another (like AI inference). It effectively hides bottlenecks.
  
  By using the Geometric Mean, we ensure that every category matters equally. If a device has a 'bottleneck' where the Exchange score is near zero because of IPC lag, it drags the entire overall score down significantly. It’s a much more 'honest' average for hardware.
  
  Our goal was to make sure you couldn't just throw a fast GPU at the problem and ignore the CPU-to-GPU handshake. If one part of the pipeline is a 'weak link,' the final score will reflect that reality.
  
  Really glad to see the 'no-install' approach is resonating. We wanted to lower the barrier so developers could test their theories on the fly without the friction of a 5GB suite.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

Is this test network-dependent at all? Do I need a gigabit connection for a high score? Always skeptical of benchmarks that don't clearly state that.

RobRob27

·
2 months ago
·
Reply
1. 1
  
  A totally fair skepticism. Absolutely not. This is a Zero Network Interference test. All the ~350MB of data (AI models and assets) is fully pre loaded into the browser memory before the timer starts. This is a pure local compute test, not a network speed test.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
1

I love the 'Exchange' benchmark. Most devs ignore the cost of moving data between the main thread and workers. It’s the silent killer of performance.

ScottMcgee

·
2 months ago
·
Reply
1. 1
  
  Exactly. If you're building a realtime speech-to-text app using their moonshine tiny test, your GPU inference might be fast, but if your OffScreen Canvas or Buffer transfers are slow, the UX feels laggy. This is the first tool I've seen that quantifies the IPC overhead specifically.
  
  kirkbreton
  
  ·
  2 months ago
  ·
  Reply
0

I built the first browser benchmark to reveal your device’s true performance, eliminating misleading scores and showing real-world speed, power, and efficiency across everyday tasks.

zldacademy

·
2 months ago
·
Reply