27 Comments

We built our own monitoring because uptime dashboards kept lying to us

by Vajid Ali

We were running production systems where everything looked healthy — but users were still blocked.

The issue wasn’t infrastructure uptime.
It was blind spots in real user flows.

We ended up building our own monitoring internally.

After running it quietly in production, we’re opening it up.

Launching soon — would love honest feedback from builders who’ve felt this pain.

Vajid Ali

on February 2, 2026

Say something nice to Rizqtek…

Post Comment

1

@Rizqtek Replying to your last comment:

That’s exactly the slippery part because the drift doesn’t announce itself, it just slowly reframes what feels “healthy.”

One thing I’ve seen help is baking outcome language into rituals, not just intentions for e.g. every review starting with “what user promise broke this week?” before any charts come up.

Once that question disappears, comfort metrics tend to fill the vacuum on their own.

XtremeCmd

·
2 months ago
·
Reply
1. 1
  
  'Comfort metrics tend to fill the vacuum' — I’m stealing that. That is painfully accurate.
  
  We found ourselves in that exact trap. Our dashboard was all green (comfort), but our support tickets were piling up (reality).
  
  That shift from 'System Health' to 'User Promise' is actually the core design philosophy we're trying to bake into this tool. If the dashboard says 'Up' but the API took 5 seconds to respond, that's a broken promise, not a win.
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    That distinction you made “broken promise, not a win” that's exactly it.
    
    I’ve noticed once teams rename latency/errors as promise breaks, the conversation changes fast. It stops being “are we up?” and becomes “who did we disappoint today?”
    
    Do you plan to surface that directly in the UI (e.g. promises broken this week), or keep it implicit through metrics?
    
    XtremeCmd
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      We are making it explicit. That's actually why we called it PingSLA instead of 'PingUptime'.
      
      Right now in the UI, we treat a 'Latency Breach' (e.g., >500ms) as a hard DOWN event. It turns the dashboard red just like a 500 error would. If the promise is broken, the light shouldn't be green.
      
      But the 'End Game' we are building towards is Auto-Remediation. We want the dashboard to say: '3 Promises Broken → 3 Auto-Fixed by Ops Engine → 0 Humans Woke Up.'
      
      We're actually opening a waitlist for that 'Ops Engine' component tomorrow. The goal is to make the 'disappointment' last seconds (machine speed) instead of hours (human speed).
      
      Rizqtek
      
      ·
      2 months ago
      ·
      Reply
1

👌

86sammyv

·
2 months ago
·
Reply
1

There is a real pain here that uptime % completely hides. ~

I've encountered circumstances where the status page was green, and infrastructure metrics were nominal, yet users were literally unable to complete the main priorities. According to Dashboard: Healthy. To the user: Broken.

The blindspot is always the same: we monitor systems, not outcomes.

I Wonder How You Model A “Real User Flow” (Is It Scripted Journeys? Recorded Sessions? Synthetic Users?) Would Love To See Some Examples.

As flows change over time, things become noisier.

It remains to be seen whether this will be a debugging or alerting tool in practice.

As this could answer reliably.

Is a new user capable of signing up, performing the main action, and receiving value immediately?”.

That signal has more weight than 99.99% uptime.

It appears you are transitioning from monitoring “is the server alive? Is the product actually useable?This is the thing that most users will be concerned about but poorly instrument.

MORPHOICES

·
2 months ago
·
Reply
1. 1
  
  Really well put — that “healthy system, broken user” gap is exactly what kept burning us.
  
  Right now we’re modeling real flows as simple outcome checks (signup → key action → value received) and then watching mismatches over time. Still early, so we’re trying to keep it low-noise before adding more layers.
  
  You’re right that flows change and things get noisy fast — that’s probably the hardest part to get right.
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
1

This resonates hard. Spent years as a PM watching dashboards turn green while support tickets piled up.

The gap between "system is up" and "user can complete their journey" is where so many problems hide. Classic vanity metrics vs. actionable metrics problem.

Curious — are you focusing on specific verticals (e.g., SaaS checkout flows, API integrations) or keeping it general-purpose? In my experience, the blind spots vary wildly depending on the product type.

ItsKondrat

·
2 months ago
·
Reply
1. 1
  
  Appreciate that perspective — PMs usually feel this pain first.
  
  We’ve intentionally kept it fairly general so far, but we’re seeing the sharpest blind spots around API-driven flows and critical SaaS journeys like auth, webhooks, and payments.
  
  You’re right though — the failure modes vary a lot by product type, and we’re still learning where this approach creates the most leverage.
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
1

The thread about business-event checkpoints vs synthetic journeys is great, but there's a gap both approaches share: they tell you what your code thinks happened, not what actually went over the wire.

We ran into this building API tooling. Webhook "received" according to the handler, but the downstream service got a mangled payload because a middleware silently transformed it. Logs said success. Wire said otherwise.

Ended up capturing actual HTTP exchanges at the boundary. Comparing what the code intended to send vs what the network carried turned up failures that no amount of business-event instrumentation would catch, because the instrumentation itself sat above the layer where things went wrong.

kxbnb

·
2 months ago
·
Reply
1. 1
  
  This is a great point — and you’re right, that boundary layer is where a lot of truth gets lost.
  
  We’ve seen similar cases where everything above the transport layer looks “successful,” while the wire tells a very different story. Instrumentation sitting too high can absolutely lie by omission.
  
  Capturing intent vs actual exchange is uncomfortable, but it exposed issues we wouldn’t have seen otherwise. Appreciate you calling that out.
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
1

This hits close to home. Everything can look “healthy” on dashboards while real users are still struggling. Focusing on real user flows instead of just uptime feels like the right move—looking forward to seeing how this works in practice.

LiamGrey

·
2 months ago
·
Reply
1. 1
  
  Appreciate that.
  We felt the same frustration — “healthy” systems that still left users stuck.
  
  Shifting the lens to real flows helped us see issues earlier and prioritize what actually mattered.
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
1

Man, we dealt with this exact thing at my last startup, everything showing green while our support inbox was on fire. Turns out "servers are running" doesn't mean users can actually do what they need to do. For us it was always the weird edge cases - auth flows timing out, webhooks failing silently, that kind of stuff. What blind spots kept biting you guys? Would definitely check this out when you launch.

Taayjus

·
2 months ago
·
Reply
1. 1
  
  Totally relate to that — “support inbox on fire” was usually our first alert too.
  
  The blind spots that kept biting us were partial failures: auth succeeds but session breaks later, webhooks return 200 but downstream actions never complete, jobs that start fine but stall under load.
  
  Curious what finally helped you catch those earlier at your startup?
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    We ended up building better logging around the critical user flows: auth, payment processing, webhook chains. Once we could see exactly where things broke (not just "webhook failed" but which step in the sequence), we caught issues way earlier. Also added health checks that actually tested the full flow, not just server status. The synthetic journey approach you mentioned is solid, so we basically did a lighter version of that, tracking key user paths as transactions rather than just individual events.
    
    Taayjus
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      That makes a lot of sense.
      Seeing which step broke instead of just “failed” was a big turning point for us too.
      
      Treating key user paths as transactions rather than isolated events changed how early we caught issues.
      Appreciate you sharing how you approached it.
      
      Rizqtek
      
      ·
      2 months ago
      ·
      Reply
1

This resonates. We’ve seen “green dashboards” while a single broken auth or checkout path silently blocks real users.

Curious how you’re defining and tracking those critical user flows, are you modeling them as synthetic journeys, or instrumenting success/failure at the business-event level? That line seems to be where most teams either win or drown in false confidence.

XtremeCmd

·
2 months ago
·
Reply
1. 1
  
  That’s exactly the gap we kept falling into.
  
  We started with synthetic journeys, but quickly realized they only catch expected paths. The real failures were happening around edge cases — auth refreshes, webhook callbacks, background jobs completing “successfully” but producing bad outcomes.
  
  What worked better for us was modeling business-critical checkpoints instead of just endpoints:
  
  • auth success vs token refresh failures
  • webhook received vs downstream action completed
  • checkout initiated vs payment actually settled
  
  It reduced false confidence a lot, but also forced us to decide what really mattered to users, not just infra.
  
  Curious — have you seen more value from synthetic checks or from instrumenting business events directly?
  
  Rizqtek
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    That’s a really strong evolution, the moment teams switch from “did the request return 200?” to “did the user outcome actually happen?” things get real fast.
    
    From what I’ve seen, the highest signal setups usually end up hybrid, but with business events as the source of truth.
    
    Synthetic journeys are still useful, but mostly for:
    
    catching regressions on known golden paths
    
    detecting obvious infra or config issues early
    
    giving fast, low-noise alerts
    
    Where teams actually reduce false confidence is exactly where you went:
    
    instrumenting business-critical transitions
    
    treating mismatches (initiated ≠ completed) as first-class failures
    
    alerting on broken promises, not broken endpoints
    
    A pattern I’ve seen work well:
    
    synthetics answer: “is the system reachable?”
    
    business events answer: “did the user get what they came for?”
    
    Once those are in place, some teams even down-weight endpoint health entirely and page only on outcome gaps which is uncomfortable at first, but forces ruthless clarity about what matters.
    
    The fact that this forced prioritization for you is a good sign and most products only discover what matters when customers are already angry.
    
    If you’re open to it, I’d be curious how you’re visualizing those gaps now (funnel deltas, time-to-complete, SLOs on events). That layer is usually where this approach really compounds.
    
    Really thoughtful work and this is a hard shift to make, and you’re clearly on the right side of it.
    
    XtremeCmd
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      Really appreciate this — you articulated the trade-offs better than I could.
      
      What worked for us was starting very simple: funnel deltas (initiated → completed) and time-to-complete on a few truly critical paths. We resisted adding too many metrics early, just focused on where promises broke.
      
      Over time that naturally turned into event-level SLOs, but only after we were confident those events actually represented user value.
      
      Completely agree — the forced prioritization is uncomfortable, but it’s the only thing that cut through false confidence for us.
      
      Rizqtek
      
      ·
      2 months ago
      ·
      Reply
      1. 1
        
        That approach is honestly textbook in the best way.
        
        Starting with funnel deltas + time-to-complete is such a disciplined move and it keeps you anchored to broken promises instead of drifting into metric collection for its own sake. I especially like that you waited to formalize SLOs until you trusted the events actually mapped to user value. Most teams do that backwards and then spend months unwinding it.
        
        The “resist adding metrics early” point is underrated too. A few well-chosen gaps you stare at every day will change behavior far more than a dashboard full of green checks.
        
        Sounds like you’ve landed on a really healthy mental model:
        
        outcomes first
        
        instrumentation second
        
        standards last
        
        Appreciate you sharing how it evolved and this is one of those things that’s obvious after you’ve been burned by false confidence, but hard to internalize before.
        
        XtremeCmd
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        Thanks — that means a lot coming from someone who’s clearly been through it.
        
        Getting burned by false confidence was the real teacher for us too. Once you feel that pain, it permanently changes how you think about “healthy.”
        
        Rizqtek
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        Totally get it, that kind of pain is brutal, but it’s also a permanent upgrade in judgment. Once you’ve seen what “false confidence” costs, every metric you pick afterward carries a lot more weight.
        
        It’s interesting how often teams try to shortcut that lesson and end up chasing dashboards that feel good rather than outcomes that matter. Feels like you’ve internalized the right mental model which is outcomes first, instrumentation second, standards last which is exactly what separates repeatable high-velocity teams from the rest.
        
        And now that you’ve lived through it, are there any “early indicators” you look for to know if a team is drifting into the green-check trap again?
        
        XtremeCmd
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        That’s a great question.
        
        The earliest warning sign for us is when teams stop talking about outcomes in reviews and only talk about metrics. When dashboards look healthy but no one can clearly answer “did users actually succeed?”, we know drift has started.
        
        That’s usually when false confidence creeps back in.
        
        Rizqtek
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        That’s a strong signal, especially the “no one can clearly answer if users succeeded” part. I’ve seen the same thing: once outcomes disappear from the conversation, metrics quietly turn into comfort objects.
        
        One subtle follow-on I’ve noticed is language drift when teams start saying “the numbers look good” instead of “users completed X without friction.” Catching that early usually saves a lot of downstream cleanup.
        
        Appreciate you sharing this, it’s one of those lessons you only really learn the hard way.
        
        XtremeCmd
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        That “language drift” point is really sharp.
        We’ve caught ourselves doing the same — once conversations move to “numbers look fine,” it gets harder to see where real friction is creeping in.
        
        Trying to keep outcome questions front-and-center so we don’t slip back into comfort metrics.
        
        Rizqtek
        
        ·
        2 months ago