this post was submitted on 01 Apr 2025
28 points (96.7% liked)

Software Reliability Engineering

54 readers
2 users here now

👋 Welcome to the SRE Community!

Whether you're just learning about Site Reliability Engineering/Software Reliability Engineering or you're a seasoned on-call warrior, you're in the right place.

SRE (Site Reliability Engineering - AKA Software Reliability Engineering) is a discipline that uses software engineering principles to ensure that systems are reliable, scalable, and resilient. It’s about balancing feature velocity with system stability—keeping things running even when they shouldn’t.

💬 What can you post here?

Here are a few ideas to get started:

🌐 Be excellent to each other

This community is part of the programming.dev network. Please make sure to:

founded 4 days ago
MODERATORS
28
SREsyphus (programming.dev)
submitted 4 days ago by th3raid0r to c/sre
 

Maybe, one of these days, my SRE team can define SLOs that actually make sense. This "alert on whatever the customer wants" stance really really sucks.

you are viewing a single comment's thread
view the rest of the comments
[–] th3raid0r 2 points 4 days ago* (last edited 4 days ago)

Definitely not that lucky. We have customers who seem to watch dashboards and create a Sev1 anytime latency degrades by 10%. They explain to their account manager that they need to have perfect performance all the time. The AM then comes to us demanding that we increase the sensitivity of the alert. Management agrees. And then, voila, just like that we have an alert that flaps all day and all night that we aren't "allowed" to remove until someone can show that the noise is literally stopping us from catching other stuff.

It's insanity.

EDIT: I only stay because new leadership seems like they want to fix it earnestly. And things are headed in the right direction, but it takes a long time to turn a ship.