Software Reliability Engineering

56 readers

2 users here now

👋 Welcome to the SRE Community!

Whether you're just learning about Site Reliability Engineering/Software Reliability Engineering or you're a seasoned on-call warrior, you're in the right place.

SRE (Site Reliability Engineering - AKA Software Reliability Engineering) is a discipline that uses software engineering principles to ensure that systems are reliable, scalable, and resilient. It’s about balancing feature velocity with system stability—keeping things running even when they shouldn’t.

💬 What can you post here?

Here are a few ideas to get started:

War stories from production incidents and what you learned
Cool tools for observability, monitoring, alerting, and automation
Best practices around on-call, SLOs, blameless postmortems, and chaos engineering
Questions about reliability engineering and career advice
Infrastructure as code, CI/CD pipelines, and deployment strategies
Memes. Tasteful ones. SREs need to laugh too 😅

🌐 Be excellent to each other

This community is part of the programming.dev network. Please make sure to:

Read and follow the programming.dev code of conduct
Keep discussions respectful and inclusive
Assume good faith, and be generous in your interpretations

founded 1 week ago

MODERATORS

th3raid0r

April Discussion: What problems are you working on. (self.sre)

submitted 1 week ago by th3raid0r to c/sre

3 comments fedilink hide all child comments

Whether you're ironing out kinks in disk provisioning at scale or implementing service mesh for microservices. Let's talk about the problems we're currently tackling.

(Without violating NDA's of course)

top 3 comments

sorted by: hot top controversial new old

[–] knightly@pawb.social 2 points 1 week ago

Ugh, my biggest problem is fighting through my executive dysfunction to get started on documentation updates. >_<

[–] th3raid0r 2 points 1 week ago

I'll go first - I'm working on the reliability of some migration patterns between a couple managed cloud implementations. A bring-your-own-account model underpinned the legacy setup, while the newer cloud offering allows us to retain it all in our accounts.

This is on top of generally sussing out the reliability of the newer product with some chaos testing.

The unfortunate bit is that my OPS workload is so high that I struggle to get much traction on the above. Not to mention that this new product moves so fast that it's hard to get any sort of week-to-week consistency. Often requiring environment re-provisioning. Not exactly stuff that the community can really help with, but hey, that's what's on my plate.

[–] Corbin 1 points 1 week ago

This week (likely most of the month) I am rebuilding a backup process which appears to currently rely on Syncthing and hasn't been tested in years. Everybody got to put in their opinions, so I need to design something which has a single source of truth, also has 3-2-1 replica counts, takes snapshots of the production DBs without degrading them, and burns a monthly DVD on a workstation in the office. Once I'm done with that, I get to look at the office VPN's performance problems.