this post was submitted on 22 Jul 2024
260 points (95.1% liked)

Programming

17491 readers
55 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 135 points 4 months ago (4 children)

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

[–] [email protected] 20 points 4 months ago* (last edited 4 months ago) (1 children)

Hey, why not just ask Dave Plummer, former Windows developer...

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I've read so far vary significantly, still that's way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

[–] [email protected] 7 points 4 months ago

that's a huge sign that their rollout process is garbage

[–] [email protected] 19 points 4 months ago (1 children)

That means spending time and money on developing such a system, which means increasing costs in the short term.. which is kryptonite for current-day CEOs

[–] [email protected] 8 points 4 months ago* (last edited 4 months ago)

Right. More than money, I say it's about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.

[–] [email protected] 18 points 4 months ago* (last edited 4 months ago) (1 children)

I strongly believe in no-blame mindsets, but "blame" is not the same as "consequences" and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that "low likelihood, big impact" corner.
Because likely when the problem happens (again) they'll be able to sweep it under the rug (again) or will have moved on to greener pastures.

What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer's recommendation).

[–] [email protected] 3 points 4 months ago* (last edited 4 months ago) (1 children)

Edit: see my response, realised the comment was about engineering accountability which I 100% agree with, leaving my original post untouched aside from a typo that's annoying me.

I respectfully disagree coming from a reliability POV, you won't address culture or processes that enable a person to make a mistake. With the exception of malice or negligence, no one does something like this in a vacuum; insufficient or incorrect training, unreasonable pressure, poorly designed processes, a culture that enables actions that lead to failure.

Example I recall from when I worked manufacturing, operator runs a piece of equipment that joins pieces together in manual rather than automatic, failed to return it to a ready flag and caused a line stop. Yeah, operator did something outside of process and caused an issue, clear cut right? Send them home? That was a symptom, not a cause, the operator ran in manual because the auto cycle time was borderline causing linestops, especially on the material being run. The operator was also using manual as there were some location sensors that had issues with that material and there was incoming quality issues, so running manually, while not standard procedure, was a work around to handle processing issues, we also found that culturally, a lot of the operators did not trust the auto cycles and would often override. The operator was unlucky, if we just put all the "accountability" on them we'd never have started projects to improve reliability at that location and change the automation to flick over that flag the operator forgot about if conditions were met regardless.

Accountability is important, but it needs to be applied where appropriate, if someone is being negligent or malicious, yeah there's consequences, but it's limiting to focus on that only. You can implement what you suggest that the devs get accountability for any failure so they're "empowered", but if your culture doesn't enable them to say no or make them feel comfortable to do so, you're not doing anything that will actually prevent an issue in the future.

Besides, I'd almost consider it a PPE control and those are on the bottom of the controls hierarchy with administrative just above it, yes I'm applying oh&s to software because risk is risk conceptually, automated tests, multi phase approvals etc. All of those are better controls than relying on a single developer saying no.

[–] [email protected] 7 points 4 months ago (1 children)

Oh I was talking in the context of my specialty, software engineering. The main difference between an engineer and an operator is that one designs processes while the other executes on those processes. Negligence/malice aside the operator is never to blame.

If the dev is "the guy who presses the 'go live' button" then he's an operator. But what is generally being discussed is all the engineering (or lack thereof) around that "go live" button.

As a software engineer I get queasy when it is conceivable that a noncritical component reaches production without the build artifact being thoroughly tested (with CI tests AND real usage in lower environments).
The fact that CrowdWorks even had a button that could push a DOA update on such a highly critical component points to their processes being so out of the industry standards that no software engineer would have signed off on anything... If software engineers actually had the same accountability as Civil Engineers. If a bridge gets built outside the specifications of the Civil Engineer who signed off on the plans, and that bridge crumbles, someone is getting their tits sued off. Yet there is no equivalent accountability in Software Engineering (except perhaps in super safety-critical stuff like automotive/medical/aerospace/defense applications, and even there I think we'd be surprised).

[–] [email protected] 6 points 4 months ago (1 children)

I realised you meant this over lunch, I'm a mech eng who changed disciplines into software (data and systems mainly) over my career, I 100% feel you, I have seen enough colleagues do things that wouldn't fly in other disciplines, it's definitely put me off a number of times. I'm personally for rubber stamping by a PEng and the responsibility that comes with that. There's enough regulatory and ethical considerations just in data usage that warrants an engineering review, systems designed for compliance should be stamped too.

Really bothers me sometimes how wildwest things are.

[–] sukhmel 2 points 4 months ago (1 children)

This might help in some regard, but this will also create a bottleneck of highly skilled highly expensive Engineers with the accountability certificate. I've seen what happens when this is cornerstone even without the accountability that would make everything even more expensive: the company wants to cut expenses so there's only one high level engineer per five or so projects. Said engineer has no time and no resources to dig into what the fuck actually happens on the projects. Changes are either under reviewed or never released because they are forever stuck in review.

On the other hand, maybe we do move a tad bit too fast, and some industries could do with a bit of thinking before doing. Not every software company should do that, though. To continue on the bridge analogy, most of software developers are more akin to carpenters even if they think about themselves as of architects of buildings and bridges. If a table fails, nothing good is going to happen, and some damage is likely to occur, but the scale is very different from what happens if a condo fails

[–] [email protected] 4 points 4 months ago (1 children)

But a company that hires carpenters to build a roof will be held liable if that roof collapses on the first snow storm. Plumbers and electricians must be accredited AFAIK, have the final word on what is good enough by their standards, and signing off on shoddy work exposes them to criminal negligence lawsuits.

Some software truly has no stakes (e.g. a free mp3 converter), but even boring office productivity tools can be more critical than my colleagues sometimes seem to think. Sure, we work on boring office productivity tools, but hospitals buy those tools and unreliable software means measurably worse health outcomes for the patients.

Engineers signing off on all software is an extreme end of the spectrum, but there are a whole lot of options between that and the current free-for-all where customers have no way to know if the product they're buying is following industry standard practices, or if the deployment process is "Dave receives a USB from Paula and connects to the FTP using a 15 year-old version of FileZilla and a post-it note with the credentials".

[–] sukhmel 1 points 4 months ago (1 children)

True, there is a spectrum of options, and some will work much better than what we have now. It's just that when I read about holding people accountable I don't quite imagine it's going to be implemented in the optimal way, not in the first hundred years or so

[–] [email protected] 3 points 4 months ago (1 children)

All of this has already been implemented for over a hundred years for other trades. Us software people have generally escaped this conversation, but I think we'll have to have it at some point. It doesn't have to be heavy-handed government regulation; a self-governed trades association may well aim to set the bar for licensing requirements and industry standards. This doesn't make it illegal to write code however you want, but it does set higher quality expectations and slightly lowers the bar for proving negligence on a company's part.

There should be a ISO-whateverthefuck or DIN-thisorother that every developer would know to point to when the software deployment process looks as bad as CrowdStrike's. Instead we're happy to shrug and move on when management doesn't even understand what a CI is or why it should get prioritized. In other trades the follow-up for management would be a CYA email that clearly outlines the risk and standards noncompliance and sets a line in the sand liability-wise. That doesn't sound particularly outlandish to me.

[–] [email protected] 3 points 4 months ago

Heck, there are already ISO language standards, and there's ISO Software Lifecycle standards, it's absolutely not a leap to move into standards adhering processes. It's not like there's no desire to do it either, code standards alone, how many times have you had discussions about style guides and coding standards company wide? It makes things more consistent and easier for different developers to maintain.

Semi related, I see a lot of non-iso standard SQL that's a pain if you do migrations or refactors, often even just sucks to read through (old school oracle joins look really strange and aren't clear compared to iso standard joins). I really wish people would adhere to the standards as much as possible.