this post was submitted on 19 Jul 2024

131 points (100.0% liked)

TechTakes

1828 readers

604 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]

131

Crowdstrike takes out last remaining threat vector (the users) (infosec.exchange)

submitted 9 months ago* (last edited 9 months ago) by [email protected] to c/[email protected]

30 comments fedilink hide all child comments

The machines, now inaccessible, are arguably more secure than before.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 29 points 9 months ago* (last edited 9 months ago) (6 children)

Fair warning that I'll be ranty because I hate losers talking about DEI hires.

So why is memory address 0x9c trying to be read from? Well because... programmer error.

So what happened is that the programmer forgot to check that the object it's working with isn't valid, it tried to access one of the objects member variables...

This is a huge assumption. ~~The last rumor I've read from actual cybersecurity people is that Crowdstrike's update files were corrupt~~ (update: disproven by Crowdstrike's blog post). If this is true it's likely still from programmer error at some level, but maybe not as simple as "whoopsie I forgot an if (data == nullptr) teehee".

He, like the rest of us that don't work at Crowdstrike, has no idea what happened. I have seen computers do the weirdest gosh darn things. I know better than to assume anything at this point. I wouldn't even rule out weird stuff like the data getting corrupted between release qualification and release yet.

It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die.

This thread is full of these sorts of small technical inaccuracies and oversimplifications so I won't point out all of them, but nothing in the C++ standard requires null pointers to refer to memory address 0x0. Nor does it require that dereferencing a null pointer terminates the program.

Windows died not because C++ asked it nicely to, but because a driver tried to access an address which wasn't paged in.

Crowdstrike should have set up automated testing using address sanitizer and thread sanitizer that runs on every code update.

The funny thing about accessing into non-paged memory in kernel space:

It will crash regardless of if it's running under Asan or not, sanitizers are literally irrelevant based on what we know so far
The Asan version he linked to is for user-space. In the windows kernel you'd need KASAN instead.

(If this was a simple nullptr dereference on bad input data then perhaps a fuzzer would have helped. Fuzzers are great though I have no idea how hard they are to use with kernel drivers)

C++ is hard. Maybe they have a DEI engineer that did this

Dude would probably call me a "DEI hire"; but I bet I could beat him in a C++ deathmatch so neener neener.

[–] [email protected] 14 points 9 months ago (4 children)

Also, and this shouldn't be left unsaid, we're talking about the Windows kernel here. A place with C++ code so cursed it is legendarily unhealthy to work in, as the cosmic horrors contained within slowly eat away at your sanity and warp the perception of time and space. Seeing that code for a few hours is enough to make a grown man cry. Seeing that code for a few weeks is enough to make you never cry again, as the terrible truth worms its way into your mind.

"DEI hire", hah! The creature makes no distinction for race or gender as it fattens itself upon your failure! Even a glimpse at the edge of its abyss is enough to trigger a cycle of revelation - all modern software lies upon a rotting pile of ancient mistakes.

[–] [email protected] 7 points 9 months ago

@V0ldek @sailor_sega_saturn "That gibbering under the desk? Oh, that's just Azathoth. Poor thing got a look at the pump controller code last year. It's never been quite the same since."

[–] [email protected] 7 points 9 months ago

From a lovely response to the Crowdstrike error and various speculation on what caused it (https://ruby.social/deck/@[email protected]/112824202708490681), comes this gem:

> all modern software lies upon a rotting pile of ancient mistakes.

To be clear: this is 100% true. As we slowly, painfully work our way toward being less awful at software engineering, we are better than we have ever been. As fucked as modern code is, old code was worse.

The lower in the stack you go, the more horrifying the revelations, just as a rule.

[–] [email protected] 6 points 9 months ago (1 children)

@V0ldek @sailor_sega_saturn have you read the writings of James Mickens, e.g. https://www.usenix.org/system/files/1311_05-08_mickens.pdf ?

[–] [email protected] 4 points 9 months ago (1 children)

Absolutely stellar writing, except for this one weird bit

Database people are systems people. Modern databases have their own memory management, thread scheduler, and a fucking compiler inside. A promising research direction is to just bundle the database with your own bloody kernel that you handwrote with a box of scraps to make the entire thing less cursed and not have to wrestle with Linux.

You know, just in case you were looking for people to include in your postapo gang, database experts will also murder whatever you want with bare hands.

[–] [email protected] 2 points 9 months ago* (last edited 9 months ago) (1 children)

@V0ldek there's a difference between people who develop database engines, and people who use an existing database engine to write database applications in SQL or whatever.

It's just your comedic hyperbolic turns of phrase reminded me of Mickens'.

[–] [email protected] 2 points 9 months ago

If more than one system devs launch into a Lovecraftian stream of epithets about how incomprehensiblly horrific it is when you ask them about their work then there just may be some truth in it.

[–] [email protected] 4 points 9 months ago

@V0ldek @sailor_sega_saturn Thanks! You are talking straight from my heart!

[–] [email protected] 12 points 9 months ago

Mention C (and to an extent C++) and turbo nerds froth to show off how ultra cool they are cause they are LoW lEvEl programmers. But like most things, these loud freaks are mostly incoherent with their random insertion of tech words. Putting aside the DEI stuff cause I will rant forever against this racist and sexist fuckwit, it’s massively annoying working in an industry and dummies love to be all hand wavy and suggest something like sanitizers. Thanks bro, let’s all add runtime sanitizers and watch perf tank in the most critical section of your computer. And as you pointed out he doesn’t even mention the right one.

Next time Crowdstrike should just have an if check all registers after every instruction to make sure their values are within your address space! And and and make sure a woman doesn’t program it cause according to him they are exempt from code reviews cause of the left agenda or some bullshit

[–] [email protected] 9 points 9 months ago

As someone who is still confused why C++ is different from a B-, thank you for your sacrifice in wading through that nonsense.

[–] [email protected] 8 points 9 months ago (3 children)

@sailor_sega_saturn And given enough time and enough scale even the most improbably weird things will eventually happen. Update file corrupted by a storage controller that flips a couple of bits at random after every 720 hours of uptime but only if it’s 23.682 seconds after the hour? Weirder shit has happened.

[–] [email protected] 16 points 9 months ago

I once helped one of my company's customers troubleshoot an issue that had seen the same ridiculous edge case error happen three times over the course of a few years. At one point the actual sustaining developer we worked with was able to narrow down a specific bit that was getting flipped somehow, and pitched that cosmic radiation was a plausible solution given how rarely this kind of thing impacted other customers.

It was at this point that we remembered that the customer was either a university with a nuclear physics lab or a hospital with a nuclear medicine program (can't remember now, ironically enough) that the server rack lived adjacent to.

[–] [email protected] 11 points 9 months ago* (last edited 9 months ago)

some twenty four years ago i managed, amongst others, a company's samba and print server (that was at the time when all the company's servers were beige boxes with less memory and disk than the laptop i'm using to type this – and still they served a few hundred employees).

the machine developed a strange custom of hard-resetting itself, which we initially tracked to specific files being sent for printing; the behaviour was fully reproducible.

as it happened, it was a hardware fault somewhere between the mainboard and the integrated SCSI card; installing a separate SCSI card and reconnecting the disks and backup tape device fixed the problem. (i did not have the budget for a new serwer, no.)

establishing the actual cause took me fucking weeks.

[–] [email protected] 4 points 9 months ago (1 children)

@m @sailor_sega_saturn
Builds failing, but only at the new office, and only if you tried to build from scratch.

Funny, the Windows network crew that operated the network and suddenly had to operate NFS over UDP on their network, never really realized that their switches were only capable of half-duplex operation. But announced full-duplex. And these Linux boxes fully used that. And big UDP packages used by NFS under load got corrupted.

[–] [email protected] 4 points 9 months ago (1 children)

@m @sailor_sega_saturn
Took a f%cking nightshift of the CTO (German company, so the CTO had PhD in C.S. and still remembered hacking C++ code) and the resident external IT consultant working on the C++ code getting frustrated with the builds crashing and literally debugging the whole shebang to discover that beside a ton of C++ memory bugs, we also had a network issue.

[–] [email protected] 4 points 9 months ago

@m @sailor_sega_saturn And philosophically, I've been now for a decade in "automatic data entry from 3rd parties", ETL (nice phrasing for industry level web scraping and data clean-up).

Literally, what I've seen (and sometimes, as I've also done website development, one wonders what the f%ck the dear colleague was thinking while (s)he developed THAT. Or I want the drugs they were on, that must have been a great trip.), nothing is unthinkable in IT.

[–] [email protected] 6 points 9 months ago (1 children)

"DEI hire" is arrogant. That's a great way to other people instead of owning the flaw. I appreciate the call for maturity in the field. Own your flaws.

[–] [email protected] 19 points 9 months ago (2 children)

the use of “DEI hire” is a shorthand for “i'm a massive racist shitweasel”

[–] [email protected] 17 points 9 months ago

It actually blows my mind that these people can see a bad thing happen, know exactly zero about it, and conclude “must have been a (insert slur) who did that”. They did the same shit with the Baltimore bridge collapse.

[–] [email protected] 12 points 9 months ago (1 children)

Hold on now, it could also be shorthand for "I'm a massive misogynist shitweasel".

[–] [email protected] 9 points 9 months ago* (last edited 9 months ago)

technically both, but “dei” became a mostly racist dogwhistle.

eta: also, people rarely express only a single type of prejudice.

[–] [email protected] 4 points 9 months ago* (last edited 9 months ago) (1 children)

(update: disproven by Crowdstrike’s blog post).

How do you mean? The current top post on the blog seems to mention .sys files as part of the problem very prominently.

Channel file "C-00000291*.sys" with timestamp of 0527 UTC or later is the reverted (good) version. Channel file "C-00000291*.sys" with timestamp of 0409 UTC is the problematic version.

[–] [email protected] 11 points 9 months ago (2 children)

https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

This is not related to null bytes contained within Channel File 291 or any other Channel File.

That to me implied that the channel file wasn't actually necessarily corrupt (or as corrupt as people thought), but that it triggered a logic error. In particular this point implies that it wasn't from garbage zero bytes in the file.

(That said I could have worded this better, in my defense I'm sick in bed and only half thinking straight)

[–] [email protected] 6 points 9 months ago

I see, thank you.

[–] [email protected] 3 points 9 months ago (1 children)

yeah that phrase of "null bytes" reads like addressing one of the rumours

"what was the problem?" "well it wasn't null bytes" "so.. what was it then?" "have definitely eliminated null bytes from the running!"

[–] [email protected] 4 points 9 months ago

Aside but I have been in some weird as heck discussions about how to phrase public blog posts. A few times I've had to point out some phrasing is so cryptic that no one will even know what we're talking about, and really there's nothing wrong with being a bit clearer about what we want to express. Sometimes you'd like companies want the audience to be bewildered and confused; and I'm not totally sure where this instinct comes from.

(Though in this case they probably don't want to share too much yet for stonk or legal reasons)