this post was submitted on 25 May 2024
113 points (98.3% liked)

Linux

48655 readers
691 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Hello,

Today my washing machine completely broke down. My parents desperately tried to get it working, but it resulted in the circuit breakers tripping and my server (an old Dell Wyse thin client) experiencing a hard power off.

When I tried to turn it back on, I received these errors on the screen.

I ran a memtest, and it completed without any issues. I also created a disk image backup just in case.

Is there any chance of getting this machine running again, or is it only fit for utilization?

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 114 points 6 months ago (3 children)

At first, I thought it was some sort of iot washing machine that stopped working due to software error lol.

[–] [email protected] 24 points 6 months ago (2 children)

I was wondering what a washing machine needed a TPM for. Better title: "machine fails to boot after power failure"

[–] [email protected] 2 points 6 months ago

enforcing their bootloader signature obviously. can't risk having consumers prod their washing machine software, they might hurt themselves.

[–] [email protected] 2 points 6 months ago

Good point - I've fixed it

[–] [email protected] 17 points 6 months ago (1 children)

Me too. Read the title, looked at the picture and concluded that some people have really fancy washing machines. 😃

[–] [email protected] 7 points 6 months ago* (last edited 6 months ago) (1 children)

I assumed OP plugged himself in some hidden serial port (like cars' obd2) and the washing machine had indeed a tpm to prevent bootleg/non original spare parts.

The human mind can be the deepest well of imagination sometimes. I'm a bit too good at that o.Ô

[–] [email protected] 2 points 6 months ago

I wish my washing machine had an ethernet port so that I could SSH into it.

[–] [email protected] 7 points 6 months ago (2 children)
[–] [email protected] 17 points 6 months ago (2 children)

The washing machine caused some sort of electric failure that damaged a thin client being used as a server

[–] [email protected] 7 points 6 months ago

Oh wow I'm dense

[–] [email protected] 3 points 6 months ago

a thin client being used as a server

"Look at me, I'm the server now."

[–] [email protected] 3 points 6 months ago
[–] [email protected] 82 points 6 months ago* (last edited 6 months ago) (3 children)

You haven't given us much information about the CPU. That is very important when dealing with Machine Check Errors (MCEs).

I've done a bit of work with MCEs and AMD CPUs, so I'll help with understanding what may be going wrong and what you probably can do.

I've done a bit of searching from the microcode & the Dell Wyse thin client that you mentioned. From what I can garner, are you using a Dell Wyse 5060 Thin Client with an AMD steppe Eagle GX-424 [1]? This is my assumption for the rest of this comment.

Machine Check Errors (MCEs) are hard to decipher find out without the right documentation. As far as I can tell from AMD's Data Sheet for the G-Series [2], this CPU belongs to family 16H.

You have two MCEs in your image:

  • CPU Core 0, Bank 4: f600000000070f0f
  • CPU Core 1, Bank 1: b400000001020103

Now, you can attempt to decipher these with a tool I used some time ago, MCE-Ryzen-Decoder [4]; you may note that the name says Ryzen - this tool only decodes MCEs of Ryzen architectures. However, MCE designs may not change much between families, but I wouldn't bank (pun not intended) on it because it seems that the G-Series are an embedded SOC compared to the Ryzen CPUs which are not. I gave it a shot and the tool spit out that you may have an issue in:

$ python3 run.py 04 f600000000070f0f
Bank: Read-As-Zero (RAZ)
Error:  ( 0x7)

$ python3 run.py 01 b400000001020103
Bank: Instruction Fetch Unit (IF)
Error: IC Full Tag Parity Error (TagParity 0x2)

Wouldn't bank (pun intended this time) on it though.

What you can do is to go through the AMD Family 16H's BIOS and Kernel Developer Guide [3] (Section 2.16.1.5 Error Code). From Section 2.16.1.1 Machine Check Registers, it looks like Bank 01 corresponds to the IC (Instruction Cache) and Bank 04 corresponds to the NB (Northbridge). This means that the CPU found issues in the NB in core 0 and the IC in core 1. You can go even further and check what those exact codes decipher to, but I wouldn't put in that much effort - there's not much you can do with that info (maybe the NB, but... too much effort). There are some MSRs that you can read out that correspond to errors of these banks (from Table 86: Registers Commonly Used for Diagnosis), but like I said, there's not much you can do with this info anyway.

Okay, now that the boring part is over (it was fun for me), what can you do? It looks like the CPU is a quad core CPU. I take it to mean that it's 4 cores * 2 SMT threads. If you have access to the linux command line parameters [5], say via GRUB for example, I would try to isolate the two faulty cores we see here: core 0 and core 1. Add isolcpus=0,1 to see the kernel boots. There's a good chance that we see only two CPU cores failing, but others may also be faulty but the errors weren't spit out. It's worth a shot, but it may not work.

Alternatively, you can tell the kernel to disable MCE checks entirely and continue executing; this can be done with the mce=off command line parameter [6] . Beware that this means that you're now willingly running code on a CPU with two cores that have been shown to be faulty (so far). isolcpus will make sure that the kernel doesn't execute any "user" code on those cores unless asked to (via taskset for example)

Apart from this, like others have pointed out, the red dots on the screen aren't a great sign. Maybe you can individually replace defective parts, or maybe you have to buy a new machine entirely. What I told you with this comment is to check whether your CPU still works with 2 SMT threads faulty.

Good luck and I hope you fix your server 🤞.

Edited to add: I have seen MCEs appear due to extremely low/high/fluctuating voltages. As others pointed out, your PSU or other components related to power could be busted.

[1] https://www.dell.com/support/manuals/en-us/wyse-5060-thin-client/5060_wie10_ug/system-specifications?guid=guid-cbeecec5-25ac-4103-8b4b-7d3a975e91f0&lang=en-us

[2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/52259_KB_G-Series_Product_Data_Sheet.pdf

[3] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/52740_16h_Models_30h-3Fh_BKDG.pdf

[4] https://github.com/DimitriFourny/MCE-Ryzen-Decoder

[5] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

[6] https://elixir.bootlin.com/linux/v6.9.2/source/Documentation/arch/x86/x86_64/boot-options.rst

[–] [email protected] 24 points 6 months ago

Amazing. I'm not OP and have no use for this info, but it was fun to learn it still.

[–] [email protected] 17 points 6 months ago

Respect 🙏

[–] [email protected] 5 points 6 months ago (1 children)

Yes, this is exactly the Dell Wyse 5060 with an AMD GX-424CC processor. This thin client is already old, which is why I decided to purchase a newer one with a better processor.

Anyway, thank you for your analysis! I learned a lot of new things. I will try to get it running with your advice and let you know how it goes.

However, this server will probably no longer be needed, since half of its cores are damaged. Previously, its computing power was fully utilized (the load was almost always 4.0), and it handled my tasks very well with four cores. Therefore, I cannot imagine using it with only half of its power available 😁

[–] [email protected] 3 points 6 months ago (1 children)

Are you planning to scrap the CPU? I may be interested in it as I find faulty hardware fun to experiment on.

[–] [email protected] 2 points 6 months ago (1 children)

I will no longer use this device, rather I will throw it away, because I will have no more use for it.

[–] [email protected] 3 points 6 months ago (1 children)

Would you consider sending it to Austria? I'd pay shipping charges (if it's within reason lol). If you are, you can send me an email at: sneela-hwelemmy92fd [at] port87.com

[–] [email protected] 2 points 6 months ago (1 children)
[–] [email protected] 2 points 6 months ago

Thank you, I'll send you an email within a day.

[–] [email protected] 41 points 6 months ago* (last edited 6 months ago) (2 children)

There’s so much incompetent advice here.

CPU is fine.

Linux is booting and tries to connect to TPM (trusted platform module).

It has nothing to do with graphics card. Fact it is booting means CPU is most likely likely unaffected.

TPM is most likely fried.

Linux can run without TPM. Plenty of old boards were shipped with TPM socket, but without TPM itself.

Best option is get manual for your motherboard and pull out that TPM.

Any passwords stored there are lost, if you used it.

If TPM is fine, then board pathway to it may be damaged. If that’s the case and you really need it, then board replacement is your option. But that’s only after good TPM was tried.

[–] [email protected] 26 points 6 months ago

In some cases a wipe/reset of the TPM from the BIOS might do it as well, is it's still functional but scrambled

[–] [email protected] 3 points 6 months ago

This is a thin client. It does not have a removable TPM module, so I cannot physically "pull out" that TPM.

[–] [email protected] 19 points 6 months ago* (last edited 6 months ago) (1 children)

You're able to run MemTest? That'd suggest it's not actually fried if it can still run things.

Check your BIOS/UEFI to see if Secure Boot was re-enabled. If your CMOS battery died and you didn't notice, your machine config could have reset to its default values during the power loss.

[–] [email protected] 3 points 6 months ago

Unfortunately no, Secure Boot is not enabled

[–] [email protected] 14 points 6 months ago

I honestly thought your washing machine was throwing the MCE when i opened the post 😹

[–] [email protected] 10 points 6 months ago (2 children)

With the hw MCE errors, it's probably toast.

You could try reseating or swapping the ram around, if it's socketed

[–] [email protected] 1 points 6 months ago

Yeah, I would think memory as well due to the screen artifacts in that low res mode. (That depends on how x86 memory is mapped these days, I suppose.)

[–] [email protected] 1 points 6 months ago

Alright, I'll give it a try. Thanks for the suggestion!

[–] [email protected] 9 points 6 months ago (1 children)

Replace psu.

If your tpm is not integrated into the cpu, replace motherboard.

[–] [email protected] 2 points 6 months ago (5 children)

Unfortunately, I don't have another power supply with the same specifications on hand, as this thin client is the only Dell device in my home 😕. Replacing the motherboard is not cost-effective for such old hardware. I'll just buy a new thin client.

[–] [email protected] 2 points 6 months ago (1 children)

Perhaps consider investing in a small UPS device as well, it might help out in any future events like this.

[–] [email protected] 2 points 6 months ago

The device was already protected by a UPS, but it failed and shut down.

load more comments (4 replies)
[–] [email protected] 5 points 6 months ago

Can you run a live CD on the machine? You mentioned Memtest but I’m wondering if an Ubuntu one would work.

[–] [email protected] 4 points 6 months ago

A cursory web search suggests your CPU may have been damaged. Can’t say for sure as I don’t know jack about modern hardware. My indepth knowledge basically ends at x86.

I found this while searching the error codes in the photo. Maybe it will be of some help.

https://bbs.archlinux.org/viewtopic.php?id=266210

[–] [email protected] 4 points 6 months ago (2 children)

Not an expert, but I don’t think replacing the TPM chip is an option.

How did you run a memory test? Do you get a command line?

[–] [email protected] 5 points 6 months ago (1 children)

AFAIK, TPMs are usually socketed.

[–] [email protected] 3 points 6 months ago

I ran memtest from a flash drive (ventoy). and no, I don't get the command line

[–] [email protected] 4 points 6 months ago* (last edited 6 months ago)

Run the BIOS self tests.

Something's definitely broken (TPM errors, self test errors, graphical artifacts), but I can't tell what from the image. I would guess motherboard problems, or a subtly damaged CPU.

Could also be more then one problem in the case of over voltage (worst case consequence of PSU damage), or intermittent failure from under voltage (should be fixed with a new PSU).

[–] [email protected] 2 points 6 months ago

My guess is this reset your bios as flipped a tpm setting on. Maybe see if you can disable all tpm/secure boot and see if it carries on.

[–] [email protected] 2 points 6 months ago (1 children)

Does it do the same thing every time? Is there a working bios or uefi menu?

[–] [email protected] 1 points 6 months ago (1 children)
[–] [email protected] 1 points 6 months ago

Maybe try diabling microcode updates on boot?

you can use dis_ucode_ldr kernel parameter according to the debian wiki:

https://wiki.debian.org/Microcode#Working%20around%20boot%20problems%20caused%20by%20microcode%20updates

i would think microcode is checked for corruption before loading, but thats just an assumption.

[–] [email protected] 2 points 6 months ago (1 children)

It's just the demon's doing their thing /s

[–] [email protected] 3 points 6 months ago (1 children)

First, you gotta check for ultraviolet, ghost writing, and freezing temps.

(I really hope somebody gets that reference)

load more comments (1 replies)
[–] [email protected] 1 points 6 months ago

All the red dots look like some kind of GPU failure. I think the TPM error is a symptom of a bigger hardware issue that is insurmountable.

A live cd or usb might help as others have stated.

load more comments
view more: next ›