this post was submitted on 21 Aug 2023

15 points (100.0% liked)

Reddthat Announcements

641 readers

1 users here now

Main Announcements related to Reddthat.

For all support relating to Reddthat, please go to [email protected]

founded 1 year ago

MODERATORS

[email protected]

Increased monitoring & video uploads... Why we had an issue with the site in the last hour(s) (reddthat.com)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

5 comments fedilink hide all child comments

The things that I do while I'm meant to be sleeping is apparently break Reddthat! Sorry folks. Here's a postmortem on the issue.

What happened

I wanted a way in which I could view the metrics of our lemmy instance. This is a feature you need to turn on called Prometheus Metrics. This isn't a new thing for me, and I tackle these issues at $job daily.

After reading the documentation here on prometheus, it looks like it is a compile time-flag. Meaning the pre-build packages that are generated do not contain metric endpoints.

No worries I thought, I was already building the lemmy application and container successfully as part of getting our video support.

So I built a new container with the correct build flags, turned on my dev server again, deployed, and tested extensively.

graph of containers showing some data

We now have interesting metrics for easy diagnosis. Tested posting comments, as well as uploading Images as well as testing out the new Video upload!

So we've done our best and deploy to prod.

ohno.webm

As you know from the other side... it didn't go to well.

After 2 minutes of huge latencies my phone lights up with a billion notifications so I know something isn't working... Initial indications showed super high cpu usage of the Lemmy containers (the one I newly created!) That was the first minor outage / high latencies around 630-7:00 UTC and we "fixed" it by rolling back the version, confirming everything was back to normal. I went and had a bite to eat.

not_again.mp4

simpsons skinner meme saying, I'm not pushing a bad build to production am i? No, it must be the documentation that is wrong

Fool-heartedly I attempted it again, with: clearing out build cache, directly building on the server, more testing, more builds, more testing, and more testing.

I opened up 50 terminals and basically DOS'd the (my) dev server with GET and POST requests as an attempt to trigger some form of high enough load that it would cause the testing to be validated and I'd figure out where I had gone wrong in the first place.

Nothing would trigger the issue, so I continued along with my validation and "confirmed everything was working".

Final Issue

oogway meme with "fuckit = finally inner peace"

So we are deploying to production again but we know it might not go so well so we are doing everything we can to minimise the issues.
At this point we've completely ditched our video upload patch and gone with a completely blank 0.18.4 with the metrics build flag to minimise the possible issues.

NOPE

an army soldier using a flamethrower with text super imposed: my hopes and dreams, lemmy

At this point I accept the downtime and attempt to work on a solution while in fire. (I would have added the everything-is-fine meme but we've already got a few here)
Things we know:

My lemmy app build is not actioning requests fast enough
The error relates to a Timeout happening inside lemmy which I assumed was to postgres because there was about ~15 postgres processes in the process of performing "Authentication" (this should be INSTANT)
postgres logs show an error with concurrency.
- but this error doesnt happen with the dev's packaged app
postgres isn't picking up the changes to the custom configuration
- was this always the case?!?!?!? are all the Lemmy admins stupid? Or are we all about to have a bad time
The docker container postgres:15-alpine now points to a new SHA! So a new release happened, and it was auto updated. ~9 days ago.

So, i fsck'd with postgres for a bit, and attempted to get it to use our customPostgresql configuration. This was the cause of the complete ~5 minute downtime. I took down everything except postgres to perform a backup because at this point I wasn't too certain what was going to be the cause of the issue.
You can never have too many backups! You don't agree, lets have a civil discussion over espressos in the comments.
I bought postgres back up with the max_connections to what it should have been (it was at the default? 100). And prayed to a deity that would alienate the least amount of people from reddthat :P

To no avail. Even with our postgres sql tuning the lemmy container I build was not performing under load as well as the developers container.

So I pretty much wasted 3 hours of my life, and reverted everything back.

Results

a 3 picture meme, all with lemmy admins on each picture. 1 picture of a kid riding a bike, 1 picture of the kid poking a stick in the front wheel, 1 picture of the kid on the ground

All lemmy builds are not equal. Idk what the dev's are doing but when you follow the docs one would hope to expect it to work. I'm probably missing something...
Postgres released a new 15-alpine docker container 9 days ago, pretty sure no-one noticed, but did it break our custom configuration? or was our custom configuration always busted? (with reference to the lemmy-ansible repo here )
I need a way to replay logs from our production server against our dev environment to generate the correct amount of load to catch these edge cases. (Anyone know of anything?)

At this point I'm unsure of a way forward to create the lemmy containers in a way that will perform the same as the developers ones, so instead of being one of the first instances to be able to upload video content, we might have to settle for video content when it comes. I've been chatting with the pict-rs dev, and I think we are on the same page relating to a path forward. Should be some easy wins coming in the next versions.

Final Finally

reddthat looking at metrics, while stability in looking at reddthat in discust

I'll be choosing stability from now on.

Tiff

Notes for other lemmy admins / me for testing later: postgres:15-alpine:

2 months ago  sha: sha256:696ffaadb338660eea62083440b46d40914b387b29fb1cb6661e060c8f576636
9 days ago sha: sha256:ab8fb914369ebea003697b7d81546c612f3e9026ac78502db79fcfee40a6d049

all 7 comments

sorted by: hot top controversial new old

[–] [email protected] 6 points 1 year ago (1 children)

As a fellow admin, this was very entertaining thanks lmao

[–] [email protected] 2 points 1 year ago

I'm glad ☺️

[–] [email protected] 4 points 1 year ago (1 children)

Hi, just wanted to say thanks for all this.

I am a software engineer and I really regret not having the time to host my own server, reading your posts is the closest to the experience I can get right now.

I hope this is as fun as it looks!

[–] [email protected] 3 points 1 year ago

I'm glad you can live vicariously through me! Please continue to do so.

There will be plenty more no doubt, and when I need other sysops I might come a knocking :)

Honestly the biggest time sinks/headaches are really problems I make for myself. If I didn't bother with improvements or fixing the small issues then being the admin wouldn't amount to more than dealing with reports. And that can get tiresome.

It took about 10 minutes for me to actually get Reddthat running as there are pre made ansible scripts for automating the deployment process. It's making improvements and scaling the services way past the original scripts that are the real features..

In saying all this fixing the small issues, making improvements, and continually upgrading reddthat is entertaining at the least. 😉

Tiff

PS. Once I make sure our ansible scripts are in a nicer state I'll tell people where the git repository is. So when people are interested they can go see how we do it.