this post was submitted on 25 Mar 2024
58 points (100.0% liked)

Reddthat Announcements

641 readers
1 users here now

Main Announcements related to Reddthat.

founded 1 year ago
MODERATORS
 

Basically, I'm sick of these network problems, and I'm sure you are too. We'll be migrating everything: pictrs, frontends & backends, database & webservers all to 1 single server in OVH.

First it was a cpu issue, so we work around that by ensuring pictrs is on another server, and have just enough CPU to keep us all okay. Everything was fine until the spammers attacked. Then we couldn't process the activities fast enough, and now we can't catch up.

We are having constant network drop outs/lag spikes where all the networking connections get "pooled" with a CPU steal of 15%. So we bought more vCPU and threw resources at the problem. Problem temporarily fixed, but we still had our "NVMe" VPS, which housed our database and lemmy applications showing an IOWait of 10-20% half the time. Unbeknown to me, that it was not IO related, but network related.

So we moved the database server off to another server, but unfortunately that caused another issue (the unintended side effects, of cheap hosting?). Now we have 1 main server accepting all network traffic, which then has to contact the NVMe DB server and pict-rs server as well. Then send all that information back to the users. This was part of the network problem.
Adding backend & frontend lemmy containers to the pict-rs server helped alleviate and is what you are seeing at the time of this post. Now a good 50% of the required database and web traffic is split across two servers which allows for our servers to not completely be saturated with request.

On top of the recent nonsense, it looks like we are limited to 100Mb/s, that's roughly 12MB/s. So downloading a 20MB video via pictrs would require the current flow: (in this example)

  • User requests image via cloudflare
  • (its not already cached so we request it from our servers)
  • Cloudflare proxies the request to our server (app1).
  • Our app1 server connects to the pictrs server.
  • Our app1 server downloads the file from pictrs at a maximum of 100Mb/s,
  • At the same time, the app1 server is uploading the file via cloudflare to you at a maximum of 100Mb/s.
  • During this point in time our connection is completely saturated and no other network queries could be handled.

This is of course an example of the network issue I found out we had after moving to the multi-server system. This is of course not a problem when you have everything on one beefy server.


Those are the board strokes of the problems.

Thus we are completely ripping everything out and migrating to a HUGE OVH box. I say huge in capital letters because the OVH server is $108/m and has 8 vCPU, 32GB RAM, & 160GB of NVMe. This amount of RAM allows for the whole database to fit into memory. If this doesn't help then I'd be at a loss at what will.
Currently (assuming we kept paying for the standalone postgres server) our monthly costs would have been around $90/m. ($60/m (main) + $9/m (pictrs) + $22/m (db))

Migration plan:

The biggest downtime will be the database migration as to ensure consistency we need to take it offline. Which is just simpler than

DB:

  • stop everything
  • start postgres
  • take a backup (20-25 mins)
  • send that backup to the new server (5-6 mins (Limited to 12MB/s)
  • restore (10-15 mins)

pictrs

  • syncing the file store across to the new server

app(s)

  • regular deployment

Which is the same process I recently did here so I have the steps already cemented in my brain. As you can see, taking a backup ends up taking longer than restoring. That's because, after testing the restore process on our OVH box we were no where near any IO/CPU limits and was, to my amazement, seriously fast. Now we'll have heaps of room to grow with a stable donation goal for the next 12 months.

See you on the other side.

Tiff

top 32 comments
sorted by: hot top controversial new old
[–] [email protected] 13 points 7 months ago* (last edited 7 months ago) (3 children)

I managed to streamline the exports and syncs so we performed them concurrently. Allowing us to finish just under 40 minutes! Enjoy the new hardware!

So it begins: (Federation "Queue")
Federation queue showing a upwards trend, then down then slightly back up again

[–] [email protected] 4 points 7 months ago (2 children)

OMG, posts load instantly now, used to take 3 to 15 seconds. I'm in US East Coast for reference.

[–] [email protected] 4 points 7 months ago

That's what I love to hear! 🎉

[–] [email protected] 4 points 7 months ago (1 children)

I'm glad you mentioned this. It's snappy!

[–] [email protected] 1 points 7 months ago

🐊 Snap Snap!

[–] [email protected] 2 points 7 months ago* (last edited 7 months ago)
[–] [email protected] 2 points 7 months ago (1 children)
[–] [email protected] 7 points 7 months ago* (last edited 7 months ago) (1 children)

That's when US timezones wakes up. We physically cannot accept more than 3 requests per second. Physically being the actual network physical limits ( of 3 x 287ms = 861ms, we used to be 930ms+. The server move got us 21ms closer!). LW generates more than 3 activities per second during US "awake" time zones. So we have a period of 8 hours where we need to catch up.

Like I've said in our forcing federation post. There isn't anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

It's just the sequential nature of Lemmy. I'm going to test a new container in the next 12 hours which removes the blocking metadata generation from the accepting of activities. That way we can guarantee at least 3 activities a second.

Realistically, that is a minor fix but it won't help with those graphs in the long term. We will need to have parallel sending, for it ever scale.

On a side note while we were on our old server and were using our forcing federation script, we had it set to 10 parallel requests. It didn't even worry about it. I saw no increase in server load. Which is good news for the lemmyverse in general, as everyone will be able to accept the new parallel sending without needing to increase their hardware.

Tiff

[–] [email protected] 4 points 7 months ago (1 children)

Thank you for the detailed answer!

There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

Sorry, it's a bit late for me on this side, but if I understand correctly, posts and comments are indeed up-to-date, but upvotes are synchronized later, is this correct?

Thank you for the work as always!

[–] [email protected] 3 points 7 months ago* (last edited 7 months ago)

but upvotes are synchronized later

Correct. All votes are syncronised eventually.

[–] [email protected] 8 points 7 months ago

Great news! Thank you so much for this!

[–] [email protected] 4 points 7 months ago (1 children)
[–] [email protected] 3 points 7 months ago
[–] [email protected] 4 points 7 months ago (1 children)

until the ~~fire nation~~ spammers attacked.

Hehe

[–] [email protected] 1 points 7 months ago
[–] [email protected] 4 points 7 months ago (1 children)
[–] [email protected] 2 points 7 months ago

/gif legen-dary

[–] [email protected] 3 points 7 months ago (1 children)

Good luck for the migration!

[–] [email protected] 2 points 7 months ago

It starts... Soon. 😎

[–] [email protected] 3 points 7 months ago (1 children)

Idk crap about lemmy backend stuff, I'm just here for Legends of the Hidden Temple.

[–] [email protected] 2 points 7 months ago (1 children)

Glad the link worked! It's always risky posting mp4 links. I'll be glad once the new front end patches come through so that by default, shows an image of the video (iirc).

[–] [email protected] 3 points 7 months ago

FWIW I didn't know it was a video until you said something haha. The video did work though also when I clicked on it.

[–] [email protected] 2 points 7 months ago (1 children)

PS. Everyone enjoying this new wide layout?

[–] [email protected] 6 points 7 months ago (1 children)

Did anything change? If so, I didn't notice it ha ha

[–] [email protected] 2 points 7 months ago

I changed the default theme to be the "Compact" version. Which makes it wide screen, but if you've set your own then it doesn't change it. If you open up reddthat.com in a private browser you should see it.

[–] [email protected] 2 points 7 months ago (2 children)

I've noticed some issues since moving to reddthat. Glad to see a fix is being worked on, keep up the good work :)

[–] [email protected] 3 points 7 months ago

If you moved recently, you are quite unlucky with the timeframe ha ha

[–] [email protected] 2 points 7 months ago (1 children)

How are your issues now? 🧐

[–] [email protected] 2 points 7 months ago

Things seem okay now, no weird behaviour like random logouts and communities not loading 😁

[–] [email protected] 2 points 7 months ago* (last edited 7 months ago) (1 children)

Did I evrr let you know that I pissed off a CIA asset that launders Russian oligarch monies using the FBI, Filipino organized crime, the Albanians, and other US based law enforcement via FedEx, UPS, USPS, and Joann's Fabrics ?

Should I contribute more monthly to cover their probable sabotaging reddthat ?

[–] [email protected] 4 points 7 months ago* (last edited 7 months ago)

Between you and me, you personally probably don't need to donate more in the short term, but I'm not going to stop you! 😛

We need about A$40-50/month extra to cover everything now. We have A$77.22 setup in recurring donations on OpenCollective, and just our server bills are A$115 (converted from US$74.80). + Domain Renewal (1.5/m Euro) + Wasabi Storage (~$8/m USD) This will be updated Funding post. With the money on Ko-Fi, OpenCollective and the recurring donations on OpenCollective, we have at least 12 months of runway before we run out of money. So it isn't critical at the moment.

Thanks! 🤎

Edit: Actual prices

[–] [email protected] 2 points 7 months ago

Thank you for all the hard work!