this post was submitted on 15 Jun 2023
190 points (93.6% liked)
Programming
17658 readers
179 users here now
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities [email protected]
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I agree, hearing about scaling issues so early into adoption is concerning. Lemmy advocates say "horizontal scaling is already built in! just add more instances!", but that doesn't explain the problem.
It's all just text! By my guess too, handling text alone a server should easily support a thousand concurrent users, and hundreds of thousands of daily users. A RasPI should handle thousands. I've heard the bottleneck is the database? In that case Rust is not to blame, Postgres is.
But my fear is that the data structures are implemented in a trivial way. If you have a good reddit-sized thread with a thousand comments, but you store each comment as a separate database entry, then every pageview will trigger a thousand database lookups! The way I imagined making a reddit clone is that I would store the comments as a flat list with some tree data on top, such that serving a single page with 1000 comments is no different that streaming a 100K text file. I'll go take a look how Lemmy does it currently once I get the courage!
No it wouldn't, that's called the N+1 query problem and it can be avoided by writing more efficient queries
Could you explain more how this works? I see how you should be reducing the number of SQL queries from N+1:
down to 1 query:
(Or something like this, I don't know SQL sorry). But wouldn't the database still have to lookup each comment line record on the backend? Yes, they are all indexed and hashed, but if you have a thousand comments, or even ten thousand (that reddit handles perfectly fine!) - isn't 10000 fetches from a hashtable still slower than fetching a 10000-long array? And what if you've been running your reddit clone for years and accumulated so many gigs of content that they don't fit in memory and have to be stored on disk. Aren't you looking at 10000 disk reads in worst case scenario with a hashtable?
You've got the right idea with your SQL example, that's pretty much exactly what N+1 would look like in your query logs.
This can happen when using an ORM, if you're not careful to avoid it. Many ORMs will query the database on attribute access, in a way that is not particularly obvious:
Given this simple python-ish example, many ORMs will let you do something like this:
Although
comment.author
looks like a simple attribute access, the ORM has to issue a DB query behind the scenes. As a dev, especially one learning a new tool, it's not particularly obvious that this is happening, unless you've got some query logging that you're likely to notice during development.A couple of fixes are possible here. Some ORMs will provide some method for fetching the comments via JOIN in the initial query. e.g.
post = Post.objects.get(id=11).select_related("comments")
instead of justpost = Post.objects.get(id=11)
. Alternately, you could fetch the Post, then do another query to grab all the comments. In this toy example, the former would almost certainly be faster, but in a more complex example where you're JOINing across multiple tables, you might try breaking the query up in different ways if you're really trying to squeeze out the last drop of performance.In general, DB query planners are very good at retrieving data efficiently, given a reasonable query + the presence of appropriate indexes.