this post was submitted on 05 Jul 2023
42 points (68.4% liked)
Fediverse
27910 readers
1 users here now
A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).
If you wanted to get help with moderating your own community then head over to [email protected]!
Rules
- Posts must be on topic.
- Be respectful of others.
- Cite the sources used for graphs and other statistics.
- Follow the general Lemmy.world rules.
Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I'll tell you a secret: they care enough to scrape everything. Not only the fediverse, every single website that's accessible. And that's not a thing for the future, that has been a reality at least since google became popular. Do yourself a favor and look into the server logs of an average webhost and you will find a whole bunch of crawlers. Some are for search engines, some are for other purposes.
I wrote my M. Sc. thesis on specialized crawlers (back in 2015) and you wouldn't believe how much research has gone into that and how effective modern crawlers are at finding every single thing that ever got uploaded to the net. The only thing needed is enough hardware to throw at the problem and that's exactly what Meta, Google, Microsoft, Amazon and all the others have. As a rule of thumb, if archive.org or your favorite search engine has indexed it, everyone else has it as well or has access to someone they can buy it from. There is no such thing as unscraped content on the internet (unless you lock it behind access restrictions and those would apply just the same to federation).
Edit: I don't have access logs enabled on my instance and obviously can't see what happens on other instances but I would bet that this very thread will be picked up by at least five different crawlers before the day is over.
Yeah, I know. My own access logs on all the VS I have control over are disabled. I still feel something, even if that something is purely symbolic, is better than nothing.