this post was submitted on 04 Dec 2024
90 points (100.0% liked)
Technology
37800 readers
135 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
But, they aren't. They're not after Activitypub specifically. They're scraping the whole internet, most of them using clear bot User Agents. So, I routinely block their bots because the AI ones are usually hitting you multiple times a second non-stop. If they started making fake Activitypub nodes they would not be scraping as a bot, and they would want specifically fediverse data. Important to note here though, an Activitypub node doesn't "collect" data, they subscribe (to mastadon users/hashtags or communities) and then get new data delivered to them. So they wouldn't get the old stuff.
Having said that, I've seen some obvious bots using genuine browser user agents on IP addresses from certain very large Chinese companies. For those I just blocked their whole AS number.
So most modern activitypub servers backfill threads and profiles. My single user instance processes 30000 notes a day. If I was actually trying, I’m sure it’d be easy to grab much more while appearing well behaved.
It's not how ActivityPub (at least Lemmy/*bin servers) works. There isn't so far as I've ever seen an API that allows for this within ActivityPub (now specific to Lemmy/*bin implementations there's the API the browser/apps use that must provide this, but that's not ActivityPub). It actually looks to be cleverly designed to prevent it. It might look like backfilling is happening because old stuff appears, but there are reasons for this.
How it works from my experience (I did some work on the federation in kbin a year or so ago).
And so old posts and comments will begin to appear as activities linked to them happen. But there isn't a method to ask for "all the posts in community X" using activity pub. I remember because I was specifically looking for this a year or so ago. It let's you see the parent object but not any children.
Maybe Mastadon etc does it different? No idea.
And all of this is moot because if I block a User Agent, or I block an AS number/IP block. They're not getting anything either by ActivityPub or scraping unless they change User Agent, AS number, or both.