this post was submitted on 24 Feb 2024

43 points (97.8% liked)

Lemmy

13098 readers

13 users here now

Everything about Lemmy; bugs, gripes, praises, and advocacy.

For discussion about the lemmy.ml instance, go to [email protected].

founded 5 years ago

MODERATORS

[email protected]

Protecting user content and data on Lemmy (self.lemmy)

submitted 1 year ago* (last edited 1 year ago) by silas to c/[email protected]

18 comments fedilink hide all child comments

I see talk here and there about how any company or individual can easily use anything we post on Lemmy however they want. This could include AI training, behavior analysis, or user profiling. With the recent news of Reddit data being sold and licensed for AI training, I thought this would be a great time to preemptively discuss how we feel about this topic and brainstorm ways to discourage unwanted use of the content we post.

I’ve seen some users add a license to the end of each of their comments. One idea might be this: Add a feature to Lemmy where each user can choose a content license that applies to everything they post. For example, one user might choose to no rights for their content (like CC0) because they don’t care how their data is used. Another user might not want companies profiting off their posts, so they’d choose a more restrictive license.

I’m eager to here everyone’s thoughts on the whole topic, so to kick things off:

Do you care how your public data and posted content is used? Why or why not?
What do you think of choosing a content license for your Lemmy account? Does this contradict the FOSS model?
Should Lemmy have features to protect user data/content in this way, or should that be left up to the user to figure out on their own?

Data is becoming an increasingly valuable commodity in the digital world. Hopefully these big-picture conversations can help us see what we value as a community and be more prepared for the future.

top 18 comments

sorted by: hot top controversial new old

[–] [email protected] 26 points 1 year ago* (last edited 1 year ago) (4 children)

I'm sorry, but that's just impossible here. I'm sorry to tell you, but it is.

ActivityPub is a protocol which takes your content and blasts it out to anyone who listens. That's the design of it, that we all listen on our own servers and we can then treat our servers as we want. There is no profit motive on our servers because anyone could just jump to a new server.

However, this means there is literally no opt out protocol. Anyone can start a server, which means anyone can start a server. Governments, corporations, the jerk down the street, anyone. The only way to turn that off is by saying "Defederate from this server", but of course the anonymous nature.. we don't have to know who they are.

Of course we can defederate from other servers but since anyone can spin up a server on any domain, how do you know that Meta doesn't have a server right now at some weird domain? OpenAI could be listening right now and training. In fact I'd be surprised if the site formerly known as Twitter didn't have a mastodon server up so they could keep tabs on it

Even deleting a message is another blast out to all other servers. "Hey, this user requests you delete this message". So what happens if someone modifies their code to just ignore that?

I guess what I'm trying to say is that the fediverse is open and free - and the downside of being open and free is that it's open and free - to everyone. There is no permenent delete. There is no way to way to license it because by clicking post you are saying "Blast this out to everyone who is listening", once it's on their server it's their data. You gave it to them. There is no way to protect data because the protocol quite literally does the opposite.

[–] silas 8 points 1 year ago (1 children)

You might be right, I definitely see your point. ActivityPub adds a whole new layer to this too. In the end though, isn’t the content we post no different than anything else published on the Internet? I guess it’s important to note that technically nothing public can be 100% prevented from being used in unwanted ways. However, there might be other ways (legally, socially, etc.) we could discourage it.

Regardless, I’d love to get a better sense of how much this matters to us here on Lemmy—or if it should even matter in the first place

[–] [email protected] 6 points 1 year ago

It's more akin to handing out flyers to people you meet randomly, with a note at the bottom that they can't do anything with it. The note might hold up in court, but at the end of the day it's probably going to be asked why you were handing the flyer out in the first place if you didn't want people to read it. On top of that, that's one court, we're talking about the entire world here, who knows who or what is listening. I think that's the biggest invert of the head, you aren't posting to someone's server like Reddit, you're throwing it out to everyone who wants to listen.

To me, this doesn't make a huge difference. If someone wants to train on it, fine, at least we get a free open platform that we can modify however we want. I just also am a bit more careful about what I post.

[–] [email protected] 7 points 1 year ago (1 children)

I understand on a current technical side why this is not possible, but the post still has some merit in that misuse of original posts can lead to legal action.

Right now, all content posted online is generally accepted as unlicensed, free to use however one pleases, works. This was fine at the beginning, but as the internet grew, control of one's data increasingly got more difficult to control - once social media became the dominate form of communicating, it was all over.

Early blogs still have copyrights posted on them that, legally, can be enforced and respected. So if each user was able to indicated in meta data their choices, with most defaulting probably to a free license, then there is some level of control returned to the user, regardless of protocol and how things get replicated on servers.

Licenses include reproduction, and the way activitypub works can make that quite murky (its being republished on servers) but that is not all it covers.

OP, I think this is a very interesting topic to discuss, thanks for bringing it up!

[–] silas 4 points 1 year ago

Of course! Yeah, this post was intended to be less of a proposal and more of a brainstorm session. Maybe licenses aren’t the way to go about this, or we create our own licenses to be compatible with ActivityPub and match Lemmy’s values? Maybe it doesn’t matter how our content is used, or there’s nothing we can do?

[–] Die4Ever 5 points 1 year ago (1 children)

cloning data in that way isn't legally different than what The Wayback Machine does for other websites, it doesn't mean a company can just ignore the legal license of the content just because they can get a copy of it

if the only concern was getting a copy of the data, then Reddit wouldn't be able to sell access to the data for $75mil or whatever, the AI company would just scrape the pages or pay the API fees directly, and then they could even store the data and serve it to other people as a mirror and make some money off of the content with ads too!

same thing with licenses on Git repos, you can't just clone it and do whatever you want with it, there are laws

[–] [email protected] 4 points 1 year ago (1 children)

The problem is that does another server have to listen to the license. You're on programming.dev. Say they obey your license that you put there. Well, say my server explicitely says "Do not send me things if you want it licensed. By sending me your data you waive all rights to your data and waive all licenses". I can put this in my legal area too. So, who wins then? That's different than git where if I clone it I'm pulling your data, you willingly pushed it to my server where I said what I would do with it.

ActivityPub sent it to me automatically, it's on my server, and on my server I say anything you give me has no license. To me, that's like the people who say FB has no right my data in a FB post.

The difference between Lemmy and Reddit is that it was Reddit's servers, they owned the data, and there was an agreement by signing up on who owned it - Reddit. Lemmy has no such agreement, and the data is not on a "Lemmy" server, it's stored on everyone's servers.

[–] Die4Ever 4 points 1 year ago* (last edited 1 year ago)

you make a good point about push vs pull, although things are only pushed if someone is subscribed (opt-ed in)

I think the proposal is for licenses to become part of the ActivityPub protocol, so all applications would retain the original license of the content, license would be a first class citizen

although without licenses this is functionally the same as email, I wonder how the laws work for that, for example I don't think you can just plagiarize something that someone wrote, quoted, or copy-pasted to you in an email if it's actually copyrighted content like from a book (aka content that had a license)

[–] [email protected] 3 points 1 year ago

You can run Lemmy in allowlist mode so it only federates with instances that you trust.

[–] [email protected] 9 points 1 year ago* (last edited 1 year ago)

These are interesting questions, thank you for getting the discussion started on them.

Do you care how your public data and posted content is used? Why or why not?

Not really. If I did, I wouldn't post it somewhere public like lemmy. I guess if I were sharing source code or artwork I had made, I would feel differently about somebody taking those and breaking the license terms on that. But I don't care if they're used to train AI. Well-trained AI benefits all of humanity, and it's not like they're making copies, they're just learning piecemeal from millions of pieces of content like mine. Whether or conventional licensing applies to AI at all is still a question of open legal debate that will probably take years to resolve.

What do you think of choosing a content license for your Lemmy account? Does this contradict the FOSS model?

I think this is a great idea and gives users some degree of additional control and clarifies for people who might want to use the data how they can use it. This can also be an interesting marketing tool to be able to say that on Lemmy you choose who uses your data and how, even if the enforcement mechanism is on the legal side not the technical side. The default should be public domain or copyleft license as that benefits the commons the most, but users should be able to make their own choice.

Should Lemmy have features to protect user data/content in this way, or should that be left up to the user to figure out on their own?

Not really aside from letting users choose licenses for their content. I do think AP should integrate encrypted DMs/messages like nostr etc has, this is an important feature. But that's really outside of this particular discussion.

Edit: Additional thought on licensing fees. If users could post, for example, their Bitcoin lightning address in their profile, they could automatically "license" their content this way. They could set a flat license fee in their profile per post or per word or whatever, perhaps it could be modified on a given post if the user wanted to, and if some company wants to come along and use their content, they could automatically pay for the licensing for that content. This would be an interesting way for users to get paid for their content. Lemmy and/or the instance could even take a portion of those payments, say 10%, and put it towards development. Having this all done via lightning would make this process automatable. Companies scraping AP/Lemmy data could search, find content, and then buy the content that suits them best. They might be willing to pay more for rare content types, for example, content on niche communities. Companies get proof, via the lightning transaction, that payment was made.

As a user, I wouldn't mind getting a few bucks per year for my content and knowing that my money is also contributing to Lemmy development and the sustainability of this whole fediverse thing. Nostr has a similar functionality with tips/zaps and tip pools, though it's not based around licensing.

[–] [email protected] 8 points 1 year ago (1 children)

If you are that concerned, I wouldn't suggest posting anything on the internet. Only send encrypted messages directly to the people you'd like to talk to and share with.

[–] [email protected] 4 points 1 year ago (1 children)

@Jake_Farm @lemmy I’m launching a new, totally secure social network based on the PigeonNet™ protocol

[–] [email protected] 1 points 1 year ago

The OG social network... Other than like actually talking to people.

[–] [email protected] 7 points 1 year ago

I don't think any protection is necessary. What you write on the Fediverse is public. Everyone can access it. Companies can use it - but they can all use it, no single company has any advantage. It's like open source software with open licenses in that way.

So I'm not that worried about protecting the data or preventing anyone from using it. I say use it - just know that everyone can use it. We're helping everyone, not just one company.

[–] [email protected] 7 points 1 year ago* (last edited 1 year ago)

I think there are two big things:

No one can sell the data when it's already freely available for everyone
Only the data that needs to be collected, is collected. Only the data that needs to be public, is public.

Pushing for these points to remain true should help a lot.

On the first point (pretty sure I also posted about possible licenses at one point), the problem is that it doesn't really help anyone that doesn't care about the license. Until we can collectively organize to defend our licenses legally, the next best thing might be to remove the profit incentive entirely.

On the second point, this does away with a LOT of other metadata (ex. Location, device orientation, contacts, etc.) that won't be available for abuse. Reddit killed off third party apps because they want people to use the official app. Part of that was for ads, but part of that was ALSO to collect more data. If we build the platforms and the clients in a way that the data isn't collected, then we'll be better off

[–] [email protected] 4 points 1 year ago* (last edited 1 year ago)

(me not lawyer nor study law)

I’ve seen some users add a license to the end of each of their comments. One idea might be this: Add a feature to Lemmy where each user can choose a content license that applies to everything they post. For example, one user might choose to no rights for their content (like CC0) because they don’t care how their data is used. Another user might not want companies profiting off their posts, so they’d choose a more restrictive license.

I don't think licensing your content prevents it from being used in AI models, considering that services such as Copilot were trained on data such as GPL licensed source code without having to comply with the terms it imposes when modifying or copying GPL licensed code (but it's not just resticted to restrictive licenses such as the GPL, since according to licenses such as the MIT they would also have to credit the authors of the original work). It seems that, for now, copyright law doesn't apply to data generated by AI models and that they don't need to comply with the terms of the licenses of the training data (or at least they don't seem to have been penalized for violating copyright law yet AFAIK).

And even if it wasn't licensed, companies can't use your works without your permission (unless it constitutes fair use). When you license a work, you are simply giving permission to other people to do things with your work they would otherwise not be allowed to do.

[–] [email protected] 2 points 1 year ago

My personal idea of freedom would be to at least make it illegal for google, openai and other giant profit oriented corpos to use my stuff (they probably would still do it but I want them to have to break the law doing it).

I mean, if you use a license in your posts that dictates profit sharing, prevents use without credit and use in proprietary formats, you might still sue. The interesting thing is that some lemming as done this under all their posts already. Is no big deal to have a client like voyager put a signature under your posts and comments indicating the proper license.

The more interesting question for me is, would google then exclude our information and would we hinder our growth unmecessarily and how would we still be findable but not end up in some proprietary LLM?

[–] [email protected] 1 points 1 year ago

Having a different license for each user would cause too much complexity and fragmentation. It makes more sense to have a license per instance. This can already be configured and gets displayed under /legal. Then based on the legal info users can decide where to signup, and admins can decide which instances to federate with.