this post was submitted on 24 Feb 2024

21 points (92.0% liked)

sh.itjust.works Main Community

7686 readers

8 users here now

Home of the sh.itjust.works instance.

founded 1 year ago

MODERATORS

[email protected]

Using the lemmyverse user generated content to train AI (sh.itjust.works)

submitted 8 months ago by [email protected] to c/[email protected]

18 comments fedilink hide all child comments

With the latest announcement regarding google allegedly paying reddit 60million per year for access to user created content to train their AI, what is stopping companies from using the freely available information on the lemmyverse to do it for free?

How does everyone feel about the likelihood of this already happening and should something be done about it?

top 18 comments

sorted by: hot top controversial new old

[–] [email protected] 20 points 8 months ago (1 children)

Nothing is stopping them from doing it, and the only reason it might not be already happening is the possibility that nobody has cared enough to bother yet.

And by design nothing much can be done about it. That's the nature of a decentralized platform - it's explicitly set up to share content, and that's what it does, by default. And there is no central authority that can control access to the fediverse as a whole. And that's pretty much that.

And personally, I don't care. I've never bought into the nonsensical idea that the stuff I post in a public forum is in any meaningful sense my property after I post it.

[–] [email protected] 7 points 8 months ago

Agreed--if I put up a poster on a billboard, I'm not really in a position to complain if someone takes a picture of it.

[–] [email protected] 14 points 8 months ago (2 children)

Nothing is stopping it except that one guy that copyrights all his comment he frightens them.

[–] [email protected] 5 points 8 months ago

Ironically, their comments are more accessible than anyone else's here.

[–] [email protected] 1 points 8 months ago

Realistically with how Fediverse works they could just ban his actor from their collection node and it'll ignore all requests made by him or replies to him, as if they never even happened.

[–] [email protected] 13 points 8 months ago

They have to pay Reddit now as the api is gone. I’m quite certain that at least one of the companies scraping the web to train their LLM have been using it.

And I’m quite certain that this happens to fediverse as well. You don’t even need an api, just set up your own instance. Make a few thousand accounts and sub all over using these. You got all the data in a nice db

[–] [email protected] 12 points 8 months ago

Nothing is stopping someone from scraping data off of Lemmy. It has probably already been happening.

[–] [email protected] 6 points 8 months ago (1 children)

Technically copyright stops them. I know, the whole copyright debate on AI training hasn't been settled. But when you sign a contract with reddit or dropbox, I assume it includes a licence to use the content to train AI.

Here on Lemmy, I never gave a licence to my instance to reuse my content. and I keep full copyright on the content.

Well I know, nobody cares about copyright, but there is a difference between OP downloading a torrent of my little pony and a company making tons of money out of it. Remember that the pirate bay founder got jail time,

[–] [email protected] 4 points 8 months ago (1 children)

Do you keep full copyright of your posts and comments here? Especially given the federated international nature of the platform, I'm not clear on how copyright works on Lemmy.

[–] [email protected] 3 points 8 months ago (2 children)

IANAL, but I don't see why you wouldn't

At the moment you create intellectual content you have copyright on-it
Many lemmy instance haven't filled the legal term and service parts and the one who did do not include the You grant us a perpetual commercial licence on your content therefore, they don't have to share your content without your consent. A picky lawyer may even argue that you never agreed that your content would be federated (But could also argue that it's implicit when publishing to the federation)

So with my limited understading on copyright, an AI company scrapping lemmy's data would potentially be infringing copyright (well there is an ongoing legal case against open AI so we'll know whether AI training is considered as re-using copyrighted data). That said, I have no doubt that it's occuring. Not only I'd struggle to identify my content in an AI model (Well someone speaking some frenglish while forgetting the plural s and mixing some letters on the keyboard ? Could be a lot of person) but lawyers are expensive, and I have better thing to do with my money.

Judging by the kind of content we have on the fedi, I can't wait to see AI sying stuff eat the rich, Blahaj is so cuuuuuuuuttte ewewewew, There is no OS but GNU and Stallman is the prophet, Capitalism is the problem, we need to re-establish the proletariate dictatorship would at least be fun.

[–] [email protected] 2 points 8 months ago (2 children)

Judging by the kind of content we have on the fedi, I can't wait to see AI sying stuff eat the rich, Blahaj is so cuuuuuuuuttte ewewewew, There is no OS but GNU and Stallman is the prophet, Capitalism is the problem, we need to re-establish the proletariate dictatorship would at least be fun.

If someone did create an LLM using fedi content and let it loose in the comments, I wonder how long it would take for people to realize it's a bot? I'm sure not flagging it as a bot is a violation of most instances rules, and it existing would probably upset some people, but it's still a fun question.

[–] [email protected] 2 points 8 months ago (1 children)

No one would notice. At worst, people would accuse it of trolling as it doubles down on factual inaccuracies. It may, and I say this without any irony, already be here and blending in. Paper books are the future.

[–] [email protected] 1 points 8 months ago (1 children)

Paper books are the future.

As if paper books can't contain garbage and misinformation oh wait (article has link to amazon page which contains listing that has option for paperback).

[–] [email protected] 1 points 8 months ago (1 children)

Cool. Not remotely what I meant, but I do sincerely enjoy a good nitpick.

[–] [email protected] 2 points 8 months ago (1 children)

Oh, I didn't exactly understand what you meant 😅

[–] [email protected] 1 points 8 months ago

I realize that I didn’t exactly specify, so you were entirely right to say what you did. I was just referring to pre-AI books with established utility and veracity. Likening things to a modern fallen Rome, rife with knowledge to uncover. And I fully understand why none of that came through given that I neither wrote nor implied any of it. Your nitpick was very much appreciated.

[–] [email protected] 2 points 8 months ago

We're going to get a weird feedback loop soon where future AI is going to be trained on posts created by current AI, eventually poisoning the well of trainable content

[–] [email protected] 1 points 8 months ago* (last edited 8 months ago)

Also not a lawyer but maybe more familiar with IP law than you are?

When an AI scrapes the post you just wrote... how exactly were you, the author of the post, harmed by that action? You weren't harmed which is a powerful fair use defence. It's not enough on it's own, but it's a huge step in that direction and other factors such as transforming the original add to that making a compelling case.

Consider the most recent fair use case, which was Google had negotiations to pay license fees for Java, then refused to pay — instead Google created a copy of Java. It dragged on in court a long time and bounced back and forth on apeal, but in the end the ruling came down to "java is protected by copyright, but Sun was not sufficiently harmed, therefore it was fair use". Or at least that's where it was headed when Oracle (who bought Sun years after the infringement happened) decided to stop burning mountains of cash fighting a lawsuit that wasn't likely to end well for them.

I was somewhat surprised by that case - I felt the fact that Google had talks about paying, then decided not to pay, was pretty clear harm. But the judge didn't see that as real harm - Java's source code is not 'free as in freedom' but it is 'free as in dollars' to download and therefore not really properly protected by copyright. The fact the license added restrictions to what you can do with the copy you were given for free didn't hold up in court (which has pretty widespread ramifications for GPL... I wonder who will be brave enough to test that in court... the FSF isn't going to back down from a lawsuit like Oracle did).

Anyway, if Java is borderline, I think the fediverse is clear cut. Almost any copy of the fediverse would be fair use. Yes, it's technically copyrighted content, but there's a loophole so big it surrounds the entire universe.