OpenAI strikes Reddit deal to train its AI on your posts : reddit

this post was submitted on 17 May 2024

69 points (97.3% liked)

17443 readers

1 users here now

News and Discussions about Reddit

Welcome to !reddit. This is a community for all news and discussions about Reddit.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules

Rule 1- No brigading.

**You may not encourage brigading any communities or subreddits in any way. **

YSKs are about self-improvement on how to do things.

Rule 2- No illegal or NSFW or gore content.

**No illegal or NSFW or gore content. **

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts.

Provided it is about the community itself, you may post non-Reddit posts using the [META] tag on your post title.

Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

:::spoiler Rule 10- Majority of bots aren't allowed to participate here.

founded 1 year ago

MODERATORS

[email protected]

OpenAI strikes Reddit deal to train its AI on your posts (www.theverge.com)

submitted 5 months ago by [email protected] to c/[email protected]

17 comments fedilink hide all child comments

cross-posted from: https://lemmy.world/post/15479755

OpenAI strikes Reddit deal to train its AI on your posts

top 17 comments

sorted by: hot top controversial new old

[–] [email protected] 26 points 5 months ago (2 children)

Each time this pops up, there is a rush of people saying to delete or edit your comments.

They have a database of your comments and all your edits. Its easy to see when you mass delete or edit them. Anything done past a certain point in time, especially all at once, is automatically reverted.

By deleting and editing, you are taking the data away from scrappers making the dataset they are selling actually unique and more valuable.

[–] [email protected] 11 points 5 months ago* (last edited 5 months ago) (1 children)

I mean that's completely illegal at least in places like Germany where people have the right to be forgotten, but unfortunately you're still right. They already commited the biggest heist in human history and got away with it. I guess NFT grifters only got punished because they dared to also steal from some rich people while Altman and his cronies are smart enough to only steal from the other 99.9%. When they have your data once, you can't request it back anymore. Because the worst that can happen to them is a slap on the wrist and the cost of being in the fastest growing business of our times. In other words: World's fucked and shit sucks.

[–] [email protected] 7 points 5 months ago (1 children)

I just did a bit of poking around on the subject of the "right to be forgotten" and it's legally complex. Data without personally identifying information, and data that's been anonymized through statistical analysis (which LLM training is a form of) aren't covered.

[–] [email protected] 9 points 5 months ago* (last edited 5 months ago)

Yup. As someone who's worked a little bit on GDPR compliance, it's not some magic wand you wave at your data. Any data they receive after the request is also not covered by that request. Also, only EU citizens and residents are legally entitled to make a request. A company may choose to comply with non-EU users, but that's purely their choice.

Comments that contain any info about where you live, your ethnicity, disabilities (cognitive or physical), gender, where you work, etc must be deleted as part of a forget request, so that might impact LLM training data.

Personally identifying information can be somewhat of a grey area in some situations as well. If I were to say I'm from New York, that'd be personally identifying. If I were to say I'm a fan of a sports team in New York, that's not (even if that implies my location). If I were to say I'm a fan of a New York sports team, my favourite pizza place is in New York, my favourite park is in New York, etc etc, that might arguably be identifying, even if each of the pieces by itself is not.

EDIT: Oh, and I forgot one of the most important parts: it's not like there are any spot checks or anything. You'd need someone to actually lodge a formal complaint, with some kind of evidence they haven't done what they're supposed to, and the procedures are different for every EU country. They are normally very involved and complex. Essentially, you'd need to lawyer up and care enough to slowly and painfully shove it through the legal system.

[–] [email protected] 4 points 5 months ago (2 children)

Surely the use of user-deleted content as training data carries the same liabilities as reinstating it on the live site? I've checked my old content and it hasn't been reinstated. I'd assume such a dataset would inherently contain personal data protected by the right to erasure under GDPR, otherwise they'd use it for both purposes. If that is correct, regardless of how they filtered it, the data would be risky to use.

Perhaps the cumulative action of disenfranchised users could serve toward the result of both the devaluation of a dataset based on a future checkpoint, or reduction in average post quality leading to decreased popularity over time (if we assume content that is user-deleted en masse was useful, which I think is fair).

[–] [email protected] 4 points 5 months ago* (last edited 5 months ago) (1 children)

I think you need to make a special request to get that level of deletion that comes with gdpr. I'm not certain, I just remember other users specifically talking about how you need to send them an email so they have to comply.

I also wouldn't be surprised if their dataset is mostly stripped of user names to get around GDPR though I'm no expert.

All that to say I'd be very very surprised if they deleted comments in their dataset.

Very valid point of devaluating the user experience thought, especially when you take into account google searches. I'm sure they have already fallen off compared to a year ago where reddit would pop up half the time no matter what you searched.

[–] [email protected] 3 points 5 months ago* (last edited 5 months ago)

Well, that'd be the mechanism of how GDPR protections are actioned, yes; but leaving themselves open to these ramifications broadly would be risky. I don't think it'd satisfy 'compliance' to ignore GDPR except upon request. Perhaps the issues with it are even more significant when using it as training data, given they're investing compute and potentially needing to re-train down the track.

Based on my understanding; de-identifying the dataset wouldn't be sufficient to be in compliance. That's actually how it worked prior to it for the most part, but I know companies largely ended up just re-identifying data by cross-referencing multiple de-identified datasets. That nullification forming part of the basis for GDPR protections being as comprehensive as they are.

There'd almost certainly be actors who previously deleted their content that later seek to verify whether it was later used to train any public AI.

Definitely fair to say I'm making some assumptions, but essentially I think at a certain point trying to use user-deleted content as a value add just becomes riskier than it's worth for a public company

[–] [email protected] 1 points 5 months ago

Surely the use of user-deleted content as training data carries the same liabilities as reinstating it on the live site?

Why would that be? It's not the same.

And what liabilities would there be for reinstating it on the live site, for that matter? Have there been any lawsuits?

[–] [email protected] 6 points 5 months ago

Your posts, maybe. Apparently I've been shadow banned

I don't even know what the offensive post was, there's zero communication. I only post on my local city subreddit and one directly related to my trade, so whatever i guess. It was a good site for a while

[–] [email protected] 6 points 5 months ago

Enjoy the botslop and increased hallucinations that come with it.

[–] [email protected] 4 points 5 months ago

Asking similarly as I did with a Twitter post, because I think it's worth discussing (and people should want others to leave the corporate enclosures so info on the internet may move more freely):

How might we help and encourage people to leave Reddit?

[–] [email protected] 1 points 5 months ago (1 children)

We all need to move on from this, including myself. We know Reddit's business practice. If we don’t like it help make the fediverse better. Having said that your posts will be scrapped over here to the benefit of nobody so is that better? At least on Reddit you have the option of owning their stock.

[–] [email protected] 5 points 5 months ago (1 children)

I disagree very strongly.

Owning any stock has nothing to do with your content being held hostage. When you and I made the posts on there (I have to assume) we made this for the benefit of everyone. We didnt know that LLMs would come to pass and gobble up our stuff but you can technically make your own LLM and gobble it up yourself. The only problem I have with this one and all other „reddit sells your data to…“ is that its not reddit‘s to sell (legal bla bla, I mean I never agreed to it being sold) and if its reddits to sell, its also mine to sell.

I know thats not how the corporate owned legal shitshow works but it is why I dont mind anyone scraping lemmy or mastodon. Everyone can do it which makes it worthless for one party. The control to access is what creates worth, which is why we need to abandon all proprietary media asap.

[–] [email protected] 2 points 5 months ago (1 children)

I don't really care if content I create is used to train LLMs, but I do object to Reddit monetizing the content I placed on their platform. Without somehow rewarding me for that content. I'd rather just post it here and let the LLMs train on it and let the value accumulate to the LLMs. I do think there's a greater good for humanity. And just because I say something intelligent or unintelligent, I see that as a gift to the greater good of society. I did not create that gift as a benefit to Reddit shareholders but having said all this I’m realistic, and the best thing I can do is try to keep the value I add within open platforms like this one.

[–] [email protected] 1 points 5 months ago

Great that we are on the same side here.

Now I‘d like to add that your idea is much better suited for the fediverse than reddit because nobody can control which LLM reads your data and therefore gives the „greater good for humanity“ to exactly that: humanity, instead of some corporation making money off of it.

[–] [email protected] 1 points 5 months ago (1 children)

http://reddit.ai/

[–] [email protected] 1 points 5 months ago

reddit Anguilla /s