this post was submitted on 12 Jun 2023
13 points (100.0% liked)

Experienced Devs

3954 readers
1 users here now

A community for discussion amongst professional software developers.

Posts should be relevant to those well into their careers.

For those looking to break into the industry, are hustling for their first job, or have just started their career and are looking for advice, check out:

founded 1 year ago
MODERATORS
 

cross-posted from: https://lemmy.world/post/76533

One of the arguments made for Reddit's API changes is that they are now the go to place for LLM training data (e.g. for ChatGPT).

https://www.reddit.com/r/reddit/comments/145bram/addressing_the_community_about_changes_to_our_api/jnk9izp/?context=3

I haven't seen a whole lot of discussion around this and would like to hear people's opinions. Are you concerned about your posts being used for LLM training? Do you not care? Do you prefer that your comments are available to train open source LLMs?

(I will post my personal opinion in a comment so it can be up/down voted separately)

top 18 comments
sorted by: hot top controversial new old
[–] framboos 12 points 1 year ago

Reddit provides a platform where regular users create the data. Moderators add value by ensuring the quality. Without any of these parties, there is no valuable data. Of course there is a cost in running the platform, but Reddit should avoid as much as possible charging users and especially moderators for using the platform.

Then there are search engines and 3rd party apps. They also add value. Search engines use the data, and in return they attract new contributors. 3rd party apps also attract regular users, and by providing a better experience make sure that the regular users stay active for longer. They should not be charged more than is required to keep the platform running and is reasonable with respect to their profits.

LLM trainers do not fit in this picture. They use large amounts of data, but do not provide anything in return that is valuable to the users, moderators or platform. Therefore, I absolutely support charging them more for accessing training data.

Users of the platform who provide value in return should not have to pay more than is reasonable and required than to keep the platform running. LLM trainers do not provide value in return, and I support charging them more. It is unreasonable to not differentiate between 3rd party app developers and LLM trainers.

[–] [email protected] 11 points 1 year ago (1 children)

I think the claim is nonsense. If that were their concern they would rather change the usage agreement and maybe take some of them to court.

What they actually did is everything in their power to drive mobile users to their mobile app. They want old fashioned user tracking data for advertising and selling on. Together with more in app ads.

[–] [email protected] 8 points 1 year ago

I totally agree that Reddit's motivation is probably not related to LLMs and the link I posted is more of an excuse than anything. However, I am curious what people think about data scraping and LLMs in general.

[–] HairHeel 9 points 1 year ago (1 children)

Reddit has every right to charge for their API, but the amount they wanted to charge was too high.

Other use cases aren’t relevant here either. They could have come to an agreement with Apollo etc that would have charged them reasonable rates while charging more to data scrapers. They could have done ads and dev share on the mobile apps. Most people wouldn’t have objected to that.

That part’s not a Reddit-specific problem though. I’ve seen a similar pattern play out at several companies I work for:

  • charge extra for a new premium feature
  • a new client with deep pockets comes along and wants part of that feature, but doesn’t want all of it, so doesn’t want to pay for it
  • sales really wants to catch this big fish
  • sales promises to build a new feature that does the same thing as the existing feature
  • the company loses more money than they would have by just giving the feature away for free, since now they’re also paying engineers to build the free version.
[–] JackbyDev 9 points 1 year ago

I think another huge problem that you didn't mention was the timeframe. Had they given the apps even 6 months from announcing the price they may have been able to pivot to subscriptions. The short timeframe (combined with the gaslighting from the CEO) makes it hard to want to try though.

[–] jmk1ng 9 points 1 year ago* (last edited 1 year ago)

I think Reddit does have a legitimate argument that the scales have tipped and Reddit eating the costs of "whales" abusing their APIs for for-profit use cases without Reddit being compensated at all is fair.

3P apps using the API at no cost while simultaneously monetizing Reddit's content by showing their own ads does seem to be taking advantage.

That said, the way Reddit approached this was so scorched earth and bone headed.

For example. Reddit gets 10s of millions of dollars in free content moderation services from volunteers. The moderators of all their biggest subreddits rely on 3P moderation tools since Reddit's are so poor.

So with the new API policy, they're asking their unpaid moderators to PAY them for the privilege. It's such a slap in the face.

Finally to address the original question, Reddit should absolutely block API consumers who are just training their glorified chat bots to regurgitate plagerized content.

[–] [email protected] 7 points 1 year ago (1 children)

I hope cross posts are OK. But I am curious about Experienced Dev's perspective on this as well since the question is rather technical.

Copying my opinion from the other thread in case you don't want to look at my other thread:

My personal opinion is that high API usage fees hurt open source LLMs (e.g. GPT4All). I would rather not see this new technology monopolized by those who can pay API fees.

[–] Clifspeare 6 points 1 year ago

I'd tend to agree. There are enough barriers to training large models without artificially increase them just because the largest players can afford it.

[–] msage 6 points 1 year ago (1 children)

Scraping open content is OK. Search engines have been doing that, it's their main job.

LLM won't exist without large inputs, hehe, and the internet is a good source for a big volume of language, most of which can even make sense.

I don't feel like Reddit should be against LLMs, ignoring their bogus claims. At least I hope GitHub doesn't share private and licenced repos.

[–] [email protected] 4 points 1 year ago (1 children)

I was wondering if someone would bring up search engine indexing. Google certainly has the upper hand for LLM training data with Reddit's new API change since they have the comments anyway. This is a big reason I fear these API changes, it is very much concentrating power in the hands of already powerful companies.

[–] msage 4 points 1 year ago (1 children)

Always has been meme.jpg

I really don't think Reddit changed because of the AI, it's just for the IPO, trying to pump and dump it sky high.

It's really sad when you imagine what we could do as a species, if we could work together instead of trying to one-up each other.

It kind of brings me back to decentralized services, which for me is the ultimate freedom model, and I'm loving this alternative to Reddit.

[–] [email protected] 3 points 1 year ago (1 children)

I am cautiously optimistic about the decentralization and federation. But I think the biggest hurdle is developing the user base right now. ExperiencedDevs is the only subreddit I followed before this all started that directly linked a Lemmy alternative.

[–] msage 1 points 1 year ago (1 children)

I'm not sure which subreddit mentioned Lemmy, but I've never been to r/ExperiencedDevs.

I've been using my Matrix instance for 99% of my private conversations since 2018, so I know it works, but also see that people won't change unless their #1 solution isn't taken away.

So right now it's do or die for any Reddit alternative.

[–] [email protected] 1 points 1 year ago (1 children)

r/ExperiencedDevs pointed at programming.dev specifically which is why I am giving it a shot. I have used decentralized stuff in the past like Usenet and IRC. I kinda miss the lack of corporate overlords.

https://www.reddit.com/r/ExperiencedDevs/comments/147ebxd/experienceddevs_will_go_dark_until_the_end_of_the/

[–] msage 1 points 1 year ago

I found this instance from the list, checked out any interesting names, saw the content, and this one looked the best for me.

I will do my own instance in the future, but it will take about a month before my schedule clears out.

It's kind of weird though, I always wanted to microblog, but never really found the time, now I'm thinking of using Lemmy for that.

[–] [email protected] 6 points 1 year ago (1 children)

I do not want my content to contribute to propertiery LLM that will make billion for large tech company without giving back to the community. Unfortunately I think fediverse have a harder time countering large scale data harvesting than a centralized service like reddit.

On the other hand, I don't mind open source, privacy respecting (is this a thing for LLM?) LLM to use my content.

[–] [email protected] 1 points 1 year ago

I am also wary of big tech companies using my comment history for their LLMs. However, I worry that the tech companies will scrape data anyway and Reddit's API pricing just locks out the open source LLMs. There are a few of them, a couple that I have played with:

https://github.com/nomic-ai/gpt4all

https://github.com/ggerganov/llama.cpp

Some projects even try to preserve privacy. But I think its more on the side of what extra training data you give it and the queries you issue.

https://github.com/imartinez/privateGPT

[–] [email protected] 3 points 1 year ago

My posts are going to be used for LLM training regardless.