this post was submitted on 10 Jan 2024

1230 points (96.5% liked)

Technology

70162 readers

3932 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

1230

"Did you realize that we live in a reality where SciHub is illegal, and OpenAI is not?" (fosstodon.org)

submitted 1 year ago by [email protected] to c/[email protected]

223 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[+] [email protected] -26 points 1 year ago* (last edited 1 year ago) (3 children)

They're not serving you the exact content they scraped, and that makes all the difference.

[–] [email protected] 21 points 1 year ago (1 children)

Well if you believe that you should look at the times lawsuit.

Word for word on hundreds/thousands of pages of stolen content, its damming

[+] [email protected] -7 points 1 year ago (1 children)

Why do you assume that I haven't? The case hasn't been resolved and it's not clear how The NY Times did what they claim, which is may as well be manipulation. It's a fair rebuttal by OpenAI. The Times haven't provided the steps they used to achieve that.

So unless that's cleared up, it's not damming in the slightest. Not yet, anyway. And that still doesn't invalidate my statement above, because it's still under very specific circumstances when that happens.

[–] [email protected] 2 points 1 year ago (1 children)

Also intention is pretty important when determining the guilt of many crimes. OpenAI doesnt intentionally spit back an author's exact words, their intention is to summarize and create unique content.

[–] [email protected] 5 points 1 year ago (2 children)

Ah, yes. The defense of "I didn't mean to do it." Always a classic.

[–] [email protected] 3 points 1 year ago

No, the real defense is "that's not how LLMs work" but you are all hinging on the wrong idea. If you so think that an LLM is capable of doing what you claim, I'd love to hear the mechanism in detail and the steps to replicate it.

[–] [email protected] 0 points 1 year ago (1 children)

I mean, I'm not sure why this conversation even needs to get this far. If I write an article about the history of Disney movies, and make it very clear the way I got all of those movies was to pirate them, this conversation is over pretty quick. OpenAI and most of the LLMs aren't doing anything different. The Times isn't Wikipedia, most of their stuff is behind a paywall with pretty clear terms of service and nothing entitles OpenAI to that content. OpenAI's argument is "well, we're pirating everything so it's okay." The output honestly seems irrelevant to me, they never should have had the content to begin with.

[–] [email protected] 2 points 1 year ago

That's not the claim that they're making. They're arguing that OpenAI retains their work they made publicly available, which OpenAI claims is fair use because it's wholly transformative in the form of nodes, weights and biases, and that they don't store those articles in a database for reuse. But their other argument is that they created a system that threatens their business which is just ludicrous.

[–] [email protected] 13 points 1 year ago (1 children)

So it's content laundering

[–] [email protected] -4 points 1 year ago (2 children)

What a colorful mischaracterization. It sounds clever at face value but it's really naive. If anything about this is deceptive, it's the lengths that people go to to slander what they dislike.

[–] [email protected] 2 points 1 year ago (1 children)

I feel most people critical of AI don't know how a neural network works...

[–] [email protected] -1 points 1 year ago

That is exactly what's going on here. Or they hate it enough that they don't mind making stuff up or mischaracterizing what it does. Seems to be a common thread on the Fediverse. It's not the first time this week I've seen it.

[–] [email protected] 2 points 1 year ago (1 children)

Actually content laundering is the best term I've heard to describe the process. Just like money laundering, you no longer know the source and know it's technically legal to use and distribute.

I mean, if the copyrighted content wasn't so critical, they would train models without it. Their essentially derivative works, but no one wants to acknowledge it because it would either require changing our copyright laws or make this potentially lucrative and important work illegal.

[–] [email protected] 4 points 1 year ago (1 children)

Content laundering is not a good way to describe it because it's misleading as it oversimplifies and mischaracterizes what a language model actually does. It's a fundamental misunderstanding of how it works. Training language models is typically a transparent and well-documented process as described by the mountains of research over the past decades. The real value comes from the weights of the nodes in the neural network and not the source that it spits out in its entirety when it was trained. The source material is evaluated and wholly transformed into new data in the form of nodes and weights. The original content does not exist as it was within the network because there's no way to encode it that way. It's a statistical system that compounds information.

And while LLMs do have the capacity to create derivative works in other ways, it's not all that they do, or what they always do. It's only one of the many functions that it has. What you say would probably be true if it was only trained on a single source, but that's not even feasible. But when you train it on millions of sources, what remains are the overall patterns of language within those works. It's much more sophisticated and flexible than what you describe.

So no, if it was cut and dry there would be grounds for a legitimate lawsuit. The problem is that people are arguing points that do not apply but sound reasonable when they haven't seen a neural network work under the hood. If anything, new laws need to be created to address what LLMs do if you're so concerned about proper compensation.

[–] [email protected] -1 points 1 year ago

I am familiar with how LLMs work and are trained. I've been using transformers for years.

The core question I'd ask is, if the copyrighted material isn't essential to the model, why don't they just train the models without that data? If it is core to the model, then can you really say they aren't derivative of that content?

I'm not saying that the models don't do something more, just that the more is built upon copyrighted material. In any other commercial situation, you'd have to license/get approval for the underlying content if you were packaging it up. When sampling music, for example, the output will differ greatly from the original song, but because you are building off someone else's work you must compensate them.

Its why content laundering is a great term. The models intermix so much data that it's hard to know if the content originated from copyrighted materials. Just like how money laundering is trying to make it difficult to determine if the money comes from illicit sources.

[–] [email protected] 5 points 1 year ago (1 children)

It's great how for most of us we're taught that just changing the order of words is still plagerism. For them they frequently end up using the exact same words as other things and people still argue it somehow is intelligent and somehow not plagerism.

[–] [email protected] 2 points 1 year ago (1 children)

"Changing the order of words" is what it does? That's news to me. And do you have examples of it "using the exact same words as other things" without prompt manipulation?

[–] [email protected] 0 points 1 year ago (2 children)

Why does the prompting matter? If I "prompt" a band to play copyrighted music does that mean they get a free pass?

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago)

That's not a very good analogy because the band would be reproducing an entire work of art which an LLM does not and cannot. And by prompt manipulation I mean purposely making it seem like the LLM is doing something it wouldn't do on its own. The operating word is seem, which is what I meant by manipulation. The prompting here is irrelevant, but how it's done is. So unless The Times releases the steps they used to get ChatGPT to output what it did, you can't really claim that that's what it does.

In a blog post, OpenAI said the Times “is not telling the full story.” It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles. “Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” OpenAI said.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

If you passed them a sheet of music I'd say that's on you, it would be your responsibility to not sell recordings of them playing it.

Just like if I typed the first chapter of Harry Potter into word it is not Microsoft's intent to breach copyright, it would have been my intent to make it do it. It would be my responsibility not to sell that first chapter, and they should come after me if I did, even though MS is a corporation who supplied the tools.