this post was submitted on 22 Dec 2024

1152 points (97.5% liked)

Technology

60060 readers

2898 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

1152

Make illegally trained LLMs public domain as punishment (www.theregister.com)

submitted 17 hours ago by [email protected] to c/[email protected]

110 comments fedilink hide all child comments

It's all made from our data, anyway, so it should be ours to use as we want

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 12 points 2 hours ago* (last edited 2 hours ago) (1 children)

intellectual property doesn't really exist in most of the world. they don't give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore...

it's arbitrary law that is designed to protect corporations and it's generally unenforceable.

[–] [email protected] 8 points 2 hours ago

But they're not developing AI in those countries they're developing it mostly in the US. In the US copyright law is enforced.

[–] [email protected] 8 points 3 hours ago

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un'ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he's selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

[–] [email protected] 1 points 2 hours ago

Nice one

[–] [email protected] 26 points 8 hours ago (2 children)

Although I'm a firm believer that most AI models should be public domain or open source by default, the premise of "illegally trained LLMs" is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of... well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn't tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

[–] [email protected] 2 points 18 minutes ago

the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world

They are not "analyzing" the data. They are feeding it into a regurgitating mechanism. There's a big difference. Their defense is only "good" because AI is being misrepresented and misunderstood.

I agree that we shouldn't strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they're doing.

[–] [email protected] 2 points 2 hours ago* (last edited 2 hours ago)

Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.

But it's important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it's this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can't because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.

[–] [email protected] 12 points 7 hours ago

Another clown dick article by someone who knows fuck all about ai

[–] [email protected] 49 points 11 hours ago (1 children)

It's not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI's private property

[–] [email protected] 1 points 2 hours ago

It doesn't matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

[–] [email protected] 3 points 7 hours ago* (last edited 7 hours ago)

To speak of AI models being "made public domain" is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by "public domain" the author means that they should be required to publish the weights and also that they shouldn't get any trade secret protections related to those weights?

[–] [email protected] 21 points 11 hours ago* (last edited 11 hours ago)

"Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data"

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author's earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

...which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it's technically true - a court didn't order it, but a guy who goes by the name "That One Privacy Guy" while blogging on linkedin did).

[–] [email protected] 55 points 14 hours ago (4 children)

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

[–] [email protected] 6 points 12 hours ago

Yes, mining companies should all be nationalised for digging up the country's ground and putting carbon in the country's air.

load more comments (3 replies)

[–] [email protected] 72 points 14 hours ago (5 children)

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

[–] [email protected] 52 points 14 hours ago (1 children)

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank's assets, and now effectively owns the bank.

[–] [email protected] 7 points 13 hours ago (1 children)

At the same time, if a bank goes under, that means they owe more than they own, so "ownership" of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

So it's not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don't necessarily want is the government owning a company long term, because there's some conflict of interest between its role as regulator and its interest as a shareholder.

load more comments (1 replies)

[–] [email protected] 10 points 13 hours ago* (last edited 13 hours ago) (1 children)

Public domain wouldn't be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can't copy a company with unique customers and physical property.

load more comments (1 replies)

load more comments (3 replies)

[–] [email protected] 3 points 7 hours ago

Correct

[–] [email protected] 109 points 17 hours ago* (last edited 14 hours ago) (20 children)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

[–] [email protected] 2 points 3 hours ago

Just a little note about the word "model", in the article it's used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.

Anyway, you make good points!

[–] [email protected] 22 points 14 hours ago (8 children)

Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

load more comments (8 replies)

[–] [email protected] 29 points 15 hours ago (1 children)

They pulled a very pubic and out in the open data heist

Oh no, not the pubes! Get those curlies outta here!

[–] [email protected] 10 points 14 hours ago

Best correction ever. Fixed. ♥️

load more comments (17 replies)

[–] [email protected] 24 points 15 hours ago (1 children)

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

[–] [email protected] 13 points 14 hours ago (3 children)

It's like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same "buy an album from a record store" model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify's solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their "buy an album" business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

load more comments (3 replies)

[–] [email protected] 36 points 16 hours ago (4 children)

It could also contain non-public domain data, and you can't declare someone else's intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

[–] [email protected] 17 points 16 hours ago (8 children)

Forcing a bunch of neural weights into the public domain doesn't make the data they were trained on also public domain, in fact it doesn't even reveal what they were trained on.

load more comments (8 replies)

[–] [email protected] 13 points 16 hours ago (7 children)

So what you're saying is that there's no way to make it legal and it simply needs to be deleted entirely.

I agree.

load more comments (7 replies)

load more comments (2 replies)

load more comments