this post was submitted on 22 Dec 2024

1305 points (97.5% liked)

Technology

60060 readers

3424 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

1305

Make illegally trained LLMs public domain as punishment (www.theregister.com)

submitted 21 hours ago by [email protected] to c/[email protected]

137 comments fedilink hide all child comments

It's all made from our data, anyway, so it should be ours to use as we want

(page 2) 50 comments

sorted by: hot top controversial new old

[–] [email protected] 3 points 11 hours ago* (last edited 11 hours ago) (1 children)

To speak of AI models being "made public domain" is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by "public domain" the author means that they should be required to publish the weights and also that they shouldn't get any trade secret protections related to those weights?

load more comments (1 replies)

[–] [email protected] 3 points 12 hours ago

Correct

[–] [email protected] 25 points 19 hours ago (1 children)

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

[–] [email protected] 13 points 18 hours ago (1 children)

It's like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same "buy an album from a record store" model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify's solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their "buy an album" business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

[–] [email protected] 6 points 18 hours ago (1 children)

Bandcamp still runs on this mode though, and quite well

[–] [email protected] 8 points 18 hours ago (1 children)

It's also one of the few places that have lossless audio files available for download. I'm a big fan of Bandcamp. I like having all my music local.

load more comments (1 replies)

[–] [email protected] 37 points 21 hours ago (4 children)

It could also contain non-public domain data, and you can't declare someone else's intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

[–] [email protected] 17 points 20 hours ago (8 children)

Forcing a bunch of neural weights into the public domain doesn't make the data they were trained on also public domain, in fact it doesn't even reveal what they were trained on.

load more comments (8 replies)

[–] [email protected] 13 points 20 hours ago (1 children)

So what you're saying is that there's no way to make it legal and it simply needs to be deleted entirely.

I agree.

[–] [email protected] 5 points 18 hours ago (8 children)

There's no need to "make it legal", things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

Training an AI doesn't involve copying the training data, the AI model doesn't literally "contain" the stuff it's trained on. So it's not likely that existing copyright law makes it illegal to do without permission.

load more comments (8 replies)

[–] [email protected] 7 points 18 hours ago

It wouldn't contain any public-domain data though. That's the thing with LLMs, once they're trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn't re-create your tax data on command, that data is now gone, but if it's seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

load more comments (1 replies)

[–] [email protected] 8 points 19 hours ago* (last edited 19 hours ago) (3 children)

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs... probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don't have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

[–] brie 2 points 15 hours ago

With current kWh/token it's 100x of a regular google search query. That's where the environmental meme came from. Also, Nvidia plans to manufacture enough chips to require global electricity production to increase by 20-30%.

load more comments (2 replies)

[–] [email protected] 5 points 18 hours ago (1 children)

Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.

[–] [email protected] 2 points 16 hours ago (3 children)

Mmm yes so all that electricity is pure waste

load more comments (3 replies)

load more comments