this post was submitted on 05 Aug 2024

469 points (97.2% liked)

Technology

69109 readers

2823 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

469

Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI (www.404media.co)

submitted 8 months ago by [email protected] to c/[email protected]

72 comments fedilink hide all child comments

https://archive.is/2024.08.05-162750/https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 26 points 8 months ago (1 children)

No, see, because it's "learning like a human", and everybody knows that you're allowed to bypass any licensing for learning. /s

But seriously I don't know how they make the jump to these conclusions either.

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago) (1 children)

This is a massive strawman argument. No one is saying you shouldn't have a license to view the content in order to train an AI on it. Most of the information used to train these models is publicly available and licensed for public viewing.

[–] [email protected] 17 points 8 months ago (3 children)

Just because something is available for public viewing does not mean it's licensed for anything except personal use.

The strawman here is that since physical people benefit from personal use exceptions in the law, machine learning software should too. But why should they? Since when is a piece of software ran by a corporation equivalent to an individual person?

[–] [email protected] 9 points 8 months ago

A tangentially related but good example of this sort of thing is BluRays and community movie nights (like setting up a projector in a park).

Most of these movie nights are de facto illegal, as even though you own the BluRay, it is not licensed for public showings, just for personal use. Obviously no one gives enough of a shit to enforce this against small groups, especially if they aren't making money off it, but if a theater started offering showings of shit the owner just bought on BluRay or UHD disks, it wouldn't last too long.

Similar thing here. Just because you can access the content to view it yourself doesn't mean you have the rights to do more than that with it. As an individual, you're likely fine to break those rules. As a giant fucking corporation, it's time for you to pay up.

[–] [email protected] 2 points 8 months ago* (last edited 8 months ago)

Since when is a piece of software ran by a ~~corporation~~ person equivalent to an individual person?

Gotta remember that legally a corporation IS a person.

Another great example of how the law is batshit serving capital and destroying the planet.

[–] [email protected] -3 points 8 months ago (2 children)

Copyright licensing allows the owner to control how a work is distributed, not how it's consumed. "Personal use" just means that you can't turn around and redistribute a work that you've obtained. Not that you're not allowed to consume it in a corporate setting.

[–] [email protected] 3 points 8 months ago (1 children)

Copyright licensing allows the owner to control how a work is distributed, not how it's consumed.

First of all, that's incorrect.

Secondly, by default you have zero rights to someone else's work. If something doesn't explicitly grant you rights, you have none. If there's a law or license, and if it's applicable to you, you get exactly what's specified in there.

The "personal use" or "fair use" exceptions in some places grant some basic rights but they are very narrow in scope and generally applicable only to individuals.

[–] [email protected] 6 points 8 months ago

I mean, it's in the name. The right to make copies. Not to be glib, but it really is

A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time.

You may notice a conspicuous absence of control over how a copied work is used, short of distributing it. You can reencode it, compress it, decompress it, make a word cloud, statistically analyze its tone, anything you want as long as you're not redistributing the work or an adaptation (which has a pretty limited meaning as well). "Personal use" and "fair use" are stipulations that weaken a copyright owner's control over the work, not giving them new rights above and beyond copyright. And that's a great thing. You get to do whatever you want with the things you own.

You don't have a right to other people's work. That's what copyright enables. But that's beside the point. The owner doesn't get to say what you use a work for that they've distributed to you.

[–] [email protected] 1 points 8 months ago* (last edited 8 months ago) (2 children)

Consuming is not the same thing as training. A machine is not a consumer, it is a tool.

[–] [email protected] 7 points 8 months ago* (last edited 8 months ago) (1 children)

Training literally is consuming. A copyright license doesn't get to dictate what computer programs the work is allowed to be used with. There's a ton a entertainment mega corps that would love for that to be the case, though.

You're saying that you're not allowed to do a statistical analysis on a copyrighted work. It's nonsense. It's well-established that copyright does not prevent that kind of use.

[–] [email protected] 0 points 8 months ago* (last edited 8 months ago) (2 children)

What makes you think copyright law doesn't apply to companies using copy written data to sell and profit off of? That is not the case. Also, you're putting words in my mouth. Feel free to read my other replies on this thread but I don't feel like repeating myself, but I think it's clear I'm not saying computers aren't allowed to process data that's absurd.

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago)

Because that's not what copyright is for. It exists to give the creator exclusive rights over distribution. That's it. So unless the company is planning to distribute the work and they obtained a copy willingly and legally distributed to them, then copyright is the wrong law to lean on.

[–] [email protected] 4 points 8 months ago (1 children)

A program of machine can be a consumer of something, although if you want to be technical you could say the person using the machine is the consumer. In actual computer science we talk about programs consuming things all the time.

[–] [email protected] -4 points 8 months ago* (last edited 8 months ago) (2 children)

In actual computer science you talk about AI all the time as well but it's not actually intelligent is it? It's just SmarterChild 2.0 and literally has no idea what word it said just before it's current one. Not intelligent. Words are often used inappropriately. The only thing computers can consume is data and electricity by definition, and consuming data is not the same as implementing it in a language (or visual) model that you intend to profit from. This is data theft, unless properly licensed.

[–] [email protected] 4 points 8 months ago* (last edited 8 months ago) (1 children)

How intelligent it is or isn't is irrelevant. We talk about much dumber programs than AI as being consumers of files and data including things like compilers. Would it not be person use for you to view a picture in a photo viewer or try and edit it in GIMP?

It's not data theft at all unless the courts and law says it is. Ranting on lemmy won't change that fact. Theft is a construct of law.

You can add clauses against use as AI training data to your licence if you wish.

[–] [email protected] -2 points 8 months ago* (last edited 8 months ago) (1 children)

You can try to equate humans to computers all day, and you can even pass laws that says they're the same thing. That does not make it true. A company using software to profit off data they have not licensed (whether it's public or not does not matter! That is not how copyright law works!) is theft.

Please try to sell DVDs of markiplier's publicaly available YouTube content and tell people how you're allowed to because it's publicaly available.

[–] [email protected] 6 points 8 months ago (1 children)

I am not equating humans with computers. These businesses are not selling people's data when doing AI training (unlike actual data brokers). You can't say something AI generated is a clone of the original anymore than you can say parody is.

[–] [email protected] -1 points 8 months ago* (last edited 8 months ago) (1 children)

I absolutely can. Parody is an art form, which is something that can exclusively only be created by human beings. AI is an art laundering service. Not an artist.

The law should reflect that these companies need to be first granted permission to use datasets by the rights holders, and creative commons licenses need to be given an opportunity to opt out of being crawled for these datasets. Anything else is wrong. Machines are not humans. Creative common copyright law was not written with the concept of machines being "consumers". These companies took advantage of the sudden emergence of these models and the delay of law in holding their hunger for data in check. They need to be held accountable for their theft.

[–] [email protected] 4 points 8 months ago* (last edited 8 months ago)

There are already anti-AI licenses out there. If you didn't license your stuff with that in mind that's on you. Deep learning models have been around for a lot longer than GPT 3 or anything that's happened in the current news cycle. They have needed training data for that long too. It was predictable stuff like this would happen eventually, and if you didn't notice in time it's because you haven't been paying attention.

You don't get to dictate what's right and wrong. As far as I am concerned all copyright is wrong and dumb, but the law is what the law is. Obviously not everyone shares my opinion and not everyone shares yours.

Whether an artist is involved or not it's still a transformative use.

[–] [email protected] 1 points 8 months ago (1 children)

Also the way you imply children can't be intelligent is disgusting.

[–] [email protected] 3 points 8 months ago

https://en.m.wikipedia.org/wiki/SmarterChild