this post was submitted on 28 Nov 2023
92 points (96.9% liked)

Technology

34906 readers
240 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
all 36 comments
sorted by: hot top controversial new old
[–] [email protected] 28 points 11 months ago (8 children)

So judges are saying:

If you trained a model on a single copyrighted work, then that would be a copyright violation because it would inevitably produce output similar to that single work.

But if you train it on hundreds of thousands of copyrighted works, that’s no longer a copyright violation, because output won’t closely match any single work.

How is something a crime if you do it once, but not if you do it a million times?

It reminds me of the scheme from Office Space: https://youtu.be/yZjCQ3T5yXo

[–] [email protected] 30 points 11 months ago (3 children)

A basic fundamental of copyright law and fair use is if the result is transformative. People literally do stuff like make collages with copyright works and it's fine in many cases.

Turning pictures into an AI model (and that's being really generous in my phrasing as if the pictures have anything to do with the math) is just about one of the most transformative things you can do with a picture.

This is like copyright 101 and if you're shocked you don't understand what you're talking about in regards to copyright.

[–] [email protected] 7 points 11 months ago

Man, its refreshing that Lemmy seems to have people with more nuanced takes on AI than the rest of.the internet

[–] [email protected] 18 points 11 months ago* (last edited 11 months ago) (1 children)

Training the AI isn’t a copyright violation though. Producing content from a single source of training information is intuitively different from producing content from a litany of sources. Is there a distinction I’m not understanding that you are pointing out?

[–] [email protected] 9 points 11 months ago

Nope, I think you nailed it.

I've trained my personal AI, my brain, by ingesting 1,000+ books. So now I can't write a book?

Suppose I use a Stephen King phrase, "friends and neighbors". Can't use that? Of course I can.

[–] [email protected] 14 points 11 months ago* (last edited 11 months ago)

"AI" models are, essentially, solvers for mathematical system that we, humans, cannot describe and create solvers for ourselves.

For example, a calculator for pure numbers is a pretty simple device all the logic of which can be designed by a human directly. A language, thought? Or an image classifier? That is not possible to create by hand.

With "AI" instead of designing all the logic manually, we create a system which can end up in a number of finite, yet still near infinite states, each of which defines behavior different from the other. By slowly tuning the model using existing data and checking its performance we (ideally) end up with a solver for some incredibly complex system.

If we were to try to make a regular calculator that way and all we were giving the model was "2+2=4" it would memorize the equation without understanding it. That's called "overfitting" and that's something people being AI are trying their best to prevent from happening. It happens if the training data contains too many repeats of the same thing.

However, if there is no repetition in the training set, the model is forced to actually learn the patterns in the data, instead of data itself.

Essentially: if you're training a model on single copyrighted work, you're making a copy of that work via overfitting. If you're using terabytes of diverse data, overfitting is minimized. Instead, the resulting model has actual understanding of the system you're training it on.

[–] [email protected] 9 points 11 months ago* (last edited 11 months ago) (1 children)

How is something a crime if you do it once, but not if you do it a million times?

Because doing it a million times seriously dilutes the harm to any single content creator (assuming those million sources are from many, many different content creators, of course). Potential harm plays a major role in how copyright cases are determined, and in cases involving such a huge amount of sources, harm can be immeasurably small.

In addition to right and wrong, the practicality of regulation and enforcement is often a part of groundbreaking decisions like these, and I’m not certain this particular issue is something our legal system is equipped to handle.

I’m not sure I agree with the reasoning here, but I see their thinking.

[–] [email protected] 3 points 11 months ago

An AI trained on a single image would also probably be fine if it was somehow a generalist AI that didn't overfit on that single image. The quantity really doesn't matter.

[–] [email protected] 6 points 11 months ago (1 children)

Imagine this situation if a human replaced the AI.

Imagine a human who wants to write a book. They've read hundreds of other books already, and lots of other things besides books. Then they write a book. The final work probably contains an amalgamation of all the other things they've read--similar characters, themes, plot points, etc.--but it's a unique combination, so it's distinct from those other works. No copyright violation.

Now imagine that same human has only ever read one book. Over and over. They know only the one book. The human wants to write a new book. But they only have experience with the one they've read again and again. So the book they write is almost exactly the same as the one book they read. That's a copyright violation.

Training an AI model is not a crime, any more than reading a book is a crime. You're not making "copies" or profiting directly from that single work.

[–] [email protected] 3 points 11 months ago

Thank you for putting into words what my entire point is to people always claiming AI Art is theft

[–] [email protected] 5 points 11 months ago

How is something a crime if you do it once, but not if you do it a million times?

You can dream up other examples of this.

If you're a DJ performing for a large audience and yell "I want to see you shake it for me!", that is legal. If you walk up to one specific woman on the street and pull her aside and say "I want to see you shake it for me", that's sexual harassment.

If the government announces that the median income of Detroit residents has gone up by 3%, that's normal. If the government public announces that John Fuckface, 36.2 years old, living at 123 Fake Street, had his income increase by 5% in the previous year, that's a privacy violation.

The whole point of training the AI is to build a model that can't reproduce a single work. It may seem superficially strange, but the more works you include, the less capable it is of reproducing one work.

[–] [email protected] 5 points 11 months ago

How is something a crime if you do it once, but not if you do it a million times?

Companies get to steal from people all the time without repercussions through erroneous fees, 'mistakes' in billing, denying coverage, and even outright fraud only gets a slap on the wrist fine at best. But an average person steals $5 and they are thrown in jail.

[–] [email protected] 2 points 11 months ago

How is something a crime if you do it once, but not if you do it a million times?

Because we are talking about a generalized knowledge base vs a specific one? Is it not obvious from the explanation you quoted that instructing an AI to respond off of millions of sources means that it isn't biased off of one person's work?

[–] [email protected] 5 points 11 months ago

finally, some sanity