this post was submitted on 06 Sep 2024
1726 points (90.3% liked)

Technology

60044 readers
2667 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 21 points 3 months ago (1 children)

The joke is of course that "paying for copyright" is impossible in this case. ONLY the large social media companies that own all the comments and content that has accumulated by the community have enough data to train AI models. Or sites like stock photo libraries or deviantart who own the distribution rights for the content. That means all copyright arguments practically argue that AI should be owned by big corporations and should be inaccessible to normal people.

Basically the "means of generation" will be owned by the capitalists, since they are the only ones with the economic power to license these things.

That is basically the worst case scenario. Not only will the value of work diminish greatly, the advances in productivity will also be only accessible to big capitalists.

Of course, that is basically inevitable anyway. Why wouldn't they want this? It's just sad seeing the stupid morons arguing for this as if they had anything to gain.

[–] [email protected] 13 points 3 months ago (1 children)

I'm getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this "there isn't enough free data" claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization's specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

[–] [email protected] 2 points 3 months ago* (last edited 3 months ago) (1 children)

Thanks for the info. But lets say you want to train a (future) AI to spot and tag disinformation and misinformation. You'd need to use and curate actual data from social media sites and articles.

If copyright is extended to learning from and analyzing publicly available data, such an AI will only be possible by licensing that data. Which will be monetize to maximize profit, first some lump sum, then later "per gb" and then later "per use".

I'm sure open source AI will make due and for many applications there is enough free data, but I can imagine a lot of cases where there wont. Anything that requires "commercially successful" media, articles, newspapers, screenplays, movies, books, social media posts and comments, images, photos, video clips...

We're basically setting up a world where the intellectual wealth of our civilization is being transformed into a commodity and then will be transferred into the hands of a few rich capitalists.

And even if there is acceptable amount of free data, if the principle is that data needs to be specifically licensed to learn and train and derive AI works from it - that makes free data use expensive too. It needs to be specifically vetted and is still vulnerable to be sued for mistakes or outrageous claims of copyright. Similar to patents, the uncertainty requires higher capitalization for any startup to defend against lawsuits.

[–] [email protected] 4 points 3 months ago (1 children)

Yeah, I've struggled with that myself, since my first AI detection model was technically trained on potentially non-free data scraped from Reddit image links. The more recent fine-tune of that used only Wikimedia and SDXL outputs, but because it was seeded with the earlier base model, I ultimately decided to apply a non-commercial CC license to the checkpoint. But here's an important distinction: that model, like many of the use cases you mention, is non-generative; you can't coerce it into reproducing any of the original training material--it's just a classification tool. I personally rate those models as much fairer uses of copyrighted material, though perhaps no better in terms of harm from a data dignity or bias propagation standpoint.

[–] [email protected] 5 points 3 months ago (1 children)

I just want a holodeck future without having to pay by the hour to DisneComBroSonyFlixMount.

[–] [email protected] 2 points 3 months ago

But that's unethical!