this post was submitted on 22 Dec 2024
1469 points (97.6% liked)

Technology

60060 readers
3358 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 5 points 7 hours ago* (last edited 7 hours ago) (1 children)

As per torrentfreak

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

...crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It's US startup culture, plain and simple, "move fast and break laws", get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

[–] [email protected] 2 points 4 hours ago

For OpenAI, I really wouldn't be surprised if that happened to be the case, considering they still call themselves "OpenAI" despite being the most censored and closed source AI models on the market.

But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.

If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it's a justified assumption, it's still an assumption, and most likely not true for most models, certainly not those trained recently.