this post was submitted on 22 Feb 2024
1019 points (98.8% liked)

Technology

60060 readers
3320 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 16 points 10 months ago (1 children)

This is only going to be adding recent Reddit data.

A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It's going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

[–] [email protected] 7 points 10 months ago* (last edited 10 months ago)

It's not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what's typically being used today, and there's been separate research that you can enhance models significantly using synthetic data from SotA models.

The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.