this post was submitted on 24 Aug 2023
165 points (94.6% liked)

Technology

58303 readers
10 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

The New York Times blocks OpenAI’s web crawler::The New York Times has officially blocked GPTBot, OpenAI’s web crawler. The outlet’s robot.txt page specifically disallows GPTBot, preventing OpenAI from scraping content from its website to train AI models.

all 19 comments
sorted by: hot top controversial new old
[–] [email protected] 31 points 1 year ago (2 children)

as if a text file is going to stop them

[–] [email protected] 7 points 1 year ago

NYT also uses a third party bot identification and mitigation service.

[–] [email protected] 19 points 1 year ago (1 children)

The question is: Does that crawler adhere to robot.txt policies?

[–] [email protected] 3 points 1 year ago

They made a flag specifically for their crawler, so they can say that they do but in the most annoying way possible.

[–] [email protected] 5 points 1 year ago

This is the best summary I could come up with:


Based on the Internet Archive’s Wayback Machine, it appears NYT blocked the crawler as early as August 17th.

The change comes after the NYT updated its terms of service at the beginning of this month to prohibit the use of its content to train AI models.

OpenAI didn’t immediately reply to a request for comment.

The NYT is also considering legal action against OpenAI for intellectual property rights violations, NPR reported last week.

If it did sue, the Times would be joining others like Sarah Silverman and two other authors who sued the company in July over its use of Books3, a dataset used to train ChatGPT that may have thousands of copyrighted works, as well as Matthew Butterick, a programmer and lawyer who alleges the company’s data scraping practices amount to software piracy.

Update August 21st, 7:55PM ET: The New York Times declined to comment.


The original article contains 202 words, the summary contains 146 words. Saved 28%. I'm a bot and I'm open source!

[–] [email protected] 5 points 1 year ago

what is the ai being trained for anyways, how to be a NYT journalist?