this post was submitted on 24 Aug 2023

165 points (94.6% liked)

Technology

58303 readers

10 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

165

The New York Times blocks OpenAI’s web crawler (www.theverge.com)

submitted 1 year ago by [email protected] to c/[email protected]

16 comments fedilink hide all child comments

The New York Times blocks OpenAI’s web crawler::The New York Times has officially blocked GPTBot, OpenAI’s web crawler. The outlet’s robot.txt page specifically disallows GPTBot, preventing OpenAI from scraping content from its website to train AI models.

all 19 comments

sorted by: hot top controversial new old

[–] [email protected] 31 points 1 year ago (2 children)

as if a text file is going to stop them

[–] [email protected] 7 points 1 year ago

NYT also uses a third party bot identification and mitigation service.

[–] [email protected] 19 points 1 year ago (1 children)

The question is: Does that crawler adhere to robot.txt policies?

[–] [email protected] 3 points 1 year ago

They made a flag specifically for their crawler, so they can say that they do but in the most annoying way possible.

[–] [email protected] 5 points 1 year ago

This is the best summary I could come up with:

Based on the Internet Archive’s Wayback Machine, it appears NYT blocked the crawler as early as August 17th.

The change comes after the NYT updated its terms of service at the beginning of this month to prohibit the use of its content to train AI models.

OpenAI didn’t immediately reply to a request for comment.

The NYT is also considering legal action against OpenAI for intellectual property rights violations, NPR reported last week.

If it did sue, the Times would be joining others like Sarah Silverman and two other authors who sued the company in July over its use of Books3, a dataset used to train ChatGPT that may have thousands of copyrighted works, as well as Matthew Butterick, a programmer and lawyer who alleges the company’s data scraping practices amount to software piracy.

Update August 21st, 7:55PM ET: The New York Times declined to comment.

The original article contains 202 words, the summary contains 146 words. Saved 28%. I'm a bot and I'm open source!

[–] [email protected] 5 points 1 year ago

what is the ai being trained for anyways, how to be a NYT journalist?

[+] [email protected] -39 points 1 year ago* (last edited 1 year ago) (5 children)

This goes against everything that the NYT preaches in terms of saying that the press is under attack and needs to be protected. AI consumption of news content makes the news more accessible. Their paid articles don’t overlap with what ChatGPT is doing. This is really a bunch of old people getting butt hurt about tech they don’t fully understand.

[–] [email protected] 37 points 1 year ago

While I am no fan of the NYT and other news site’s pricing models, I don’t think that this goes against “protecting the press”. Journalists do a job. They research, compile, draft, and write articles in their own voice (or the voice of the news outlet). They are paid for this work. OpenAI wants to scrape the words off news sites so that their language model can regurgitate them for free.

This is the AI Art thing all over again. Creators should be paid for their work.

[–] [email protected] 24 points 1 year ago

Please don't tell me you get your news from LLMs.

[–] [email protected] 15 points 1 year ago (1 children)

AI consumption of news content makes the news more accessible.

If journalists and their platforms do not get paid their articles won't get written. So no, the free absorbtion of professional articles into a LLM that uses the article to answer a Pokemon question online in 6 months is not making "news" more "accesible".

[–] [email protected] -1 points 1 year ago

It’s moreso an archive of historical knowledge. Thinking it just answers Pokémon questions is shortsighted.

[–] [email protected] 7 points 1 year ago (3 children)

If you claim to fully understand machine learning technology, you should also understand why it's considered theft by many. Everything that a generative AI churns out is ultimately derived from human works. Some of it is legally unencumbered, but much of it is protected by copyright and integrated into an AI model without the author's permission or knowledge, and reused without attribution.

I have no love for the NYT, but in this, they're right.

[–] [email protected] -2 points 1 year ago

Everything anyone churns out is ultimately derived from human works. I know that 2+2 = 4 because my teacher taught me that. I can read Hegel and understand it because both he and I read Kant. The corpus of work created by humanity collectively builds on itself.

When you listen to a song on the radio, there has been an infinitely long chain of influence that goes back hundreds of years.

Everytjing is built on everything else. AI isn't fundamentally different. It's just done automatically by a mathematical model.

In my opinion instead of trying to prevent this technology like a neo-luddite we need to be looking at new models for our creators to survive. I'm a big fan of the Patreon model. We don't have to use Patreon of course (and we shouldn't)

But imagine a world where all content is free and people with money choose to support the creators they enjoy. Even a dollar or two when done en masse would be enough to sustain someone's lifestyle and reliably reward them for work.

We need to think forward and not act like conservatives. This technology isn't going away. It's simply going to accelerate and break a lot of things while it picks up speed.

[–] [email protected] -3 points 1 year ago* (last edited 1 year ago)

I can't say I fully understand how LLMs work (can't anyone??) but I know a little and your comment doesn't seem to understand how they use training data. They don't use their training data to "memorize" sentences, they use it as an example (among billions) of how language works. It's still just an analogy, but it really is pretty close to LLMs "learning" a language by seeing it used over and over. Keeping in mind that we're still in an analogy, it isn't considered "derivative" when someone learns a language from examples of that language and then goes on to write a poem in that language.

Copyright doesn't even apply, except perhaps on extremely fringe cases. If a journalist put their article up online for general consumption, then it doesn't violate copyright to use that work as a way to train a LLM on what the language looks like when used properly. There is no aspect of copyright law that covers this, but I don't see why it would be any different than the human equivalent. Would you really back up the NYT if they claimed that using their articles to learn English was in violation of their copyright? Do people need to attribute where they learned a new word or strengthened their understanding of a language if they answer a question using that word? Does that even make sense?

Here is a link to a high level primer to help understand how LLMs work: https://www.understandingai.org/p/large-language-models-explained-with