this post was submitted on 11 Jan 2024
287 points (96.4% liked)

Technology

58133 readers
4384 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

At a Senate hearing on AI’s impact on journalism, lawmakers backed media industry calls to make OpenAI and other tech companies pay to license news articles and other data used to train algorithms.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 55 points 8 months ago* (last edited 8 months ago) (15 children)

“What would that even look like?” asks Sarah Kreps, who directs the Tech Policy Institute at Cornell University. “Requiring licensing data will be impractical, favor the big firms like OpenAI and Microsoft that have the resources to pay for these licenses, and create enormous costs for startup AI firms that could diversify the marketplace and guard against hegemonic domination and potential antitrust behavior of the big firms.”

As our economy becomes more and more driven by AI, legislation like this will guarantee Microsoft and Google get to own it.

[–] [email protected] 28 points 8 months ago* (last edited 8 months ago) (7 children)

Yes, and they'll use legislation to pull up the ladder behind them. It's a form of Regulatory Capture, and it will absolutely lock out small players.

But there are open source AI training datasets, but the question is whether LLMs can be trained as accurately with them.

[–] [email protected] 8 points 8 months ago (3 children)

Any foundation model is trained on a subset of common crawl.

All the data in there is, arguably, copyrighted by one individual or another. There is no equivalent open - or closed - source dataset to it.

Each single post, page, blog, site, has a copyright holder. In the last year big companies have started to change their TOS to make that they are able to use, relicense and generally sell your data hosted in their services as their own for the intent of AI training, so potentially some small parts of common crawl will be licensable in bulk - or directly obtained from the source.

This does still leave out the majority of the data directly or indirectly used today, even if you were willing to pay, because it is unfeasable to search and contract every single rights holder.

On the other side of it there have been work to use less but more heavily curated data, which could potentially generate good small, domain specific, models. But still they will not be like the ones we currently have, and the open source community will not be able to have access to the same amount and quality of data.

It's an interesting problem that I'm personally really interested to see where it leads.

[–] [email protected] 3 points 8 months ago (1 children)

Thanks for the link to Common Crawl; I didn't know about that project but it looks interesting.

That's also an interesting point about heavily curated data sets. Would something like that be able to overcome some of the bias in current models? For example, if you were training a facial recognition model, access a curated, open source dataset that has representative samples of all races and genders to try and reduce the racial bias. Anyone training a facial recognition model for any purpose could have a training set that can be peer reviewed for accuracy.

[–] [email protected] 3 points 8 months ago

Face recognition is probably dead as an open endeavor. The surveillance aspect makes it too controversial. I mean that not only will we not see open source work on this, but any work is behind closed doors.

In general, a major problem is that it is often not clear what reducing bias means. With face recognition, it is clear that we just want it to work for everyone. With genAI it is unclear. EG you type "US president" into an image generator. The historical fact is that all US presidents were male, and all but one were white. What's the unbiased output?

One answer is that it should reflect who is eligible for the US presidency. But in the future, one would expect far more people to be of "mixed race". So would that perhaps be biased against "interracial marriage"? In either case, one could accuse the makers of covering up historical injustice. I think in practice, people want image generators that just give them what they want with minimum fuss; wants which are probably biased by social expectations.

In any case, such curated datasets are used to fine-tune models trained on uncurated data. I don't think that is known how such a dataset should look like exactly, to yield an unbiased model (however defined).

load more comments (1 replies)
load more comments (4 replies)
load more comments (11 replies)