ClamDrinker

joined 1 year ago
[–] [email protected] 0 points 11 hours ago

This is an issue for the AI user though. And I do agree that needs to be more conscious in people's minds. But I think time will change that. Perhaps when the photo camera came out there were some shmucks that took pictures of people's artworks and claimed it as their own because the novelty of the technology allowed that for a bit, but eventually those people are properly differentiated from people properly using it.

[–] [email protected] 1 points 12 hours ago* (last edited 11 hours ago)

Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing

Ehh, no almost certainly not (But it does depend on your local laws). But that honestly just sounds like some corporate boogyman to prevent you from pirating their books. The person hosting the download, if they did not have the rights to publicize it freely, would possibly be prosecuted though.

To illustrate, there's this story of John Cena who sold a special Ford after signing a contract with Ford to explicitly forbid him from doing that. However, the person who bought the car was never prosecuted or sued, because they received the car from Cena with no strings attached. They couldn't be held responsible for Cena's break of contract, but Cena was held personally responsible by Ford.

For physical goods there is 'theft by proxy' though (receiving stolen goods that you know are most likely stolen), but that quite certainly doesn't apply to digital, copyable goods. As to even access any kind of information on the internet, you have to download and thus, copy it.

[–] [email protected] -1 points 12 hours ago* (last edited 12 hours ago)

That would be true if they used material that was paywalled. But the vast majority of the training information used is publicly available. There's plenty of freely available books and information that you only require an internet connection for to access, and learn from.

[–] [email protected] 0 points 12 hours ago* (last edited 12 hours ago)

Your first point is misguided and incorrect. If you've ever learned something by 'cramming', a.k.a. just repeating ingesting material until you remember it completely. You don't need the book in front of you anymore to write the material down verbatim in a test. You still discarded your training material despite you knowing the exact contents. If this was all the AI could do it would indeed be an infringement machine. But you said it yourself, you need to trick the AI to do this. It's not made to do this, but certain sentences are indeed almost certain to show up with the right conditioning. Which is indeed something anyone using an AI should be aware of, and avoid that kind of conditioning. (Which in practice often just means, don't ask the AI to make something infringing)

[–] [email protected] -1 points 12 hours ago (2 children)

This would be a good point, if this is what the explicit purpose of the AI was. Which it isn't. It can quote certain information verbatim despite not containing that data verbatim, through the process of learning, for the same reason we can.

I can ask you to quote famous lines from books all day as well. That doesn't mean that you knowing those lines means you infringed on copyright. Now, if you were to put those to paper and sell them, you might get a cease and desist or a lawsuit. Therein lies the difference. Your goal would be explicitly to infringe on the specific expression of those words. Any human that would explicitly try to get an AI to produce infringing material... would be infringing. And unknowing infringement... well there are countless court cases where both sides think they did nothing wrong.

You don't even need AI for that, if you followed the Infinite Monkey Theorem and just happened to stumble upon a work falling under copyright, you still could not sell it even if it was produced by a purely random process.

Another great example is the Mona Lisa. Most people know what it looks like and if they had sufficient talent could mimic it 1:1. However, there are numerous adaptations of the Mona Lisa that are not infringing (by today's standards), because they transform the work to the point where it's no longer the original expression, but a re-expression of the same idea. Anything less than that is pretty much completely safe infringement wise.

You're right though that OpenAI tries to cover their ass by implementing safeguards. Which is to be expected because it's a legal argument in court that once they became aware of situations they have to take steps to limit harm. They can indeed not prevent it completely, but it's the effort that counts. Practically none of that kind of moderation is 100% effective. Otherwise we'd live in a pretty good world.

[–] [email protected] 1 points 12 hours ago (1 children)

Honestly, that's why open source AI is such a good thing for small creatives. Hate it or love it, anyone wielding AI with the intention to make new expression will be much more safe and efficient to succeed until they can grow big enough to hire a team with specialists. People often look at those at the top but ignore the things that can grow from the bottom and actually create more creative expression.

[–] [email protected] 1 points 14 hours ago

I think I got the point just fine... you're wasting a ton of electricity and potentially your own money on making text that is not bad training data. Which is exactly what I said would happen.

LLMs are made of billions of lines of text, the last we know is for GPT3 with sources ranging from 570 GB to 45 TB of text. A short reddit comment is quite literally a drop in a swimming pool. It's word prediction ability isnt going to change for the worse if you just post a readable comment. It will simply reinforce it.

And sure you can lie in it, but LLM are trained on fiction as well and have to deal with that as well. There are supplementary techniques they apply to make the AI less prone to hallucinations that dont involve the training data, such as RLHF (Reinforcement learning from humans). But honestly speaking the truth is a dumb thing they try to use the AI for anyways. Its primary function has always been to predict words, not truth.

You would have to do this at such a scale and so succesfully voting wise that by that time you are significantly represented in the data to poison it you are either dead, banned, bankrupt, excluded from the data, or Google will have moved on from Reddit.

If you hate or dislike LLMs and want to stop them, let your voice be known. Talk to people about it. Convincing one person succesfully will be worth more than a thousand reddit comments. Poisoning the data directly is a thing, but it's essentially impossible to inflict alone. It's more a consequence of bad data gathering, bad storage practice, and bad training. None of those are in your control through a reddit comment.

[–] [email protected] 2 points 2 days ago

Yeah I can definitely say for a while that was the case for me as well. It's honestly why I like Lemmy, since by the nature of federation it can both be self-contained and owned by the people actually using it, but still kept around even if the specific instance doesn't last forever.

[–] [email protected] 2 points 2 days ago (2 children)

I'm sorry to hear that, that's a shame. My experiences are more with gaming communities from the early 2000s, so perhaps my view isn't universally applicable to other hobbies, professions, and such.

[–] [email protected] 1 points 2 days ago

I'm sorry to hear that. For me I've seen far more (relatively) big forums either turn into a discord, a subreddit, or just die out altogether due to being unsustainable for it's cost. Just seems more logical to me that the less personal places have more trouble sustaining themselves, but we can disagree on that.

[–] [email protected] 3 points 2 days ago* (last edited 2 days ago) (2 children)

I hate to ruin this for you, but if you post nonsense, it will get downvoted by humans and excluded from any data set (or included as examples of what to avoid). If it's not nonsensical enough to be downvoted, it still won't do well vote wise, and will not realistically poison any data. And if it's upvoted... it just might be good data. That is why Reddit's data is valuable to Google. It basically has a built in system for identifying 'bad' data.

[–] [email protected] 3 points 2 days ago (6 children)

What makes you think so? I read hardcore as 'small and tight-knit', exactly the kind of forum that could survive easily on user donations and due to the more personal relationship there's more loss in leaving it. I know some forums that fit that description that are still around now.

view more: next ›