this post was submitted on 22 Jul 2023
49 points (100.0% liked)
Technology
104 readers
2 users here now
This magazine is dedicated to discussions on the latest developments, trends, and innovations in the world of technology. Whether you are a tech enthusiast, a developer, or simply curious about the latest gadgets and software, this is the place for you. Here you can share your knowledge, ask questions, and engage in discussions on topics such as artificial intelligence, robotics, cloud computing, cybersecurity, and more. From the impact of technology on society to the ethical considerations of new technologies, this category covers a wide range of topics related to technology. Join the conversation and let's explore the ever-evolving world of technology together!
founded 2 years ago
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Spoiler alert it’s because of r/counting and some weirdness in training data
They really scraped all subs? No one had any concerns about that?
I believe, if this sort of generative AI is going to be trustworthy in the future, we need some sort of external verification system so we can make our own trust judgements based on the data used to train the system. For example, if a system is trained including 4chan as a data source, I'm going to trust it less than if it wasn't trained using that source.
I don't think big business yet realises how important the training data is but, as soon as they do, they will want the AI companies to provide guarantees about the sanity and appropriateness of the training data.
That's how human intelligence works. We assign a value to the source of the information. The fact that the AI's seemed to be trained without that explains why they "lie" so much. They simply reconstruct patterns without giving any weight to specific patterns.
For example, if you have the information "President Biden will launch a ground invasion of Russia." If the New York Times, BBC, and CNN are all reporting it, we would give that information a higher likelihood of being true than if the information was found on random blogs. However, if the random blogs reporting the information belonged to reputable reporters or bloggers on military and international affairs, we would assign the information a higher value of being correct than if the information came from Bob's Bigfoot and Alien sightings Index.
Without the ability to check the level of accuracy of source data, all the generative AI could be corrupted. If you fed an art AI photos of the Statue of Liberty but kept telling it that it was the Eiffel Tower, when asked to draw the Eiffel Tower it would spit out the Statue of Liberty. Right now, without the ability to assess the accuracy of a response, any of the chat-based AI are garbage for most of the use-cases companies are deploying them in.
Whether 4chan is a good data source or not depends on what you intend to use the AI for. If you want to have it interact with users on a web forum or similar context then using 4chan data would likely be very useful indeed.
Bear in mind that as long as it's properly labelled then "bad" data is still useful as an example of bad data. A common example is with image AIs, where people can give negative prompts like "ugly" and "blurry" to tell the AI to make images that are not like that.
No. There’s a computerfile video but iirc r/counting was accidentally left in the training data set for part of the training process
r/SubredditSimulator? What could go wrong?