this post was submitted on 28 Aug 2024
6 points (75.0% liked)

AI

4142 readers
1 users here now

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

founded 3 years ago
 

I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 2 months ago

If generation temperature is non-zero (which it often is), there is inherent randomness to the output. So even if the first number in a statistic should be 1, sometimese it will just randomly pick any other plausible number. Even if the network always picks the correct token as the highest probability, it's basically doing a coin toss for every token to make answers more creative.

That's on top of hoping the LLM has even seen that data during training AND managed to memorize it during training AND that the networks just happens to be able to reproduce the correct data given your prompt (it might not be able to for a different prompt).

If you want any reliability at all, you need to use RAG AND also you yourself have to double check all the references it quotes (if it even has that capability).

Even if it has all the necessary information to answer correctly in it's context window, it can still answer incorrectly.

None of the current models are anywhere close to producing trustworthy output 100% of the time.