Post

Conversation

We now have a benchmark for AI hallucinations, and, based on OpenAI models, 3 useful findings: 1) Larger models hallucinate less 2) If you ask models their confidence in answers, high confidence = lower chance of hallucination 3) Where accuracy is low, answers you get vary a lot
Image
Image
Image
David Watson ๐Ÿฅ‘
Post your reply

Where accuracy is low (on topics where the AI makes a lot of mistakes), answers you get vary a lot. In areas where the AI has high accuracy, you get consistent answers
Square profile picture
Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy ratesโ€”up to 95% - Up to 30% fewer hallucinations than other leaders - Low latencyโ€”63 minutes converts in 35 seconds Try via API for free today ๐Ÿ‘‡
Reduction in hallucinations should be the top priority when it comes to the development of new models As these frontier models get more and more powerful, people are going to rely on their output with greater degree of blind faith
A problem to attack from both sides. Training better reasoning and intellectual humility. o1 models have shown improvement by replacing "However" w/ "Alternatively," reducing overconfidence in reasoning paths & opening up more opportunities for superior paths.
Interesting, I'm going to edit my custom coding GPT to only answer with responses it is extremely confident in otherwise inform me and suggest alternate approaches. We'll see if that improves some of the hallucinations
The study presents a benchmark called SimpleQA that evaluates the ability of large language models to answer short, fact-seeking questions. The researchers designed SimpleQA to be challenging and have a single, indisputable answer for each question. The researchers evaluated
Show more
Image
I wonder if o1 is better than 4o because it using reasoning to infer answers to questions based on what it already knows.
To me, this seemed logical. In the same way, I donโ€™t quite understand why they use softmax instead of using embedding outputs and searching for the closest vector. It feels like it would make more sense. Sometimes, I wonder if Iโ€™m missing something or overthinking it.
Asking an AI model to rate their confidence level on an answer is a very interesting idea that I hadnโ€™t thought about. Iโ€™ll have to play with this! I wonder if the simple act of asking a model to rate itself will result in more careful answers vs not asking this?
In my personal use benchmark, I have observed that customized GPTs with training data (pdfs, docs, etc.) hallucinate much much less than the raw model. For anything I need accuracy, I make a custom GPT with context data. I hope we have upload in o1 family soon.
Until that value reaches at least 90%, you should not trust any data you get from AI.
I'd love to see these findings applied to real-world scenarios. How do you think this research can be used to mitigate AI hallucinations in practical applications?
Quote
Adi Simhi
@AdiSimhi
LLMs often "hallucinate". But not all hallucinations are the same! This paper reveals two distinct types: (1) due to lack of knowledge and (2) hallucination despite knowing. Check out our new preprint, "Distinguishing Ignorance from Error in LLM Hallucinations"
Show more
Image
In summary larger models (which generally know more) have less reason to "fill in the gaps" in their knowledge by making up bullshit that sounds convincing ? Not surprising, the less data points you have the more shaky extrapolation becomes.