body { -ms-overflow-style: scrollbar; overflow-y: scroll; overscroll-behavior-y: none; } .errorContainer { background-color: #FFF; color: #0F1419; max-width: 600px; margin: 0 auto; padding: 10%; font-family: Helvetica, sans-serif; font-size: 16px; } .errorButton { margin: 3em 0; } .errorButton a { background: #1DA1F2; border-radius: 2.5em; color: white; padding: 1em 2em; text-decoration: none; } .errorButton a:hover, .errorButton a:focus { background: rgb(26, 145, 218); } .errorFooter { color: #657786; font-size: 80%; line-height: 1.5; padding: 1em 0; } .errorFooter a, .errorFooter a:visited { color: #657786; text-decoration: none; padding-right: 1em; } .errorFooter a:hover, .errorFooter a:active { text-decoration: underline; } #placeholder, #react-root { display: none !important; } body { background-color: #FFF !important; }

JavaScript is not available.

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2024 X Corp.

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

We now have a benchmark for AI hallucinations, and, based on OpenAI models, 3 useful findings: 1) Larger models hallucinate less 2) If you ask models their confidence in answers, high confidence = lower chance of hallucination 3) Where accuracy is low, answers you get vary a lot

6:55 AM · Oct 31, 2024

49.2K

Views

David Watson 🥑

Post your reply

If you ask models "Please give your best guess, along with your confidence as a percentage that that is the correct answer," their confidence aligns roughly with the chance of hallucination, but the actual confidence level is not accurate Link:

Introducing SimpleQA

From openai.com

Can you explain point 3 please?

Where accuracy is low (on topics where the AI makes a lot of mistakes), answers you get vary a lot. In areas where the AI has high accuracy, you get consistent answers

Ad

Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy rates—up to 95% - Up to 30% fewer hallucinations than other leaders - Low latency—63 minutes converts in 35 seconds Try via API for free today

From assemblyai.com

Reduction in hallucinations should be the top priority when it comes to the development of new models As these frontier models get more and more powerful, people are going to rely on their output with greater degree of blind faith

@Freds_Mulligans

Well we can all thank our lucky stars that it at least doesn't say "it depends"

A problem to attack from both sides. Training better reasoning and intellectual humility. o1 models have shown improvement by replacing "However" w/ "Alternatively," reducing overconfidence in reasoning paths & opening up more opportunities for superior paths.

The Average Kiwi

Interesting, I'm going to edit my custom coding GPT to only answer with responses it is extremely confident in otherwise inform me and suggest alternate approaches. We'll see if that improves some of the hallucinations

Garrett -DeepWriterAI

I wonder if #3 can be leveraged to generate more randomness and diverse ideas before being passed to fact checking.

Travis C. Porco

maybe it's worth running queries 5 times and checking consistency after all

The study presents a benchmark called SimpleQA that evaluates the ability of large language models to answer short, fact-seeking questions. The researchers designed SimpleQA to be challenging and have a single, indisputable answer for each question. The researchers evaluated

Show more

@andromeda74356

I wonder if o1 is better than 4o because it using reasoning to infer answers to questions based on what it already knows.

To me, this seemed logical. In the same way, I don’t quite understand why they use softmax instead of using embedding outputs and searching for the closest vector. It feels like it would make more sense. Sometimes, I wonder if I’m missing something or overthinking it.

Ad

Build powerful products with the most accurate Speech AI models on the market.

Superhuman transcription accuracy and low-latency

30% fewer hallucinations than other providers

13.5% more accurate than models like Whisper Start building with $50 in free credits

Get $50 in free API credits

From assemblyai.com

Christopher Jackson

Water is wet

@victor_explore

actually makes sense - like how experienced people tend to be more accurate and know when to say "i'm not sure"

Asking an AI model to rate their confidence level on an answer is a very interesting idea that I hadn’t thought about. I’ll have to play with this! I wonder if the simple act of asking a model to rate itself will result in more careful answers vs not asking this?

Aquários Sobrinho

In my personal use benchmark, I have observed that customized GPTs with training data (pdfs, docs, etc.) hallucinate much much less than the raw model. For anything I need accuracy, I make a custom GPT with context data. I hope we have upload in o1 family soon.

You probably saw this study that drew the opposite conclusion to #1. I didn't find it convincing - they tested the wrong tasks, I thought. Sharing fyi:

The Larger the LLM, the Less Reliable it Becomes

From customerland.net

Until that value reaches at least 90%, you should not trust any data you get from AI.

plots are beautiful

I'd love to see these findings applied to real-world scenarios. How do you think this research can be used to mitigate AI hallucinations in practical applications?

Quote

Adi Simhi

@AdiSimhi

Oct 30

LLMs often "hallucinate". But not all hallucinations are the same! This paper reveals two distinct types: (1) due to lack of knowledge and (2) hallucination despite knowing. Check out our new preprint, "Distinguishing Ignorance from Error in LLM Hallucinations"

Show more

In summary larger models (which generally know more) have less reason to "fill in the gaps" in their knowledge by making up bullshit that sounds convincing ? Not surprising, the less data points you have the more shaky extrapolation becomes.