Conversation
If you ask models "Please give your best guess, along with your confidence as a percentage that that is the correct answer," their confidence aligns roughly with the chance of hallucination, but the actual confidence level is not accurate
Link:
Where accuracy is low (on topics where the AI makes a lot of mistakes), answers you get vary a lot. In areas where the AI has high accuracy, you get consistent answers
Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks.
- The highest accuracy ratesโup to 95%
- Up to 30% fewer hallucinations than other leaders
- Low latencyโ63 minutes converts in 35 seconds
Try via API for free today 
Reduction in hallucinations should be the top priority when it comes to the development of new models
As these frontier models get more and more powerful, people are going to rely on their output with greater degree of blind faith
Well we can all thank our lucky stars that it at least doesn't say "it depends"
A problem to attack from both sides.
Training better reasoning and intellectual humility.
o1 models have shown improvement by replacing "However" w/ "Alternatively," reducing overconfidence in reasoning paths & opening up more opportunities for superior paths.
Interesting, I'm going to edit my custom coding GPT to only answer with responses it is extremely confident in otherwise inform me and suggest alternate approaches.
We'll see if that improves some of the hallucinations
I wonder if #3 can be leveraged to generate more randomness and diverse ideas before being passed to fact checking.
maybe it's worth running queries 5 times and checking consistency after all
The study presents a benchmark called SimpleQA that evaluates the ability of large language models to answer short, fact-seeking questions. The researchers designed SimpleQA to be challenging and have a single, indisputable answer for each question.
The researchers evaluated
Show more
I wonder if o1 is better than 4o because it using reasoning to infer answers to questions based on what it already knows.
To me, this seemed logical. In the same way, I donโt quite understand why they use softmax instead of using embedding outputs and searching for the closest vector. It feels like it would make more sense. Sometimes, I wonder if Iโm missing something or overthinking it.
Build powerful products with the most accurate Speech AI models on the market.
Superhuman transcription accuracy and low-latency
30% fewer hallucinations than other providers
13.5% more accurate than models like Whisper
Start building with $50 in free credits 
actually makes sense - like how experienced people tend to be more accurate and know when to say "i'm not sure"
Asking an AI model to rate their confidence level on an answer is a very interesting idea that I hadnโt thought about. Iโll have to play with this!
I wonder if the simple act of asking a model to rate itself will result in more careful answers vs not asking this?
In my personal use benchmark, I have observed that customized GPTs with training data (pdfs, docs, etc.) hallucinate much much less than the raw model. For anything I need accuracy, I make a custom GPT with context data. I hope we have upload in o1 family soon.
You probably saw this study that drew the opposite conclusion to #1. I didn't find it convincing - they tested the wrong tasks, I thought. Sharing fyi:
I'd love to see these findings applied to real-world scenarios. How do you think this research can be used to mitigate AI hallucinations in practical applications?
In summary larger models (which generally know more) have less reason to "fill in the gaps" in their knowledge by making up bullshit that sounds convincing ? Not surprising, the less data points you have the more shaky extrapolation becomes.