Contact Us

New OpenAI models smarter but more prone to misinformation

OpenAI's latest AI models, GPT-o3 and o4-mini, produce more misinformation despite being more advanced, raising concerns about their reliability in real-world applications. Researchers link the rise in false outputs to their complex reasoning style.

Agencies and A News TECH
Published May 08,2025
Subscribe

AI is advancing rapidly, but not without problems. OpenAI's latest models, GPT-o3 and the more compact o4-mini, were designed to better mimic human reasoning. However, recent research reveals these smarter models actually generate more misleading information.

Since the emergence of chatbots, hallucinations—false or fabricated responses—have remained a persistent issue. While newer models were expected to reduce such errors, OpenAI's findings show the opposite: hallucinations are increasing.

In a test involving public figures, GPT-o3 gave inaccurate information in 33% of its responses—double that of the earlier GPT-o1 model. GPT o4-mini performed even worse, producing false data 48% of the time.

Ironically, the new "step-by-step thinking" programming meant to improve reasoning may be part of the problem. Researchers say the more the model "thinks," the more likely it is to veer off track. Unlike older models that prioritized safe, concise answers, the newer ones try to bridge complex ideas—sometimes reaching strange or incorrect conclusions.

OpenAI suggests the issue isn't just about how the models think but how confidently and wordily they express themselves. In striving to be thorough and helpful, AI can blur the line between guesswork and fact—often sounding convincing while being completely wrong.

These hallucinations pose serious risks in real-world settings, such as legal, medical, educational, or governmental use. Misleading content in a court filing or medical report could be disastrous. There have already been cases where lawyers were sanctioned for submitting fabricated case citations from ChatGPT.

As AI becomes more integrated into daily life, we might expect it to make fewer errors. But the paradox is clear: the more useful it becomes, the greater the impact of its mistakes.