Published on June 10, 2025
In AI News

OpenAI’s o3 is Genius, Scores 135 in Toughest IQ test

The reasoning capabilities in vision-based and multimodal systems still lag in abstract problem-solving tasks.

By C P Balasubramanyam

OpenAI’s o3 model has emerged as the most cognitively capable AI system in a new benchmark test conducted by Voronoi, based on data from Tracking AI. The test, which uses Norway’s Mensa IQ test, a high-difficulty assessment typically reserved for human intelligence evaluation, placed o3 at an IQ score of 135, well above the human average of 90–110.

Other high scorers include Anthropic’s Claude-4 Sonnet at 127 and Google’s Gemini 2.0 Flash at 126.

The analysis covered 24 leading AI models, with top positions mostly occupied by text-only models, while vision-enabled systems scored significantly lower.

GPT-4o with vision, for example, received an IQ score of 63, while Grok-3 Think (Vision) followed with 60.

These results suggest that while language-based reasoning capabilities in AI are rapidly surpassing human benchmarks, vision-based and multimodal systems still lag in abstract problem-solving tasks.

The test results raise important questions about how AI models are architected and trained, particularly when it comes to general intelligence versus domain-specific strengths.

Voronoi’s findings reflect a broader trend in AI development, where performance gains in language models continue to dominate, but genuine multimodal reasoning remains a key challenge.

Not Thinking, Yet

Coincidentally, in a new paper titled The Illusion of Thinking, researchers from the Cupertino-based company argued that even the most advanced AI models, including the so-called large reasoning models (LRMs), don’t actually think. Instead, they simulate reasoning without truly understanding or solving complex problems.

The paper, released just ahead of Apple’s Worldwide Developer Conference, tested leading AI models, including OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, using specially designed algorithmic puzzle environments rather than standard benchmarks.

The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to “data contamination” and fail to reveal how these models actually “think”.

“We show that state-of-the-art LRMs still fail to develop generalisable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,” the paper noted.

📣 Want to advertise in AIM? Book here