Google's Gemini 3 model excels in benchmarks but faces reliability issues

Google has unveiled its latest AI model, Gemini 3, which outperforms rivals on several key benchmarks, including a score of 37.5 percent on Humanity’s Last Exam. The company claims it achieves PhD-level reasoning, yet experts caution that such scores may not reflect real-world capabilities. Persistent hallucinations remain a concern for practical applications.

In a recent blog post, Google executives Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu announced the release of Gemini 3, highlighting its superior performance on graduate-level tests. The model scored 37.5 percent on Humanity’s Last Exam, a set of 2,500 research-level questions across maths, science, and humanities, surpassing OpenAI’s GPT-5, which achieved 26.5 percent.

Experts like Luc Rocher from the University of Oxford emphasize the limitations of these benchmarks. “If a model goes from 80 per cent to 90 per cent on a benchmark, what does it mean? Does it mean that a model was 80 per cent PhD level and now is 90 per cent PhD level? I think it’s quite difficult to understand,” Rocher said. He added, “There is no number that we can put on whether an AI model has reasoning, because this is a very subjective notion.” Benchmarks often rely on multiple-choice formats that do not require showing work, and there is a risk that training data includes test answers, allowing models to effectively cheat.

Google states that Gemini 3’s enhancements will improve software production, email organization, document analysis, and Google search through added graphics and simulations. Adam Mahdi from the University of Oxford predicts benefits in agentic coding workflows rather than casual chatting: “I think we’re hitting the upper limit of what a typical chatbot can do, and the real benefits of Gemini 3 Pro will probably be in more complex, potentially agentic workflows, rather than everyday chatting.”

Online reactions mix praise for coding and reasoning abilities with criticism over failures in simple visual tasks, like tracing hand-drawn arrows. Google acknowledges ongoing hallucinations and factual inaccuracies at rates similar to competitors. Artur d’Avila Garcez from City St George’s, University of London, warns, “The problem is that all AI companies have been trying to reduce hallucinations for more than two years, but you only need one very bad hallucination to destroy trust in the system for good.” These issues raise questions about justifying massive investments in AI infrastructure.

Tovuti hii inatumia vidakuzi

Tunatumia vidakuzi kwa uchambuzi ili kuboresha tovuti yetu. Soma sera ya faragha yetu kwa maelezo zaidi.
Kataa