A New York Times analysis shows Google's AI Overviews, powered by Gemini, answering correctly only 90% to 91% of questions in a standard benchmark. This translates to tens of millions of incorrect responses daily across searches. Google disputes the test's relevance.
The New York Times, working with startup Oumi, tested AI Overviews using SimpleQA, a benchmark of over 4,000 questions released by OpenAI in 2024. Initial tests with Gemini 2.5 showed 85% accuracy, improving to 91% after the Gemini 3 update. Extrapolated to Google's search volume, this means tens of millions of wrong answers generated each day, or millions per hour as highlighted in reports on the findings.