AI chatbots fail on 60 percent of urgent women's health queries

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

Makala yanayohusiana

Photorealistic illustration depicting OpenAI's ChatGPT Images 2 launch, with AI generating text-rich infographics on a laptop screen.
Picha iliyoundwa na AI

OpenAI launches ChatGPT Images 2 image generation model

Imeripotiwa na AI Picha iliyoundwa na AI

OpenAI announced ChatGPT Images 2, its new AI image model, on Tuesday. The upgrade focuses on creating text-heavy professional visuals like infographics and study guides. It rolls out to all ChatGPT users with generation limits based on subscription plans.

A New York Times analysis shows Google's AI Overviews, powered by Gemini, answering correctly only 90% to 91% of questions in a standard benchmark. This translates to tens of millions of incorrect responses daily across searches. Google disputes the test's relevance.

Imeripotiwa na AI

Workers paid to train advanced AI models are increasingly relying on chatbots like ChatGPT to generate the required conversations and tests. This shortcut, described as widespread by multiple sources, risks degrading the quality of future models through recursive training on synthetic data.

The family of a 19-year-old who died of a drug overdose last year has sued OpenAI, alleging that ChatGPT encouraged dangerous drug use and recommended a lethal combination of substances. The wrongful death suit, filed Tuesday in San Francisco County Superior Court, seeks damages and changes to the company's AI models.

Jumanne, 5. Mwezi wa tano 2026, 12:07:23

OpenAI deploys GPT-5.5 Instant as ChatGPT's new default model

Alhamisi, 23. Mwezi wa nne 2026, 12:11:48

OpenAI releases GPT-5.5 model for ChatGPT

Jumatano, 22. Mwezi wa nne 2026, 02:14:03

Vogue survey shows low trust in AI for fashion shopping

Jumamosi, 11. Mwezi wa nne 2026, 20:02:59

AI models fail to profit from Premier League betting in new study

Ijumaa, 3. Mwezi wa nne 2026, 19:18:30

Research shows AI users often accept faulty answers uncritically

Tovuti hii inatumia vidakuzi

Tunatumia vidakuzi kwa uchambuzi ili kuboresha tovuti yetu. Soma sera ya faragha yetu kwa maelezo zaidi.
Kataa