Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.
A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.
Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.
Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.
Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.