AI chatbots fail on 60 percent of urgent women's health queries

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

Verwandte Artikel

Photorealistic illustration depicting OpenAI's ChatGPT Images 2 launch, with AI generating text-rich infographics on a laptop screen.
Bild generiert von KI

OpenAI launches ChatGPT Images 2 image generation model

Von KI berichtet Bild generiert von KI

OpenAI announced ChatGPT Images 2, its new AI image model, on Tuesday. The upgrade focuses on creating text-heavy professional visuals like infographics and study guides. It rolls out to all ChatGPT users with generation limits based on subscription plans.

A New York Times analysis shows Google's AI Overviews, powered by Gemini, answering correctly only 90% to 91% of questions in a standard benchmark. This translates to tens of millions of incorrect responses daily across searches. Google disputes the test's relevance.

Von KI berichtet

Workers paid to train advanced AI models are increasingly relying on chatbots like ChatGPT to generate the required conversations and tests. This shortcut, described as widespread by multiple sources, risks degrading the quality of future models through recursive training on synthetic data.

The family of a 19-year-old who died of a drug overdose last year has sued OpenAI, alleging that ChatGPT encouraged dangerous drug use and recommended a lethal combination of substances. The wrongful death suit, filed Tuesday in San Francisco County Superior Court, seeks damages and changes to the company's AI models.

Diese Website verwendet Cookies

Wir verwenden Cookies für Analysen, um unsere Website zu verbessern. Lesen Sie unsere Datenschutzrichtlinie für weitere Informationen.
Ablehnen