AI chatbots fail on 60 percent of urgent women's health queries

7. Januar 2026

Von KI berichtet

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

OpenAI launches ChatGPT Images 2 image generation model

21. April 2026 Von KI berichtet Bild generiert von KI

OpenAI announced ChatGPT Images 2, its new AI image model, on Tuesday. The upgrade focuses on creating text-heavy professional visuals like infographics and study guides. It rolls out to all ChatGPT users with generation limits based on subscription plans.

AI chatbots fail on 60 percent of urgent women's health queries

Verwandte Artikel

OpenAI launches ChatGPT Images 2 image generation model

Study finds Google's AI Overviews wrong in 10% of cases

AI trainers use chatbots to complete model tasks

Mother sues OpenAI after daughter's suicide

Professionals take offense at AI fact-checking by clients

Tests show ai chatbots can reveal personal details

Lawsuit accuses chatgpt of advising teen on fatal drug mix

OpenAI deploys GPT-5.5 Instant as ChatGPT's new default model

OpenAI releases GPT-5.5 model for ChatGPT

Vogue survey shows low trust in AI for fashion shopping

AI models fail to profit from Premier League betting in new study

Research shows AI users often accept faulty answers uncritically

Diese Website verwendet Cookies