AI chatbots fail on 60 percent of urgent women's health queries

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

Relaterede artikler

Illustration of Swedes in a Stockholm cafe using AI chatbots amid survey stats on rising usage and skepticism.
Billede genereret af AI

Øget brug af AI-chatbots blandt svenskere – men også bekymringer

Rapporteret af AI Billede genereret af AI

Ifølge den seneste SOM-undersøgelse fra Göteborgs universitet er andelen af svenskere, der ugentligt chatter med en AI-bot, steget fra 12 til 36 procent mellem 2024 og 2025. Samtidig er skepsissen over for AI vokset, hvor 62 procent ser det som en større risiko end mulighed for samfundet.

A new study from Brown University identifies significant ethical concerns with using AI chatbots like ChatGPT for mental health advice. Researchers found that these systems often violate professional standards even when prompted to act as therapists. The work calls for better safeguards before deploying such tools in sensitive areas.

Rapporteret af AI

Researchers at UC San Francisco and Wayne State University found that generative AI can process complex medical datasets faster than traditional human teams, sometimes yielding stronger results. The study focused on predicting preterm birth using data from over 1,000 pregnant women. This approach reduced analysis time from months to minutes in some cases.

OpenAI intends to launch a text-only adult mode for ChatGPT, enabling adult-themed conversations but not erotic media, despite unanimous opposition from its wellbeing advisers. The company describes the content as 'smut rather than pornography,' according to a spokesperson cited by The Wall Street Journal. Launch has been delayed from early 2026 amid concerns over minors' access and emotional dependence.

Rapporteret af AI

OpenAI has launched GPT-5.5, its latest AI model integrated into ChatGPT, seven weeks after GPT-5.4. The update focuses on coding, computer use, and research, with enhanced agentic capabilities for independent task completion. Paying ChatGPT and Codex users can access it now, with API rollout planned soon.

OpenAI has launched GPT-5.4, including variants Thinking and Pro, aimed at improving agentic tasks and knowledge work. The update features enhanced computer-use capabilities and reduced factual errors, amid competition from Anthropic following a US defense deal controversy. The models are available immediately to paid users and developers.

Rapporteret af AI Faktatjekket

A study published March 24, 2026 in *Radiology* reports that AI-generated “deepfake” X-rays can be convincing enough to mislead radiologists and several multimodal AI systems. In testing, radiologists’ average accuracy rose from 41% when they were not told fakes were included to 75% when they were warned, highlighting potential risks for medical imaging security and clinical decision-making.

 

 

 

Dette websted bruger cookies

Vi bruger cookies til analyse for at forbedre vores side. Læs vores privatlivspolitik for mere information.
Afvis