AI chatbots fail on 60 percent of urgent women's health queries

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

Verwandte Artikel

Illustration of Swedes in a Stockholm cafe using AI chatbots amid survey stats on rising usage and skepticism.
Bild generiert von KI

Zunehmende Nutzung von KI-Chatbots in Schweden – aber auch Bedenken

Von KI berichtet Bild generiert von KI

Laut der neuesten SOM-Umfrage der Universität Göteborg stieg der Anteil der Schweden, die wöchentlich mit einem KI-Bot chatten, zwischen 2024 und 2025 von 12 auf 36 Prozent. Gleichzeitig ist die Skepsis gegenüber KI gewachsen: 62 Prozent sehen sie eher als Risiko denn als Chance für die Gesellschaft.

In a comparative evaluation of leading AI models, Google's Gemini 3.2 Fast demonstrated strengths in factual accuracy over OpenAI's ChatGPT 5.2, particularly in informational tasks. The tests, prompted by Apple's partnership with Google to enhance Siri, highlight evolving capabilities in generative AI since 2023. While results were close, Gemini avoided significant errors that undermined ChatGPT's reliability.

Von KI berichtet

A new study from Brown University identifies significant ethical concerns with using AI chatbots like ChatGPT for mental health advice. Researchers found that these systems often violate professional standards even when prompted to act as therapists. The work calls for better safeguards before deploying such tools in sensitive areas.

A Guardian report has revealed that OpenAI's latest AI model, GPT-5.2, draws from Grokipedia, an xAI-powered online encyclopedia, when addressing sensitive issues like the Holocaust and Iranian politics. While the model is touted for professional tasks, tests question its source reliability. OpenAI defends its approach by emphasizing broad web searches with safety measures.

Von KI berichtet

Researchers warn that major AI models could encourage hazardous science experiments leading to fires, explosions, or poisoning. A new test on 19 advanced models revealed none could reliably identify all safety issues. While improvements are underway, experts stress the need for human oversight in laboratories.

OpenAI has decided to pause its planned 'adult mode' for ChatGPT indefinitely, focusing instead on core products. The move comes days after discontinuing its Sora video tool. CEO Sam Altman is prioritizing ChatGPT, Codex, and the Atlas AI browser amid competitive pressures.

Von KI berichtet

OpenAI has rolled out an updated image generation model for ChatGPT, making it four times faster and better at following user instructions. The upgrade includes improved editing capabilities and enhanced text rendering. This comes shortly after the release of GPT-5.2 and amid competition from Google's Gemini.

 

 

 

Diese Website verwendet Cookies

Wir verwenden Cookies für Analysen, um unsere Website zu verbessern. Lesen Sie unsere Datenschutzrichtlinie für weitere Informationen.
Ablehnen