AI chatbots fail on 60 percent of urgent women's health queries

Commonly used AI models, including ChatGPT and Gemini, often fail to provide adequate advice for urgent women's health issues, according to a new benchmark test. Researchers found that 60 percent of responses to specialized queries were insufficient, highlighting biases in AI training data. The study calls for improved medical content to address these gaps.

A team of 17 women's health researchers, pharmacists, and clinicians from the US and Europe created 345 medical queries across specialties like emergency medicine, gynecology, and neurology. These were tested on 13 large language models from companies such as OpenAI, Google, Anthropic, Mistral AI, and xAI. The experts reviewed the AI responses, identifying failures and compiling a benchmark of 96 queries.

Overall, the models failed to deliver sufficient medical advice for 60 percent of these questions. GPT-5 performed best, with a 47 percent failure rate, while Ministral 8B had the highest at 73 percent. Victoria-Elisabeth Gruber, a team member at Lumos AI, noted the motivation behind the study: “I saw more and more women in my own circle turning to AI tools for health questions and decision support.” She highlighted risks from AI inheriting gender gaps in medical knowledge, and was surprised by the variation in model performance.

Cara Tannenbaum from the University of Montreal explained that AI models are trained on historical data with built-in biases, urging updates to online health sources with explicit sex- and gender-related information. However, Jonathan H. Chen from Stanford University cautioned that the 60 percent figure might be misleading, as the sample was limited and expert-designed, not representative of typical queries. He pointed to conservative scenarios, like expecting immediate suspicion of pre-eclampsia for postpartum headaches.

Gruber acknowledged these points, emphasizing that the benchmark sets a strict, clinically grounded standard: “Our goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation.” An OpenAI spokesperson responded that ChatGPT is meant to support, not replace, medical care, and that their latest GPT 5.2 model better considers context like gender. Other companies did not comment. The findings, published on arXiv (DOI: arXiv:2512.17028), underscore the need for cautious use of AI in healthcare.

관련 기사

Illustration depicting OpenAI's ChatGPT-5.2 launch, showing professionals using the AI to enhance workplace productivity amid rivalry with Google's Gemini.
AI에 의해 생성된 이미지

OpenAI releases ChatGPT-5.2 to boost work productivity

AI에 의해 보고됨 AI에 의해 생성된 이미지

OpenAI has launched ChatGPT-5.2, a new family of AI models designed to enhance reasoning and productivity, particularly for professional tasks. The release follows an internal alert from CEO Sam Altman about competition from Google's Gemini 3. The update includes three variants aimed at different user needs, starting with paid subscribers.

In a comparative evaluation of leading AI models, Google's Gemini 3.2 Fast demonstrated strengths in factual accuracy over OpenAI's ChatGPT 5.2, particularly in informational tasks. The tests, prompted by Apple's partnership with Google to enhance Siri, highlight evolving capabilities in generative AI since 2023. While results were close, Gemini avoided significant errors that undermined ChatGPT's reliability.

AI에 의해 보고됨

A study applying Chile's university entrance exam, PAES 2026, to AI models shows several systems scoring high enough for selective programs like Medicine and Civil Engineering. Google's Gemini led with averages near 950 points, outperforming rivals like ChatGPT. The experiment underscores AI progress and raises questions about standardized testing efficacy.

Some users of AI chatbots from Google and OpenAI are generating deepfake images that alter photos of fully clothed women to show them in bikinis. These modifications often occur without the women's consent, and instructions for the process are shared among users. The activity highlights risks in generative AI tools.

AI에 의해 보고됨

A recent report highlights serious risks associated with AI chatbots embedded in children's toys, including inappropriate conversations and data collection. Toys like Kumma from FoloToy and Poe the AI Story Bear have been found engaging kids in discussions on sensitive topics. Authorities recommend sticking to traditional toys to avoid potential harm.

Building on yesterday's ChatGPT image upgrade, OpenAI has detailed GPT Image 1.5, a multimodal model enabling precise conversational photo edits. It responds to rivals like Google's Nano Banana while introducing safeguards against misuse.

AI에 의해 보고됨

A new OpenAI report reveals that while AI adoption in businesses is surging, most workers are saving only 40 to 60 minutes per day. The findings come from data on over a million customers and a survey of 9,000 employees. Despite benefits in task speed and new capabilities, productivity gains remain modest for the average user.

 

 

 

이 웹사이트는 쿠키를 사용합니다

사이트를 개선하기 위해 분석을 위한 쿠키를 사용합니다. 자세한 내용은 개인정보 보호 정책을 읽으세요.
거부