Researchers warn that major AI models could encourage hazardous science experiments leading to fires, explosions, or poisoning. A new test on 19 advanced models revealed none could reliably identify all safety issues. While improvements are underway, experts stress the need for human oversight in laboratories.
The integration of artificial intelligence into scientific research promises efficiency, but it also introduces significant safety risks, according to a study published in Nature Machine Intelligence. Led by Xiangliang Zhang at the University of Notre Dame in Indiana, the research developed LabSafety Bench, a benchmark comprising 765 multiple-choice questions and 404 pictorial scenarios to evaluate AI's ability to detect lab hazards.
Testing 19 large language models and vision language models, the team found that no model exceeded 70 percent accuracy overall. For instance, Vicuna performed nearly as poorly as random guessing in multiple-choice sections, while GPT-4o achieved 86.55 percent and DeepSeek-R1 reached 84.49 percent. On image-based tests, models like InstructBlip-7B scored below 30 percent.
These shortcomings are particularly alarming given past lab accidents, such as the 1997 death of chemist Karen Wetterhahn from dimethylmercury exposure, a 2016 explosion that cost a researcher her arm, and a 2014 incident causing partial blindness.
Zhang remains cautious about deploying AI in self-driving labs. "Now? In a lab? I don’t think so," she said. "They were very often trained for general-purpose tasks... They don’t have the domain knowledge about these [laboratory] hazards."
An OpenAI spokesperson acknowledged the study's value but noted it did not include their latest model. "GPT-5.2 is our most capable science model to date, with significantly stronger reasoning, planning, and error-detection," they stated, emphasizing human responsibility for safety.
Experts like Allan Tucker from Brunel University London advocate for AI as a human assistant in experiment design, warning against over-reliance. "There is already evidence that humans start to sit back and switch off, letting AI do the hard work but without proper scrutiny," he said.
Craig Merlic from the University of California, Los Angeles, shared an example where early AI models mishandled advice on acid spills but have since improved. He questions direct comparisons to humans, noting AI's rapid evolution: "The numbers within this paper are probably going to be completely invalid in another six months."
The study underscores the urgency of enhancing AI safety protocols before widespread lab adoption.