
Large Language Models Show Promise in Detecting Drug Safety Signals from Clinical Notes
Key Takeaways:
- Large language models can identify immune-related adverse events in clinical notes without task-specific training, offering a potential alternative to labour-intensive manual review
- Performance remains below the threshold required for clinical decision support, with models tending to overpredict adverse events
- Despite limitations, this approach may support large-scale safety monitoring and accelerate research into cancer immunotherapies
The challenge of detecting drug safety signals
Drug safety signals are often embedded within unstructured clinical text, particularly in electronic health records. Identifying these signals has traditionally required either manual chart abstraction, which is resource-intensive, or natural language processing systems tailored to specific drugs and healthcare settings.
This challenge is particularly evident in the case of immune checkpoint inhibitors. These cancer therapies, first introduced in 2011, are associated with a broad range of immune-related adverse events. These events can affect multiple organ systems, including the colon, liver, lungs, heart, nervous system, skin, and endocrine system, making systematic detection complex and time-consuming.
Exploring large language models as a solution
Large language models are increasingly being explored as a way to streamline the identification of drug safety signals within clinical text. A multicentre study, published in eBioMedicine, evaluated whether these models could detect immune-related adverse events associated with immune checkpoint inhibitors.
The study focused on a zero-shot learning approach. In this setting, the model receives a single, detailed prompt without prior examples. The prompt used by the researchers began: “You are a clinical expert in identifying immune-related adverse events caused by immune checkpoint inhibitors …” and included a list of six immune checkpoint inhibitors alongside numerous associated adverse events.
This prompt was applied to clinical notes from multiple sources. These included records from 100 people treated at Vanderbilt Health, 70 people from the University of California, San Francisco, and 272 people enrolled in seven Roche-sponsored clinical trials.
Study design and model performance
The research team evaluated three models: GPT-3.5, GPT-4, and GPT-4o, with GPT-4o demonstrating the strongest overall performance.
To assess accuracy, the investigators used F1 scores, a metric that balances false positives and false negatives. Scores range from zero to one, with values above 90 percent considered excellent. A score of 80 percent or higher may be sufficient for use in automated clinical decision support systems.
At the patient level, GPT-4o achieved average F1 scores of 56 percent for Vanderbilt Health data, 66 percent for University of California, San Francisco data, and 62 percent for Roche clinical trial data. The models showed a consistent tendency to overpredict the presence of immune-related adverse events.
When analysing individual clinical notes, the model achieved an average F1 score of 57 percent across 667 notes from Vanderbilt Health, evaluating 17 different adverse events.
Implications for clinical practice and research
The findings suggest that large language models can play a role in identifying drug safety signals, even without task-specific training data.
“Manual patient chart abstraction for monitoring the safety and efficacy of drugs already at market requires tremendous resources and puts a drag on the pace of discovery in precision medicine. And that’s especially true with immune checkpoint inhibitors, where the adverse events are so varied. If zero-shot learning with LLMs could help with these notes, it could significantly reduce time and costs for all concerned,” said the report’s corresponding author, Cosmin Bejan, PhD, assistant professor of Biomedical Informatics at Vanderbilt Health.
However, the current level of performance falls short of what would be required for clinical decision support.
“These results show that zero-shot learning with a powerful LLM is useful for detecting these adverse events,” Bejan said. “This performance does not rise to the level required for clinical decision support, but the method could be valuable for automated irAE extraction across multiple sites, potentially speeding discovery and enhancing the safety and effectiveness of cancer immunotherapies.”
Wider research context
The study involved collaboration among multiple researchers at Vanderbilt Health, including Yaomin Xu, PhD, Eric Mukherjee, MD, PhD, Matthew Krantz, MD, Douglas Johnson, MD, MSCI, Elizabeth Phillips, MD, and Justin Balko, PhD. Funding support was provided in part by the National Institutes of Health.
Related research further highlights safety concerns associated with immune checkpoint inhibitors. In a research letter published in JAMA Oncology, Mukherjee, Phillips, and colleagues used logistic regression analysis of adverse event reports from the Food and Drug Administration. They confirmed that these therapies are independently associated with an increased risk of Stevens-Johnson syndrome and toxic epidermal necrolysis, which are severe and potentially life-threatening skin reactions. The study also found that this risk may be linked to exposure to human leukocyte antigen–restricted drugs.
Conclusion
Large language models represent a promising tool for extracting clinically meaningful insights from unstructured health data. While their current performance limits direct clinical application, their ability to operate across multiple datasets without task-specific training suggests potential for supporting large-scale pharmacovigilance efforts. As these models continue to improve, they may contribute to more efficient and comprehensive monitoring of drug safety in clinical practice.




