Assessing AI generated Alerts and Warnings: the start of the benchmarking process
- Jeannette Sutton

- May 2
- 4 min read
AI-Generated Alerts and Warnings? Part 4
For the past few weeks, I’ve been writing about AI for Alert and Warning and the need for benchmarks that set standards that LLMs should adhere to. I’ve had a few research groups reach out to me about this and I’m happy to see that there is academic and practitioner interest in the project that I have begun with my colleagues. I promise, and EM1 has promised, that once the benchmarks are developed, they will be made open access. So, hold on … they’re coming!
Today I want to share a bit about how benchmarks are created by drawing from domain-specific examples found in academic research.
As discussed before, benchmarks are useful for establishing thresholds for performance of systems and processes, and now AI activity. For LLMs, benchmarks have been instrumental for developers as they select models that are best suited for their applications. In fact, if you Google LLM evaluation and benchmarks, you might stumble onto articles such as this one: https://www.evidentlyai.com/llm-guide/llm-benchmarks that has compiled a database of more than 250 LLM benchmarks to test general knowledge and other capabilities.
In contrast with these types of available benchmarks, AI for alert and warning is domain specific. Examples of domain-specific products include those that are created for law, finance, medicine, and emergency management. They require custom datasets and criteria tailored to the use case. You can read more about that here: https://www.evidentlyai.com/llm-guide/llm-evaluation They also require tailored benchmarks to assess and evaluate their effectiveness. Here are a few examples of how those benchmarks and have been created and applied.
Medical researchers have invested significant time assessing and verifying the use of LLMs for various tasks. For instance, one study evaluated and compared the performance of three large LLMs – ChatGPT, Google Gemini, and Deepseek (which is trained on biomedical research) - in “responding to a set of frequently asked questions from parents of children with autism spectrum disorder” (Almulla and Khasawneh, 2025).
The authors identified twenty common questions about Autism Spectrum Disorder (ASD) then worked with a panel of pediatric neurodevelopment specialists to create standardized benchmark answers. Following the creation of benchmarks, two pediatric autism experts evaluated AI-generated responses to those 20 questions, based on quality (accuracy, completeness, clarity, and educational value), usefulness (understandable and practically helpful), and reliability (factually correct and error-free). This was a labor-intensive effort requiring human coding of the LLM generated responses. Many of the general benchmarks for LLMs use multiple choice and true-false questions.
Other examples from the field of medical research include comparing LLM performance include cross-platform study of ChatGPT, Claude, Gemini, and copilot in medical embryology (Bolgova et al. 2025); an evaluation of the reliability, usefulness, quality, and readability of ChatGPT’s responses on scoliosis (Ciracioglu and Dal Erdogan, 2025), an evaluation of accuracy and reliability of chatbot responses to physician questions (Goodman et al. 2023), an evaluation of accuracy and reproducibility of ChatGPT responses to breast cancer tumor board patients (Liao et al. 2025), and benchmark evaluation of Deepseek large language models in clinical decision-making (Sandmann et al. 2025). The goal of each of these is to systematically assess the performance of LLMs with myriad foci including clinical utility. If you Google medical benchmark LLM leaderboard, you’ll see that evaluations are ongoing and even assessments of the benchmarks themselves.
How might this benchmarking approach work for AI for alerts and warnings? A similar process would begin with a set of outcomes (acceptability, accuracy, and actionability, for instance) that are linked to standards for what makes a good and effective warning. Those standards can be drawn from the scientific research that has consistently demonstrated what contents should be contained in an effective warning message and how messages should be structured and styled. We have decades of research that laid the foundation for effective messaging; more recent research, including content analysis, experiments, and surveys, have clarified the importance of key contents for additional alert and warning types such as missing persons and post-alert messages.
Once those standards are determined, subject matter experts, such as practitioners, communication scholars, and professionals with expertise in specific hazards (i.e., natural-technical hazards, law enforcement and missing persons) can assist with developing a repository of messages that correspond to a range of scenarios from the everyday hazards to complex edge cases that escalate over time. Those standardized messages can then be used to as a test case to evaluate the effectiveness (accuracy and actionability) of LLMs in generating AI alerts and warnings. Essentially, each AI system will receive a performance score for each benchmark category. This allows users to assess which AI for Alert and Warning is acceptable for their use.
Notably, benchmarking is not the same as training. To train a domain-specific LLM would require thousands of examples (or more). Unfortunately, for public alerts and warnings, most of the messages found in the wild do not conform to the evidence-based practices recommended by researchers. If a general LLM is using these messages in the wild as evidence for designing effective alerts and warnings, they are going to fail.
I also didn’t suggest drawing standards based upon what the public user wants. While there are ongoing conversations about using AI to generate personalized warnings, research has shown that what we want in a warning is not always the thing that works best. For example, in one study I conducted with my colleague Michele M Wood, participants were shown a map, text, and map + text options for emergency alerts. They said they wanted map + text, but the text only option resulted in better outcomes. Surely map + text is on the technological horizon, but we should be cautious about assuming the approach we think we want is the best approach to suit our needs.
What’s next? I’m happy to say that EM1 is working on this benchmarking problem right now. I’ll have more to report in the coming weeks. Stay tuned for next week’s article where I’ll write about the standards I intend to recommend that will help to shape future LLM evaluations.


