Why Emergency Managers need benchmarks for technology adoption.

Jeannette Sutton
May 2
3 min read

AI -Generated Alerts and Warnings? Part 2

Originally published on LinkedIn on March 28, 2026 as part of a series on AI generated Alert and Warnings.

Why are benchmarks valuable to Emergency Managers and other public safety organizations? Well, without them, you don't have a standardized way of determining if something is “good enough.”

In medicine, researchers conduct clinical trials to determine the effectiveness of a new drug. In public health, researchers conduct experiments to determine the effectiveness of an intervention in promoting new behaviors. In communication, scholars investigate how adapting a message can affect a person’s beliefs and attitudes, leading to behavior change.

In each of these, examples, these tests show if the intervention (the drug, the campaign, the message) works as it is supposed to and produces processes and results that are not harmful or detrimental.

This is the same goal of a benchmark for AI generated Alerts and Warnings. By setting thresholds for accuracy, actionability, and other core variables, we can assess how well an AI system performs when responding to simple, complex, and compound scenarios that require an public emergency alert.

Here’s a thought experiment about AI generated Alerts and Warnings: Would it be appropriate for an intervention (a message) to be 90% accurate (maybe it tells people shelter in place; sealing doors and windows for a blizzard)? To be 80% complete (that is containing only 4 of the 5 recommended contents)? To include jargon 20% of the time (2 messages in 10 use operational terminology rather than plain language)? To be written in ALL CAPS 10% of the time (1 out of 10 YELLS AT YOU AND MAKES IT DIFFICULT TO READ)? This is the measurement that a benchmark can provide – by evaluating your technology against the standards set by experts, you can determine if it is “good enough” and acceptable to use.

Let’s think a bit about what the average alert looks like today – my research team at the University at Albany conducted a review of all WEAs issued from 2012-2022, the first full decade in which they were issued. We found that fewer than 9% of messages contained all 5 contents that have been recommended by researchers since 1990 (see Mileti and Sorensen, 1990). That is 26 years of evidence that has not been integrated into our most powerful alert and warning tool available.

So, messages in the wild aren’t great; if an AI system produces messages similar in quality, it is no improvement over what is currently done. Furthermore, it may reduce trust in the sending organization (researchers are quickly learning that the public wants authenticity from the officials they rely on in crises).

We can address these quality issues by generating benchmarks that can assess and score the common output from Agentic AI systems such as ChatGPT, Claude, Gemini, and other context specific technologies before a practitioner chooses what technology they will adopt and use. This will ensure that when (if?) a practitioner chooses an AI Alert and Warning system, they know what they are getting and that it reaches a threshold of effectiveness.

How many times have you, as an emergency manager, had to ask your peer for advice on what systems to select as you build out your A&W capabilities? You spend time and resources comparing across technologies, getting bids, listening to pitches, then you make your best assessment and dive in. But if there was a benchmark in place you could assess a system before you “buy” a solution.

Another benefit of benchmarking is this: when benchmarks exist, developers may be inclined to improve their own AI Alert and Warning systems. This suggests that a benchmark for AI Alert and Warning can actually help increase system effectiveness across an entire domain. Once benchmarks exist, even ENS companies may find themselves rising to meet the standards.

In sum, there are a few overarching reasons to develop benchmarks for AI Alerts and Warnings:

To develop trust in systems and in the messages generated.

To establish a safeguard for messages by setting minimum standards that align with research and subject matter expertise.

To promote consistent standards across AI Alert and Warning technologies

Interested in learning more? Stay tuned! Next week we’ll return to the idea of acceptability, accuracy, and actionability and why these are the first constructs that I have identified as the framework for an AI A&W benchmark.