What established measures should we include in an AI-generated Alert and Warning benchmark?
- Jeannette Sutton

- May 2
- 3 min read
AI-generated Alerts and Warnings? Part 5
Originally published on LinkedIn on April 18 as part of a series on AI-generated Alerts and Warnings
“If we can’t measure it, we can’t improve.” That’s what Tyler Felous, CEO of EM1 said about benchmarks. And agree with this 100%. Using a data-driven approach means measuring success and failure in order to identify ways to improve. But what exactly would a benchmark measure?
Last week, I described the ways that medical researchers have been assessing success. One group aligned “success” with the following criteria:
QUALITY – including measures of accuracy, completeness, and clarity
USEFULNESS – including measures of understandability and practically helpful
RELIABILITY - being factually correct and error-free.
Each of these criteria can be used for AI-generated Alerts and Warnings. And that’s a really good thing, because this means this benchmark can build upon the work that others have already done. How these things are measured will differ and I’d like to talk about a few of those things here.
Let’s look first at the idea of accuracy. If we apply the idea of accuracy to a warning message, at least two things come to mind. First is accuracy in terms of how the message conforms to what the evidence says about how a message should be written, including the style of the text, and the format of the content. We have research about each of these features and the evidence clearly shows what and how a warning should be written. In fact, this is what I write about almost every week – an effective warning should include the source, hazard, impact, location, guidance, and time, plus a link to additional information. But research has also shown that these contents differ based upon the type of hazard. Nearly every hazard in the Warning Lexicon follows the exact same patterns, but the contents and style for a missing person message will differ. The same holds true for a post-alert message. So, clearly identifying this sort of “internal” accuracy (how the message is communicated) will differ from a second type of accuracy, which is more closely aligned with the term “reliability” above.
The second type of accuracy is “being factually correct” and “error free.” These qualities of a message mean the AI-generated warning correctly assesses the facts from the scenario about the threat that is evolving. For instance, it must correctly identify the location(s) directly affected by the threat, the populations that need to take action, and the areas that serve as an evacuation gathering point. It must correctly identify the roads to avoid and the direction of travel for evacuees. The AI-generated warning will also need to provide the correct protective actions associated with the threat and the populations at risk. It will need to infer how the hazard can affect people and locations and determine the correct actions to recommend.
I think that the first type of accuracy, which falls under quality of communication, is probably easier to assess and assign a value to. The second type of accuracy, which is demonstrating that the information is reliable, will be more difficult in a real-time situation and will require human intervention to ensure it is correct and error free.
Let’s consider also the idea of completeness. Complete messages will include all of the contents recommended from the Warning Response Model (source, hazard/hazard impact, location, time, and guidance). While this benchmark is simple, researchers have found that fewer than 9% of alerts issued in the first decade of WEA (2012-2022) included all five contents. Hitting this mark alone would be a significant improvement for all messages.
And finally, let’s think about how we might evaluate if an AI-generated warning is understandable. In our experimental studies, understanding is measured by asking respondents if they understand who sent the message, what is the threat, where it is happening and when, and what they should do. But we also ask participants to define specific words that are commonly used in a WEA, such as “evacuation warning” or “evacuation order.” These both represent a type of jargon or technical language that requires public education to help people to know their definition, what to do, and how they differ from each other. Our research shows that these two phrases aren’t well-understood by the public. Another set of jargon terms are words like watch, warning, emergency, and advisory; each of them correspond with a different level of severity and necessary action.
Once we assign measures to each category, we can then create scores to show how closely each AI-generated Warning message conforms to the benchmarks.
I hope you’re finding this thought experiment as much fun as I am. We’re on our way to developing these benchmarks right now meaning scores will be coming soon. Stay tuned!


