Benchmarks: One key to assessing acceptability, accuracy, and actionability

Jeannette Sutton
May 2
3 min read

AI Generated Alerts and Warnings Part 1

Originally posted on LinkedIn March 21, 2026 as part of a series on AI Generated Alerts and Warnings

I've been thinking for a while about AI generated alerts and warnings. As AI capabilities increase, it's a reasonable thing to consider. After all, I've seen students in my training classes load my published papers into ChatGPT and prompt it to write a message that follows the guidance I've written. And it's no secret that there are some AI companies that have launched tools that borrow from the rest of my research. It's publicly available, after all. We paid to make the data available because we wanted to see it used beyond the Message Design Dashboard that we built for FEMA-IPAWS. So, it makes sense. But I still have questions.

Not everyone is on the AI Alert and Warning bandwagon. To be honest, I'm not there yet either. And part of it is because I haven't seen a strategy to assess the acceptability, accuracy, and actionability of the warnings that are generated in the various chat engines and agents that are now available. But there is one industry approach that I think has legs and its worth discussing and promoting: Benchmarks.

A benchmark is a type of standard that sets a threshold for evaluation. Its a point of reference used to compare, measure, and evaluate the quality, performance, or value of processes, products, or services. Synonyms for "benchmark" are measure, reference point, gauge, yardstick, criterion, model, and touchstone, as well as my favorite STANDARDS.

I've seen a lot of published research on benchmarks in the field of medicine - assessing the accuracy of responses generated by chatbots is more than a matter of trust; it could be a matter of life and death.

For emergency management technology, benchmarks are created by federal agencies, international organizations, research laboratories, and industry partnerships. The key entities include the usual suspects: DHS S&T, CISA, NIST, and national laboratories like ANL and PNNL. The benchmarks they develop are then used by local agencies who apply those benchmarks to their own technology, training, and other services as a strategy to evaluate how they stack up against the recommended practices and technology adoption.

But here's the catch. For alerts and warnings, there are no federal standards. There is plenty of funded research resulting in recommended practices, but there are no standards. In my work with FEMA IPAWS it became evident that there aren't additional expectations beyond the independent training course to access IPAWS (IS-247.C or IS-251.A) and the regular proficiency demonstrations to show that an agency can competently access the IPAWS system to issue a message.

In fact, it seems that standards are more likely to come from organizations like the National Fire Protection Association (NFPA) that has already invested significant time and effort to integrate research from the fire protection engineering division of NIST into NFPA 72 (the National Fire Alarm and Signaling Code) and NFPA 1660 (Standard for Emergency, Continuity, and Crisis Management: Preparedness, Response, and Recovery). In January, NFPA 1660 launched a working group to address standards around emergency alerts, suggesting that there are some good things coming (in about a year or so).

In the meantime, the creation of an AI Alert and Warning benchmark is likely remain in the hands of the technology developers who recognize the value in establishing standards, not just for themselves, but for others to also demonstrate that the messages they generate are trustworthy.

Here's what I think those benchmarks need to represent: the state of the art in warning research and the very best evidence available. The people who should contribute to those benchmarks should include researchers, practitioners, and technology designers. And the benchmarks should clearly help to establish an evaluative criteria for acceptability, accuracy, and actionability.

I'll be writing more about this in the coming weeks. So if you're interested in thinking about this along with me, come back next Saturday and be sure to contribute to the conversation in the comments. I'd love to hear what YOU think should be part of an AI Alert and Warning benchmark.

For more about effective alerts and warnings, visit thewarnroom.com