UNDER CONSTRUCTION: Benchmarks!
- Jeannette Sutton

- May 2
- 3 min read

AI-generated Alerts and Warnings? Part 7
Test bank – answer key – grading rubric. Now under construction.
To be honest, it took me quite a while to understand what my EM1 colleagues were interested in doing with an AI-generated Alert and Warning benchmark. It took months of back and forth before I was assured that I understood what they needed for running tests, including the test questions, the test answers, and creating a scoring rubric to assess how close an AI-generated Alert and Warning got to the test answer.
It was confusing. It was, at times, frustrating (I felt like we didn’t speak the same language, after all, I’m a social scientist with a preference for qualitative research; they’re computer nerds who, I assume, love to code). It was, at many times, daunting (I incorrectly wondered, how can I possibly generate all the data that is needed to train an LLM? I learned later that a benchmark is not training, its evaluating). But it was also a new challenge that I was excited to learn about. So, I dug into the peer reviewed literature within information science and other academic journals and started reading. As a scientist, I have the luxury of doing this kind of exploration and deep dive; it’s also one of the reasons I continue to write these Saturday Morning Posts. You, as future users of these AI tools, deserve to know and understand how they work and how they are being evaluated.
From all of my reading and exchange of ideas with my colleagues, we finally came to a mutual understanding. I learned what exactly is needed to create an AI generated Alert and Warning benchmark, and I can now report that it is officially UNDER CONSTRUCTION.
What does it include?
First, we selected a set of test cases. In this case, we are building scenarios that capture characteristics of real-world incidents. And these scenarios were selected by looking across the Warning Lexicon, reviewing the type of alerts issued most frequently in the first decade of WEA (2012-2022), and creating a matrix of that shows how individual hazards can lead to complex hazard impacts. For instance, heavy rain in an area that previously experience wildfire can lead to flooding, debris flow, road damage. This means many different WEA messages with different actions for a single scenario, such as evacuation, shelter in place, and avoiding the area.
These scenarios will become the test bank given to an LLM.
Next comes the answers. Every scenario has multiple answers, each of which can be prompted using simple requests such as: write a ‘take action’ message for heavy rain that is no longer than 360 characters.** Answers will be created by my team of researchers and subject matter experts within the EM and law enforcement community. This ensures that the answers conform to the empirical evidence showing how a message should be structured and formatted while also emphasizing the key elements necessary for motivating protective actions. Accuracy represents both the quality of the message as well as it’s reliability. Both of which are very important for message senders and receivers.
The answer messages are one way to assess the accuracy of the AI generated alert and warning. We know, however, that it is highly unlikely that the answer messages generated by AI will be a perfect match with those created by the benchmark team. So, we will also be developing a set of requirements that will serve as a scoring rubric. These include constructs that I’ve written about previously: quality, usefulness, and reliability. Importantly, the requirements will differ by message type (natural-technological-public safety; missing persons; and post-alert messages).
All of this is UNDER CONSTRUCTION now. We’ll keep you posted as things progress. After all, transparency builds trust, and you should be informed about how these benchmarks are being created if you’re going to put your own trust in the output that is generated.
** it’s important that these very simple prompts work to deliver good messages. Sure, you could spend your time writing good prompts, but that means instructing the AI to generate a message that conforms to evidence. Wouldn’t it be better if it did it without additional intervention?


