Planning for Natural Language Failures with the AI Playbook

WIRED feature block quote

We all know that AI can dramatically improve the user experience of software products. But there are a lot of growing pains with AI and the current process for prototyping AI experiences is not as efficient as we want it to be. Product teams working with natural langauge AI products such as chatbots or writing assistants are experiencing major challenges because they are discovering unexpected errors after deploying the product to end-users. Based on a systematic characterization of common failure scenarios in natural language-based systems, we developed the AI Playbook to help product teams proactively address AI failures in early stages of prototyping.

  • Project Date: Apr - Jul 2020
  • Affiliations:  Microsoft Research AI
  • Funding: Microsoft
  • My Role:  Study design, study protocol development, data collection, data analysis, tool development, and paper writeup
  • Collaborators:  Adam Fourney (Principal Researcher), Derek DeBellis (UX Researcher), and Saleema Amershi (Senior Principal Research Manager)


There are several reasons that challenge our ability to prototype AI experiences:

  • Many AI solutions are built on probabilistic machine learning models
  • The stochastic nature of these solutions makes it difficult to predict failure behaviors of a system.
  • Design practitioners have varying levels of AI expertise to anticipate and account for all types of AI failures.
  • Most importantly, it's difficult to know how users will react to failures in realistic scenarios.

This leaves design practitioners with no choice but to take a reactive approach to address errors in the prototyping process. But what kind of failure behaviors are we dealing with? It is difficult to characterize what a failure behavior is because its meaning can change depending on whether it is interpreted by the end user or the system developer.

Suppose we are planning a trip to Ontario (which is in Canada) and ask an AI assistant about its weather. Our assistant, however, might think we are talking about a city in California, which is close by and happens to carry the same name Ontario with the same spelling and pronunciation. In this case, the response given by the AI assistant is considered a failure from the user's perspective because we intended to ask about the weather in a different location. The developer on the other hand, might have written an algorithm to assume that the user is asking about the weather closest to their current location. So how can we help product teams anticipate failures of this nature?


The goal of this research is to understand how we can help product teams working on natural language products address AI failures early enough in the prototyping process.

  • Semi-structured interviews with 12 practitioners (user researchers, designers, and product managers)
  • Development of taxonomy of common AI failures in natural language products
  • User study with 9 practitioners (prior participants)

Challenges Prototyping AI Experiences

Our formative interviews with 12 design and AI practitioners from a large technology company helped us identify key challenges they faced in prototyping natural language AI experiences. Product teams...

  • rarely engaged in any prototyping for new AI experiences because it slows down their pace of development.
  • were focused on the “hero scenarios” that would only happen in ideal circumstances.
  • experienced breakdowns in interdisciplinary communication due to having varying levels of AI expertise.
  • had difficulty anticipating types of errors that can occur from the end-user's perspective

Based on consideration of these challenges, we narrowed our project scope to focus on errors that:

  • are either very common, or rare but very costly
  • directly affect the end-user of the system
  • occur within a single session

We expected this would have the most impact in early prototyping because realistically simulating long-term failures is particularly difficult without high-fidelity prototypes.

Our realization that many practitioners were not aware of the range of errors possible in their AI application contexts prompted us to develop a taxonomy of AI failures.

Taxonomy of AI Failure Types and Scenarios in NL Products

We developed our taxonomy starting with 8 experts in NLP research, by:

  1. Capturing a wide range of natural language scenarios from previous work on information retrieval, dialogue management, speech recognition, and slot filling, etc.
  2. Organizing the scenarios based on failures that fall under attention, input perception, understanding, and response generation.
  3. Walking through concrete real-world examples of NL tasks for each scenario (e.g., extracting a meeting request from an email, or booking a flight with a voice assistant, etc.).

After multiple rounds of iteration, we arrived at 16 different failure scenarios that may occur in the context of natural language (NL) based products.

Failure Type Failure Source Failure Scenario Example Failure Scenario with NL Products
Attention Missed trigger System fails to detect a valid triggering event [Scheduling assistant] fails to detect a meeting request in an email.
[Voice assistant] fails to detect a wake word.
Spurious trigger System triggers in the absence of a valid triggering event (it triggers when not intended). [Scheduling assistant] mistakes meeting minutes as a new meeting request.
[Voice assistant] mistakes background speech as a wake word.
Delayed trigger System detects a valid triggering event, but responds too late to be useful. [Scheduling assistant] detects a meeting request, but offers support only after the user has already manually scheduled the meeting.
Perception Truncation System begins capturing input too late, or stops capturing input too early, and thus acts only on partial input. [Voice assistant] interprets a pause as the end of an utterance, and answers before the user has finished speaking.
Overcapture System begins capturing input too early, or stops capturing input too late, and thus acts on spurious data. [Voice assistant] captures user’s intended query along with a few more words spoken after the query.
Noisy channel User input is corrupted by spelling errors in written text, or background noise in audio input. [Search engine] user misspells their search query.
[Voice Assistant] background noise, such as music, makes speech less discernible.
Transcription System generates common transcription errors such as homonyms, homophones, plural word forms, etc. [Automatic captioning service] fails to transcribe certain speech utterances by adding or removing inflected endings (e.g., -ing, -s, -ed).
Understanding No understanding System fails to map the user’s input to any known action or response category. [Virtual assistant] responds “I’m sorry, I don’t know how to answer that question.”
Misunderstanding System maps the user’s input to the wrong category of action or response. [Shopping assistant] mistakes an item refund request for an item exchange request.
Partial understanding System correctly infers some aspects of the user’s intent, (e.g., correct action or response category) but get some details wrong. [Travel assistant] mistakes the origin for the destination.
[Scheduling assistant] fails to consider a location’s timezone in a meeting request.
Ambiguity There may be several reasonable interpretations of the user’s intent, leading to ambiguity. The system fails to correctly resolve ambiguity. [Search engine] responds to the query “US Open” by returning results for tennis, when the user intended golf.
[Scheduling assistant] sends a meeting request to the wrong “John”.
Response generation Action execution System fails in executing the desired action. [Travel assistant] attempts to re-book a trip by canceling and reissuing a ticket, but fails partway, leaving the reservation in an unknown state.
Generative response System generates an incoherent, inappropriate, factually incorrect or partially correct response. [Social chatbot] produces an offensive response to a user’s input.
Ranked list System produces a ranked list with low precision, recall, or result diversity. [Presentation design assistant] offers five recommended slide designs, but two designs are duplicates, or are otherwise indiscernible.
System generates a false negative or false positive response. [Spam filter] misclassifies an email as unsolicited spam.
[Content moderation system] misclassifies a comment as obscene.
Multi-class classification System fails to produce correct classifications among close or distant categories. [Email client] classifies a receipt as a promotional email.

AI Playbook Design

Using our taxonomy as a foundation, and based on the practitioner reported challenges in prototyping AI, we designed the AI Playbook to support the following goals:

  • Provide a means to discover non-ideal, error conditions that are outside of the golden path scenario.
  • Provide actionable and contextually-relevant guidance on how to simulate these scenarios.
  • Provide a means to explore a range of options and consequences of in-the-moment decisions.
  • Provide the above-mentioned features as efficiently, and low- cost as possible.

AI Playbook is an interactive survey that allows practitioners to walk-through a product or feature that they are designing, input modality, trigger, delimiter, and the form of the expected response.

The AI Playbook consists of four main features:

  • Interactive Survey (A) allows users to navigate between different question and answer pairs as they describe their envisioned AI scenario
  • Help Center (B) displays supportive information (e.g., description and examples) tailored to the user's selected answer choice
  • Scenario Builder (C) interactively builds up related scenarios to test, allowing users to efficiently explore the consequences of different design choices
  • Playbook Report (D) outputs a report of recommended test scenarios along with contextualized explanations and examples to further aid in developing and testing prototypes.

Imagine you were a designer working on a voice-based personal assistant. You would start with Conversational AI, then move on to select speech as the primary input. Your system will have a clear way of knowing when to trigger because it listens for a wake-word (such as "Alexa" or "Hey Siri"). As you continue to answer the questions, the Scenario section interactively builds up the related failure scenarios, which allows you to explore the consequences of the different design choices. In the end, the Playbook then outputs a full report of test scenarios and actionable guidance on how to simulate them. For example, we can simulate a list with reduced diversity by inserting a duplicate item to the list or adding an article to a list item.

User Study (60 min)

We evaluated the Playbook in a thinkaloud session with 9 of our previous participants, who were to complete two tasks using the AI Playbook (remotely, through screen-share).

  • Task I: Use the tool to facilitate prototyping a fictional chatbot system designed to help website visitors search for an apartment.
  • Task II: Apply the tool to a feature or system that their product team was currently in the process of conceptualizing or prototyping.

We asked participants to thinkaloud while completing both tasks and we followed up with product focused questions, including concrete aspects of the tool, suitability and applicability of the tool to their current work practices, perceived or anticipated value of the tool, and opportunities for further development.

Opportunities for Addressing AI Failures With the AI Playbook

Three key insights emerged from our evaluation of the AI Playbook.

Many saw how the AI Playbook could help them step off idealized interaction scenarios by systematically navigating the error space and standardizing the types of errors that can be included in testing their prototype.

"This will standardize the error case design [an​​​​​​​d] help improve the PM spec a lot."

Project Manager

"You know, that’s pretty low investment, right? For a high [value] trade off. Yeah, I really enjoyed this."

UX Design Manager

Another perceived benefit of the tool came from its ability to show the full range of options and consequences of people's design decisions within a short 3-4 minutes of their time.

We found the potential for the tool to help different teams overcome breakdowns in communication by directing their attention to the same terminology and scenarios in ways that both designers, machine learning engineers, and data scientists can relate to.

"I feel like it puts everybody on the same level in terms of being able to use the same terminology to have very clear scenarios laid out."

UX Design Manager

Lessons Learned
  • Errors will never fully go away, but if we can find a way to plan for these errors, instead of reacting to them, we could be heading in the right direction to mitigate these errors.
  • We are so often used to seeing so many recommendations and guidelines for best practices, but it’s really difficult to act on them because we don’t have the time or don’t have the bandwidth to interpret the guidelines and see how they relate to our own localized problems.
  • There are many promising opportunities in building tools that can provide tailored and actionable guidance, and to do this as early as possible in the product development process.

Looking Ahead

This study only focused on failure scenarios affecting the end-user and occurring within a single interaction session. There are endless possibilities for extending the AI Playbook to different scenarios of use. Besides considering failures that impact other stakeholder users and occur overtime in long-term interaction sessions, we should begin to address failures that are contingent on the end-user's personal, cultural, and societal context. Tackling these failures, however, will come with another set of challenges as it requires access to longitudinal and personal data, which may not exist yet or add risk to a user's privacy. In light of these considerations, I'm happy to announce that the AI Playbook (rebranded as HAX Playbook) is now available as an open source project and can be extended to accommodate multiple scenarios of use and the specific needs of product teams.

Presentation & Demo

Microsoft's HAX Playbook (previously AI Playbook) is an interactive tool for generating interaction scenarios to test when designing user-facing AI systems. Click the image for an interactive demo.

Publications and Project Outcomes
blogpost New toolkit aims to help teams create responsible human-AI experiences, by Leah Culler, Microsoft AI Blog for Business & Tech.
MSR-Webinar-HAX-Toolkit-Registration-Live The HAX Toolkit Project.
Project Leads: Saleema Amershi, Mihaela Vorvoreanu
Core Team: Andrew Anderson, Quan Ze (Jim) Chen, Adam Fourney, Jason Geiger, Matthew K. Hong, Besmira Nushi, Tobias Schnabel, Jenn Wortman Vaughan, Kathy Walker, Hanna Wallach, Meg Young
MSR-Webinar-HAX-Toolkit-Registration-Live Webinar: Create human-centered AI with the Human-AI eXperience (HAX) Toolkit, by Mihaela Vorvoreanu and Saleema Amershi, Microsoft Research Webinar Series.
WIRED "The Efforts to Make Text-Based AI Less Racist and Terrible”, by Khari Johnson, WIRED.
ACM Digital Library Planning for Natural Language Failures with the AI Playbook.
Matthew K. Hong, Adam Fourney, Derek DeBellis, and Saleema Amershi. Proceedings of the 39th Annual ACM Conference on Human Factors in Computing Systems (CHI 2021), Yokohama, Japan, 2021 (26.3% acceptance rate).
GitHub HAX Playbook Open Source Repository.
Nicholas King, Jingya Chen, Kathleen Walker, Mihaela Vorvoreanu, Xavier Fernandes, Juan Lema, Adam Fourney, Saleema Amershi, Derek DeBellis, Matthew K. Hong.