Conversational interfaces represent the next frontier of human-computer collaboration, where interaction through manipulating visual representations gives way to language-based communication. What does this mean for the field of conversation AI? How do we adapt our human-centered design practices to build conversational assistants that don’t only do what we say but do what we want?
The urge to “communicate” with machines was one of the main drivers behind developments in computer science for centuries. In MIT’s science report aired on TV in 1963, Prof. Steven Coons announces the demonstration of a computer program called SketchPad as follows:
SketchPad was the first program that utilized the concept of direct manipulation of visual representations in the history of human-computer interaction: with the use of a “light pen”, the operator could draw lines and circles, grab them, move them, change their size, and use constraints. Developed by Ivan Sutherland in his 1963 thesis titled A Man-Machine Graphical Communication System, it is commonly referred to today as the first instance of a graphical user interface that enabled an operator without knowledge of any coding language to “communicate” in real-time with a computer.
Around the time of the appearance of Sketchpad and the first graphical user interfaces, a new academic field called cognitive science started gaining ground. At the dawn of the discipline, cognitive scientists studied computers as models for human cognition and believed that by programming the right representations/symbols into the system along with rules to manipulate and transmit them, computers could learn human-level reasoning.
Influenced by the ideas of cognitive science, human-computer interaction (HCI) researchers and designers started working with the concept of mental models. A mental model, they claimed, is a representation living in the user's heads about what they believe about the system at hand based on their interactions with the interface. Designers thought that if they figured out what affordances will create the right conceptual model of how the system works in the users’ minds, users will learn to operate the systems faster.
Thus, in order to facilitate the creation of “correct” mental models (faithful representations of reality), designers of early GUIs based their designs on familiar real-world objects. Files started to look like pieces of paper, file directories like file folders, and arithmetic operations like physical calculators. Good visual design became synonymous with designing skeuomorphic affordances: visual elements that were modeled after real-world objects. This approach has indeed drastically reduced the learning curve of these systems, but not for the reasons HCI researchers theorized at the time.
I clearly remember the moment when, as a child, I realized that the Windows operating system had been modeled after typical workplace objects. I must have been around 8 years old, and I was already fluent in navigating Windows. As an 8-year old living in a non-Western country, I was not familiar with the domain of white-collar office work, so I couldn’t use it as a mental model for learning to navigate it. And yet, I figured out how to do it intuitively, through trial and error, much like the way I learned how to speak: not through cramming meanings of words and grammatical functions, but through interacting with my parents.
The success of these early GUIs lied in their ability to create “ontologically clean” experiences due to them being modeled after a very specific real-world domain. The fact that they were modeled after a physical space with an established operational structure, helped create a coherent “world” or “language,” that was easy to learn even for those who were not familiar with the domain they were modeled after.
How Language Actually Works
Sketchpad’s design was based on the inherent assumption that human communication is based on the manipulation of symbolic representations of real-world objects and phenomena. But at its most basic level, human language is not the manipulation of a storehouse of representations through reasoning, but a fundamentally contextual and performative vehicle for mutual orientation between two conversing agents.
Wittgenstein, one of the most prominent 20th-century theorists of language, has argued that the meaning of words does not live independently of their context, usage, and grammar. To understand the meaning of a word is not to have some definition in your head, but to be able to use the word correctly in the activities and practices in which it is commonly used.
As Terry Winograd and Fernando Flores write in their work Computers and Cognition:
Or to put it differently, there is no language that exists outside of interaction and there is no interaction that happens in a vacuum, outside of a particular context.
Morphological, syntactic, and semantic correspondences between the world and language emerge through use but do not precede use as a governing superstructure. They are models that humans created in order to structure a system (language) that we have an inherent understanding of. But they are not inherent to the language itself. And this becomes apparent with the success of neural language models.
As neural nets recognize patterns to determine a context-based interpretation that can predict the likelihood of a sequence of words, they uncover patterns that are the most “useful” to the particular context. Thus if we define the particular context of the interaction, we can curate the right training data and have neural nets find the most useful patterns to accommodate it. This has an immediate implication for design: we have to design for context and not for content.
Designing for Context
Most unsuccessful software fails because it was not designed appropriately for the context in which it was to be deployed. And because of the fundamentally contextual nature of language, this is even more relevant so for conversational interfaces. A bad design doesn’t take into account how the situational context, background, and expectations of users lead to interpretation and forces the user to deal with complexities that belong to the wrong domain.
In order to build systems that can appropriately respond to what users want and not just what they say, we have to have an in-depth understanding of the different situational and linguistic end-user contexts and how they relate to each other and to the overarching business use case.
Many chatbot development projects think context is exhausted simply by creating a few slots for tracking the topic of the conversation and user segments like authenticated user and non-authenticated users. This however will inevitably lead to domain-mismatch.
Context at its bare minimum, includes the following elements:
- Where is the user in their journey of interacting with the organization
- The topic of the conversation (where the user is in the dialogue)
- Previous touch points between the company and the user
- What the company knows about the user: end-user personas (linguistic profile, socio-cultural context) or contextual user profiles (CUP)
- The setting of the interaction (time, day, location etc.)
Context and situation influence linguistic code choices in such a way that every context is associated with a particular language variety that is considered appropriate for use in that context. If we want our system to perform well, our training data has to be representative or fine-tuned for the particular domain and language variety we are modeling, otherwise the bot will not be able to recognise tokens that fall outside of the domain we are building for. With technologies like end-to-end learning and Large Language Models (LLMs) gaining a wider ground, decision trees will become more and more obsolete, and designing “bounded contexts” will become the primary means of creating experiences that are able to adapt to unique user needs.
Bounded context is a term I’m borrowing from Domain Driven Design, a software 1.0 design approach focusing on modeling software to match a domain according to input from that domain's experts. Bounded contexts can be useful to conceptualize designing for software 2.0 systems (neural nets), too: after mapping the domain the model should operate in, different teams should be assigned to different bounded contexts, in order to make sure they can accurately curate and fine-tune data that is representative of the domain.
In the case of LLM-augmented systems, designing the right prompts complements teaching through training data. By understanding the domain and the changing context of the end-user, we will be able to come up with prompts that generate conversations that will result in the most accurate and contextually appropriate output.
At Rasa, we developed the concept of adaptive brand experiences, which takes personalisation a step further, opening up a paradigm where the assistant can not only match a static end-user segment but also adapt to changing end-user needs. Because interpretation is such a big part of creating conversations that feel natural, and interpretation primarily relies on context, we believe that predicting changing end-user contexts is key to developing assistants that not only do what users say but what they actually want – even without articulating it.
In other words, conversational design has to be primarily context-driven, and has to be able to adapt recursively to shifting user needs based on not only user utterances (content of the conversation) but also user context.