Since we first published the 5 levels of AI assistants, I've spoken about them at conferences, the community has commented, pulled them apart, and pushed the thinking further. The market and tech have changed, and it's time for an update.
Conversational AI is hard, and it is a huge opportunity. We have the chance to build the most user-friendly applications in the history of software. There's one idea I'd like you to keep in mind when reading this post: end users are already telling us what they want from AI assistants, and to get to level 5 we "just" have to listen.
Why this matters
The internet and smartphones have made travel (and many other things) vastly easier for my generation. But people close to me have been left behind because they lack confidence with the technology. If instead we can use tech the same way we interact with other humans, we have a chance to build software that serves vastly more people.
Making it possible for anyone to use conversational AI is already hard, but it's not enough. We also need to empower all kinds of people to create these applications. There are a million important uses for this technology that big tech companies will never build (check out our showcase for some examples). The only way that can happen is through a massive developer movement.
If you're reading this, you've probably built conversational AI before and know that it's not easy. You and I and everyone who has experience with this owe it to the next generation of developers to build better tools and abstractions so they don't have to find out how hard it is the hard way.
The 5 levels
For AI assistants to work for everyone, both the end-user and developer experiences have to improve dramatically.
The end user perspective
As we progress through the five levels, assistants become more accommodating of the way humans think, and feel less like an API endpoint.
Let's use the example of a person getting a quote for a mortgage. At each level, we are lowering the burden on the end user to translate what they want (a new place to live) into the language of the bank (a quote for a 30 year fixed-rate mortgage).
Level 1 assistants put all the work on the end user. Static web forms are one example, command line applications are another. Remember that before Windows came around, everyone used a computer this way. Command line apps are great for experts (I'm a huge fan myself) but have a steep learning curve. The command for generating a mortgage quote might look like this:
Which requires the end user to know exactly which input fields to provide. Tricky, but still much more convenient than doing the calculations by hand.
Level 2 assistants are basic chatbots. If a user says something similar to "I'm interested in mortgage rates", the chatbot will ask them a series of questions, gathering the same information that we previously provided in the command line.
The frustrating thing about level 2 assistants is that they crash hard when you deviate from the happy path. The user can express their goal ("I want a mortgage") and provide information ("fixed rate please") in natural language, but they have to behave exactly as the assistant expects. Following the happy path, the user has to provide the information the bot asked for and not do anything unexpected.
Level 3 are contextual assistants. The user can ask for clarification: "how does that monthly payment compare to renting?", "how long do people usually take to pay that off?", they can correct themselves, or change their mind, without derailing the conversation. The user still has to know exactly what their goal is ("I want a mortgage quote") but they no longer have to know how to use the assistant to avoid breaking the conversation.
Level 4 are consultative assistants. The user can express their situation in their own terms, e.g. "my kids have gone to college and I want to downsize." A mortgage offer may be the end result, but the user doesn't have to know that. As a developer, it's tempting to think of this as 'ambiguous input', but I don't think that's a helpful mindset. What the person said is perfectly clear, and it's the assistant's job to figure out how they can help.
Level 5 are adaptive assistants. The assistant can suss out the level of detail the user is looking for at this moment. Maybe they just want to know that typical fixed interest rates are around 3-3.5% right now, or maybe they want to calculate a detailed offer. Maybe they have already done a lot of research and need extra clarification. A level 5 assistant picks up cues and adjusts its behaviour.
At each level, we are lowering the burden on the end user to translate what they want into the language of the bank. That's exactly where the value of conversational AI is. But to make that a reality, we need a community of developers with the tools to push the boundary of what's possible.
The developer perspective
There are two unique things about conversational AI that impact the developer experience.
The first is that end users are literally telling us what they want. For other kinds of software, all we can really do is track where people click and guess at what they want. But the information we need to get to level 5 understanding is already there in the conversations people have with our assistants. We "just" need the tools to learn from real conversations and tap into that resource.
The second is that it's not just the developer who can push the assistant to evolve. Because end users can say anything they want, users can also drive changes in AI assistants by changing their behaviour. As I'm writing this, we are experiencing a public health and economic crisis caused by COVID-19. I am sure that any mortgage-related AI assistants running today are getting a flood of messages from newly-unemployed people who are worried about making their monthly payments. The world changes, and sometimes it changes fast. AI has to be able to keep up.
So, the second perspective on the 5 levels is that of the developer. Through the five levels, the developer's experience isn't a straight line of things getting easier, but it's an evolution in the role.
Let's not forget what we are trying to do. We want to build software that can converse naturally with humans while completing tasks on their behalf. This is incredibly hard. Going from level 2 to 3 and beyond, we are shifting the burden of 'translation' from the end user to the assistant. We can no longer force the user to think about the problem the way that we do. And that has implications for how we build software.
At level 1, it is easy for a developer to add new functionality. We can release a new version of our command line tool along with new documentation, and it's up to the end user to learn about the new feature. We have split the problem into separate components, and defined rigid rules for how these compose together.
At level 2, it becomes harder and harder to add new functionality as the assistant evolves. Level 2 assistants are built with a fixed set of intents and a set of if statements describing how to respond in different contexts. When you add new capabilities, you have to solve a puzzle: how do I add this new functionality without breaking everything else? By solving every edge case with yet another if statement, you build a house of cards that becomes harder and harder to modify. This problem is so hard that we decided to start Rasa to try and solve it.
At level 3, we start to accommodate that users don't think about problems the same way that developers do, and that not every message can be neatly classified into an intent. As developers, we love splitting larger problems up into separate components. But to achieve fluid conversation, we have to accommodate that users don't respect the boundaries we draw. For example, mortgages and savings accounts both have interest rates associated with them (although these mean very different things). And people also talk about 'saving up' for a mortgage deposit. To a bank, these are separate products, but to a customer these are just means to an end. We have to observe how users are talking to our assistant, and use that information to teach the assistant how to talk. In other words, we have to practice Conversation-Driven Development (CDD). To really move past level 2 many things have to change, including getting rid of intents. This is a tough nut to crack, but advancing through levels 3, 4, and 5 is the reason Rasa exists.
At level 4, we start to automate parts of the CDD process. Automatically finding which conversations were successful, which ones failed (and where) is one piece of the puzzle. When assistants gain the ability to judge conversations as successful or not, they can automatically turn successes into new training data, and point the developer exactly to the mistakes they should fix to improve the assistant further.
At level 5, we are technically capable of fully automating CDD. To add new capabilities to an assistant, we provide the business logic for how to complete a task, and with a few practice conversations the assistant can complete the new task as effectively as established tasks, and can switch between them while carrying over context. While the process of learning from real conversations can become largely automated, I believe people should be involved to steer the learning process, setting direction and looking out for problematic behaviour. Software for humans should be created by humans.
We can see the developer's role evolve as we progress past level 2 to levels 3, 4, and 5. Rather than wrestling with edge cases by adding more logic at the micro level, they architect a system that handles the small-scale decisions on its own. We can make an analogy with another field of software engineering that has embraced data and automation: application performance monitoring. Advanced APM infrastructure lets developers monitor complex systems of services and respond more intelligently to changes in the system. With this infrastructure, engineers can build and scale systems that wouldn't be possible to maintain without this automation.
How we'll get there
Meena and BlenderBot are cool, but level 3, 4, or 5 conversational AI will not arrive in a press release from a big tech company. We're not waiting for a single breakthrough, or yet another, bigger model. In the years since we started working on this problem, I've been humbled by how much work it is to build the infrastructure and tooling required to learn from real conversations and move beyond level 2.
We can look to the web for inspiration. You never needed anybody's approval to build a website, or to invent a new use for the internet. If everyone had been stuck with using a WYSIWYG website builder, Web 2.0 would have never happened. To accelerate progress on conversational AI we need the same ingredients: open source tools and the ability to try new things without asking for permission. We need to enable every developer to push what's possible with conversational AI, not just the big tech companies.
Open source code makes a field progress 10x faster. Building level 3 assistants is hard, and it's not like we at Rasa have all the answers. Developers regularly hack things into our open source framework for their own purposes. They don't have to file a feature request and pray that we include it in our next release. Anyone can extend our infrastructure and put their new idea to work.
Rasa as a company is not going to come up with all the ideas we need to move through the 5 levels. We have a friendly, global community of developers and other makers exchanging and building on each other's ideas. The bigger this community gets, the more help new users get and the faster new ideas propagate. This creates a network effect. By building a community, we get feedback on how makers are progressing on the five levels, which things work and which don't.
In the future almost all software will contain some machine learning. ML is no longer a niche field but a foundational piece of software engineering. So we need to make ML usable for developers, and developers need opportunities to learn how ML works and deepen their understanding. This is why we created NLP for developers and the algorithm whiteboard.
There are more NLP papers being published now than ever before. Rasa contributes in two ways: we do our own research on the approaches we believe will get us to level 5, and we look for the most meaningful breakthroughs in the field and find a way to make our community benefit from them. My aspiration is for Rasa to be the best place in the world to do applied research in conversational AI.
Conversational AI needs you
The short answer to 'how will we get to level 5?' is, well, all of you. So I'll end with some practical things you can do to accelerate progress.
- Build something awesome and submit it to the showcase to inspire others.
- Go to the forum and tell us where you get stuck.
- Talk to others about CDD.
- If you disagree with these definitions of the 5 levels, or have another perspective to add, email me! And thank you to all the Rasa Heroes and Superheroes who gave feedback on this post.