Despite the hype, there is a lot of work to be done before we can build conversational software. For now, it's an ugly marriage of bits of AI which "kind of" work and lots of hand coding. These are some notes about what interesting conversational software would look like, and what techniques we'll need to build it.
What VCs get excited about
It may be obvious, but I feel we have to point out that the giddy excitement around bots stems from being happy that there is something new to build/invest in/write medium posts about, and not from exciting new technology. For VCs, new platforms mean new opportunities to bundle and unbundle services, and new battlegrounds for the big players (likely leading to acquisitions). So even without real technological breakthroughs, there is at least some money to be made investing in bot startups.
What we ought to be excited about
Alex and I have been thinking a lot about what would make a meaningful, bot-native UX. Truly 'conversational' software can manage a complex, evolving state. It can use knowledge provided a priori by the developer, and can gain new knowledge from users. That's a difficult problem to solve. There's exciting research being published every week which suggests we may be close, and the big tech companies' ability to push research ideas into production software is super impressive. It's not a sure thing, but it's possible that we'll see intelligent conversational agents soon. But it will take a few breakthroughs and lots of graft.
Three Levels of Conversation
It makes sense to think about three different levels of how 'conversational' a bot is. The key component is statefulness:
level 1 - command & response
Stateless bots are essentially a command line app over HTTP. They are only conversational in the sense that commands can be expressed in natural language, e.g. "find me a cheap flight from Amsterdam to Rome". There is some NLP involved in extracting structured data from those questions, but tools like wit make these simple cases really easy. You can build something like this during a hackathon.
level 2 - hard-coded conversation flows
This is the 'business logic' of most of the bots which currently exist. Users essentially navigate a flow chart defined by the developers. The bot's state corresponds directly to a particular block of code being executed.
level 3 - fuzzy/continuous/fluid state.
This is what we should be aiming for. Human conversations don't follow a template, even when both parties have an agenda. Hard-coding conversations as flow charts won't work forever, so how do we make the jump?
Before jumping into machine learning topics, how closely can we mimic organic conversation with hand coding? Scripting tools like superscript get you quite far. They let you define complex conversation flows, and provide multiple responses to provide some variety. You can make use of large hand-crafted knowledge bases which contain semantic relationships like "basketball is a sport" and "Seoul is in Korea", to provide some 'common sense' knowledge of the world. Evi is an intelligent bot built with "knowledge base" technology.
A hybrid approach
What are the things we want from a machine learning model that we wouldn't get from a hand-coding approach? For one, letting users teach our bots things we hadn't anticipated. Let's say you're building ShoeBot. Your user asks "show me some black heels I can walk around in". As a developer, you probably thought to include filters for colour and size and brand, but what about comfort?
A lot of the pieces are there to build a bot which could handle this. Semantic embeddings like GloVe, word2vec, and sense2vec can recognise that 'can walk around in' and 'comfort' are related. Topic models like LDA can find shoe reviews which positively describe comfort. Generative sentence-level models can produce a sequence like "Sure! Here are 3 different pairs of comfortable black heels". So provided you can gather enough quality data, you could probably build a decent shopping bot which can handle 'out of scope' requests. But scraping lots of data and cleaning it is a big unsexy effort. For the near future, this is probably how conversational software will get built.
Will deep learning get us to level 3?
To whet the appetite, here's an example from a recent paper. It shows off a neural network which has learned to answer questions based on text input.
input: Jane went to the hallway.
input: Mary walked to the bathroom.
input: Sandra went to the garden.
input: Daniel went back to the garden.
input: Sandra took the milk there.
Q: Where is the milk?
Sanity note: this is synthetic data about a very simple topic. Real children's books are much harder.
The interesting thing about deep learning isn't "neural networks with lots of layers", it's the ability to do supervised learning on essentially raw input data. You take a bunch of sample conversations, and optimise the parameters in a big neural net until the model produces similar responses to what it saw in the training data. It simultaneously learns an efficient way to represent the data and a model which uses that representation. Designing neural network architectures which are capable of learning tasks like the above is a hot topic in ML research. Denny Britz's blog goes into more detail on various models if you want to dive in.
Harder than playing Go
I'm even more intrigued by the promise of reinforcement learning. The team at DeepMind have shown impressive results in the past few years. Their atari and go-playing systems are interesting because they devise complex strategies from their own experience. A similar approach could let us build conversational agents which ask questions their creators had never thought of - all in pursuit of a particular goal. But unlike the world of games, it'll be much harder to build an objective function to optimise. Going back to our ShoeBot example, if we train our system to optimise for maximum purchase value, we could end up with a very aggressive salesbot, or one which exploits weaknesses in our psychology. Instead, we'll need mathematical functions (and training data!) which capture the brand, personality, and strategy behind our bots. That data will also be much harder to generate: AlphaGo practices by playing itself billions of times. Bootstrapping a training partner for a conversational agent is not as straightforward.
For all its merits, deep learning might just be an application developer's worst nightmare. In the example above, no one had to manually insert knowledge about milk, or how physical items can be moved around - the model inferred this knowledge directly from the training data.
Unfortunately, that's a double-edged sword. While these systems are flexible and able to learn a great deal on their own - you can't just insert an extra fact, or explicitly handle an edge case. It's also difficult to understand why these models behave the way they do. Worse, any insight you get by inspecting a trained model might be completely wrong if you re-train it. So you probably have to build a whole separate system around your deep learning model, to make sure it doesn't do things which are too unexpected. Deep learning and hand-coding are unhappy bedfellows. The big tech companies surely have lots of experience dealing with these issues, but it's not the kind of thing you show off in a publication or an open-source tool.
Despite all the progress in deep learning, it's not at all easy for a small team of engineers to just pick a domain (like buying shoes) and train an AI chatbot which is ready for release into the wild. I've seen a couple of startups who promise to make this easier, but I won't mention them since I haven't actually seen their tech. Microsoft are pushing hard, and I'm sure we'll see offerings from Facebook and Google as well. For now, at least, startups can build a defensible business by getting their hands dirty and building better conversational software the inelegant way. In a few years, that could become a lot easier, and that will change the landscape all over again.
PS Here's me: https://twitter.com/alanmnichol
Interested to join us building more human-like bots? We're hiring!