Going Beyond the LLM Hype: Rethinking Agents for Production

The initial “LLMs can do anything” hype has worn off. Engineers are now on the hook to ship real applications. If you’re done playing whac-a-mole, tweaking your temperature, adding yet another prompting technique to your swamp of text, and crossing your fingers and hoping that your application actually works this time, you might want to try the CALM Developer Edition.

LLM Agents figure out their business logic on-the-fly every time a user interacts with them. Someday this approach could give us a personal concierge that can navigate the whole internet to do anything for us. But business AI agents aren’t like that at all, and as an architecture for a production app, I can’t really imagine anything worse.

Here’s the value CALM (Conversational AI with Language Models) adds if you’re building an LLM agent:

A simple, declarative way to define your business logic, and a logical engine that executes it deterministically. The ability to break your logic into small reusable pieces and make your application easier to maintain. A guarantee that users can’t override your business logic through prompt injection.

Conversation as a first class citizen. Collecting information from users, input validation, disambiguation, context switching, corrections, and digressions all work out of the box using customizable patterns.

Easy debugging. In CALM, the LLM either gives the correct output or it doesn’t - it's that straightforward. You don’t need to fall back to fuzzy text matching or trusting another LLM to ‘score’ your answers. If a conversation doesn’t go the way you expect, you can see exactly why and where the problem is. And you can track user journeys step-by-step.

Nuanced understanding. CALM doesn’t rely on intents and can understand complex instructions, negation, and pragmatics.

No hallucinations! CALM defaults to using templated, human-authored answers, and it’s up to you when and where to allow generation. And if fifty thousand people ask for your opening hours every week, you don’t have to pay (and wait) to regenerate the same tokens every time. A mature app should only generate when it’s actually saying something new.

Speed! Your LLM only has to generate a handful of tokens. There’s no loop that recursively calls an LLM, and no “chaining” of instructions to multiple LLMs.

We have a lot of big improvements shipping in the next couple of months based on early-adopter feedback, and we always like to hear from teams that are pushing the limits of Rasa. Please share any feedback you have with us.