Self-improving CALM prompts Using DeepSeek R1

Prompt engineering is a science, an art, and frustratingly rooted in trial and error. It’s also essential to getting the most out of LLMs. Without clear, concise and complete instructions, LLMs can miss the mark at completing a task accurately. Impressed by the performance of “reasoning” models like DeepSeek R1, we wanted to see if it could optimize prompts for us and teach us about prompt engineering along the way. Here's what we learned.

The Challenge

The command generator is an essential component of a CALM assistant – it interprets what’s happening in a conversation and produces a list of commands that capture how the user would like to progress the conversation.

The nice thing about this kind of LLM task is that we can conclusively say whether the LLM produced the right answer, and “hill climb” using a set of test cases. And we noticed that GPT-4o was making mistakes where other LLMs weren’t.

So, what could we do about it?

By using a reasoning model like deepseek-R1 to suggest modifications to our prompt, we have the ingredients to build a self-improving system: an automated way to run experiments, and an objective, automated scoring method.

In addition, a nice side-effect of models like R1 is that the intermediate “reasoning” text they produce give clues as to how you might improve further.

R1's Analytical Insights

GPT-4o was making errors in handling a money transfer skill. Specifically, the LLM was being overly pessimistic and giving up too quickly. It would predict a “cancel flow” command when the user said “I want to change the recipient”.

Using DeepSeek R1 as a diagnostic tool, we were able to detect subtle issues with the prompt that were potentially causing this:

The prompt lacked explicit instructions about modifying existing slot values during an active flow.
The example actions focused heavily on flow control (starting/canceling) rather than slot modifications.
The conservative guidance ("Don't be overly confident") may have biased the model toward canceling flows.

The above issues would have been hard to spot manually. Still, once the suggestions for the above problems were incorporated, the assistant was able to handle the slot corrrections with ease. The bigger lesson? Working with LLMs can be daunting, but reasoning models turn a shaky art into something closer to engineering. They can potentially help bridge the gap between how humans think and how AIs “think” or at least, how they follow instructions.

You can see the full prompt and DeepSeek R1’s output in this gist.

The Path Forward

While R1 proved invaluable for this specific debugging case, the challenge now lies in scaling this approach. We're exploring workflows to systematically use R1's reasoning capabilities for analyzing failing conversations in bulk, potentially revolutionizing how we debug and optimize AI conversations.