How to Spot “Prompt and Pray” in Disguise

I’ve been using the phrase “prompt and pray” to describe approaches to conversational AI that claim to make LLMs behave reliably but actually do no such thing.

The architecture of most LLM apps fundamentally looks like this.

Engineers often chuckle when I show this slide because they’ve all felt that way. Small variations in the input prompt can cause wildly different results, and when you make a change to your prompt, the only way to know the impact is to try it and see.

Many techniques touted as “guardrails” for Gen AI do not fundamentally change this picture. So if you hear about a new hyped-up approach like “agentic RAG” or “chain-of-thought” or “prompt chaining”, first ask yourself: is this just “prompt and pray” in disguise?

RAG is Great but does not “Solve Hallucinations”

RAG is an incredibly valuable use case for LLMs and a step-change in the quality of “informational” chatbots. But if someone tells you that RAG “solves” the “hallucination problem”, they are telling you they don’t understand how any of this works. RAG means pre-fetching some relevant information to add to the prompt before calling the LLM.

You only have indirect, limited control over the output this produces. The best that Apple’s engineers could come up with was to add “Do not hallucinate. Do not make up factual information” to their prompts. I’m sure it helped a little.

Prompt and pray with extra steps

Many chatbots do more than return information - we want them to be transactional, interacting with APIs and getting things done on our behalf.

One approach is to let the LLM make all the decisions: make a plan, take a step, critique the output, make a new plan, etc. At each step, we feed intermediate output back into the LLM, and generate more output, amplifying the uncertainty each time. [Some people call this approach “agentic”, but I think to most people “agentic AI” just means “a chatbot that does stuff”. ]

But given how fragile this approach is, it’s not surprising that reliability is the #1 reason companies aren’t putting agents in production.

You Don’t Get Brownie Points for LLM Laundering

Why were we asking the LLM to make a plan anyway? If a user needs to reset their password, we know what the steps are. The user’s experience isn’t better because we had an LLM guess the steps. In fact, this approach isn’t just unreliable, it’s slower and more expensive than just having the steps documented.

If your business logic is going to change every time a user interacts with your agent, then it might make sense to have an LLM guess the right set of steps. But if your logic is already known, there are only downsides to laundering that information through an LLM.

If you want an LLM-based chatbot that actually executes tasks reliably, you need a system like CALM.

So, how do you spot “prompt and pray” in disguise?

If they:

Talk about building reliable assistants but only offer a way to observe what’s happening,
Talk about ‘guardrails’ but actually just offer content filtering on LLM input & output,
Promise to ‘tackle hallucinations’ by adding RAG,
Brag about ‘dynamic LLM reasoning’ without a reason why the process should change for every user interaction,

Then it’s just “prompt and pray” all the way down.